Python_data_science_第9课

今天才看到Machine Learning:
Machine learning:
Algorithms that can learn from observational data, and can make predictions based on it.
Machine Learning分为unsupervised learning 和 supervised learning.
区别在: supervised learning: 有“参考答案”，而unsupervised learning则没有”参考答案”
Train /Test :
Need to ensure both sets are large enough to contain representatives of all the variations and outliers in the data you care about.
The data sets must be selected randomly.
Train/Test is a great way to guard against overfitting.

K-fold Cross validation:
One way to further protect against overfitting is k-fold cross validation.
- split your data into k randomly-assigned segments.
- Reserve one segment as your test data.
- Train on each of the remaining k-1 segments and measure their performance against the test set.
- Take the average of the k-1 r-squared scores.

下面就是实际代码
Train /Test practice - prevent overfitting of a polynomial regression.

首先是伪造数据

 np.random.seed(2)
 pageSpeeds = np.random.normal(3.0, 1.0, 100)
 purchaseAmount = np.random.normal(50.0, 30.0, 100) / pageSpeeds
 plt.scatter(pageSpeeds,purchaseAmount)
 plt.show()


 np.random.seed(2)
 pageSpeeds = np.random.normal(3.0, 1.0, 100)
 purchaseAmount = np.random.normal(50.0, 30.0, 100) / pageSpeeds
 plt.scatter(pageSpeeds,purchaseAmount)
 plt.show()

 #now we will split the data in two part 80% of the data used for training.
 #20% for testing, This way we avoid overfitting.
 #in real world, shuffle the data before split it.
 trainX = pageSpeeds[:80]
 testX= pageSpeeds[80:]

 trainY = purchaseAmount[:80]
 testY=purchaseAmount[80:]

 plt.scatter(trainX, trainY)
 plt.show()
 plt.scatter(testX, testY)
 plt.show()

 #we will try to fit an 8-degree polynomial to this data(centainly overfitting)
 x=np.array(trainX)
 y=np.array(trainY)

 p8=np.poly1d(np.polyfit(x, y, 8))

 #plot the polynomial against the training data
 xp=np.linspace(0,7,100)
 axes=plt.axes()
 axes.set_xlim([0,7])
 axes.set_ylim([0,200])
 plt.scatter(x,y)
 plt.plot(xp, p8(xp), c='r')
 plt.show()

 #plot the polynomial against the test data
 testx=np.array(testX)
 testy=np.array(testY)
 axes=plt.axes()
 axes.set_xlim([0,7])
 axes.set_ylim([0,200])
 plt.scatter(testx,testy)
 plt.plot(xp, p8(xp), c='r')
 plt.show()



 #r-squared score on the test data is terrible: 0.30 , even though it fits the training data.
 r2=r2_score(testy, p8(testx))
 print(r2)

 #r-squared score on the training data: 0.64, better
 r3=r2_score(y, p8(x))
 print(r3)

写在最后, 经过实验
使用6-degree polynomial 的时候，test data 的r-squared最大. 0.60.

 p8=np.poly1d(np.polyfit(x, y, 8))
 p7=np.poly1d(np.polyfit(x, y, 7))
 p6=np.poly1d(np.polyfit(x, y, 6))
 p5=np.poly1d(np.polyfit(x, y, 5))
 p4=np.poly1d(np.polyfit(x, y, 4))

 In [23]: p5=np.poly1d(np.polyfit(x, y, 5))

 In [24]: r2=r2_score(testy, p5(testx))
     ...: print(r2)
     ...:
 0.504072389719

 In [25]: r2=r2_score(testy, p6(testx))
     ...: print(r2)
     ...:
 0.605011947036

 In [26]: p7=np.poly1d(np.polyfit(x, y, 7))

 In [27]: r2=r2_score(testy, p7(testx))
     ...: print(r2)
     ...:
 0.546145145284

PREVIOUSPython_data_science_第八课

NEXTKali_install_rabbitmq