Python_data_science_第七课

polynomial regression

Not all relationships are linear.
Linear formula: y=mx+b
This is a “first order” or “first degree” polynomial , as the power of x is 1.
Second order polynomial: y=ax**2 + bx + c
Third order polynomial: y = ax**3 + bx**2 + cx + d
Higher orders polynomial produce more complex curves.
beware overfitting:
Don’t use more degrees than you need.
Visualize your data first to see how complex of a curve there might really be.
Visualize the fit - is your curve going out of its way to accomodate outliers?
A high r-squared simply means your curve fits your training data well, but it may not be a good predictor.

Code

 #fabricate data
 np.random.seed(2)
 pageSpeeds = np.random.normal(3.0, 1.0, 1000)
 purchaseAmount = np.random.normal(50.0, 10.0, 1000) / pageSpeeds
 plt.scatter(pageSpeeds, purchaseAmount)
 plt.show()
 #numpy has a handy polyfit function we can use, to let us construct an nth-degree polynomial model of our data that minimizes squared error. Let's try it with a 4th degree polynomial.
 x=np.array(pageSpeeds)
 y=np.array(purchaseAmount)
 p4=np.poly1d(np.polyfit(x,y, 4))
    
 #visualize
 xp=np.linspace(0, 7, 100)
 plt.scatter(x, y)
 plt.plot(xp, p4(xp), c='r')
 plt.show()
    
 #measure the r-squared error, 0 is bad, and 1 is good.
 from sklearn.metrics import r2_score
 r2=r2_score(y, p4(x))
 print(r2)
 #output will be ,pretty good
 0.82937663963
    
 #change the order to 8
 In [14]: p4=np.poly1d(np.polyfit(x,y, 8))
     ...:
 In [15]: xp=np.linspace(0, 7, 100)
     ...: plt.scatter(x, y)
     ...: plt.plot(xp, p4(xp), c='r')
     ...: plt.show()
     ...:
    
 In [16]: from sklearn.metrics import r2_score
     ...: r2=r2_score(y, p4(x))
     ...: print(r2)
     ...:
 #more accurate than order of 4
 0.881439566368
    
 #change the order to 1 , this will be linear regression.
 p4=np.poly1d(np.polyfit(x,y, 1))
 xp=np.linspace(0, 7, 100)
 plt.scatter(x, y)
 plt.plot(xp, p4(xp), c='r')
 plt.show()
 from sklearn.metrics import r2_score
 r2=r2_score(y, p4(x))
 print(r2)
 #r-squared is only 0.50
 0.502494130455

PREVIOUSPython_data_science_第六课

NEXTSublime_text_3_on_linux中文支持