Regression Packages#
There are many Python packages to compute and analyze linear regression models. Let’s look at the packages scipy.stats
, sklearn.linear_model
, and statsmodels
.
import numpy as np
import matplotlib.pyplot as plt
import scipy.linalg as la
Example: REALTOR.ca#
Let’s compute linear regression models for the same example using each of the different Python packages and compare the results. The data below was taken from REALTOR.ca for listings in the Kitsilano neighborhood of Vancouver. The variables \(x_1,x_2,x_3\) correspond to the size (in square feet, sqft), the number of bedrooms and the number of bathrooms respectively. The dependent variable \(y\) is the listing price.
x1 = np.array([1050,943,1557,662,829,724,482,702,733,637,819,802,771,779,823,924,1088,1018])
x2 = np.array([2,2,2,1,2,1,1,1,2,1,2,2,1,2,2,2,2,2])
x3 = np.array([2,2,2,1,2,1,1,1,1,1,1,1,1,2,2,2,2,2])
y = np.array([1.39,1.079,1.398,0.859,1.098,0.619,0.625,0.639,0.728888,0.778,0.749888,0.825,0.858,0.899,0.8999,0.999888,1.099,1.149])
In a previous section, we computed optimal parameters for the linear regression model by explicitly constructing and solving the the normal equations \(X^T X \boldsymbol{\beta} = X^T \mathbf{y}\).
Let’s construct the linear regression model \(y = \beta_0 + \beta_1 x_1 + \varepsilon\) which just uses the size of the property.
N = len(y)
X = np.column_stack([np.ones(N),x1])
beta = la.solve(X.T@X,X.T@y)
print(beta)
[0.18918767 0.00086607]
residuals = y - (beta[0] + beta[1]*x1)
R2 = 1 - np.sum(residuals**2)/np.sum((y - np.mean(y))**2)
print('R2 =',R2)
R2 = 0.7235562224127696
Let’s also construct the linear regression model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\).
N = len(y)
X = np.column_stack([np.ones(N),x1,x2,x3])
beta = la.solve(X.T@X,X.T@y)
print(beta)
[ 0.15375002 0.00059475 -0.0341029 0.21569858]
residuals = y - (beta[0] + beta[1]*x1 + beta[2]*x2 + beta[3]*x3)
R2 = 1 - np.sum(residuals**2)/np.sum((y - np.mean(y))**2)
R2adj = 1 - (1 - R2)*(N - 1)/(N - 3 - 1)
print('R2 =',R2,'R2adj =',R2adj)
R2 = 0.8302699983155557 R2adj = 0.7938992836688891
scipy.stats
#
The package scipy.stats
includes a function linregress
which computes a linear regression for only a single feature \(p=1\). The functions returns 5 values:
beta1
is the slopebeta0
is the interceptr
is the square root of \(R^2\) (called the Pearson correlation coefficient)pvalue
is the \(p\)_value for a hypothesis test whose null hypothesis is that the slope is zerostderr
is the standard error of the estimated slope
from scipy.stats import linregress
beta1, beta0, r, pvalue, stderr = linregress(x1,y)
print(beta0,beta1)
print(r**2)
0.18918767130612613 0.0008660748169516868
0.7235562224127696
The results match our explciit calculations above exactly!
See also
Check out the documentation to learn more about scipy.stats.linregress
.
sklearn.linear_model
#
The package sklearn
includes objects and methods for all kinds of machine learning models. The subpackage sklearn.linar_model
includes an object LinearRegression
to compute linear regression models.
from sklearn.linear_model import LinearRegression
X = np.column_stack([x1,x2,x3])
Y = y.reshape((len(y),1))
model = LinearRegression().fit(X,Y)
The result is an object that we have named model
which has several attributes. The attribute model.intercept_
corresponds to \(\beta_0\).
model.intercept_
array([0.15375002])
The attribute model.coeff_
corresponds to the vector of parameters \([\beta_1,\beta_2,\beta_3]\).
model.coef_
array([[ 0.00059475, -0.0341029 , 0.21569858]])
The method model.score(X,Y)
computes \(R^2\).
model.score(X,Y)
0.8302699983155556
The results match our explicitly computations above exactly!
Use the method model.predict()
to predict the listing price of a 1000 sqrt property with 2 bedrooms and 1 bathroom.
model.predict([[1000,2,1]])
array([[0.89599539]])
This is the same as computing explicitly:
model.intercept_ + model.coef_[0,0]*1000 + model.coef_[0,1]*2 + model.coef_[0,2]*1
array([0.89599539])
See also
Check out the documentation to learn more about sklearn.linear_model
.
statsmodels
#
The package statsmodels
includes many functions for statistical analysis. Let’s use the fucntion statsmodels.api.OLS
to compute linear regression models.
import statsmodels.api as sm
X = np.column_stack([np.ones(N),x1,x2,x3])
model = sm.OLS(y,X).fit()
The result is an object that we have named model
which has several attributes. The attribute model.params
corresponds to the vector of parameters \([\beta_0,\beta_1,\beta_2,\beta_3]\).
model.params
array([ 0.15375002, 0.00059475, -0.0341029 , 0.21569858])
model.rsquared
0.8302699983155556
model.rsquared_adj
0.793899283668889
Use the method model.predict()
to predict the listing price of a 1000 sqrt property with 2 bedrooms and 1 bathroom.
model.predict([1,1000,2,1])
array([0.89599539])
Note that we have to include the value \(x_0 = 1\) as if the model has the form \(y = \beta_0 x_0 + \beta_1 x_1 + \beta_1 x_2 + \beta_3 x_3 + \varepsilon\).
This is the same as computing explicitly:
model.params[0]*1 + model.params[1]*1000 + model.params[2]*2 + model.params[3]*1
0.895995393498798
See also
Check out the documentation to learn more about statsmodels
.