Regression Packages#

There are many Python packages to compute and analyze linear regression models. Let’s look at the packages scipy.stats, sklearn.linear_model, and statsmodels.

import numpy as np
import matplotlib.pyplot as plt
import scipy.linalg as la

Example: REALTOR.ca#

Let’s compute linear regression models for the same example using each of the different Python packages and compare the results. The data below was taken from REALTOR.ca for listings in the Kitsilano neighborhood of Vancouver. The variables \(x_1,x_2,x_3\) correspond to the size (in square feet, sqft), the number of bedrooms and the number of bathrooms respectively. The dependent variable \(y\) is the listing price.

x1 = np.array([1050,943,1557,662,829,724,482,702,733,637,819,802,771,779,823,924,1088,1018])
x2 = np.array([2,2,2,1,2,1,1,1,2,1,2,2,1,2,2,2,2,2])
x3 = np.array([2,2,2,1,2,1,1,1,1,1,1,1,1,2,2,2,2,2])
y  = np.array([1.39,1.079,1.398,0.859,1.098,0.619,0.625,0.639,0.728888,0.778,0.749888,0.825,0.858,0.899,0.8999,0.999888,1.099,1.149])

In a previous section, we computed optimal parameters for the linear regression model by explicitly constructing and solving the the normal equations \(X^T X \boldsymbol{\beta} = X^T \mathbf{y}\).

Let’s construct the linear regression model \(y = \beta_0 + \beta_1 x_1 + \varepsilon\) which just uses the size of the property.

N = len(y)
X = np.column_stack([np.ones(N),x1])
beta = la.solve(X.T@X,X.T@y)
print(beta)
[0.18918767 0.00086607]
residuals = y - (beta[0] + beta[1]*x1)
R2 = 1 - np.sum(residuals**2)/np.sum((y - np.mean(y))**2)
print('R2 =',R2)
R2 = 0.7235562224127696

Let’s also construct the linear regression model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\).

N = len(y)
X = np.column_stack([np.ones(N),x1,x2,x3])
beta = la.solve(X.T@X,X.T@y)
print(beta)
[ 0.15375002  0.00059475 -0.0341029   0.21569858]
residuals = y - (beta[0] + beta[1]*x1 + beta[2]*x2 + beta[3]*x3)
R2 = 1 - np.sum(residuals**2)/np.sum((y - np.mean(y))**2)
R2adj = 1 - (1 - R2)*(N - 1)/(N - 3 - 1)
print('R2 =',R2,'R2adj =',R2adj)
R2 = 0.8302699983155557 R2adj = 0.7938992836688891

scipy.stats#

The package scipy.stats includes a function linregress which computes a linear regression for only a single feature \(p=1\). The functions returns 5 values:

  • beta1 is the slope

  • beta0 is the intercept

  • r is the square root of \(R^2\) (called the Pearson correlation coefficient)

  • pvalue is the \(p\)_value for a hypothesis test whose null hypothesis is that the slope is zero

  • stderr is the standard error of the estimated slope

from scipy.stats import linregress

beta1, beta0, r, pvalue, stderr = linregress(x1,y)
print(beta0,beta1)
print(r**2)
0.18918767130612613 0.0008660748169516868
0.7235562224127696

The results match our explciit calculations above exactly!

See also

Check out the documentation to learn more about scipy.stats.linregress.

sklearn.linear_model#

The package sklearn includes objects and methods for all kinds of machine learning models. The subpackage sklearn.linar_model includes an object LinearRegression to compute linear regression models.

from sklearn.linear_model import LinearRegression

X = np.column_stack([x1,x2,x3])
Y = y.reshape((len(y),1))
model = LinearRegression().fit(X,Y)

The result is an object that we have named model which has several attributes. The attribute model.intercept_ corresponds to \(\beta_0\).

model.intercept_
array([0.15375002])

The attribute model.coeff_ corresponds to the vector of parameters \([\beta_1,\beta_2,\beta_3]\).

model.coef_
array([[ 0.00059475, -0.0341029 ,  0.21569858]])

The method model.score(X,Y) computes \(R^2\).

model.score(X,Y)
0.8302699983155556

The results match our explicitly computations above exactly!

Use the method model.predict() to predict the listing price of a 1000 sqrt property with 2 bedrooms and 1 bathroom.

model.predict([[1000,2,1]])
array([[0.89599539]])

This is the same as computing explicitly:

model.intercept_ + model.coef_[0,0]*1000 + model.coef_[0,1]*2 + model.coef_[0,2]*1
array([0.89599539])

See also

Check out the documentation to learn more about sklearn.linear_model.

statsmodels#

The package statsmodels includes many functions for statistical analysis. Let’s use the fucntion statsmodels.api.OLS to compute linear regression models.

import statsmodels.api as sm

X = np.column_stack([np.ones(N),x1,x2,x3])
model = sm.OLS(y,X).fit()

The result is an object that we have named model which has several attributes. The attribute model.params corresponds to the vector of parameters \([\beta_0,\beta_1,\beta_2,\beta_3]\).

model.params
array([ 0.15375002,  0.00059475, -0.0341029 ,  0.21569858])
model.rsquared
0.8302699983155556
model.rsquared_adj
0.793899283668889

Use the method model.predict() to predict the listing price of a 1000 sqrt property with 2 bedrooms and 1 bathroom.

model.predict([1,1000,2,1])
array([0.89599539])

Note that we have to include the value \(x_0 = 1\) as if the model has the form \(y = \beta_0 x_0 + \beta_1 x_1 + \beta_1 x_2 + \beta_3 x_3 + \varepsilon\).

This is the same as computing explicitly:

model.params[0]*1 + model.params[1]*1000 + model.params[2]*2 + model.params[3]*1
0.895995393498798

See also

Check out the documentation to learn more about statsmodels.