Residuals and R2#

import numpy as np
import matplotlib.pyplot as plt
import scipy.linalg as la
import pandas as pd

Residuals#

Let (x1,y1),,(xN,yN) be a data set of N observations where each xi=(xi,1,,xi,p) is a vector of values of the independent variables x1,,xp. Consider a linear regression model

y=β0+β1x1++βpxp+ε

where ε is the residual. Use the notation εi for the residual of the ith observation in the data set:

εi=yi(β0+β1xi,1++βpxi,p)

The parameters β0,β1,,βp are computed to minimize the residual sum of squares

RSS=i=1Nεi2

Once we compute parameters β0,,βp which best fit the data, we would like to validate the model. We present two methods: the coefficient of determination R2 (and Radj2) and graphical regression diagnostics.

Coefficient of Determination R2#

The quantity RSS measures the total error of the linear regression model but the value on its own does not tell us how well the model fits the data. The coefficient of determination is given by

R2=1i=1Nεi2i=1N(yiy¯)2

The coefficient R2 compares the error RSS to the error of a model consisting of simply the mean value

y¯=1Ni=1Nyi

The value R2 is near 1 when the model fits the data very closely and the value R2 is near 0 when the linear regression model has nearly the same error as just simply drawing a straight horizontal line through the data. The definition of R2 applies to any linear regression model however the value RSS usually increases with the number of features p in the model. Therefore we define the adjusted Radj2 coefficient as

Radj2=1N1Np1i=1Nεi2i=1N(yiy¯)2

The coefficient N1Np1 takes into account the number of features in the model.

See also

Check out Wikipedia: Coefficient of Determination for more information.

Grapical Regression Diagnostics#

The linear regression model is a good representation of the relationship between y and x1,,xp if the residual ε satisfies the following properties:

  1. The residuals εi are normally distributed with mean 0.

  2. Distribution of residuals εi does not depend on x1,,xp

  3. Distribution of residuals εi does not depend on index i (or time ti)

We can verify these properties graphically by producing the following figures:

  1. Plot histogram of residuals εi

  2. Plot εi versus xi,j for each feature xj

  3. Plot εi versus index i (or time ti)

Plot 1 should be a familiar bell curve centered at 0, and plots 2 and 3 should be random scatter plots with mean 0 and constant variance in the horizontal direction.

See also

Check out Wikipedia: Regression Validation for more information.

Example: REALTOR.ca#

The data below was taken from REALTOR.ca for listings in the Kitsilano neighborhood of Vancouver. The variables x1,x2,x3 correspond to the size (in square feet, sqft), the number of bedrooms and the number of bathrooms respectively. The dependent variable y is the listing price.

x1 = np.array([1050,943,1557,662,829,724,482,702,733,637,819,802,771,779,823,924,1088,1018])
x2 = np.array([2,2,2,1,2,1,1,1,2,1,2,2,1,2,2,2,2,2])
x3 = np.array([2,2,2,1,2,1,1,1,1,1,1,1,1,2,2,2,2,2])
y  = np.array([1.39,1.079,1.398,0.859,1.098,0.619,0.625,0.639,0.728888,0.778,0.749888,0.825,0.858,0.899,0.8999,0.999888,1.099,1.149])

Let’s construct different linear regression models and compare the R2 and Radj2 values. We’ll start with the model which includes only the size

y=β0+β1x1+ε

Compute parameters β0,β1 and plot the result.

N = len(y)
X = np.column_stack([np.ones(N),x1])
beta = la.solve(X.T@X,X.T@y)
x1s = np.linspace(500,1600,50)
ys = beta[0] + beta[1]*x1s

plt.plot(x1,y,'.')
plt.plot(x1s,ys)
plt.title('Listing Price on REALTOR.ca')
plt.xlabel('Size (sqft)'), plt.ylabel('Price (Millons $)'), plt.grid(True)
plt.show()
../../_images/2194eb665aee8a38b9bc53e2ca6d1e3d56a1067797e276600a970f744569b555.png

Compute residuals and coefficient R2.

residuals = y - (beta[0] + beta[1]*x1)
yhat = np.mean(y)
R2x1 = 1 - np.sum(residuals**2)/np.sum((y - yhat)**2)
print('R2 =',R2x1)
R2 = 0.7235562224127696

Repeat the procedure for the linear regression model y=β0+β1x1+β2x2+ε.

X = np.column_stack([np.ones(N),x1,x2])
beta = la.solve(X.T@X,X.T@y)
residuals = y - (beta[0] + beta[1]*x1 + beta[2]*x2)
R2x1x2 = 1 - np.sum(residuals**2)/np.sum((y - yhat)**2)
R2adjx1x2 = 1 - (1 - R2x1x2)*(N - 1)/(N - 2 - 1)
print('R2 =',R2x1x2,'R2adj =',R2adjx1x2)
R2 = 0.7400376017871559 R2adj = 0.7053759486921101

Repeat the procedure again for the linear regression model y=β0+β1x1+β2x2+β3x3+ε.

X = np.column_stack([np.ones(N),x1,x2,x3])
beta = la.solve(X.T@X,X.T@y)
residuals = y - (beta[0] + beta[1]*x1 + beta[2]*x2 + beta[3]*x3)
R2x1x2x3 = 1 - np.sum(residuals**2)/np.sum((y - yhat)**2)
R2adjx1x2x3 = 1 - (1 - R2x1x2x3)*(N - 1)/(N - 3 - 1)
print('R2 =',R2x1x2x3,'R2adj =',R2adjx1x2x3)
R2 = 0.8302699983155557 R2adj = 0.7938992836688891

The coefficient of determination increases when we include the number of bedrooms and bathrooms!

Example: Temperature#

Let’s again look at daily average temperature in Vancouver (see vancouver.weatherstats.ca).

df = pd.read_csv('temperature.csv')
df.head()
day month year dayofyear avg_temperature
0 13 4 2023 103 7.10
1 12 4 2023 102 5.19
2 11 4 2023 101 8.00
3 10 4 2023 100 7.69
4 9 4 2023 99 9.30

Let T be the temperature and let d be the day of the year. Define a regression model of the form

T=β0+β1cos(2πd/365)+β2sin(2πd/365)

Note that if we rewrite the variables as y=T, x1=cos(2πd/365) and x2=sin(2πd/365) then we see that this is indeed a linear regression model

y=β0+β1x1+β2x2

Construct the corresponding data matrix X and compute the coefficient vector β.

N = len(df)
d = df['dayofyear']
T = df['avg_temperature']
X = np.column_stack([np.ones(N),np.cos(2*np.pi*d/365),np.sin(2*np.pi*d/365)])
beta = la.solve(X.T@X,X.T@T)

ds = np.linspace(0,365,500)
x1 = np.cos(2*np.pi*ds/365)
x2 = np.sin(2*np.pi*ds/365)
Ts = beta[0] + beta[1]*x1 + beta[2]*x2

plt.plot(df['dayofyear'],df['avg_temperature'],'.',alpha=0.1,lw=0)
plt.plot(ds,Ts,'r'), plt.grid(True)

plt.title('Daily Average Temperature 1995-2023')
plt.xlabel('Day of the Year'), plt.ylabel('Temperature (Celsius)')
plt.show()
../../_images/f841af2d54fe04ac7c3ad661577a05dbc7c334c2ef6caa9263d5222b495123f1.png

Compute the residuals and plot the distribution:

f = lambda d,beta: beta[0] + beta[1]*np.cos(2*np.pi*d/365) + beta[2]*np.sin(2*np.pi*d/365)
residuals = T - f(d,beta)
plt.hist(residuals,bins=50,density=True), plt.grid(True)
plt.title('Distribution of Residuals')
plt.xlabel('Residual'), plt.ylabel('Relative Frequency')
plt.show()
../../_images/6988a84cc7d51f7e79424489e295dca8af9a5fde775e6d814f2fb6304fe072c8.png

The distribution definitely looks like a normal distribution. Compute the mean and variance:

mean = np.mean(residuals)
var = np.var(residuals)
print('mean =',mean,', variance =',var)
mean = 2.8549607122840824e-14 , variance = 6.773167782960342

Compute the coefficients R2 and Radj2:

That = np.mean(T)
R2 = 1 - np.sum(residuals**2)/np.sum((T - That)**2)
R2adj = 1 - (N-1)/(N-2-1)*np.sum(residuals**2)/np.sum((T - That)**2)
print('R2 =',R2,'R2adj =',R2adj)
R2 = 0.8035962434723157 R2adj = 0.8035569509332485

Next, let’s plot the residuals εi versus each of the input variables x1 and x2.

plt.plot(np.cos(2*np.pi*d/365),residuals,'.',alpha=0.1), plt.grid(True)
plt.xlabel('x1'), plt.ylabel('Residual')
plt.show()
../../_images/39ce22935e8d7476b425f270e9d43e97b5f6793a2b969c82ca70ae6206271a15.png
plt.plot(np.sin(2*np.pi*d/365),residuals,'.',alpha=0.1), plt.grid(True)
plt.xlabel('$x_2$'), plt.ylabel('Residual')
plt.show()
../../_images/702529f9d3e53c9fda4238cd556ecda2597e7e24ad35956a9e3a6f4d0c051d71.png

Both plots suggest that there is not much variation with respect to x1 and x2.

Now let’s plot εi versus day of the year d.

plt.plot(d,residuals,'.',alpha=0.1), plt.grid(True)
plt.xlabel('Day of the Year'), plt.ylabel('Residual')
plt.show()
../../_images/67314142e0e791b770a4fdcb43db9cc2aa9187b1c16ece857f525349c1d90bbd.png

We can see there is some variation in the error! This means that our model is missing something. It looks like there is a variation with period 6 months. Let’s build that into the model!

Example: Temperature with Biannual Variation#

We can see in the previous example that the linear model doesn’t quite fit the data. Let’s modify our linear regression model to include periodic functions with periods of 1 year (365 days) and 6 months (365/2 days):

T=β0+β1cos(2πd/365)+β2sin(2πd/365)+β3cos(4πd/365)+β4sin(4πd/365)

Construct the corresponding data matrix X and compute the coefficient vector β.

N = len(df)
d = df['dayofyear']
T = df['avg_temperature']
X = np.column_stack([np.ones(N),
                     np.cos(2*np.pi*d/365),np.sin(2*np.pi*d/365),
                     np.cos(4*np.pi*d/365),np.sin(4*np.pi*d/365)])
beta = la.solve(X.T@X,X.T@T)

ds = np.linspace(0,365,500)
x1 = np.cos(2*np.pi*ds/365)
x2 = np.sin(2*np.pi*ds/365)
x3 = np.cos(4*np.pi*ds/365)
x4 = np.sin(4*np.pi*ds/365)
Ts = beta[0] + beta[1]*x1 + beta[2]*x2 + beta[3]*x3 + beta[4]*x4

plt.plot(df['dayofyear'],df['avg_temperature'],'.',alpha=0.1,lw=0)
plt.plot(ds,Ts,'r'), plt.grid(True)
plt.title('Daily Average Temperature 1995-2023')
plt.xlabel('Day of the Year'), plt.ylabel('Temperature (Celsius)')
plt.show()
../../_images/d7a5464d6c9df8c85abb322bc143d7a44c634a3ccfb9cba4964f38d756ff355e.png

Analyze the model. Plot the distribution of residuals:

f = lambda d,beta: beta[0] + beta[1]*np.cos(2*np.pi*d/365) + beta[2]*np.sin(2*np.pi*d/365) + beta[3]*np.cos(4*np.pi*d/365) + beta[4]*np.sin(4*np.pi*d/365)
residuals = T - f(d,beta)
plt.hist(residuals,bins=50,density=True), plt.grid(True)
plt.title('Distribution of Residuals')
plt.xlabel('Residual'), plt.ylabel('Relative Frequency')
plt.show()
../../_images/025a64df8f8ddead8b2d05c16bcf6ada40b288d657cd178b9510d9397109349b.png

The distribution definitely looks like a normal distribution. Compute the mean and variance:

mean = np.mean(residuals)
var = np.var(residuals)
print('mean =',mean,', variance =',var)
mean = 2.6413715659145965e-14 , variance = 6.20105106178613

Compute the coefficients R2 and Radj2:

That = np.mean(T)
R2 = 1 - np.sum(residuals**2)/np.sum((T - That)**2)
R2adj = 1 - (N-1)/(N-4-1)*np.sum(residuals**2)/np.sum((T - That)**2)
print('R2 =',R2,'R2adj =',R2adj)
R2 = 0.8201860987382082 R2adj = 0.8201141371969329

The coefficients both increase when we include the variables x3,x4.

Next, let’s plot the residuals εi versus each of the input variables x1,x2,x3,x4.

plt.subplot(2,2,1)
plt.plot(np.cos(2*np.pi*d/365),residuals,'.',alpha=0.1), plt.grid(True)
plt.xlabel('$x_1$'), plt.ylabel('Residual')

plt.subplot(2,2,2)
plt.plot(np.sin(2*np.pi*d/365),residuals,'.',alpha=0.1), plt.grid(True)
plt.xlabel('$x_2$'), plt.ylabel('Residual')

plt.subplot(2,2,3)
plt.plot(np.cos(4*np.pi*d/365),residuals,'.',alpha=0.1), plt.grid(True)
plt.xlabel('$x_3$'), plt.ylabel('Residual')

plt.subplot(2,2,4)
plt.plot(np.sin(4*np.pi*d/365),residuals,'.',alpha=0.1), plt.grid(True)
plt.xlabel('$x_4$'), plt.ylabel('Residual')

plt.tight_layout()
plt.show()
../../_images/a9aba60ed11f2fa8a4204fe70bf4ec4297e0c5a2a7097bb510dfe847bdbc4426.png

Finally, plot εi versus day of the year d.

plt.plot(d,residuals,'.',alpha=0.1), plt.grid(True)
plt.xlabel('Day of the Year'), plt.ylabel('Residual')
plt.show()
../../_images/6e821c96867dffd785b7fce5a981553c409f5eea46e36940163d3903e9d1540d.png

All the plots suggest the mean of the residual is 0 independent of each variable and time however the variance seems to increase in the winter. How can we account for this?