Multiple Linear Regression: A Practical Reference

2025-02-08 — 5 min read

This page is intended to provide a quick reference guide for different aspects of multiple regression, including model fitting, the covariance matrix and standard errors on coefficients, and confidence and prediction intervals on new observations.

Multiple linear regression is the extension of simple linear regression to data involving two or more predictor variables. It fits models of the form $$ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + ... + \beta_m x_{i m} + \varepsilon_i, $$ or in vector form $$ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}. $$ Writing this out explicitly, the vector form is the following: $$ \left[\begin{matrix}y_1\\y_2\\y_3\\...\\y_n \end{matrix}\right] = \left[\begin{matrix} 1 & x_{11} & x_{12} & ... & x_{1m}\\ 1 & x_{21} & x_{22} & ... & x_{2m}\\ 1 & x_{31} & x_{32} & ... & x_{3m}\\ ... & ... & ... & ... & ...\\ 1 & x_{n1} & x_{n2} & ... & x_{nm} \end{matrix}\right] \left[\begin{matrix}\beta_1\\\beta_2\\\beta_3\\...\\\beta_m \end{matrix}\right] + \left[\begin{matrix}\varepsilon_1\\\varepsilon_2\\\varepsilon_3\\...\\\varepsilon_m \end{matrix}\right]. $$

The model is linear in the coefficients, $\beta_i$ — hence the name. Errors are assumed to idependent and identally distributed (IID) Gaussian random variables.

The coefficients may be estimated by using least-squares regression, found by solving $$ \mathbf{\hat\beta} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}. $$

Here and throughout a variable with a hat ^ means it is estimated. The standard errors on the estimated coefficients, $\mathbf{\hat\beta}$, are given by $$ \text{se}(\hat\beta_i) = \sqrt{\hat\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}_{ii}} $$ (where $ii$ indicates the diagonal terms), where $$ \hat\sigma^2 = \frac{\mathbf{\hat\varepsilon}^T\mathbf{\hat\varepsilon}}{n-p-1} $$ is the fitted model variance, and $$ \mathbf{\hat\varepsilon} = \mathbf{y} - \mathbf{X}\mathbf{\hat\beta} $$ are the residuals after fitting the model.

$n$ is the total number of observations used in fitting the model, and $p$ is the number of predictor variables. The extra $-1$ in the denominator of $\hat\sigma^2$ accounts for the intercept term, $\beta_0$.

$\hat\sigma^2$ is the unbiased estimator of $\sigma^2$, with $-p-1$ being the correction.

The matrix $\hat\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$ is the covariance matrix, which tells us the relationship between pairs of variables.

The estimated coefficients $\hat\beta_i$ are considered significant if they are not consistent with $0$. A p-value can be calculated for each parameter using a t-test with $n-p-1$ degrees of freedom: $$ \text{t-value} = \frac{\hat\beta_i}{\text{se}(\hat\beta_i)};\ \ \ \ \ \text{d.o.f}=n-p-1 $$ If the sample size is large (typically $n\gtrsim 30$), a z-test can be used instead, almost identically: $$ \text{z-score} = \frac{\hat\beta_i}{\text{se}(\hat\beta_i)} $$ The z-test does not require a degrees-of-freedom parameter.

Confidence interval an prediction interval for a new observation

Assume you have fitted a model to a some data and then you make a new observation, $\mathbf{x_0}$. The confidence interval is the expected range of the mean response of the model, whereas the prediction interval is the expected range of the response for any single new observation. Subtly different!

Confidence interval: $ \mathbb{E}(\hat{y}|\mathbf{x_0})$
Prediction interval: $ \hat{y}|\mathbf{x_0}$

Both the confidence interval and the prediction interval are given by the value predicted by the model, $\hat{y}$, plus the standard error on the estimate: $$ \text{se}( \mathbb{E}(\hat{y}|\mathbf{x_0}) ) = \sqrt{\hat\sigma^2\cdot \mathbf{x_0}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x_0}} $$ $$ \text{se}(\hat{y}|\mathbf{x_0}) =\sqrt{\hat\sigma^2\cdot [1+\mathbf{x_0}^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x_0}]} $$

The additional $1+$ term in the latter accounts for the variation due to the error on the new observation — if we were looking at the expected value (mean), this error goes to zero because the expected value of random noise is zero.

Overall, the confidence interval (CI) and prediction interval (PI) are therefore given by $$ \text{CI:}\ \ \ \ \hat{y} \pm t_{(\alpha/2,\ n-p-1)}\cdot\text{se}( \mathbb{E}(\hat{y}|\mathbf{x_0}) ) $$ $$ \text{PI:}\ \ \ \ \hat{y} \pm t_{(\alpha/2,\ n-p-1)}\cdot\text{se}( \hat{y}|\mathbf{x_0} ) $$

$ t_{(\alpha/2,\ n-p-1)} $ is the t-value corresponding to the $100(1-\alpha)$% confidence interval for a t-distribtuion with $n-p-1$ degrees of freedom. Again, for large $n$, a z-value can be used instead (note: in both cases this is the value for a two-tailed distribution).

Typical confidence intervals, z-values, and associated p-values:

Confidence	z-value	p-value
90%	1.644854	0.1
95%	1.959964	0.05
99%	2.575829	0.01
99.9%	3.290527	0.001

Since t-values are dependent on the degrees-of-freedom, they are typically looked up in a table — there are many such tables available online.