ANOVA for Regression (2024)

Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability within a regression model and form a basis for tests of significance. The basic regression line concept, DATA = FIT + RESIDUAL, is rewritten as follows:
(y_i - ) = (_i - )+ (y_i - _i).
The first term is the total variation in the response y, the second term is the variation in mean response, and the third term is the residual value. Squaring each of these terms and adding over all of the n observations gives theequation

(y_i - )² = (_i - )²+ (y_i - _i)².
This equation may also be written as SST = SSM + SSE,where SS is notation for sum of squares and T, M, and E are notationfor total, model, and error, respectively.

The square of the sample correlation is equal to the ratio of the modelsum of squares to the total sum of squares: r² = SSM/SST.
This formalizes the interpretation of r² as explaining the fraction of variabilityin the data explained by the regression model.

The sample variance s_y² is equal to (y_i - )²/(n - 1) = SST/DFT,the total sum of squares divided by the total degrees of freedom (DFT).
For simple linear regression, the MSM (mean square model) = (_i - )²/(1) = SSM/DFM, sincethe simple linear regression model has one explanatory variable x.
The corresponding MSE (mean square error) = (y_i - _i)²/(n - 2) = SSE/DFE,the estimate of the variance about the population regression line (&sup2).

ANOVA calculations are displayed in an analysis of variance table, which has the following format for simple linear regression:

SourceDegrees of FreedomSum of squaresMean SquareF Model1(_i-)² SSM/DFMMSM/MSE Errorn - 2(y_i-_i)²SSE/DFE

Totaln - 1(y_i-)²SST/DFT

The "F" column provides a statistic for testing the hypothesis that₁ 0against the null hypothesis that ₁ = 0.The test statistic is the ratio MSM/MSE, the mean square model term dividedby the mean square error term. When the MSM term is large relative tothe MSE term, then the ratio is large and there is evidence against thenull hypothesis.

For simple linear regression, the statistic MSM/MSE has an F distribution with degrees of freedom (DFM, DFE) = (1, n - 2).

Example

The dataset "Healthy Breakfast" contains, among other variables, the Consumer Reports ratings of 77 cereals and the number of grams of sugar contained in each serving. (Data source: Free publication available in many grocery stores. Dataset available through the Statlib Data and Story Library (DASL).)

Considering "Sugars" as the explanatory variable and "Rating" as the response variable generated the followingregression line:
Rating = 59.3 - 2.40 Sugars (see Inference inLinear Regression for more information about this example).

The "Analysis of Variance" portion of the MINITAB output is shown below.The degrees of freedom are provided in the "DF" column, the calculatedsum of squares terms are provided in the "SS" column, and the meansquare terms are provided in the "MS" column.

Analysis of VarianceSource DF SS MS F PRegression 1 8654.7 8654.7 102.35 0.000Error 75 6342.1 84.6Total 76 14996.8

In the ANOVA table for the "Healthy Breakfast" example, the F statisticis equal to 8654.7/84.6 = 102.35. The distribution is F(1, 75),and the probability of observing a value greater than or equal to 102.35is less than 0.001. There is strong evidence that ₁ is not equal to zero.

The r² term is equal to 0.577, indicating that 57.7% of the variabilityin the response is explained by the explanatory variable.

ANOVA for Multiple Linear Regression

Multiple linear regression attempts to fit a regression line for a response variable using more than one explanatory variable. The ANOVA calculations formultiple regression are nearly identical to the calculations for simple linear regression, except that the degrees of freedom are adjusted to reflect the number of explanatory variables included in the model.

For p explanatory variables,the model degrees of freedom (DFM) are equal to p, the error degrees offreedom (DFE) are equal to (n - p - 1), and the total degrees of freedom(DFT) are equal to (n - 1), the sum of DFM and DFE.

Example

The "Healthy Breakfast" dataset contains, among other variables, the Consumer Reports ratings of 77 cereals, the number of grams of sugar contained in each serving, and the number of grams of fat containedin each serving. (Data source: Free publication available in many grocery stores. Dataset available through the Statlib Data and Story Library (DASL).)

As a simple linear regression model, we previously considered "Sugars" as the explanatory variable and "Rating" as the response variable. How do the ANOVA results change when "FAT"is added as a second explanatory variable?

The regression line generated by the inclusion of "Sugars" and "Fat" is the following:
Rating = 61.1 - 2.21 Sugars - 3.07 Fat (see MultipleLinear Regression for more information about this example).

Analysis of VarianceSource DF SS MS F PRegression 2 9325.3 4662.6 60.84 0.000Error 74 5671.5 76.6Total 76 14996.8Source DF Seq SSSugars 1 8654.7Fat 1 670.5

The mean square error term is smaller with "Fat" included, indicating less deviation betweenthe observed and fitted values. The P-value for the F test statistic is less than0.001, providing strong evidence against the null hypothesis. The squared multiple correlationR² = SSM/SST = 9325.3/14996.8 = 0.622, indicating that 62.2% of the variabilityin the "Ratings" variable is explained by the "Sugars" and "Fat" variables. This is an improvementover the simple linear model including only the "Sugars" variable.

RETURN TO MAIN PAGE.