lectur17 (2024)

Lecture 17
SimpleRegression

Prediction
When I discussed correlation, I noted that a perfect correlation between twovariables produces a line when plotted in a bivariate scatterplot.

In this figure, every increase ofthe value of X is associated with an increase in Y without any exceptions. Ifwe wanted to predict values of Y based on a certain value of X, we would haveno problem in doing so with this figure. A value of 2 for X should beassociated with a value of 10 on the Y variable, as indicated by this graph.Many times in scientific studies, we are interested in predicting values of Y basedon X. For example, we may be interested in the blood cholesterol level ofpatients based on their daily intake of fat. We would be predicting values ofcholesterol based on how much fat is in the diet. In the above picture, X is aperfect predictor of Y. We know exactly what the value of Y will be for a givenvalue of X, because they all fall on a perfect line.

Error ofPrediction--the "Unexplained Variance"
Usually, prediction won't be so perfect. Most often, not all the points willfall perfectly on the line. There will be some error in the prediction. Foreach value of X, we know the approximate value of Y but not the exact value.

In the above figure, bloodcholesterol and fat intake are related, but we are not able to predictcholesterol levels perfectly based on the fat intake. The points in the graphthat fall off of the line are not perfectly predicted. We can look at how mucheach point falls off the line by drawing a little line straight from the pointto the line as shown below (sorry, my little lines are not perfectly alignedwith the points).

If we wanted to summarize how mucherror in prediction we had overall, we could sum up the distances (or deviations)represented by all those little lines. The middle line is called the regressionline. Summing up the deviations of the points gives us an overall idea ofhow much error in prediction there is. Unfortunately, this method does not workvery well. If we choose a line that goes exactly through the middle of thepoints, about half of the points that fall off of the line should be below theline and about half should be above. Some of the deviations will be negativeand some will be positive, and, thus the sum of all of them will equal 0.Remember that when we added all the deviations of the scores from the mean, wealso got 0? So, we pick a similar solution to the problem here. If we want tosummarize the overall error in prediction, we sum up all the squared deviationsfrom the regression line.

To calculate the deviations from theline we need a little notation. In regression analyses, we label the points . We try topredict the scores on the Y variable, given certain values of the X variable.So, we are primarily concerned with the scores. The (imaginary) scores that fallexactly on the regression line are called the predicted scores, and there is apredicted score for each value of X. The predicted scores are represented by (sometimesreferred to as "y-hat", because of the little hat; or as"y-predict"). So the sum of the squared deviations from the predictedscores is represented by

in which each y scores is subtractedfrom the predicted score (or the line) and then squared. Then all the squareddeviations are summed up. Notice that this is a type of variation. It is the unexplainedvariation in the prediction of y when x is used to predict the y scores.Some books refer to this as the "sum of squares residual" because itis a measure of the residual variation (like a "residue" left over).Whatever we decide to call it, sum of the squared deviations from theregression line (or the predicted points) is a summary of the error ofprediction.

The Regression Lineand Ordinary Least Squares Criterion
Conversely, if we want to draw a line that is perfectly through the middle ofthe points, we would choose a line that had the squared deviations from theline. Actually, we would use the smallest squared deviations. This criterionfor best line is called the "Least Squares" criterion or OrdinaryLeast Squares (OLS).

We use the least squares criterionto pick the regression line. The regression line is sometimes called the"line of best fit" because it is the line that fits best when drawnthrough the points. It is a line that minimizes the distance of the actualscores from the predicted scores.

Sum of SquaresRegression--the Explained Variance
When we studied correlation, wesaw that a linear relationship between two variables could be seen as a streamof points when plotted. When the correlation between x and y equals zero (notendency for y to be large (or small) when x is large), there is just a groupof random points. The graphs below show an example of each.

On the left, there is norelationship between fat intake and cholesterol. On the right, there is arelationship. The regression line is flat when there is no ability to predictwhatsoever. The regression line is sloped at an angle when there is arelationship.

The extent to which the regressionline is sloped, however, represents the degree to which we are able to predictthe y scores with the x scores. When I discussed the mean, I said the meanoffered a way to summarize a group of scores. The scores are most likely to beclose to the mean, because it is the middle. Therefore, if you wanted to guessat any one of the scores, the best guess would be the mean. When there is norelationship between x and y, the values of x are of no help in predicting they scores, so we might as well use the mean of y, or to predict y scores. In the lefthand side of the above figure, there is a flat line drawn at the mean. The bestway to predict the y scores is with the mean of y. To the extent that there isa relationship between x and y, there will be some slope in the line ofprediction. So, the degree to which the regression line is sloped compared tothe mean, represents the amount we can predict y scores.

The extent to which the regressionline is sloped represents the amount we can predict y scores based on x scores,and the extent to which the regression line is beneficial in predicting yscores over and above the mean of the y scores. To represent this, we couldlook at how much the predicted points (which fall on the regression line)deviate from the mean. This deviation is represented by the little verticallines I've drawn in the above figure. If we want to quantify this distance, wecould use a similar method as before. The squared deviations of the predictedscores from the mean score, or

represent the amount of varianceexplained in the y scores by the x scores.

Total Variation
We may also want to know about the total variation in the y scores. That ispretty easy to represent, because we have done the same thing elsewhere. Thetotal variation is measured simply by the sum of the squared deviations of they scores from the mean.

It turns out that the explained sumof squares and the unexplained sum of squares add up to equal the total sum ofsquares. The variation of the scores is either explained by x or not.

Totalsum of squares = explained sum of squares + unexplained sum of squares.

The RegressionEquation
The regression equation is simply a mathematical equation for a line. It is theequation that describes the regression line. In algebra, we represent theequation for a line with something like this:

a is the intercept, or the point atwhich the line travels through the y-axis (sometimes called the y-intercept),and b is the slope of the line. One can think of the y-intercept as the valueof y when x is equal to 0. With a grid, we could find the slope of the line bycounting how many points we have to go up to meet the line again after we havegone over one point to the right (remember "rise over run"). So theslope is a ratio of the increase in y with every point increase in x. Withregression analysis, we need to find out what the equation of the line is forthe best fitting line. What is the slope and intercept for the regression line?If the slope is zero, there is no relationship between x and y. If the slope islarger than 0 (or smaller, if the relationship is negative), there is arelationship.

To figure out the equation for theregression line, we first want figure out the slope, b. Here is the formula forthat:

Pretty simple, we've done similarformulas before. Notice that on the top of the formula, we compute thedeviations of the x's from the mean of x and the deviation of the y's from themean and multiple them. We do not square them. This top part of the equationcan be called the covariance of x and y. The slope then represents theamount that x and y covary together relative to the overall variation of x. Andsometimes the equation for b is written as:

the intercept, a, can then beobtained using b:

where and are the means of x and yrespectively. In regression analysis, we are attempting to predict y based on xscores, so we represent the regression equation with a symbol to indicate a predictedscore:

I used the example of exam scoresand time on examine in the correlation lectureto compute the regression equation. I'm not going to go through the steps here,because you should be getting the hang of this stuff. You might want to checkmy work as an exercise. If we solve the above equations for b and a, we wouldthen wind up with an equation that looked like this:

meaning that the y-intercept was avalue of -13.24 on the y axis, and the slope of the line is .69. The sloperepresents the amount of increase in y scores with one unit change in x scores.As we increase the time on the exam by 1 minute (x), we expect scores on theexam (y) to increase by .69. The y-intercept is not always meaningful. Forinstance, here, it means that when there are 0 minutes spent on the exam, thescore on the test is about -13.