Common pitfalls in statistical analysis: Linear regression analysis (2024)

Journal List
Perspect Clin Res
v.8(2); Apr-Jun 2017
PMC5384397

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Perspect Clin Res. 2017 Apr-Jun; 8(2): 100–102.

doi:10.4103/2229-3485.203040

PMCID: PMC5384397

PMID: 28447022

Rakesh Aggarwal and Priya Ranganathan¹

Abstract

In a previous article in this series, we explained correlation analysis which describes the strength of relationship between two continuous variables. In this article, we deal with linear regression analysis which predicts the value of one continuous variable from another. We also discuss the assumptions and pitfalls associated with this analysis.

Keywords: Biostatistics, linear model, regression analysis

We often have information on two numeric characteristics for each member of a group and believe that these are related to each other – i.e. values of one characteristic vary depending on the values of the other. For instance, in a recent study, researchers had data on body mass index (BMI) and mid-upper arm circumference (MUAC) on 1373 hospitalized patients, and they decided to determine whether there was a relationship between BMI and MUAC.[1] In such a situation, as we discussed in a recent piece on “Correlation” in this series,[2] the researchers would plot the data on a scatter diagram. If the dots fall roughly along a straight line, sloping either upwards or downwards, they would conclude that a relationship exists. As a next step, they may be tempted to ask whether, knowing the value of one variable (MUAC), it is possible to predict the value of the other variable (BMI) in the study group. This can be done using “simple linear regression” analysis, also sometimes referred to as “linear regression.” The variable whose value is known (MUAC here) is referred to as the independent (or predictor or explanatory) variable, and the variable whose value is being predicted (BMI here) is referred to as the dependent (or outcome or response) variable. The independent and dependent variables are, by convention, referred to as “x” and “y” and are plotted on horizontal and vertical axes, respectively.

At times, one is interested in predicting the value of a numerical response variable based on the values of more than one numeric predictors. For instance, one study found that whole-body fat content in men could be predicted using information on thigh circumference, triceps and thigh skinfold thickness, biceps muscle thickness, weight, and height.[3] This is done using “multiple linear regression.” We will not discuss this more complex form of regression.

Although the concepts of “correlation” and “linear regression” are somewhat related and share some assumptions, these also have some important differences, as we discuss later in this piece.

THE REGRESSION LINE

Linear regression analysis of observations on two variables (x and y) in a sample can be looked upon as plotting the data and drawing a best fit line through these. This “best fit” line is so chosen that the sum of squares of all the residuals (the vertical distance of each point from the line) is a minimum – the so-called “least squares line” [Figure 1]. This line can be mathematically defined by an equation of the form:

Open in a separate window

Figure 1

Data from a sample and estimated linear regression line for these data. Each dot corresponds to a data point, i.e., an individual pair of values for x and y, and the vertical dashed lines from each dot represent residuals. The capital letters (Y) are used to indicate predicted values and lowercase letters (x and y) for known values. Intercept is shown as “a” and slope or regression coefficient as “b”

ASSUMPTIONS

Regression analysis makes several assumptions, which are quite akin to those for correlation analysis, as we discussed in a recent issue of the journal.[1] To recapitulate, first, the relationship between x and y should be linear. Second, all the observations in a sample must be independent of each other; thus, this method should not be used if the data include more than one observation on any individual. Furthermore, the data must not include one or a few extreme values since these may create a false sense of relationship in the data even when none exists. If these assumptions are not met, the results of linear regression analysis may be misleading.

CORRELATION VERSUS REGRESSION

Correlation and regression analyses are similar in that these assess the linear relationship between two quantitative variables. However, these look at different aspects of this relationship. Simple linear regression (i.e., its coefficient or “b”) predicts the nature of the association – it provides a means of predicting the value of dependent variable using the value of predictor variable. It indicates how much and in which direction the dependent variable changes on average for a unit increase in the latter. By contrast, correlation (i.e., correlation coefficient or “r”) provides a measure of the strength of linear association – a measure of how closely the individual data points lie on the regression line. The values of “b” and “r” always carry the same sign – either both are positive or both are negative. However, their magnitudes can vary widely. For the same value of “b,” the magnitude of “r” can vary from 1.0 to close to 0.

ADDITIONAL CONSIDERATIONS

Some points must be kept in mind when interpreting the results of regression analysis. The absolute value of regression coefficient (“b”) depends on the units used to measure the two variables. For instance, in a linear regression equation of BMI (independent) versus MUAC (dependent), the value of “b” will be 2.54-fold higher if the MUAC is expressed in inches instead of in centimeters (1 inch = 2.54 cm); alternatively, if the MUAC is expressed in millimeters, the regression coefficient will become one-tenth of the original value (1 mm = 1/10 cm). A change in the unit of “y” will also lead to a change in the value of the regression coefficient. This must be kept in mind when interpreting the absolute value of a regression coefficient.

Similarly, the value of “intercept” also depends on the unit used to measure the dependent variable. Another important point to remember about the “intercept” is that its value may not be biologically or clinically interpretable. For instance, in the MUAC-BMI example above, the intercept was −0.042, a negative value for BMI which is clearly implausible. This happens when, in real-life, the value of independent variable cannot be 0 as was the case for the MUAC-BMI example above (think of MUAC = 0; it simply cannot occur in real-life).

Furthermore, a regression equation should be used for prediction only for those values of the independent variable that lie within in the range of the latter's values in the data originally used to develop the regression equation.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

REFERENCES

1. Benítez Brito N, Suárez Llanos JP, Fuentes Ferrer M, Oliva García JG, Delgado Brito I, Pereyra-García Castro F, et al. Relationship between mid-upper arm circumference and body mass index in inpatients. PLoS One. 2016;11:e0160480. [PMC free article] [PubMed] [Google Scholar]

2. Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: The use of correlation techniques. Perspect Clin Res. 2016;7:187–90. [PMC free article] [PubMed] [Google Scholar]

3. Bielemann RM, Gonzalez MC, Barbosa-Silva TG, Orlandi SP, Xavier MO, Bergmann RB, et al. Estimation of body fat in adults using a portable A-mode ultrasound. Nutrition. 2016;32:441–6. [PubMed] [Google Scholar]

Articles from Perspectives in Clinical Research are provided here courtesy of Wolters Kluwer -- Medknow Publications

Common pitfalls in statistical analysis: Linear regression analysis (2024)

FAQs

What are the common pitfalls of linear regression? ›

Lesson 10: Regression Pitfalls

Nonconstant variance and weighted least squares.
Autocorrelation and time series methods.
Multicollinearity, which exists when two or more of the predictors in a regression model are moderately or highly correlated with one another.
Overfitting.
Excluding important predictor variables.

More items...

Read On ›

What is the common error in linear regression? ›

Common mistakes in AI with linear regression include neglecting outliers, which can significantly skew predictions, and producing models that are either overfit or underfit, impacting the reliability of the projections.

Discover More Details ›

What can be a major problem with linear regression? ›

3 Disadvantage: Sensitive to outliers and noise. One of the main disadvantages of using linear regression for predictive analytics is that it is sensitive to outliers and noise. Outliers are data points that deviate significantly from the rest of the data, and noise is random variation or error in the data.

What are the weaknesses of linear regression? ›

Weaknesses: Linear regression performs poorly when there are non-linear relationships. They are not naturally flexible enough to capture more complex patterns, and adding the right interaction terms or polynomials can be tricky and time-consuming.

See Details ›

What are the main problems in regression analysis? ›

Essential Concept 5: Problems in Regression Analysis

Problem	Effect	Solution
Heteroskedasticity: variance of error term is not constant. Test using BP test BP = nR	F-test is unreliable. Standard error underestimated. t-stat overstated.	Robust standard errors Generalized least squares

2 more rows

Find Out More ›

Which of the following is not a common pitfall of regression? ›

The statement "Nonlinear relationships cannot be incorporated" is not a common pitfall of regression...

Tell Me More ›

What mistakes do people make when working with regression analysis? ›

1 Error 1: Omitting important variables

One of the main assumptions of regression analysis is that the model includes all the relevant variables that affect the outcome. If you omit important variables, you may end up with biased estimates, misleading interpretations, and inaccurate predictions.

Show Me More ›

What is the best error for regression? ›

The average absolute difference between the anticipated and actual values of the target variable is known as the Mean Absolute Error (MAE). When comparing estimation methods, MAE is preferred since it is less likely to be affected by extreme values.

Explore More ›

What is the standard error of the linear regression analysis? ›

The Standard Error of the Regression expresses the degree of uncertainty in the accuracy of the dependent variable's projected values. It conveniently tells you how far off the regression model is on average by utilising the response variable's units. It is also called the SE of the estimate.

Why linear regression is wrong? ›

There are lots of reasons why linear regression may perform "so bad". A linear regression model may in fact be appropriate but there is a lot of noise in the data. In other words, the explanatory variables that you have simply don't explain enough of the variation in the response.

Show Me More ›

When should you avoid linear regression? ›

[1] To recapitulate, first, the relationship between x and y should be linear. Second, all the observations in a sample must be independent of each other; thus, this method should not be used if the data include more than one observation on any individual.

Read The Full Story ›

Where does linear regression fail? ›

One common way for the "independence" condition in a multiple linear regression model to fail is when the sample data have been collected over time and the regression model fails to effectively capture any time trends.

See Details ›

What is the main problem with using the regression line? ›

Answer: The main problem with using single regression line is it is limited to Single/Linear Relationships. linear regression only models relationships between dependent and independent variables that are linear. It assumes there is a straight-line relationship between them which is incorrect sometimes.

Get More Info Here ›

What is the criticism of linear regression? ›

Linear regression models assume that the independent variables are not highly correlated with each other. Multicollinearity occurs when the independent variables are highly correlated, making it difficult for the model to estimate the effect of each independent variable on the dependent variable.

When should regression analysis not be done? ›

Do not use the regression equation to predict values of the response variable (y) for explanatory variable (x) values that are outside the range found with the original data.

What are the pitfalls of multiple regression? ›

Multiple regression can also suffer from overfitting, which is when your model fits the data too well, and loses its ability to generalize to new or unseen data. Overfitting can occur when you have too many independent variables, or when your variables are highly correlated with each other.

View Details ›

Why does linear regression fail? ›

This can be caused by accidentally duplicating a variable in the data, using a linear transformation of a variable along with the original (e.g., the same temperature measurements expressed in Fahrenheit and Celsius), or including a linear combination of multiple variables in the model, such as their mean.