Learn how to select the best performing linear regression for univariate models (2024)

/ #Data Science
Learn how to select the best performing linear regression for univariate models (1)
freeCodeCamp
Learn how to select the best performing linear regression for univariate models (2)

By Björn Hartmann

Find out which linear regression model is the best fit for your data

Inspired by a question after my previous article, I want to tackle an issue that often comes up after trying different linear models: You need to make a choice which model you want to use. More specifically, Khalifa Ardi Sidqi asked:

“How to determine which model suits best to my data? Do I just look at the R square, SSE, etc.?

As the interpretation of that model (quadratic, root, etc.) will be very different, won’t it be an issue?”

The second part of the question can be answered easily. First, find a model that best suits to your data and then interpret its results. It is good if you have ideas how your data might be explained. However, interpret the best model, only.

The rest of this article will address the first part of his question. Please note that I will share my approach on how to select a model. There are multiple ways, and others might do it differently. But I will describe the way that works best for me.

In addition, this approach only applies to univariate models. Univariate models have just one input variable. I am planning a further article, where I will show you how to assess multivariate models with more input variables. For today, however, let us focus on the basics and univariate models.

To practice and get a feeling for this, I wrote a small ShinyApp. Use it and play around with different datasets and models. Notice how parameters change and become more confident with assessing simple linear models. Finally, you can also use the app as a framework for your data. Just copy it from Github.

Learn how to select the best performing linear regression for univariate models (3)Click on the image for an interactive version

Use the Adjusted R2 for univariate models

If you only use one input variable, the adjusted R2 value gives you a good indication of how well your model performs. It illustrates how much variation is explained by your model.

In contrast to the simple R2, the adjusted R2 takes the number of input factors into account. It penalizes too many input factors and favors parsimonious models.

In the screenshot above, you can see two models with a value of 71.3 % and 84.32%. Apparently, the second model is better than the first one. Models with low values, however, can still be useful because the adjusted R2 is sensitive to the amount of noise in your data. As such, only compare this indicator of models for the same dataset than comparing it across different datasets.

Usually, there is little need for the SSE

Before you read on, let’s make sure we are talking about the same SSE. On Wikipedia, SSE refers to the sum of squared errors. In some statistic textbooks, however, SSE can refer to the explained sum of squares (the exact opposite). So for now, suppose SSE refers to the sum of squared errors.

Hence, the adjusted R2 is approximately 1 — SSE /SST. With SST referring to the total sum of squares.

I do not want to dive deeper into the math behind this. What I want to show you is that the adjusted R2 is computed with the SSE. So the SSE usually does not give you any additional information.

Furthermore, the adjusted R2 is normalized such that it is always between zero and one. So it is easier for you and others to interpret an unfamiliar model with an adjusted R2 of 75% rather than an SSE of 394 — even though both figures might explain the same model.

Have a look at the residuals or error terms!

What is often ignored are error terms or so-called residuals. They often tell you more than what you might think.

The residuals are the difference between your predicted values and the actual values.

Their benefit is that they can show you both the magnitude as well as the direction of your errors. Let’s have a look at an example:

Learn how to select the best performing linear regression for univariate models (4)We do not want residuals to vary like this around zero

Here, I tried to predict a polynomial dataset with a linear function. Analyzing the residuals shows that there are areas where the model has an upward or downward bias.

For 50 &l_t_; x < 100, the residuals are above zero. So in this area, the actual values have been higher than the predicted values — our model has a downward bias.

For100 < x &lt; 150, however, the residuals are below zero. Thus, the actual values have been lower than the predicted values — the model has an upward bias.

It is always good to know, whether your model suggests too high or too low values. But you usually do not want to have patterns like this.

The residuals should be zero on average (as indicated by the mean) and they should be equally distributed. Predicting the same dataset with a polynomial function of 3 degrees suggests a much better fit:

Learn how to select the best performing linear regression for univariate models (5)Here the residuals are equally distributed around zero. Suggesting a much better fit

In addition, you can observe whether the variance of your errors increases. In statistics, this is called Heteroscedasticity. You can fix this easily with robust standard errors. Otherwise, your hypothesis tests are likely to be wrong.

Histogram of residuals

Finally, the histogram summarizes the magnitude of your error terms. It provides information about the bandwidth of errors and indicates how often which errors occurred.

Learn how to select the best performing linear regression for univariate models (6)

Learn how to select the best performing linear regression for univariate models (7)The right histogram indicates a smaller bandwidth of errors than the left one. So it seems to be a better fit.

The above screenshots show two models for the same dataset. In the left histogram, errors occur within a range of -338 and 520.

In the right histogram, errors occur within -293 and 401. So the outliers are much lower. Furthermore, most errors in the model of the right histogram are closer to zero. So I would favor the right model.

Summary

When choosing a linear model, these are factors to keep in mind:

  • Only compare linear models for the same dataset.
  • Find a model with a high adjusted R2
  • Make sure this model has equally distributed residuals around zero
  • Make sure the errors of this model are within a small bandwidth

Learn how to select the best performing linear regression for univariate models (8)Click on the image to open the app

Learn how to select the best performing linear regression for univariate models (9)

If you have any questions, write a comment below or contact me. I appreciate your feedback.

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

ADVERTIsem*nT

Learn how to select the best performing linear regression for univariate models (10)
freeCodeCamp

Learn to code. Build projects. Earn certifications—All for free.

If you read this far, thank the author to show them you care.

Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started

ADVERTIsem*nT

Learn how to select the best performing linear regression for univariate models (2024)
Top Articles
Bitcoin: Still The King Of The Coins (BTC-USD)
Why Bitcoin Remains The King Of Cryptocurrency | TechCabal
Bleak Faith: Forsaken – im Test (PS5)
Junk Cars For Sale Craigslist
Google Sites Classroom 6X
Boggle Brain Busters Bonus Answers
5 Bijwerkingen van zwemmen in een zwembad met te veel chloor - Bereik uw gezondheidsdoelen met praktische hulpmiddelen voor eten en fitness, deskundige bronnen en een betrokken gemeenschap.
Hendersonville (Tennessee) – Travel guide at Wikivoyage
Gameplay Clarkston
Craigslist Dog Sitter
Southland Goldendoodles
Large storage units
Gfs Rivergate
General Info for Parents
More Apt To Complain Crossword
Magic Mike's Last Dance Showtimes Near Marcus Cedar Creek Cinema
Tcgplayer Store
Buff Cookie Only Fans
Race Karts For Sale Near Me
Timeforce Choctaw
Rogue Lineage Uber Titles
Hesburgh Library Catalog
Frank Vascellaro
Bridgestone Tire Dealer Near Me
The value of R in SI units is _____?
Kokomo Mugshots Busted
Where Do They Sell Menudo Near Me
2016 Honda Accord Belt Diagram
Terrier Hockey Blog
When His Eyes Opened Chapter 2048
Froedtert Billing Phone Number
Citibank Branch Locations In Orlando Florida
Three V Plymouth
Sig Mlok Bayonet Mount
COVID-19/Coronavirus Assistance Programs | FindHelp.org
Cocorahs South Dakota
Myrtle Beach Craigs List
Does Target Have Slime Lickers
Trending mods at Kenshi Nexus
26 Best & Fun Things to Do in Saginaw (MI)
Willkommen an der Uni Würzburg | WueStart
American Bully Puppies for Sale | Lancaster Puppies
Marcel Boom X
Online College Scholarships | Strayer University
Ronnie Mcnu*t Uncensored
Guy Ritchie's The Covenant Showtimes Near Look Cinemas Redlands
Game Like Tales Of Androgyny
Festival Gas Rewards Log In
Dcuo Wiki
Taterz Salad
login.microsoftonline.com Reviews | scam or legit check
Latest Posts
Article information

Author: Sen. Ignacio Ratke

Last Updated:

Views: 6326

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Sen. Ignacio Ratke

Birthday: 1999-05-27

Address: Apt. 171 8116 Bailey Via, Roberthaven, GA 58289

Phone: +2585395768220

Job: Lead Liaison

Hobby: Lockpicking, LARPing, Lego building, Lapidary, Macrame, Book restoration, Bodybuilding

Introduction: My name is Sen. Ignacio Ratke, I am a adventurous, zealous, outstanding, agreeable, precious, excited, gifted person who loves writing and wants to share my knowledge and understanding with you.