According to a Forrester report, the machine learning industry is set to grow to about 60 billion dollars in 2025 from the 1.4 billion dollars worth in 2016. By 2030, the projected growth of global GDP as a result of AI is estimated at a staggering 15.7 trillion dollars. It is no secret that Machine Learning is disrupting almost every single sector from healthcare to automation. To make matters more interesting, a study released by Tencent in 2017 claimed that there were only 300,000 AI & machine learning practitioners worldwide. However, one study claimed that after taking into account Linkedin data, there are only 22,000 AI experts as of 2019. Clearly there is a huge lack in suppy of talent for the millions of machine learning jobs that are going to be opened in the next few years.
If you are thinking of dipping your toes into machine learning then this is the place to start. We are gonna start off by learning the most basic machine learning algorithm — Linear Regression.
Although basic, Linear Regression is a very powerful statistical technique. It can be used to generate key insights on consumer behaviour, understanding trends and making estimates and forecasts. and factors influencing profitability. Linear regressions can be used in business to evaluate trends and make estimates or forecasts. For example, if a company’s sales have increased steadily every month for the past few years, by conducting a linear analysis on the sales data with monthly sales, the company could forecast sales in future months.
So what is Linear Regression? Linear Regression is the method of finding the relationship between two variables, x and y. Where, x is the independent variable also called thepredictorand y is the dependent variable also known as theresponse. In this article, I will only be focusing on the most basic form of Linear Regression — I will cover multivariable linear regression in another post.
I believe the best way to learn is through examples, so let’s imagine you are a recruiter of a company and you are trying to guage how much to pay new hires and also scale the pay of existing employees. You have two types of data of your existing employees; the number of years of experience they have and their salary. Since their salary depends on their years of experience, employee salary will be the y-variable and years of experience will be the x-variable.
The above plot shows our data. So let’s say you are looking to hire a new employee who had five years of experience, how much should you pay them? This is where Linear Regression comes in — we want to find the “Best-Fit” line thats fit with our current data points. We can then use that best fit line to estimate salaries for new employees based on their years of experience.
Understanding Linearity
In school you’ve most probably learnt the following equation to describe straight lines:
One method to get the best-fit line is by hand-drawing; making sure that an equal number of points lie above and below your line. After drawing the straight line, you can get its gradient and y-intercept and then calculate the equation of the line. But this method is riddled with errors and is not feasible if you working with a huge dataset.
There are a multitude of ways to carry out linear regression. In this instance, I’ll be showing you how to get the best fit line using plot.ly. Its a pretty great tool that has an offline and online version as well.
Firstly, download the dataset from the github link below:
After you’ve done that, use plot.ly to plot all the points. If you need help in this step, just follow my tutorialhere. Once you have also plotted a “best-fit” line for these points you will have a graph which looks like the image below.
Plot.ly will also give you an estimate for m(gradient) & b(y-intercept).
Take these values of M & B and put it back into the straight line equation of Y=MX+B and you get the linear regression model to help predict employee salaries based on their years of experience.
This method is a easy way for you to get an equation for the best fit line. However it is imperative for us to understand how these tools manage to derive this equation. They use something called the Cost Function. The Cost Function allows us to measure how good of an estimate the best-fit line is to the data points. To learn more about the cost function, check out my article here.