Linear Regression — Basics, Assumptions and Evaluation

8 min readJan 28, 2023

Birds flying linearly — Photo by Howie Mapson on Unsplash

Linear Regression (LR) is the first algorithm I learnt while starting my journey in Data Science. In a sense Linear Regression is “Hello World” of algorithms. This is the most basic algorithm. Linear Equations are a part of curriculum from early on so we already have an introduction to Linear Regression hence it is much easier to understand Linear Regression algorithm.

In this article, I will go in detail of Linear Regression from Data Science perspective. What is LR, in what situations it is used, what are the assumptions and data pre-processing needed for LR.

What IS Linear Regression?

According to Wikipedia¹

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. The most common form of regression analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion.

So what does this mean? Let us take this in parts.

Processes for estimating the relationships between a dependent variable and one or more independent variables — This means that both dependent and independent variables should be present in data. In other words, the data should be pre-labeled and value of target variable (one we are trying to predict) is known with respect to other variables in the data. LR is, therefore, a supervised machine learning technique.

In which one finds the line that most closely fits the data according to a specific mathematical criterion — This means that the relationship between target and independent variables will be treated as LINEAR only. The linear function is usually of the form f(x) = mx + b which means that the target variable will be dependent on the first degree of independent variables only.

Assumptions for Linear Regression

As the LR is specifically looking to find the linear function i.e. to fit a line across data points, there are some assumptions for the data. In addition to the basic assumption that “The sample is representative of the population at large”², the other assumptions are as follows³ —

Linearity — The relationship between independent variable (x) and the mean of dependent variable (y) is linear. Explanation — As stated above, this means that relationship between x and y is represented by a straight line function or a liner function f(x) = mx + b which means that the target variable will be dependent on the first degree of independent variables only.
Homoscedasticity — The variance of residual is the same for any value of x. Explanation — Residual is the error term or the “noise” or the random disturbance in the data. What homoscedasticity means is that the random disturbance is same for all the observations of independent variables. It is easier to understand homoscedasticity through heteroscedasticity. Let me explain further with an example — Say we have data related to income of household and amount spent on dining out. There is a strong and positive correlation between income and spending. The low income families spend very less in dining out so the variation in residuals for low income families is small and nearly same (which is needed for LR). In case of high income group the situation is a bit different. Not all wealthy families spend nearly same in dining out. Some wealthier families spend a lot in dining out, some spend moderate amount and some spend very less in dining out. Hence, the residual or the size of error varies across the income group or in other words varies across the values of the independent variable. In case of homoscedasticity the scatter plot of residuals will be nearly a straight line parallel to x-axis and in case of heteroscedasticity the scatterplot of residual will have a cone shaped pattern. The question still remains — why is this a problem in LR? Why is LR not a “good model” in case of heteroscedasticity? The answer lies in the way Linear Regression works. LR uses Ordinary Least Squares (OLS) as cost function which tries to minimize the residuals and thus the smallest possible standard error. OLS gives equal weightage to all observations therefore in case of heteroscedasticity the observations with larger residuals have more “influence” or “pull” than other observations.
Independence — Observations are independent of each other. Explanation —This has 2 aspects — A. related to each observation i.e. row of the data and B. related to each independent variable. Assumption related to independence is applicable to both. It is assumed that, each of the observations of the data is collected independently in other words every row in the data is collected independently. Also, independence assumed that each variable is independent of each other. Say, if x₁ and x₂ are 2 variables in data then x₁ ≠ f(x₂) and x₂ ≠ f(x₁).
Normality — For any fixed value of x, y is normally distributed. Explanation — This is one of the most discussed & most wrongly interpreted assumption related to Linear Regression. One of the classic example against this assumption is x has Bernoulli distribution and y = x + C. Neither x nor y is normally distributed on its own, but regressing y on x still works. What the assumption really means is that residuals of y is normally distributed. What happens if this “assumption” of linearity is violated? We can apply a nonlinear transformation to the independent and/or dependent variable.

Data cleanup before Linear Regression

As Linear Regression is a fairly simple machine learning model, it requires the data in a certain way. In addition to the data conforming with the above mention assumptions, the observations should also be in a “certain way”. So the steps to take feeding the data into linear regression are as follows —

Feature scaling — As the target variable is linearly dependent on feature variables, the scale of feature variable plays a very important role in Linear Regression. Let us consider a data in which one feature is in range of 1–10 (say weight in kg) and another is in range of 1,000–10,000 (length in meters). As the target variable is linearly dependent on the feature variables, the feature variables with higher scale will influence the target variable more. If the model follows the equation y = x₁ + 1000x₂ + C, then any change in x₂ will effect y 1000 times more than same change in x₁. Therefore, to avoid this issue, feature scaling is very important preprocessing step for Linear Regression. The multiple ways of feature scaling are —

Absolute Maximum Scaling — all the values in feature are divided by the absolute maximum value of the feature.
Min-Max Scaling — In min-max, subtract the minimum value in the dataset with all the values and then divide this by the range of the dataset. Range of dataset = (maximum — minimum)
Normalization — In normalization, subtract the mean value in the dataset with all the values and then divide this by the range of the dataset. Range of dataset = (maximum — minimum)
Standardization — In standardization, calculate the z-value for each of the values of feature and replace those with these values. This will make sure that all the features are centered around the mean value with a standard deviation value of 1.
Robust Scaling — In robust scaling, subtract all the data points with the median value and then divide it by the Inter Quartile Range(IQR) value.

2. Outliers — As is the case with scale, outliers also play a significant role in “skewing” the prediction of target variable. Firstly, outliers affect the scaling. In all the above scaling methods the maximum or minimum value of the data points play a very important role and hence outliers have the power to “skew” the scaling itself. Secondly, as a part of individual value in the feature, since the outliers are either very large or very small as compared to rest of the data points and in LR the target variable is linearly dependent on feature variables, the outliers can “skew” the model.

3. Missing values — The default setting in linear regression analysis is to eliminate and data point with missing data on any of the variable. As a result, the sample size can reduce significantly in case there are a lot of observations with missing values thus affecting the model.

Cost Function

Now the actual regression line may be like —

y = a₁x₁ + a₂x₂ + a₃x₃ + … + aₙxₙ + C

The different values of coefficients a₁, a₂, a₃, …, aₙ gives different regression lines. The main goal of Linear Regression is to find the best fit line which means that the error between the predicted and actual value should be minimum. The distance between actual value and predicted value is called Residual. To minimize the residual we use cost function. If the residual is high then cost function will be high and if residual is small, the cost function will be small.

In Linear Regression, we generally use Mean Squared Error (MSE) as cost function. MSE is the average of squared residual. A low value of MSE implies a better fit.

Another way to understand “fit” is R-Squared (R²) method. R² determines the goodness of fit⁴. It is also called as coefficient of determination. R² is a measure of strength of relationship between predicted and actual values on a scale of 0–100%. Statistically, R² indicates the percentage of variance in dependent variables that independent variables explain collectively. A higher value of R² indicates smaller differences between predicted and observed values.

R² = 0, represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.

R² = 100, represents a model that explains all the variation in the response variable around its mean.

A larger value of R² indicates a better regression model.

Caveat

Higher values of R² or lower values of MSE can be a result of overfitting. Therefore, in order to evaluate the goodness of fit, we should look into residual plot in addition to R² and MSE. It is much easier to identify a biased model using residual plot than by finding problematic patterns numerically. In a good model the residuals should be randomly distributed around zero for every predicted value. When the residuals center on zero, they indicate that the model’s predictions are correct on average rather than systematically too high or low. Regression also assumes that the residuals follow a normal distribution and that the degree of scattering is the same for all predicted values⁵.