Linear Regression from Scratch: Understanding the Mathematics Behind the Model¶
Introduction¶
What is Linear Regression?
Linear regression is one of the simplest yet most powerful algorithms in machine learning.
It is used to model the relationship between an independent variable (X) and a dependent variable (Y) by fitting a linear equation.
If you have ever tried to learn any thing about data science and machine learning models, one concept you must have come across at least once is "Linear Regression". I'm sure for those of us that still remember our high school arithmetic, the easiest way to understand it would be to see it as "a model that draws the line of best fit for a given set of points."
Almost all data scientists use Scikit-Learn's LinearRegression model, but understanding how it works under the hood is essential for mastering the fundamentals of data science.
The main objective of this project is to walk you through understanding the mathematics behind it and implementing linear regression from scratch using the closed-form solution, also known as the normal equation.
In this notebook, we will:
- Explore the mathematical foundation of linear regression.
- Derive the closed-form solution (also known as the normal equation).
- Implement linear regression from scratch using NumPy and Pandas.
- Evaluate the model's performance by implementing the metrics using only numpy and pandas.
- Drawbacks of the closed-form solution.
The Mathematical Foundation of Linear Regression¶
As you probably already know, the goal of linear regression is to find the best-fitting straight line (or hyperplane in higher dimensions) that describes how the dependent variable changes as the independent variables change.
In simple linear regression, we only have one independent variable, so the aim is to fit a straight line.
The linear regression equation is given by the general equation of a straight line:
$$ y = mx + c $$
Where:
- $ y $ is the dependent variable.
- $ x $ is the independent variable.
- $ m $ is the gradient.
- $ c $ is the y-intercept.
In Machine learning however, it is not so simplified because:
- $ y $ is a vector of the target variables (the variable we want to predict).
- $ X $ is the feature matrix (a matrix of all input features) - We can have one or more independent features.
- $ m $ becomes $ \beta $ (theta) which is the vector of model coefficients/weights for every independent feature.
- $ c $ becomes $ \epsilon $ which is the error term (noise).
The linear regression equation therefore becomes:
$$ y = \beta X + \epsilon $$
The error term $ \epsilon $ represents the portion of $ y $ that cannot be explained by the linear relationship with $ X $. It accounts for noise in the data, measurement errors, and unmodeled influences.
To simplify the linear regression model, we introduce an intercept term $ \beta_0 $ (bias), which shifts the regression line so it best fits the data. We rewrite the equation as:
$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_d x_d + \epsilon $$
where:
- $ \beta_0 $ is the intercept term.
- $ \beta_1, \beta_2, \dots, \beta_d $ are the coefficients of the independent variables.
Deriving The Closed Form Equation¶
The whole idea of a best fit line is to minimize the errors between the predicted y-value (on the regression line) and the actual y-value for every value of x. This is how we measure the performance of a linear regression model. We calculate the absolute differences (residuals) between the predicted and the actual y-values and find the mean. This is called the Mean Absolute Error (MAE). We also use the Mean Squared Error (MSE) which as the name implies is the mean of the square of the residuals.
It would only make sense then that the lower our MAE or MSE, the better the performance of our model because this would mean that the differences between the predicted and actual values are lesser.
In linear regression model, we aim to fit the line that minimizes this error in the best way possible. To do this, we have to find the best possible value or set of values for $ \beta $ that minimizes this error.
To estimate $ \beta $, we minimize the Mean Squared Error (MSE) by differentiating with respect to $ \beta $ and equating the derivative to 0:
Recall, from the definition above,
$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - (\hat{y_i}))^2 $$
Where:
- n = Number of data points
- $ y_i $ = Actual y-values
- $ \hat{y_i} $ = Predicted y-values
Since $ \hat{y_i} = (\beta_0 + \beta_1 x_i) $
$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2 $$
In matrix form, we define:
- $ y $ as an $ n \times 1 $ vector of observed values.
- $ X $ as an $ n \times (d+1) $ matrix where each row corresponds to a sample, with the first column being ones for the intercept.
- $ \beta $ as a $ (d+1) \times 1 $ vector of coefficients.
Transforming and expressing the MSE equation in matrix form,
$$ \text{MSE} = \frac{1}{n} (y - X\beta)^T (y - X\beta) $$
Expanding the MSE expression,
$$ \text{MSE} = \frac{1}{n} \left[ y^T y - y^T X\beta - \beta^T X^T y + \beta^T X^T X \beta \right] $$
Simplifying the expression, since $ y^T X\beta $ and $ \beta^T X^T y $ are scalars, they are equal:
$$ \text{MSE} = \frac{1}{n} \left[ y^T y - 2y^T X\beta + \beta^T X^T X \beta \right] $$
Taking the derivative with respect to $ \beta $,
$$ \frac{\partial \text{MSE}}{\partial \beta} = \frac{1}{n} \left[ -2X^T y + 2X^T X \beta \right] $$
Setting the derivative to zero to find the minimum,
$$ -2X^T y + 2X^T X \beta = 0 $$
Solving for $ \beta $,
$$ X^T X \beta = X^T y $$
$$ \beta = (X^T X)^{-1} X^T y $$
The closed-form solution for the linear regression coefficients $ \beta $ is:
$$ \beta = (X^T X)^{-1} X^T y $$
This is known as the Normal Equation.
By minimizing the MSE, we derived the closed-form solution for the linear regression model. This solution allows us to find the optimal coefficients $ \beta $ that best fit the data.
This is the equation we will use to train the model in this project instead of the Scikit-Learn LinearRegression model.
Implementing Linear Regression from Closed Form Equation¶
We will now implement the Linear Regression on a dataset using the closed form equation we derived above.
Let's start by importing all the libraries we will use for this project
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Load the Dataset¶
For this project, we will use the Energy Consumption Dataset from kaggle.
This dataset is all about predicting how much energy buildings use based on different features and environmental factors. It includes information on various types of buildings, their square footage, the number of people inside, the appliances they use, the average temperature, and even the day of the week. The aim is to create a linear regression model that can estimate energy consumption using these details.
Data Dictionary:
- Building Type - The type of building (residential, commercial, etc,.)
- Square Footage - The total size of the building in square foot
- Number of Occupants - The number of people occupying the building
- Appliances Used - The number of appliances used in the building
- Average Temperature - The average temperature of the building or climate area (in Celsius)
- Day of Week - Indicates whether the data point corresponds to a weekday or weekend
- Energy Consumption - The energy consumption of the building in kWh (kilowatt-hours). This is the target variable
The data set is originally split into two different files (train and test) , so we will load both files and merge it
# For downloading datasets from Kaggle
import kaggle
# Download the dataset
kaggle.api.dataset_download_files("govindaramsriram/energy-consumption-dataset-linear-regression", path="./datasets/", unzip=True)
Dataset URL: https://www.kaggle.com/datasets/govindaramsriram/energy-consumption-dataset-linear-regression
# Load train and test datasets
df_train = pd.read_csv('./datasets/train_energy_data.csv')
df_test = pd.read_csv('./datasets/test_energy_data.csv')
# merge both datasets
df = pd.concat([df_train, df_test], axis=0)
# Preview the data
df.head()
| Building Type | Square Footage | Number of Occupants | Appliances Used | Average Temperature | Day of Week | Energy Consumption | |
|---|---|---|---|---|---|---|---|
| 0 | Residential | 7063 | 76 | 10 | 29.84 | Weekday | 2713.95 |
| 1 | Commercial | 44372 | 66 | 45 | 16.72 | Weekday | 5744.99 |
| 2 | Industrial | 19255 | 37 | 17 | 14.30 | Weekend | 4101.24 |
| 3 | Residential | 13265 | 14 | 41 | 32.82 | Weekday | 3009.14 |
| 4 | Commercial | 13375 | 26 | 18 | 11.92 | Weekday | 3279.17 |
Explore the dataset¶
Checking for nulll values ...
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1100 entries, 0 to 99 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Building Type 1100 non-null object 1 Square Footage 1100 non-null int64 2 Number of Occupants 1100 non-null int64 3 Appliances Used 1100 non-null int64 4 Average Temperature 1100 non-null float64 5 Day of Week 1100 non-null object 6 Energy Consumption 1100 non-null float64 dtypes: float64(2), int64(3), object(2) memory usage: 68.8+ KB
✓ The dataset contains no null values.
Inspecting the summary statisticts of the numerical columns ...
df.describe()
| Square Footage | Number of Occupants | Appliances Used | Average Temperature | Energy Consumption | |
|---|---|---|---|---|---|
| count | 1100.000000 | 1100.000000 | 1100.000000 | 1100.000000 | 1100.000000 |
| mean | 25500.527273 | 48.268182 | 25.730000 | 22.559745 | 4168.191273 |
| std | 14236.955632 | 29.127624 | 14.116209 | 7.122357 | 924.278723 |
| min | 560.000000 | 1.000000 | 1.000000 | 10.050000 | 1683.950000 |
| 25% | 13203.750000 | 22.000000 | 13.000000 | 16.365000 | 3510.460000 |
| 50% | 25785.500000 | 47.000000 | 26.000000 | 22.810000 | 4189.690000 |
| 75% | 37536.750000 | 73.000000 | 38.000000 | 28.760000 | 4859.510000 |
| max | 49997.000000 | 99.000000 | 49.000000 | 34.990000 | 6530.600000 |
✓ The quantiles, seem to be evenly spaced around the median which is mostly close to the mean in all features. This means there are no unnecessary outliers
We can use the standard scaling technique to scale the values.
Inspecting the number of observations per categorical column ...
# Check unique values in each categorical column
for col in df.select_dtypes(include='object').columns:
print(f"{col}: {df[col].unique()}")
Building Type: ['Residential' 'Commercial' 'Industrial'] Day of Week: ['Weekday' 'Weekend']
Modify column names for uniformity and to avoid errors (replace spaces with underscores and unify case)
new_cols = ['building_type', 'square_footage', 'number_of_occupants', 'appliances_used', 'average_temperature', 'day_of_week', 'energy_consumption']
df.columns = new_cols
df.head()
| building_type | square_footage | number_of_occupants | appliances_used | average_temperature | day_of_week | energy_consumption | |
|---|---|---|---|---|---|---|---|
| 0 | Residential | 7063 | 76 | 10 | 29.84 | Weekday | 2713.95 |
| 1 | Commercial | 44372 | 66 | 45 | 16.72 | Weekday | 5744.99 |
| 2 | Industrial | 19255 | 37 | 17 | 14.30 | Weekend | 4101.24 |
| 3 | Residential | 13265 | 14 | 41 | 32.82 | Weekday | 3009.14 |
| 4 | Commercial | 13375 | 26 | 18 | 11.92 | Weekday | 3279.17 |
Data Pre-Processing¶
Since the data is already clean, in order to get this dataset ready for the model, we just have to make sure the data is in the right format.
We will perform the following pre-processing operations:
- Scaling: to ensure fairness so that the model does not disproportionately assign weight to some features just because they generally have higher values
- Encoding: to convert categorical values to numerical values because linear regression using the the closed form will only accept numerical values (because of the matrix operations)
- Split the data in Training and Testing sets
Note: All these operations will be performed using Numpy and Pandas (No Scikit-Learn)
Encoding: Using pd.get_dummies()¶
# Encoding for `building_type` column
df = pd.get_dummies(df, columns=['building_type'], drop_first=True)
# Encoding for `day_of week` column
df = pd.get_dummies(df, columns=['day_of_week'], drop_first=True)
# convert to integer
df[['building_type_Industrial', 'building_type_Residential', 'day_of_week_Weekend']] = df[['building_type_Industrial', 'building_type_Residential', 'day_of_week_Weekend']].astype(int)
df.head()
| square_footage | number_of_occupants | appliances_used | average_temperature | energy_consumption | building_type_Industrial | building_type_Residential | day_of_week_Weekend | |
|---|---|---|---|---|---|---|---|---|
| 0 | 7063 | 76 | 10 | 29.84 | 2713.95 | 0 | 1 | 0 |
| 1 | 44372 | 66 | 45 | 16.72 | 5744.99 | 0 | 0 | 0 |
| 2 | 19255 | 37 | 17 | 14.30 | 4101.24 | 1 | 0 | 1 |
| 3 | 13265 | 14 | 41 | 32.82 | 3009.14 | 0 | 1 | 0 |
| 4 | 13375 | 26 | 18 | 11.92 | 3279.17 | 0 | 0 | 0 |
Scaling: Using Standard Scaling Method¶
X = df.drop(columns=['energy_consumption'])
y = df['energy_consumption']
# Perform scaling for all numerical columns
num_cols = X.select_dtypes(include='number').columns
for col in num_cols:
X[col] = (X[col] - X[col].mean()) / X[col].std()
# Preview the data
X.head()
| square_footage | number_of_occupants | appliances_used | average_temperature | building_type_Industrial | building_type_Residential | day_of_week_Weekend | |
|---|---|---|---|---|---|---|---|
| 0 | -1.295047 | 0.952080 | -1.114322 | 1.022169 | -0.684251 | 1.356725 | -0.997730 |
| 1 | 1.325527 | 0.608763 | 1.365097 | -0.819918 | -0.684251 | -0.736399 | -0.997730 |
| 2 | -0.438684 | -0.386856 | -0.618438 | -1.159693 | 1.460124 | -0.736399 | 1.001364 |
| 3 | -0.859420 | -1.176484 | 1.081735 | 1.440570 | -0.684251 | 1.356725 | -0.997730 |
| 4 | -0.851694 | -0.764504 | -0.547597 | -1.493852 | -0.684251 | -0.736399 | -0.997730 |
Implementing the Model¶
Now that we have preprocessed the data, we will implement the Linear Regression model using the closed-form equation
# Split the data into train and test sets
# Get index for splitting the data 80% train
train_split_index = int(0.8 * len(df))
# Split data into train and test
train_data_X = X.iloc[:train_split_index]
test_data_X = X.iloc[train_split_index:]
train_data_y = y.iloc[:train_split_index]
test_data_y = y.iloc[train_split_index:]
# Get X and Y values of the test and train models
X_train = train_data_X[X.columns]
y_train = train_data_y.values
X_test = test_data_X[X.columns]
y_test = test_data_y.values
Implement the closed-form solution to linear regression: $ \begin{equation} \beta = (X^T X)^{-1} X^T y \end{equation} $
- a. Where $ \beta $ is the vector of weights (including the bias (𝑏)), 𝑋 is the feature matrix with a column of ones added to represent the intercept, and 𝑦 is the vector of target values.
- b. Use your implementation to compute the coefficients for your linear regression model on the training dataset split.
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((880, 7), (880,), (220, 7), (220,))
# Add a column of ones to X_train for the bias term
X_train = np.c_[np.ones(X_train.shape[0]), X_train]
X_test = np.c_[np.ones(X_test.shape[0]), X_test]
# Compute different parts of closed form Solution
XTX_inv = np.linalg.inv(X_train.T @ X_train)
XTy = X_train.T @ y_train
# Compute Theta (Coefficients)
coefficients = XTX_inv @ XTy
Printing the learned coefficients (weights) of the model.
features_intercept_list = ['intercept'] + list(df.columns.drop('energy_consumption'))
# Print the computed coefficients
for feature, weight in zip(features_intercept_list, coefficients):
print(f'{feature} : {weight}')
intercept : 4168.191390734602 square_footage : 711.8474075727273 number_of_occupants : 291.276209188346 appliances_used : 282.324251173425 average_temperature : -35.61210470642426 building_type_Industrial : 233.1683008733712 building_type_Residential : -238.8776550204007 day_of_week_Weekend : -25.011398286890056
Evaluating the model's performance¶
Utilize the learned coefficients to generate predictions on the test dataset split, where:¶
$ \begin{equation} \hat{y} = X\theta \end{equation} $
# Compute y_pred using the equation
y_pred = X_test @ coefficients
# Preview y_pred
y_pred[:10]
array([2924.84992404, 2495.95032149, 4873.00038804, 5088.20019472,
5518.15034471, 3542.20047749, 3372.40042206, 4895.60064123,
3771.10044665, 1915.34997022])
We will apply the following evaluation metrics using NumPy functions only:
- a. Mean Absolute Error (MAE)
- b. Mean Squared Error (MSE)
Mean Absolute Error (MAE)
# Calculate MAE
mae = np.mean(np.abs(y_test - y_pred))
mae
np.float64(0.011712122589022494)
Mean Squared Error (MSE)
# Calculate MSE
mse = np.mean((y_test - y_pred)**2)
mse
np.float64(0.00018989334638738073)
These low values indicate that the model is making very accurate predictions, with average prediction errors close to zero. The small MSE also means that large errors are rare.
In short: The model is performing very well. 🎯👏
Niow, we'll calculate R² (coefficient of determination).
R² (Coefficient of Determination)
# Calculate R²
ss_res = np.sum((y_test - y_pred)**2) # Residual Sum of Squares
ss_tot = np.sum((y_test - np.mean(y_test))**2) # Total Sum of Squares
r2_score = 1 - (ss_res / ss_tot)
r2_score
np.float64(0.9999999997701359)
The R² score is ~0.999999997, which is extremely close to 1.
This means that the model explains almost 100% of the variance in the target variable.
The predictions are highly accurate, with almost negligible error.
It's an excellent fit, suggesting that the closed-form linear regression model is performed exceptionally well.
Drawbacks of the Closed-Form Equation¶
While the closed-form solution (also known as the normal equation) provides an exact answer to the linear regression problem, it comes with several important drawbacks that make it less ideal in many real-world scenarios:
1. Poor Scalability with Large Datasets:
The closed-form approach involves computing the inverse of $ X^T X $, which has a time complexity of O(n³) (where n is the number of features). This becomes computationally expensive and inefficient as the dataset grows, making it unsuitable for high-dimensional or large-scale data.
2. Inversion Issues with Singular Matrices:
If $ X^T X $ is singular or nearly singular—often caused by multicollinearity (highly correlated features) or having more features than samples—the matrix cannot be inverted, causing the closed-form solution to break down.
3. No Built-in Regularization:
Unlike techniques such as Ridge or Lasso regression, the closed-form method does not include any form of regularization. As a result, it may overfit the training data, leading to poor performance on unseen data.
4. Memory Intensive:
The process of calculating and storing large matrices, especially for datasets with many features, can be very memory-demanding. On typical machines, this could lead to slow performance or memory errors.
5. Numerical Instability:
Computing the inverse of a matrix can introduce floating-point precision errors, particularly if the matrix is ill-conditioned. This can result in unreliable or unstable predictions.