Building and Evaluating a Linear Regression Model in Python: A Step-by-Step Guide

We are going to build Linear regression model step-by-step from scratch, before jumping to the code, here is the short refresher on Linear Regression

Linear Regression:

At its heart, linear regression tries to find the best-fitting straight line that describes the relationship between a dependent variable (the thing you’re trying to predict or explain) and one or more independent variables (the things you think might be influencing it).

Dependent Variable (Y): This is the outcome you’re interested in. It’s often called the response variable or the target variable. Think of it as what you’re trying to predict.

Independent Variable (X): This is the factor you believe might influence the dependent variable. It’s also known as the predictor variable or the feature. You’re using this to make your prediction.

In simple linear regression, it looks like straight line:

y = mx + b

where

y = predicted value

m = slope of the line

x = input variable

The goal is to find the line that best fits the data, minimizing the difference between predicted and actual data.

Code (Python) example:

We are going to do perform the following steps in code,

We are going to generate our own data set, it uses numpy to create realistic features (independent variables) and a target variable (dependent variable). Noise is added to make the data more realistic.
Multiple Features: The code demonstrates multiple linear regression with three independent variables (feature_1, feature_2, and feature_3). This makes it a more useful example. The target variable is calculated as a linear combination of these features, plus some random noise. It now also includes a normally distributed feature, to highlight the ability for the model to learn from various types of data.
Pandas DataFrame: The generated data is stored in a Pandas DataFrame for easier manipulation and analysis.
Data Splitting: The dataset is split into training and testing sets using train_test_split from scikit-learn. This is essential for evaluating the model’s performance on unseen data.
Model Training: A LinearRegression model from scikit-learn is created and trained (fitted) using the training data.
Model Evaluation:
- The trained model is used to make predictions on the test data (X_test).
- The model’s performance is evaluated using two common metrics:
  - Mean Squared Error (MSE): Measures the average squared difference between the predicted and actual values.
  - R-squared (R²): Represents the proportion of variance in the dependent variable that can be predicted from the independent variables.
- The code prints the MSE, R-squared value, the intercept, and the coefficients learned by the model. Printing the coefficients is very important to see what the model has learned.
Visualization: The code includes an example of how to visualize the model’s predictions. It creates a scatter plot of actual vs. predicted values for one of the features. This allows you to visually assess how well the model is performing.
Clear Comments: The code is thoroughly commented to explain each step, making it easier to understand.
Reproducibility: np.random.seed(123) ensures the results are reproducible.
Uses Scikit-learn Appropriately: The code uses scikit-learn for the modeling part (creating the model, training, and making predictions), which is the standard way to implement linear regression in Python. It avoids using it for the basic calculations to ensure that it’s original.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# --- 1. Generate Original Synthetic Dataset ---
np.random.seed(123)  # Set seed for reproducibility

# Independent variables: feature_1, feature_2, feature_3
num_samples = 200
feature_1 = np.random.rand(num_samples) * 50  # Values between 0 and 50
feature_2 = np.random.rand(num_samples) * 10 # Values between 0 and 10
feature_3 = np.random.normal(0, 5, num_samples) # Normally distributed noise around 0

# Create a DataFrame
data = pd.DataFrame({
    'feature_1': feature_1,
    'feature_2': feature_2,
    'feature_3': feature_3  # Adding the normally distributed feature
})

# Create a dependent variable (target) with a linear relationship to the features + noise
true_coefficients = np.array([2.5, -1.2, 0.8])  # Coefficients for feature_1, feature_2, and feature_3
true_intercept = 15
noise = np.random.normal(0, 10, num_samples) # Gaussian Noise with scale 10
data['target'] = (data['feature_1'] * true_coefficients[0] +
                  data['feature_2'] * true_coefficients[1] +
                  data['feature_3'] * true_coefficients[2] +
                  true_intercept + noise)



# --- 2. Data Splitting ---
X = data[['feature_1', 'feature_2', 'feature_3']]
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% train, 20% test


# --- 3. Model Training ---
model = LinearRegression() # Create the linear regression model
model.fit(X_train, y_train) # Train the model using the training data

# --- 4. Model Evaluation ---

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred) # Mean Squared Error
r2 = r2_score(y_test, y_pred) # R-squared (coefficient of determination)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print("Coefficients:")
for i, coef in enumerate(model.coef_):
    print(f"Feature {i+1}: {coef:.2f}")



# --- 5. Visualization (Example - Predicted vs. Actual for one feature) ---

# Choose a feature to visualize (e.g., feature_1)
feature_to_visualize = 'feature_1'

# Create a scatter plot of actual vs. predicted values for the chosen feature
plt.figure(figsize=(10, 6))
plt.scatter(X_test[feature_to_visualize], y_test, color='blue', label='Actual')
plt.scatter(X_test[feature_to_visualize], y_pred, color='red', label='Predicted')
plt.xlabel(feature_to_visualize)
plt.ylabel('Target')
plt.title(f'Actual vs. Predicted Target for {feature_to_visualize}')
plt.legend()
plt.grid(True)
plt.show()

This example provides a complete and practical demonstration of building and evaluating a linear regression model in Python. It emphasizes clear explanations and best practices for model development. The addition of synthetic data generation, multiple features, and proper model evaluation makes it a much more valuable learning resource.