Multiple Regression with NumPy

NumPy provides powerful tools for performing multiple linear regression, a statistical method used to model the relationship between a dependent variable and two or more independent variables. This guide will explain the key concepts of multiple regression and demonstrate how to implement it efficiently using NumPy.

Understanding Multiple Regression

Multiple regression seeks to find a linear equation that best describes the relationship between a dependent variable (Y) and multiple independent variables (X₁, X₂, …, X_n). The general form of the multiple regression equation is:

Y = β₀ + β₁X₁ + β₂X₂ + … + β_nX_n + ε

Where:

Y: The dependent variable (the variable you want to predict).
X₁, X₂, …, X_n: The independent variables (predictor variables or features).
β₀: The intercept (the value of Y when all X variables are zero).
β₁, β₂, …, β_n: The coefficients (weights) of the independent variables, representing the change in Y for a one-unit change in the corresponding X variable.
ε: The error term (the difference between the observed and predicted values of Y).

Performing Multiple Regression with NumPy

Here’s a step-by-step guide to performing multiple regression using NumPy’s linear algebra capabilities:

1. Import NumPy

import numpy as np

2. Prepare Your Data

Organize your data into NumPy arrays. You’ll have one array for the dependent variable (Y) and one or more arrays for the independent variables (X). It is best practice to have the independent variables as columns in a matrix.

X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]) # Independent variables (features) in columns
Y = np.array([3, 5, 7, 8, 10]) # Dependent variable

3. Add the Intercept Term

To include the intercept (β₀) in the model, add a column of ones to the X matrix:

X = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
print(X)

4. Calculate the Coefficients (Using the Normal Equation)

The coefficients (β) can be calculated using the normal equation:

β = (X^TX)^-1X^TY

coefficients = np.linalg.inv(X.T @ X) @ X.T @ Y
beta_0 = coefficients[0]
beta_1 = coefficients[1]
beta_2 = coefficients[2]

print(f"Intercept (beta_0): {beta_0}")
print(f"Coefficient for X1 (beta_1): {beta_1}")
print(f"Coefficient for X2 (beta_2): {beta_2}")

5. Make Predictions

Once you have the coefficients, you can use them to make predictions:

Y_pred = X @ coefficients #Much more efficient than manual calculation
print(f"Predictions: {Y_pred}")

6. Using scikit-learn (Alternative)

While NumPy can be used for multiple regression, the scikit-learn library provides a more convenient and feature-rich approach:

from sklearn.linear_model import LinearRegression
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([3, 5, 7, 8, 10])
model = LinearRegression().fit(X, y)
r_sq = model.score(X, y)
print(f"Coefficient of determination (R^2): {r_sq:.2f}") # R^2 score

print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")

y_pred = model.predict(X)
print(f"predicted response:\n{y_pred}")