NumPy provides powerful tools for performing multiple linear regression, a statistical method used to model the relationship between a dependent variable and two or more independent variables. This guide will explain the key concepts of multiple regression and demonstrate how to implement it efficiently using NumPy.
Understanding Multiple Regression
Multiple regression seeks to find a linear equation that best describes the relationship between a dependent variable (Y) and multiple independent variables (X1, X2, …, Xn). The general form of the multiple regression equation is:
Y = β0 + β1X1 + β2X2 + … + βnXn + ε
Where:
- Y: The dependent variable (the variable you want to predict).
- X1, X2, …, Xn: The independent variables (predictor variables or features).
- β0: The intercept (the value of Y when all X variables are zero).
- β1, β2, …, βn: The coefficients (weights) of the independent variables, representing the change in Y for a one-unit change in the corresponding X variable.
- ε: The error term (the difference between the observed and predicted values of Y).
Performing Multiple Regression with NumPy
Here’s a step-by-step guide to performing multiple regression using NumPy’s linear algebra capabilities:
1. Import NumPy
import numpy as np
2. Prepare Your Data
Organize your data into NumPy arrays. You’ll have one array for the dependent variable (Y) and one or more arrays for the independent variables (X). It is best practice to have the independent variables as columns in a matrix.
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]) # Independent variables (features) in columns
Y = np.array([3, 5, 7, 8, 10]) # Dependent variable
3. Add the Intercept Term
To include the intercept (β0) in the model, add a column of ones to the X matrix:
X = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
print(X)
4. Calculate the Coefficients (Using the Normal Equation)
The coefficients (β) can be calculated using the normal equation:
β = (XTX)-1XTY
coefficients = np.linalg.inv(X.T @ X) @ X.T @ Y
beta_0 = coefficients[0]
beta_1 = coefficients[1]
beta_2 = coefficients[2]
print(f"Intercept (beta_0): {beta_0}")
print(f"Coefficient for X1 (beta_1): {beta_1}")
print(f"Coefficient for X2 (beta_2): {beta_2}")
5. Make Predictions
Once you have the coefficients, you can use them to make predictions:
Y_pred = X @ coefficients #Much more efficient than manual calculation
print(f"Predictions: {Y_pred}")
6. Using scikit-learn (Alternative)
While NumPy can be used for multiple regression, the scikit-learn library provides a more convenient and feature-rich approach:
from sklearn.linear_model import LinearRegression
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([3, 5, 7, 8, 10])
model = LinearRegression().fit(X, y)
r_sq = model.score(X, y)
print(f"Coefficient of determination (R^2): {r_sq:.2f}") # R^2 score
print(f"intercept: {model.intercept_}")
print(f"coefficients: {model.coef_}")
y_pred = model.predict(X)
print(f"predicted response:\n{y_pred}")