InΒ [Β ]:
# Install necessary libraries (if not already present in Colab environment)
!pip install numpy pandas matplotlib seaborn scikit-learn
InΒ [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn for Linear Regression models, preprocessing, and metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer # For combining transformers
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set a consistent plotting style
sns.set_theme(style="whitegrid")

Part 1: Understanding Linear Regression - The BasicsΒΆ

Linear Regression is a fundamental statistical model and a workhorse in Machine Learning for predicting a continuous target variable. The core idea is to model the relationship between a dependent variable (the target) and one or more independent variables (features) by fitting a linear equation to the observed data.

1.1 The Goal: To find the "best-fit" straight line (or hyperplane in higher dimensions) that minimizes the distance between the observed data points and the line itself.

1.2 Simple Linear Regression Equation: When you have one independent variable ($x$) and one dependent variable ($y$), the relationship is modeled by a straight line: $$y = \beta_0 + \beta_1 x + \epsilon$$ Where:

  • $y$: The dependent variable (what we want to predict).
  • $x$: The independent variable (feature).
  • $\beta_0$: The y-intercept (the value of y when x is 0).
  • $\beta_1$: The slope of the line (how much y changes for a one-unit change in x).
  • $\epsilon$: The error term or residual (the difference between the actual value and the predicted value, representing noise or unmodeled factors).

1.3 The Least Squares Method: How do we find the "best-fit" line? Linear Regression typically uses the Ordinary Least Squares (OLS) method. OLS aims to minimize the Sum of Squared Residuals (SSR) or Mean Squared Error (MSE). A residual is the vertical distance between an actual data point and the regression line (the predicted value). By minimizing the sum of these squared distances, we find the line that best approximates the overall trend of the data.

Intuition: Imagine plotting data points on a graph. The goal of linear regression is to draw a line through these points such that the total "vertical distance" from each point to the line is as small as possible, where "distance" is measured as squared errors to penalize larger errors more heavily.

InΒ [3]:
# Let's create a simple conceptual dataset to visualize
np.random.seed(42) # for reproducibility
X_concept = np.random.rand(50) * 10 # 50 random values between 0 and 10
# True relationship: y = 2 * x + 5 + some noise
y_concept = 2 * X_concept + 5 + np.random.randn(50) * 2

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_concept, y=y_concept, color='blue', alpha=0.7)
plt.title('Conceptual Data for Simple Linear Regression')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (Y)')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
print("\nVisualizing the conceptual goal: finding a line that best fits these points.")
No description has been provided for this image
Visualizing the conceptual goal: finding a line that best fits these points.

Discussion Point:

  • In the simple linear regression equation, what does a positive $\beta_1$ (slope) indicate about the relationship between $x$ and $y$? What about a negative $\beta_1$?
  • Why do we minimize the squared errors instead of just the absolute errors? (Hint: consider positive and negative errors, and penalizing large errors.)

Part 2: Simple Linear Regression - Manual Calculation (Conceptual)ΒΆ

To truly understand how linear regression works, let's manually calculate the slope and intercept for a very small dataset. This demonstrates the core idea behind the Least Squares Method.

The formulas for the OLS coefficients ($\beta_0$ and $\beta_1$) are:

$$\beta_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$

$$\beta_0 = \bar{y} - \beta_1 \bar{x}$$

Where:

  • $x_i, y_i$: Individual data points.
  • $\bar{x}, \bar{y}$: Means of $x$ and $y$ respectively.
  • $n$: Number of data points.

Tasks:

  • Create a small, synthetic dataset.
  • Manually calculate $\bar{x}$ and $\bar{y}$.
  • Manually calculate $\beta_1$ and $\beta_0$.
  • Plot the original data points and the manually calculated regression line.
  • Calculate and visualize residuals.
InΒ [4]:
# Small synthetic dataset
x_manual = np.array([1, 2, 3, 4, 5])
y_manual = np.array([3, 5, 4, 7, 6]) # Slightly noisy linear relationship

print(f"X data: {x_manual}")
print(f"Y data: {y_manual}")

# 2.1 Calculate means
x_bar = np.mean(x_manual)
y_bar = np.mean(y_manual)
print(f"\nMean of X (x_bar): {x_bar}")
print(f"Mean of Y (y_bar): {y_bar}")

# 2.2 Calculate Beta_1 (slope)
numerator = np.sum((x_manual - x_bar) * (y_manual - y_bar))
denominator = np.sum((x_manual - x_bar)**2)
beta_1_manual = numerator / denominator
print(f"\nCalculated Slope (Beta_1): {beta_1_manual:.4f}")

# 2.3 Calculate Beta_0 (intercept)
beta_0_manual = y_bar - beta_1_manual * x_bar
print(f"Calculated Intercept (Beta_0): {beta_0_manual:.4f}")

print(f"\nManually calculated Regression Line: y = {beta_0_manual:.4f} + {beta_1_manual:.4f} * x")

# 2.4 Plot original data and the manually calculated regression line
plt.figure(figsize=(8, 6))
sns.scatterplot(x=x_manual, y=y_manual, color='blue', s=100, label='Actual Data Points')

# Generate points on the regression line for plotting
x_line = np.array([x_manual.min(), x_manual.max()])
y_line_manual = beta_0_manual + beta_1_manual * x_line
plt.plot(x_line, y_line_manual, color='red', linestyle='-', linewidth=2, label='Manual Regression Line')

plt.title('Simple Linear Regression: Manual Calculation')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# 2.5 Calculate and visualize residuals
y_pred_manual = beta_0_manual + beta_1_manual * x_manual
residuals_manual = y_manual - y_pred_manual

print(f"\nPredicted Y values (manual): {y_pred_manual}")
print(f"Residuals (manual): {residuals_manual}")
print(f"Sum of Squared Residuals (manual): {np.sum(residuals_manual**2):.4f}")

plt.figure(figsize=(8, 6))
sns.scatterplot(x=x_manual, y=y_manual, color='blue', s=100, label='Actual Data Points')
plt.plot(x_line, y_line_manual, color='red', linestyle='-', linewidth=2, label='Regression Line')
# Plot residuals as vertical dashed lines
for i in range(len(x_manual)):
    plt.plot([x_manual[i], x_manual[i]], [y_manual[i], y_pred_manual[i]], 'g--', alpha=0.7)
plt.title('Simple Linear Regression: Residuals')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
X data: [1 2 3 4 5]
Y data: [3 5 4 7 6]

Mean of X (x_bar): 3.0
Mean of Y (y_bar): 5.0

Calculated Slope (Beta_1): 0.8000
Calculated Intercept (Beta_0): 2.6000

Manually calculated Regression Line: y = 2.6000 + 0.8000 * x
No description has been provided for this image
Predicted Y values (manual): [3.4 4.2 5.  5.8 6.6]
Residuals (manual): [-0.4  0.8 -1.   1.2 -0.6]
Sum of Squared Residuals (manual): 3.6000
No description has been provided for this image

Discussion Point:

  • Why are the residuals important? What does it mean if a residual is positive or negative?
  • Could you apply the manual calculation method to a dataset with 1000 data points? Why or why not? What does this imply about the need for libraries?

Part 3: Simple Linear Regression with Scikit-learnΒΆ

While manual calculation is good for understanding, in practice, we use libraries like scikit-learn which provide optimized and robust implementations.

Tasks:

  • Re-use the X_concept, y_concept dataset from Part 1.
  • Use LinearRegression from sklearn.linear_model.
  • Train the model (.fit()).
  • Access the coef_ (slope) and intercept_ (intercept) attributes.
  • Make predictions (.predict()).
  • Plot the data and the scikit-learn regression line.
  • Compare results with the manual calculation's intuition.
InΒ [5]:
# Re-use the conceptual data
# X needs to be 2D for scikit-learn (e.g., (n_samples, n_features))
X_concept_reshaped = X_concept.reshape(-1, 1) # Convert 1D array to 2D column vector

# 3.1 Create and train the Linear Regression model
model_sklearn_simple = LinearRegression()
model_sklearn_simple.fit(X_concept_reshaped, y_concept)

print("Scikit-learn Simple Linear Regression model trained.")

# 3.2 Access coefficients and intercept
beta_1_sklearn = model_sklearn_simple.coef_[0] # Coefficient for the single feature
beta_0_sklearn = model_sklearn_simple.intercept_
print(f"\nScikit-learn Slope (Beta_1): {beta_1_sklearn:.4f}")
print(f"Scikit-learn Intercept (Beta_0): {beta_0_sklearn:.4f}")

print(f"\nScikit-learn Regression Line: y = {beta_0_sklearn:.4f} + {beta_1_sklearn:.4f} * x")

# 3.3 Make predictions
y_pred_sklearn_simple = model_sklearn_simple.predict(X_concept_reshaped)

# 3.4 Plot data and the scikit-learn regression line
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_concept, y=y_concept, color='blue', alpha=0.7, label='Actual Data Points')
plt.plot(X_concept, y_pred_sklearn_simple, color='green', linestyle='-', linewidth=2, label='Scikit-learn Regression Line')
plt.title('Simple Linear Regression with Scikit-learn')
plt.xlabel('Independent Variable (X)')
plt.ylabel('Dependent Variable (Y)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

print("\nNotice how the scikit-learn line also closely fits the data, similar to the manual intuition.")
Scikit-learn Simple Linear Regression model trained.

Scikit-learn Slope (Beta_1): 1.9553
Scikit-learn Intercept (Beta_0): 5.1934

Scikit-learn Regression Line: y = 5.1934 + 1.9553 * x
No description has been provided for this image
Notice how the scikit-learn line also closely fits the data, similar to the manual intuition.

Discussion Point:

  • Compare the slope and intercept from your manual calculation (Part 2) with the scikit-learn results. Are they similar? What factors might cause slight differences (if any)?
  • What is the advantage of using scikit-learn for linear regression compared to manual calculation, especially for larger datasets?

Part 4: Multiple Linear RegressionΒΆ

Most real-world problems involve more than one independent variable. Multiple Linear Regression extends the concept to handle multiple features.

The equation becomes: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + \epsilon$$ Where:

  • $x_1, x_2, \ldots, x_k$: The multiple independent variables (features).
  • $\beta_1, \beta_2, \ldots, \beta_k$: The coefficients for each feature, indicating the change in $y$ for a one-unit change in that feature, holding all other features constant.

Tasks:

  • Load a more complex, real-world regression dataset (e.g., Abalone dataset for predicting age from physical measurements).
  • Perform initial data exploration and preprocessing (missing values, categorical features).
  • Define features (X) and target (y).
  • Split data into training and testing sets.
  • Apply feature scaling to numerical features.
  • Train a LinearRegression model with multiple features.
  • Access coefficients for each feature and the intercept. Interpret the meaning of these coefficients.
  • Make predictions on the test set.
InΒ [42]:
# --- 4.1 Load a more complex, real-world regression dataset ---
# We'll use the Abalone dataset, predicting age from physical measurements.
# It's a common dataset for regression tasks.
# Data source: https://archive.ics.uci.edu/ml/datasets/Abalone
# For simplicity, we'll use a version readily available from a GitHub raw URL.

abalone_url = "https://raw.githubusercontent.com/TheBabu/Abalone-Machine-Learning/master/abalone.csv"

column_names = ['Sex', 'Length', 'Diameter', 'Height', 'Whole_weight', 'Shucked_weight',
                'Viscera_weight', 'Shell_weight', 'Rings']

try:
    # Skip the first row as it appears to be a header incorrectly read as data
    df_abalone = pd.read_csv(abalone_url, header=0, names=column_names) # Use header=0 to read the first row as header and then replace with custom names
    print(f"\nSuccessfully loaded Abalone data from: {abalone_url}")
except Exception as e:
    print(f"Error loading Abalone data: {e}")
    df_abalone = pd.DataFrame() # Empty DataFrame to prevent errors later
Successfully loaded Abalone data from: https://raw.githubusercontent.com/TheBabu/Abalone-Machine-Learning/master/abalone.csv
InΒ [44]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression # Corrected import

# --- 4.1 Initial inspection ---
print("\n--- Abalone DataFrame Head ---")
print(df_abalone.head())

print("\n--- Abalone DataFrame Info ---")
df_abalone.info()

print("\n--- Abalone DataFrame Description ---")
print(df_abalone.describe())
--- Abalone DataFrame Head ---
  Sex  Length  Diameter  Height  Whole_weight  Shucked_weight  Viscera_weight  \
0   M   0.455     0.365   0.095        0.5140          0.2245          0.1010   
1   M   0.350     0.265   0.090        0.2255          0.0995          0.0485   
2   F   0.530     0.420   0.135        0.6770          0.2565          0.1415   
3   M   0.440     0.365   0.125        0.5160          0.2155          0.1140   
4   I   0.330     0.255   0.080        0.2050          0.0895          0.0395   

   Shell_weight   Age  
0         0.150  16.5  
1         0.070   8.5  
2         0.210  10.5  
3         0.155  11.5  
4         0.055   8.5  

--- Abalone DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole_weight    4177 non-null   float64
 5   Shucked_weight  4177 non-null   float64
 6   Viscera_weight  4177 non-null   float64
 7   Shell_weight    4177 non-null   float64
 8   Age             4177 non-null   float64
dtypes: float64(8), object(1)
memory usage: 293.8+ KB

--- Abalone DataFrame Description ---
            Length     Diameter       Height  Whole_weight  Shucked_weight  \
count  4177.000000  4177.000000  4177.000000   4177.000000     4177.000000   
mean      0.523992     0.407881     0.139516      0.828742        0.359367   
std       0.120093     0.099240     0.041827      0.490389        0.221963   
min       0.075000     0.055000     0.000000      0.002000        0.001000   
25%       0.450000     0.350000     0.115000      0.441500        0.186000   
50%       0.545000     0.425000     0.140000      0.799500        0.336000   
75%       0.615000     0.480000     0.165000      1.153000        0.502000   
max       0.815000     0.650000     1.130000      2.825500        1.488000   

       Viscera_weight  Shell_weight          Age  
count     4177.000000   4177.000000  4177.000000  
mean         0.180594      0.238831    11.433684  
std          0.109614      0.139203     3.224169  
min          0.000500      0.001500     2.500000  
25%          0.093500      0.130000     9.500000  
50%          0.171000      0.234000    10.500000  
75%          0.253000      0.329000    12.500000  
max          0.760000      1.005000    30.500000  
InΒ [43]:
# --- 4.2 Data Preprocessing for Abalone Dataset ---
# Convert numerical columns to numeric, coercing errors
for col in ['Length', 'Diameter', 'Height', 'Whole_weight',
            'Shucked_weight', 'Viscera_weight', 'Shell_weight', 'Rings']:
    # Already handled by reading with header=0, but keeping coerce for robustness
    df_abalone[col] = pd.to_numeric(df_abalone[col], errors='coerce')

# Drop rows with NaN values that resulted from coercion (e.g., any true non-numeric)
df_abalone.dropna(inplace=True)

# Now 'Rings' should be numeric, so we can calculate 'Age'
# Age is Rings + 1.5 according to dataset documentation
df_abalone['Age'] = df_abalone['Rings'] + 1.5
df_abalone = df_abalone.drop('Rings', axis=1)

# Check for missing values again after dropping NaNs
print("\nMissing values in Abalone dataset after cleaning:\n", df_abalone.isnull().sum())

# Define features
numerical_features = ['Length', 'Diameter', 'Height', 'Whole_weight',
                      'Shucked_weight', 'Viscera_weight', 'Shell_weight']
categorical_features = ['Sex']

# Define the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# --- 4.3 Define Features (X) and Target (y) ---
X = df_abalone.drop('Age', axis=1)
y = df_abalone['Age']

# --- 4.4 Split Data into Training and Testing Sets ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set X shape: {X_train.shape}, y shape: {y_train.shape}")
print(f"Testing set X shape: {X_test.shape}, y shape: {y_test.shape}")

# --- 4.5 Apply Feature Scaling and One-Hot Encoding ---
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Get feature names
num_feat_names = numerical_features
# Ensure we get the correct feature names from the OneHotEncoder
cat_feat_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = num_feat_names + list(cat_feat_names)


print(f"\nProcessed X_train shape: {X_train_processed.shape}")
print(f"Processed X_test shape: {X_test_processed.shape}")
print(f"All feature names after preprocessing: {all_feature_names}")

# --- 4.6 Train a Linear Regression Model ---
model_sklearn_multiple = LinearRegression()
model_sklearn_multiple.fit(X_train_processed, y_train)

print("\nMultiple Linear Regression model trained successfully!")

# --- 4.7 Access coefficients and intercept ---
print(f"\nModel Intercept (Beta_0): {model_sklearn_multiple.intercept_:.4f}")
print("\nModel Coefficients (Beta_i for each feature):")
for i, coef in enumerate(model_sklearn_multiple.coef_):
    print(f"  {all_feature_names[i]}: {coef:.4f}")
Missing values in Abalone dataset after cleaning:
 Sex               0
Length            0
Diameter          0
Height            0
Whole_weight      0
Shucked_weight    0
Viscera_weight    0
Shell_weight      0
Age               0
dtype: int64

Training set X shape: (3341, 8), y shape: (3341,)
Testing set X shape: (836, 8), y shape: (836,)

Processed X_train shape: (3341, 10)
Processed X_test shape: (836, 10)
All feature names after preprocessing: ['Length', 'Diameter', 'Height', 'Whole_weight', 'Shucked_weight', 'Viscera_weight', 'Shell_weight', 'Sex_F', 'Sex_I', 'Sex_M']

Multiple Linear Regression model trained successfully!

Model Intercept (Beta_0): 11.4330

Model Coefficients (Beta_i for each feature):
  Length: -0.0240
  Diameter: 1.0976
  Height: 0.4440
  Whole_weight: 4.3903
  Shucked_weight: -4.5169
  Viscera_weight: -1.0460
  Shell_weight: 1.2302
  Sex_F: 0.2052
  Sex_I: -0.5137
  Sex_M: 0.3085
InΒ [45]:
# --- 4.3 Define Features (X) and Target (y) ---
X = df_abalone.drop('Age', axis=1)
y = df_abalone['Age']

# --- 4.4 Split Data into Training and Testing Sets ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set X shape: {X_train.shape}, y shape: {y_train.shape}")
print(f"Testing set X shape: {X_test.shape}, y shape: {y_test.shape}")
Training set X shape: (3341, 8), y shape: (3341,)
Testing set X shape: (836, 8), y shape: (836,)
InΒ [46]:
# --- 4.5 Apply Feature Scaling and One-Hot Encoding ---
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Get feature names
num_feat_names = numerical_features
# Ensure we get the correct feature names from the OneHotEncoder
cat_feat_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = num_feat_names + list(cat_feat_names)


print(f"\nProcessed X_train shape: {X_train_processed.shape}")
print(f"Processed X_test shape: {X_test_processed.shape}")
print(f"All feature names after preprocessing: {all_feature_names}")
Processed X_train shape: (3341, 10)
Processed X_test shape: (836, 10)
All feature names after preprocessing: ['Length', 'Diameter', 'Height', 'Whole_weight', 'Shucked_weight', 'Viscera_weight', 'Shell_weight', 'Sex_F', 'Sex_I', 'Sex_M']
InΒ [47]:
# --- 4.6 Train a Linear Regression Model ---
model_sklearn_multiple = LinearRegression()
model_sklearn_multiple.fit(X_train_processed, y_train)

print("\nMultiple Linear Regression model trained successfully!")

# --- 4.7 Access coefficients and intercept ---
print(f"\nModel Intercept (Beta_0): {model_sklearn_multiple.intercept_:.4f}")
print("\nModel Coefficients (Beta_i for each feature):")
for i, coef in enumerate(model_sklearn_multiple.coef_):
    print(f"  {all_feature_names[i]}: {coef:.4f}")
Multiple Linear Regression model trained successfully!

Model Intercept (Beta_0): 11.4330

Model Coefficients (Beta_i for each feature):
  Length: -0.0240
  Diameter: 1.0976
  Height: 0.4440
  Whole_weight: 4.3903
  Shucked_weight: -4.5169
  Viscera_weight: -1.0460
  Shell_weight: 1.2302
  Sex_F: 0.2052
  Sex_I: -0.5137
  Sex_M: 0.3085
InΒ [48]:
# --- 4.8 Make predictions on the test set ---
y_pred_sklearn_multiple = model_sklearn_multiple.predict(X_test_processed)

print(f"\nFirst 5 true Age values (test set): {y_test.tolist()[:5]}")
print(f"First 5 predicted Age values: {[f'{val:.2f}' for val in y_pred_sklearn_multiple.tolist()[:5]]}")
First 5 true Age values (test set): [10.5, 9.5, 17.5, 10.5, 15.5]
First 5 predicted Age values: ['13.26', '11.74', '15.50', '13.50', '12.66']

Discussion Point:

  • In multiple linear regression, what does it mean to say that a coefficient indicates the change in $y$ for a one-unit change in a feature, holding all other features constant?
  • Why is it important to apply StandardScaler to numerical features before training a LinearRegression model when multiple features are involved (especially if they have different scales)?

Part 5: Model Evaluation for RegressionΒΆ

Evaluating regression models requires specific metrics that measure the difference between predicted and actual continuous values.

Common Regression Metrics:

  • Mean Absolute Error (MAE):

    • $MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$
    • The average of the absolute differences between predictions and actual values.
    • Interpretation: Easy to understand, directly measures the average magnitude of the errors in the same units as the target variable. Less sensitive to outliers than MSE.
  • Mean Squared Error (MSE):

    • $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
    • The average of the squared differences between predictions and actual values.
    • Interpretation: Penalizes larger errors more heavily than MAE due to squaring. Units are squared.
  • Root Mean Squared Error (RMSE):

    • $RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{MSE}$
    • The square root of MSE.
    • Interpretation: The most popular metric. It is in the same units as the target variable, making it more interpretable than MSE. Also penalizes larger errors.
  • R-squared ($R^2$) - Coefficient of Determination:

    • $R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} = 1 - \frac{MSE(model)}{Variance(actual\_y)}$
    • Interpretation: Represents the proportion of the variance in the dependent variable that can be predicted from the independent variables. Values range from 0 to 1 (or sometimes negative if the model is worse than predicting the mean).
      • $R^2 = 1$: Perfect fit.
      • $R^2 = 0$: The model explains none of the variance in the target variable (it's as good as just predicting the mean of the target).
      • $R^2 < 0$: The model is worse than simply predicting the mean of the target.
    • Caution: $R^2$ always increases or stays the same when you add more features, even if they are irrelevant. Adjusted R-squared (not covered here but worth knowing) addresses this by penalizing the addition of unnecessary features.

Tasks:

  • Calculate MAE, MSE, RMSE, and R-squared for the Multiple Linear Regression model using sklearn.metrics.
  • Discuss the interpretation of each metric in the context of the Abalone age prediction.
InΒ [49]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# --- 5 Calculate Evaluation Metrics ---
mae = mean_absolute_error(y_test, y_pred_sklearn_multiple)
mse = mean_squared_error(y_test, y_pred_sklearn_multiple)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_sklearn_multiple)

print(f"\nMean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2): {r2:.4f}")

# --- Plot: Actual vs. Predicted Values ---
plt.figure(figsize=(10, 7))
sns.scatterplot(x=y_test, y=y_pred_sklearn_multiple, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect Prediction Line')
plt.xlabel('Actual Age')
plt.ylabel('Predicted Age')
plt.title('Actual vs. Predicted Age (Multiple Linear Regression)')
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend()
plt.show()

# --- Plot: Distribution of Residuals ---
residuals = y_test - y_pred_sklearn_multiple
plt.figure(figsize=(10, 7))
sns.histplot(residuals, kde=True, bins=30)
plt.title('Distribution of Residuals')
plt.xlabel('Residuals (Actual - Predicted)')
plt.ylabel('Frequency')
plt.show()

# --- Plot: Residuals vs. Predicted Values ---
plt.figure(figsize=(10, 7))
sns.scatterplot(x=y_pred_sklearn_multiple, y=residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals vs. Predicted Values')
plt.xlabel('Predicted Age')
plt.ylabel('Residuals')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Mean Absolute Error (MAE): 1.5931
Mean Squared Error (MSE): 4.8912
Root Mean Squared Error (RMSE): 2.2116
R-squared (R2): 0.5482
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Discussion Point:

  • In the context of the Abalone dataset, what does an MAE of, say, 1.5 mean? And an RMSE of 2.0?
  • If a model has an R-squared of 0.85, what does that tell you about its performance? What if it's 0.10?

Part 6: Assumptions of Linear Regression & DiagnosticsΒΆ

Linear Regression relies on several key assumptions about the data and the error term for its statistical inferences (like p-values, confidence intervals) to be valid, and for the model to perform optimally. While scikit-learn's LinearRegression will run even if assumptions are violated, interpreting the coefficients and evaluating performance requires checking these.

Key Assumptions (LINE):

  1. Linearity: The relationship between each independent variable and the dependent variable is linear.

    • Check: Scatter plots of features vs. target. Look for non-linear patterns (e.g., curves).
    • Implication if violated: Model will poorly capture the true relationship, leading to higher error.
    • Remedy: Transform variables (e.g., log, square root), add polynomial features (e.g., $x^2$), use non-linear models.
  2. Independence of Errors/Residuals: The residuals (errors) are independent of each other. This is especially important in time series data, where errors might be correlated over time.

    • Check: Plot residuals against time (if applicable) or against previous residuals. Look for patterns.
    • Implication if violated: Inaccurate standard errors and p-values, leading to incorrect conclusions about feature significance.
    • Remedy: Use time-series specific models (e.g., ARIMA), consider lagged variables.
  3. Normality of Residuals: The residuals are approximately normally distributed.

    • Check: Histogram of residuals (should be bell-shaped), Q-Q plot (points should lie along a straight line).
    • Implication if violated: Affects confidence intervals and p-values. Not as critical for large sample sizes due to Central Limit Theorem.
    • Remedy: Data transformations, using robust regression methods.
  4. Equal Variance of Errors (Homoscedasticity): The variance of residuals is constant across all levels of the independent variables (or predicted values). This means the spread of residuals should be roughly the same across the range of predictions.

    • Check: Scatter plot of residuals vs. predicted values. Look for a "fan" or "cone" shape (heteroscedasticity) vs. a consistent, random band (homoscedasticity).
    • Implication if violated: Leads to inefficient parameter estimates and incorrect standard errors.
    • Remedy: Data transformations (e.g., log transform on target), Weighted Least Squares.

Additional Important Assumption:

  • No Multicollinearity: Independent variables should not be highly correlated with each other. This applies to Multiple Linear Regression.
    • Check: Correlation matrix/heatmap of features. Variance Inflation Factor (VIF) scores (VIF > 5-10 indicates problematic multicollinearity).
    • Implication if violated: Coefficients become unstable and difficult to interpret (e.g., large changes in coefficients with small changes in data, opposite signs than expected). Does not necessarily affect the predictive power of the model.
    • Remedy: Remove one of the highly correlated variables, combine them into a new feature, use dimensionality reduction (e.g., PCA), or use regularization techniques (Ridge, Lasso regression).
InΒ [50]:
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# --- Residuals ---
residuals = y_test - y_pred_sklearn_multiple

# --- 6.1 Check Linearity ---
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_abalone['Length'], y=df_abalone['Age'], alpha=0.6)
plt.title('Linearity Check: Length vs. Age')
plt.xlabel('Length')
plt.ylabel('Age')
plt.show()

# --- 6.2 Check Homoscedasticity ---
plt.figure(figsize=(10, 7))
sns.scatterplot(x=y_pred_sklearn_multiple, y=residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
plt.title('Homoscedasticity Check: Residuals vs. Predicted Values')
plt.xlabel('Predicted Age')
plt.ylabel('Residuals')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# --- 6.3 Check Normality of Residuals ---
plt.figure(figsize=(8, 6))
sns.histplot(residuals, kde=True, bins=30)
plt.title('Normality Check: Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

# Q-Q Plot for Normality
plt.figure(figsize=(8, 6))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Normality Check: Q-Q Plot of Residuals')
plt.show()

# --- 6.4 Check Multicollinearity ---
# Numerical features for abalone (excluding categorical 'Sex')
numerical_abalone_features = df_abalone[numerical_features]
plt.figure(figsize=(10, 8))
sns.heatmap(numerical_abalone_features.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Multicollinearity Check: Correlation Matrix of Numerical Features')
plt.show()

print("\nNote on Multicollinearity: Check the heatmap above. Strong correlations (e.g., between Length, Diameter, Whole_weight) suggest multicollinearity, which can make coefficients less reliable.")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Note on Multicollinearity: Check the heatmap above. Strong correlations (e.g., between Length, Diameter, Whole_weight) suggest multicollinearity, which can make coefficients less reliable.

Discussion Point:

  • Describe what "heteroscedasticity" means in the context of residuals. What does it look like on a residuals vs. predicted values plot?
  • If two of your features are highly correlated (multicollinearity), how might that affect the interpretation of their individual coefficients in a multiple linear regression model?

Part 7: Advantages and Disadvantages of Linear RegressionΒΆ

Linear Regression is a powerful and widely used algorithm, but like any model, it has its strengths and weaknesses.

Advantages:

  1. Simplicity and Interpretability:
    • The model is straightforward to understand and explain. The coefficients directly show the estimated change in the dependent variable for a one-unit change in each independent variable (holding others constant). This makes it easy to communicate insights.
  2. Computational Efficiency:
    • It has a closed-form solution (OLS) and is computationally inexpensive to train, even for large datasets.
  3. Strong Theoretical Foundation:
    • Based on well-established statistical principles, which allows for statistical inference (e.g., hypothesis testing on coefficients, confidence intervals).
  4. Good Baseline Model:
    • Often serves as a simple, quick-to-implement baseline model against which more complex models can be compared. If a complex model doesn't significantly outperform linear regression, it might not be worth the added complexity.
  5. Handles Linearity Well:
    • If the true underlying relationship between variables is linear, it performs very well and is often the best choice.

Disadvantages:

  1. Assumes Linearity:
    • Its biggest limitation. It assumes a linear relationship between features and the target. If the relationship is genuinely non-linear, a linear model will provide a poor fit and inaccurate predictions unless appropriate feature transformations (e.g., polynomial features, log transforms) are applied.
  2. Sensitive to Outliers:
    • The "least squares" method involves squaring the errors. Large errors (from outliers) are heavily penalized, which can significantly pull the regression line towards them, leading to a distorted model.
  3. Assumes Independence of Errors:
    • Violated in time series data or hierarchical data where observations are not independent, leading to biased standard errors and invalid statistical tests.
  4. Assumes Homoscedasticity:
    • If the variance of errors is not constant across all levels of predictors (heteroscedasticity), the model's coefficients are still unbiased but their standard errors are inaccurate, affecting confidence intervals and p-values.
  5. Assumes Normality of Residuals:
    • Primarily affects the validity of statistical inference (p-values, confidence intervals), especially with small sample sizes. For prediction accuracy, this assumption is less critical for larger datasets (due to Central Limit Theorem).
  6. Multicollinearity Issues:
    • When independent variables are highly correlated with each other, it can make the individual coefficients unstable and difficult to interpret. This doesn't necessarily reduce the model's predictive accuracy but makes it hard to understand the individual impact of correlated features.
  7. Does Not Automatically Handle Feature Scaling or Categorical Features:
    • Requires manual preprocessing steps like feature scaling (though not strictly necessary for OLS coefficients, it's good practice for numerical stability and regularization) and one-hot encoding for categorical variables.

Prepared By

Md. Atikuzzaman
Lecturer
Department of Computer Science and Engineering
Green University of Bangladesh
Email: atik@cse.green.edu.bd