Dimensionality Reduction with PCA and Visualization with t-SNEΒΆ

Curse of Dimensionality, Principal Component Analysis (PCA), Variance Maximization, Eigenvectors/Eigenvalues, Explained Variance, t-Distributed Stochastic Neighbor Embedding (t-SNE), Manifold Learning, Perplexity, KL Divergence.

Objectives:ΒΆ

  • Students will understand the "Curse of Dimensionality" and the critical need for dimensionality reduction techniques in machine learning.
  • Students will grasp the theoretical basis of PCA as a linear technique that finds orthogonal components of maximum variance.
  • Students will implement PCA using scikit-learn to reduce the number of features and interpret the results using explained variance.
  • Students will understand the intuition behind t-SNE as a non-linear technique for visualizing the local structure and neighborhoods of high-dimensional data.
  • Students will implement t-SNE to create meaningful 2D embeddings and interpret the resulting visualizations.
  • Students will critically compare PCA and t-SNE, highlighting their different goals (preserving global variance vs. preserving local similarities) and appropriate use cases.
  • Students will apply these techniques to a real dataset and complete tasks to solidify their understanding of key hyperparameters.

Setup: Install and Import LibrariesΒΆ

InΒ [1]:
# Install necessary libraries if not already present in Colab environment
!pip install pandas numpy scikit-learn matplotlib seaborn -q

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import time # To compare computation times

# Scikit-learn for PCA, t-SNE, datasets, and models
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Set plot style for better aesthetics
sns.set_theme(style="whitegrid")
sns.set_context("notebook", font_scale=1.2)


# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print(f"Pandas Version: {pd.__version__}")
print(f"Seaborn Version: {sns.__version__}")
print(f"Scikit-learn Version: {sklearn.__version__}")
Pandas Version: 2.2.2
Seaborn Version: 0.13.2
Scikit-learn Version: 1.6.1

Part 1: The Curse of Dimensionality & Our DatasetΒΆ

As the number of features (dimensions) in a dataset grows, the data becomes increasingly sparse. This phenomenon, known as the "Curse of Dimensionality," makes it harder for machine learning algorithms to find patterns, as the distance between any two points becomes less meaningful. Dimensionality reduction aims to solve this by projecting the data onto a lower-dimensional subspace while retaining as much relevant information as possible.

Dataset: Handwritten Digits We will use the scikit-learn Digits dataset. Each data point is an 8x8 pixel image of a handwritten digit, flattened into a 64-dimensional vector. Our goal is to see if we can simplify these 64 features and visualize the natural groupings of the 10 different digits (0-9).

InΒ [2]:
print("--- Part 1: Dataset Loading and Inspection ---")

# 1. Load the Digits dataset
digits = load_digits()
X = digits.data
y = digits.target

print(f"Data shape (X): {X.shape}")
print(f"Target shape (y): {y.shape}")
print(f"Number of features (dimensions): {X.shape[1]}")
print(f"Number of unique classes (digits): {len(np.unique(y))}")

# 2. Visualize a few sample digits to understand the data
fig, axes = plt.subplots(2, 5, figsize=(12, 5), subplot_kw={'xticks':[], 'yticks':[]})
for i, ax in enumerate(axes.flat):
    ax.imshow(X[i].reshape(8, 8), cmap='binary')
    ax.set_title(f"Label: {y[i]}")
plt.suptitle("Sample Handwritten Digits")
plt.show()
--- Part 1: Dataset Loading and Inspection ---
Data shape (X): (1797, 64)
Target shape (y): (1797,)
Number of features (dimensions): 64
Number of unique classes (digits): 10
No description has been provided for this image

Part 2: Principal Component Analysis (PCA) for Dimensionality ReductionΒΆ

PCA is a linear technique that transforms the data into a new coordinate system. The new axes, called principal components, are orthogonal and are ordered by the amount of variance they capture from the original data. The first principal component is the direction of highest variance, the second is the next highest (orthogonal to the first), and so on.

Key Idea: We can often describe the vast majority of the data's structure (variance) using just a few principal components, allowing us to discard the rest with minimal information loss.

Important: PCA is highly sensitive to the scale of the features, so we must standardize the data first.

Principal Component Analysis (PCA) is a linear transformation technique that reduces dimensionality while preserving as much variance (information) as possible.
It identifies principal components (PCs) β€” new orthogonal axes β€” that capture the directions of maximum variance in the data.

1. Data Preprocessing (Standardization)ΒΆ

Since PCA is sensitive to scale, we standardize each feature:

$$ x'_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j} $$

  • $x_{ij}$: value of feature $j$ for observation $i$
  • $\mu_j$: mean of feature $j$
  • $\sigma_j$: standard deviation of feature $j$

Each feature then has zero mean and unit variance.

2. Covariance MatrixΒΆ

For standardized data matrix $X$ ($n \times d$, with $n$ samples and $d$ features):

$$ \Sigma = \frac{1}{n-1} X^\top X $$

Each entry:

$$ \Sigma_{jk} = \frac{1}{n-1} \sum_{i=1}^n (x'_{ij})(x'_{ik}) $$

3. Eigen DecompositionΒΆ

We solve the eigenvalue problem:

$$ \Sigma v = \lambda v $$

  • $v$: eigenvector (direction of a PC)
  • $\lambda$: eigenvalue (variance explained by that PC)

Sort eigenvalues:

$$ \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d $$

4. Principal ComponentsΒΆ

  • First PC:

$$ \text{PC}_1 = X v_1 $$

  • k-th PC:

$$ \text{PC}_k = X v_k $$

  • Projection matrix with top $k$ eigenvectors:

$$ W_k = [ v_1, v_2, \dots, v_k ] $$

  • Reduced representation:

$$ Z = X W_k $$

5. Explained Variance RatioΒΆ

How much variance each PC captures:

$$ \text{Explained Variance Ratio for PC}_j = \frac{\lambda_j}{\sum_{k=1}^d \lambda_k} $$

Cumulative explained variance:

$$ \text{Cumulative Explained Variance}(k) = \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j} $$

6. PCA Algorithm StepsΒΆ

  1. Standardize the dataset.
  2. Compute covariance matrix $\Sigma$.
  3. Find eigenvalues and eigenvectors of $\Sigma$.
  4. Sort eigenvectors by descending eigenvalues.
  5. Select top $k$ eigenvectors β†’ projection matrix $W_k$.
  6. Transform data:

$$ Z = X W_k $$

InΒ [3]:
print("\n--- Part 2: Principal Component Analysis (PCA) ---")

# 1. Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Data has been standardized.")

# 2. Fit PCA to find the explained variance of all components
pca_full = PCA()
pca_full.fit(X_scaled)

# 3. Plot the cumulative explained variance
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Number of Principal Components')
plt.grid(True)
plt.axhline(y=0.95, color='r', linestyle='--', label='95% Explained Variance')
plt.legend()
plt.show()

# INSIGHT: We can see that with ~30 components, we can capture 95% of the total variance in the original 64-feature dataset.

# 4. Perform PCA again, but this time to reduce dimensionality
# We'll choose n_components to capture 95% of the variance
pca_95 = PCA(n_components=0.95)
X_pca = pca_95.fit_transform(X_scaled)

print(f"\nOriginal number of features: {X_scaled.shape[1]}")
print(f"Reduced number of features (for 95% variance): {X_pca.shape[1]}")
--- Part 2: Principal Component Analysis (PCA) ---
Data has been standardized.
No description has been provided for this image
Original number of features: 64
Reduced number of features (for 95% variance): 40

Part 3: t-SNE for High-Dimensional VisualizationΒΆ

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique designed specifically for visualizing high-dimensional data in 2 or 3 dimensions.

Key Idea: Unlike PCA, which focuses on preserving global variance, t-SNE's goal is to preserve local structure. It models the similarity between high-dimensional points as a probability distribution and then optimizes a similar distribution for the corresponding low-dimensional points. In simple terms, it tries to keep points that are close in high dimensions close in the low-dimensional map.

The most important hyperparameter is perplexity, which can be thought of as a soft measure of the number of nearest neighbors each point considers.

InΒ [4]:
print("\n--- Part 3: t-SNE for Visualization ---")

# 1. Initialize and run t-SNE
# Perplexity is typically between 5 and 50. n_iter can be increased if the plot looks crowded.
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42)
start_time = time.time()
X_tsne = tsne.fit_transform(X_scaled) # Apply t-SNE to the original scaled data
end_time = time.time()

print(f"t-SNE completed in {end_time - start_time:.2f} seconds.")
print(f"Shape of t-SNE embedding: {X_tsne.shape}")

# 2. Visualize the t-SNE embedding
plt.figure(figsize=(12, 10))
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=y, palette=sns.color_palette("hsv", 10), legend="full")
plt.title('t-SNE Visualization of Digits Dataset')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend(title='Digit Label')
plt.show()

# INSIGHT: The visualization is stunning. t-SNE has created very clear, well-separated clusters for each of the 10 digits,
# revealing the underlying structure of the data in a way that would be impossible to see in 64 dimensions.
--- Part 3: t-SNE for Visualization ---
t-SNE completed in 21.33 seconds.
Shape of t-SNE embedding: (1797, 2)
No description has been provided for this image

Part 4: Comparing PCA and t-SNE VisualizationsΒΆ

It's crucial to understand that PCA and t-SNE are designed for different tasks. PCA's 2D plot is a projection onto the two axes of greatest variance. t-SNE's plot is an embedding optimized to preserve local neighborhoods.

Let's visualize the data using only the first two principal components and compare it directly to the t-SNE plot.

InΒ [5]:
print("\n--- Part 4: Comparing PCA and t-SNE Visualizations ---")

# 1. Get the 2D projection from PCA
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)

# 2. Create side-by-side plots for comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 9))

# PCA Plot
sns.scatterplot(x=X_pca_2d[:, 0], y=X_pca_2d[:, 1], hue=y, palette=sns.color_palette("hsv", 10), legend="full", ax=ax1)
ax1.set_title('PCA 2D Visualization')
ax1.set_xlabel('Principal Component 1')
ax1.set_ylabel('Principal Component 2')

# t-SNE Plot
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=y, palette=sns.color_palette("hsv", 10), legend="full", ax=ax2)
ax2.set_title('t-SNE 2D Visualization')
ax2.set_xlabel('t-SNE Component 1')
ax2.set_ylabel('t-SNE Component 2')

plt.show()

# INSIGHT: The difference is clear. PCA shows some separation, but the clusters are heavily overlapped. This is because
# it's trying to preserve the global variance, not the local separation. t-SNE provides a much clearer view of the
# distinct groupings, confirming it is superior for visualization.
--- Part 4: Comparing PCA and t-SNE Visualizations ---
No description has been provided for this image

Lab Tasks & ExercisesΒΆ

Now, apply what you've learned. The following code cell contains three tasks to help you explore the concepts of PCA and t-SNE further.

InΒ [7]:
# # --- TASK 1: PCA Variance Threshold ---
# # Re-run the PCA analysis to find the number of components needed to capture 80% of the variance.
# # How many components are needed? How does this compare to the 95% threshold?

# # YOUR CODE HERE
# pca_80 = PCA(n_components=0.80)
# X_pca_80 = pca_80.fit_transform(X_scaled)
# print(f"Components for 80% variance: {X_pca_80.shape[1]}")


# # --- TASK 2: t-SNE Perplexity ---
# # Re-run the t-SNE visualization twice, first with a very low perplexity (e.g., 5) and then with a
# # very high one (e.g., 50 or 100). Plot the results.
# # How does the visualization change? What does this tell you about the role of perplexity?

# # YOUR CODE HERE for perplexity=5
# tsne_5 = TSNE(n_components=2, perplexity=5, random_state=42)
# X_tsne_5 = tsne_5.fit_transform(X_scaled)
# plt.figure(figsize=(8,6))
# sns.scatterplot(x=X_tsne_5[:,0], y=X_tsne_5[:,1], hue=y, palette=sns.color_palette("hsv", 10)).set_title('Perplexity = 5')
# plt.show()

# # YOUR CODE HERE for perplexity=50
# tsne_50 = TSNE(n_components=2, perplexity=50, random_state=42)
# X_tsne_50 = tsne_50.fit_transform(X_scaled)
# plt.figure(figsize=(8,6))
# sns.scatterplot(x=X_tsne_50[:,0], y=X_tsne_50[:,1], hue=y, palette=sns.color_palette("hsv", 10)).set_title('Perplexity = 50')
# plt.show()


# # --- TASK 3: PCA for Model Performance ---
# # PCA is not just for visualization; it's a powerful preprocessing step.
# # 1. Train a Logistic Regression classifier on the ORIGINAL scaled data (X_scaled, y).
# # 2. Train another Logistic Regression classifier on the PCA-REDUCED data (X_pca, y).
# # 3. Compare their accuracy and training time. What do you observe?

# # Split the data
# X_train_orig, X_test_orig, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# X_train_pca, X_test_pca, _, _ = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# # YOUR CODE HERE for the original data
# log_reg_orig = LogisticRegression(max_iter=1000, random_state=42)
# start = time.time()
# log_reg_orig.fit(X_train_orig, y_train)
# end = time.time()
# y_pred_orig = log_reg_orig.predict(X_test_orig)
# acc_orig = accuracy_score(y_test, y_pred_orig)
# print(f"Original Data -> Accuracy: {acc_orig:.4f}, Time: {end-start:.4f}s")

# # YOUR CODE HERE for the PCA-reduced data
# log_reg_pca = LogisticRegression(max_iter=1000, random_state=42)
# start = time.time()
# log_reg_pca.fit(X_train_pca, y_train)
# end = time.time()
# y_pred_pca = log_reg_pca.predict(X_test_pca)
# acc_pca = accuracy_score(y_test, y_pred_pca)
# print(f"PCA-Reduced Data -> Accuracy: {acc_pca:.4f}, Time: {end-start:.4f}s")
Components for 80% variance: 21
No description has been provided for this image
No description has been provided for this image
Original Data -> Accuracy: 0.9722, Time: 0.7514s
PCA-Reduced Data -> Accuracy: 0.9611, Time: 0.5276s

Part 7: Advanced Topics & DiscussionΒΆ

  • Linear vs. Non-linear Methods: PCA is a linear method, meaning it can only capture linear relationships in the data. t-SNE is a non-linear (or manifold learning) method, allowing it to uncover complex, curved structures. For highly non-linear data, other techniques like UMAP (Uniform Manifold Approximation and Projection), which is often faster and better at preserving global structure than t-SNE, or Kernel PCA are powerful alternatives.

  • PCA for Preprocessing vs. t-SNE for Visualization: This is the most critical takeaway. PCA is an excellent preprocessing step to reduce features before feeding them into a supervised model, as it can reduce noise, prevent overfitting, and speed up training (as seen in Task 3). t-SNE should not be used for this; it is almost exclusively a tool for exploratory data analysis and visualization.

  • Interpreting PCA Components: While PCA components are mathematically abstract, you can inspect the components_ attribute of the fitted PCA object. Each component is a vector where the values correspond to the original features. A large absolute value means that the original feature contributes strongly to that component, allowing for some level of interpretation.

  • Limitations of t-SNE: While powerful for visualization, t-SNE results must be interpreted with caution:

    • Cluster sizes are not meaningful: A cluster that looks larger on a t-SNE plot does not necessarily contain more points or have more variance.
    • Distances between clusters are not meaningful: You cannot conclude that two clusters that are far apart are "more different" than two clusters that are close together. t-SNE preserves local neighborhoods, not global distances.
    • It's a visualization tool, not a clustering algorithm: While it reveals clusters, you should use formal algorithms like K-Means or DBSCAN to assign points to clusters.

Prepared By

Md. Atikuzzaman
Lecturer
Department of Computer Science and Engineering
Green University of Bangladesh
Email: atik@cse.green.edu.bd