In [ ]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs, load_wine, load_iris # For synthetic and real datasets
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, homogeneity_score, completeness_score, adjusted_rand_score
from sklearn.decomposition import PCA # For visualization of high-dimensional data

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

Part 1: Understanding K-Means on a Simple Synthetic Dataset¶

We'll start with a simple, synthetically generated 2D dataset to clearly visualize how K-means works.

Tasks:

  • Generate a dataset with a known number of distinct clusters using make_blobs.
  • Apply the KMeans algorithm with the correct number of clusters.
  • Visualize the data points, assigned clusters, and final centroids.
  • Calculate and display the inertia_ (Sum of Squared Distances to the closest cluster center).

1. Generate a synthetic dataset with clear clusters¶

In [ ]:
n_samples = 300
n_features = 2
n_clusters_true = 3 # We know the true number of clusters for this synthetic data
random_state = 42

X_synth, y_true_synth = make_blobs(n_samples=n_samples, n_features=n_features,
                                   centers=n_clusters_true, cluster_std=0.8,
                                   random_state=random_state)

print(f"Synthetic dataset shape: {X_synth.shape}")
Synthetic dataset shape: (300, 2)

2. Apply KMeans algorithm¶

In [ ]:
# n_init='auto' is the default in newer versions, specifying how many times to run with different centroid seeds.
kmeans_synth = KMeans(n_clusters=n_clusters_true, random_state=random_state, n_init='auto')
kmeans_synth.fit(X_synth)

# Get cluster labels and centroids
labels_synth = kmeans_synth.labels_
centroids_synth = kmeans_synth.cluster_centers_
inertia_synth = kmeans_synth.inertia_

print(f"\nK-means (K={n_clusters_true}) completed.")
print(f"Inertia (Sum of Squared Distances): {inertia_synth:.2f}")
print(f"Final Centroids:\n{centroids_synth}")

# 3. Visualize the clustered data and centroids
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_synth[:, 0], y=X_synth[:, 1], hue=labels_synth, palette='viridis', legend='full', s=80, alpha=0.7)
plt.scatter(centroids_synth[:, 0], centroids_synth[:, 1], marker='X', s=200, color='red', label='Centroids', edgecolor='black')
plt.title(f'K-means Clustering with K={n_clusters_true} on Synthetic Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
K-means (K=3) completed.
Inertia (Sum of Squared Distances): 362.79
Final Centroids:
[[-2.60842567  9.03771305]
 [-6.88302287 -6.96320924]
 [ 4.72565847  2.00310936]]
No description has been provided for this image

Discussion Point:¶

  • What does the inertia_ value represent, and why might a lower inertia generally be desirable (with caveats)?

Part 2: Determining Optimal K - Elbow Method & Silhouette Score¶

A crucial step in K-means is determining the "optimal" number of clusters, K, when it's not known beforehand. We'll explore two popular methods: the Elbow Method and the Silhouette Score.

Tasks:

  • Apply K-means for a range of k values (e.g., 2 to 10) on the synthetic dataset.
  • For each k, record the inertia_ and calculate the silhouette_score.
  • Plot the inertia_ values against k (Elbow Method). Identify the "elbow point."
  • Plot the silhouette_score values against k. Identify the k with the highest score.
  • Discuss the interpretations and potential limitations of each method.
In [ ]:
# Define a range of K values to test
k_range = range(2, 11)
inertias = []
silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=random_state, n_init='auto')
    kmeans.fit(X_synth)
    inertias.append(kmeans.inertia_)
    score = silhouette_score(X_synth, kmeans.labels_)
    silhouette_scores.append(score)
    print(f"  K={k}: Inertia={kmeans.inertia_:.2f}, Silhouette Score={score:.4f}")

# Plotting the Elbow Method
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, inertias, marker='o', linestyle='-')
plt.title('Elbow Method for Optimal K (Inertia)')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid(True, linestyle='--', alpha=0.6)

# Plotting the Silhouette Score
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, marker='o', linestyle='-', color='orange')
plt.title('Silhouette Score for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.xticks(k_range)
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
  K=2: Inertia=5526.51, Silhouette Score=0.7206
  K=3: Inertia=362.79, Silhouette Score=0.8781
  K=4: Inertia=320.34, Silhouette Score=0.6929
  K=5: Inertia=276.63, Silhouette Score=0.4982
  K=6: Inertia=233.60, Silhouette Score=0.3268
  K=7: Inertia=206.92, Silhouette Score=0.3327
  K=8: Inertia=181.80, Silhouette Score=0.3403
  K=9: Inertia=164.57, Silhouette Score=0.3266
  K=10: Inertia=153.53, Silhouette Score=0.3268
No description has been provided for this image

Discussion Points:¶

  • Where would you identify the "elbow" in the inertia plot? Does it align with the true number of clusters (K=3) for our synthetic data?
  • Which K value yielded the highest Silhouette Score? How does this compare to the Elbow Method and the true K?
  • What are the strengths and weaknesses of each method (Elbow vs. Silhouette)? When might one be preferred over the other?

Part 3: K-Means on a Real-World Dataset with Preprocessing¶

Real-world datasets often have features with different scales, which can heavily influence distance-based algorithms like K-means. We'll use the Wine dataset to demonstrate the importance of scaling and then evaluate clustering results. The Wine dataset has known classes, allowing us to use extrinsic evaluation metrics.

Tasks:

  • Load the Wine dataset.
  • Crucially: Perform feature scaling using StandardScaler. Explain why this is necessary.
  • Apply K-means to the scaled data.
  • Evaluate the clustering performance using:
    • silhouette_score (unsupervised)
    • homogeneity_score (supervised - measures if each cluster contains only members of a single class)
    • completeness_score (supervised - measures if all members of a given class are assigned to the same cluster)
    • adjusted_rand_score (supervised - measures similarity between two clusterings, correcting for chance)
  • Visualize the clusters using PCA for dimensionality reduction, as the Wine dataset has 13 features.
In [ ]:
# 1. Load the Wine dataset
wine = load_wine()
X_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
y_true_wine = wine.target # True labels (for evaluation)

print(f"Wine dataset shape: {X_wine.shape}")
print(f"True number of classes in Wine dataset: {len(np.unique(y_true_wine))}")
print("\nOriginal Wine Data (first 5 rows):")
print(X_wine.head())

# 2. Feature Scaling
# K-means is a distance-based algorithm. Features with larger scales will
# dominate the distance calculations. Scaling ensures all features contribute equally.
scaler_wine = StandardScaler()
X_wine_scaled = scaler_wine.fit_transform(X_wine)
X_wine_scaled_df = pd.DataFrame(X_wine_scaled, columns=wine.feature_names)

print("\nScaled Wine Data (first 5 rows):")
print(X_wine_scaled_df.head())

# 3. Determine Optimal K for Wine dataset (using Elbow and Silhouette)
# True classes are 3, so let's aim around that.
k_range_wine = range(2, 7) # Test K from 2 to 6
inertias_wine = []
silhouette_scores_wine = []

for k in k_range_wine:
    kmeans_wine_opt = KMeans(n_clusters=k, random_state=random_state, n_init='auto')
    kmeans_wine_opt.fit(X_wine_scaled)
    inertias_wine.append(kmeans_wine_opt.inertia_)
    score_wine = silhouette_score(X_wine_scaled, kmeans_wine_opt.labels_)
    silhouette_scores_wine.append(score_wine)
    print(f"  Wine Data K={k}: Inertia={kmeans_wine_opt.inertia_:.2f}, Silhouette Score={score_wine:.4f}")

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(k_range_wine, inertias_wine, marker='o', linestyle='-')
plt.title('Elbow Method for Optimal K (Wine Data)')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.xticks(k_range_wine)
plt.grid(True, linestyle='--', alpha=0.6)

plt.subplot(1, 2, 2)
plt.plot(k_range_wine, silhouette_scores_wine, marker='o', linestyle='-', color='orange')
plt.title('Silhouette Score for Optimal K (Wine Data)')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.xticks(k_range_wine)
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Based on elbow/silhouette (and knowing true K is 3), let's proceed with K=3
optimal_k_wine = 3
kmeans_wine = KMeans(n_clusters=optimal_k_wine, random_state=random_state, n_init='auto')
kmeans_wine.fit(X_wine_scaled)
labels_wine = kmeans_wine.labels_

# 4. Evaluate the clustering performance
print(f"\n--- K-means Evaluation for Wine Dataset (K={optimal_k_wine}) ---")
print(f"Silhouette Score (unsupervised): {silhouette_score(X_wine_scaled, labels_wine):.4f}")
print(f"Homogeneity Score (supervised): {homogeneity_score(y_true_wine, labels_wine):.4f}")
print(f"Completeness Score (supervised): {completeness_score(y_true_wine, labels_wine):.4f}")
print(f"Adjusted Rand Score (supervised): {adjusted_rand_score(y_true_wine, labels_wine):.4f}")

# 5. Visualize clusters using PCA
# PCA reduces the dimensionality for visualization while retaining as much variance as possible.
pca = PCA(n_components=2)
X_wine_pca = pca.fit_transform(X_wine_scaled)

plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_wine_pca[:, 0], y=X_wine_pca[:, 1], hue=labels_wine, palette='viridis', legend='full', s=80, alpha=0.7)
plt.scatter(kmeans_wine.cluster_centers_[:, 0], kmeans_wine.cluster_centers_[:, 1], marker='X', s=200, color='red', label='Centroids', edgecolor='black')
plt.title(f'K-means Clusters (K={optimal_k_wine}) on Wine Data (PCA-reduced)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# Optional: Visualize true labels for comparison
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_wine_pca[:, 0], y=X_wine_pca[:, 1], hue=y_true_wine, palette='viridis', legend='full', s=80, alpha=0.7)
plt.title('True Classes of Wine Data (PCA-reduced)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Wine dataset shape: (178, 13)
True number of classes in Wine dataset: 3

Original Wine Data (first 5 rows):
   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  
0                          3.92   1065.0  
1                          3.40   1050.0  
2                          3.17   1185.0  
3                          3.45   1480.0  
4                          2.93    735.0  

Scaled Wine Data (first 5 rows):
    alcohol  malic_acid       ash  alcalinity_of_ash  magnesium  \
0  1.518613   -0.562250  0.232053          -1.169593   1.913905   
1  0.246290   -0.499413 -0.827996          -2.490847   0.018145   
2  0.196879    0.021231  1.109334          -0.268738   0.088358   
3  1.691550   -0.346811  0.487926          -0.809251   0.930918   
4  0.295700    0.227694  1.840403           0.451946   1.281985   

   total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  \
0       0.808997    1.034819             -0.659563         1.224884   
1       0.568648    0.733629             -0.820719        -0.544721   
2       0.808997    1.215533             -0.498407         2.135968   
3       2.491446    1.466525             -0.981875         1.032155   
4       0.808997    0.663351              0.226796         0.401404   

   color_intensity       hue  od280/od315_of_diluted_wines   proline  
0         0.251717  0.362177                      1.847920  1.013009  
1        -0.293321  0.406051                      1.113449  0.965242  
2         0.269020  0.318304                      0.788587  1.395148  
3         1.186068 -0.427544                      1.184071  2.334574  
4        -0.319276  0.362177                      0.449601 -0.037874  
  Wine Data K=2: Inertia=1661.68, Silhouette Score=0.2650
  Wine Data K=3: Inertia=1277.93, Silhouette Score=0.2849
  Wine Data K=4: Inertia=1211.75, Silhouette Score=0.2542
  Wine Data K=5: Inertia=1123.16, Silhouette Score=0.1836
  Wine Data K=6: Inertia=1079.54, Silhouette Score=0.1690
No description has been provided for this image
--- K-means Evaluation for Wine Dataset (K=3) ---
Silhouette Score (unsupervised): 0.2849
Homogeneity Score (supervised): 0.8788
Completeness Score (supervised): 0.8730
Adjusted Rand Score (supervised): 0.8975
No description has been provided for this image
No description has been provided for this image

Discussion Points:¶

  • Why was scaling particularly important for the Wine dataset before applying K-means?
  • Compare the Silhouette Score to the supervised metrics (Homogeneity, Completeness, Adjusted Rand Score). How do these metrics complement each other?
  • How well did K-means recover the true underlying structure of the Wine dataset based on your evaluation and visualizations? What challenges might arise in evaluating clustering without ground truth?

Part 4: Initialization Strategies and Limitations¶

The initial placement of centroids can significantly impact the final clustering result, especially for complex datasets. K-means also has inherent limitations.

Tasks:

  • Generate another synthetic dataset that might highlight K-means' limitations (e.g., non-spherical clusters, varying densities, or outliers).
  • Compare k-means++ (default) vs. random initialization. Run K-means multiple times with n_init=1 for 'random' to see variability.
  • Visualize the results of different initialization strategies.
  • Discuss how K-means handles (or struggles with) non-spherical clusters, clusters of varying densities, and outliers.
In [ ]:
# 1. Generate a synthetic dataset to highlight limitations (e.g., non-spherical)
# Make two crescent shapes for non-spherical clusters (using make_moons)
from sklearn.datasets import make_moons
X_moons, y_true_moons = make_moons(n_samples=200, noise=0.05, random_state=random_state)
# Add some "outliers" manually
outliers = np.array([[2.0, 1.0], [-1.5, 1.5], [0.5, -1.0]])
X_moons = np.vstack([X_moons, outliers])
y_true_moons = np.hstack([y_true_moons, [2, 2, 2]]) # Assign a dummy label for outliers

print(f"Moons dataset shape (with outliers): {X_moons.shape}")

# 2. Compare 'k-means++' (default) vs. 'random' initialization
n_clusters_moons = 2 # We aim for 2 main clusters ignoring outliers for a moment

print("\n--- Comparing Initialization Methods for Moons Data (K=2) ---")

# k-means++ initialization
print("\nK-means with 'k-means++' initialization:")
kmeans_plusplus = KMeans(n_clusters=n_clusters_moons, init='k-means++', n_init=10, random_state=random_state)
kmeans_plusplus.fit(X_moons)
labels_plusplus = kmeans_plusplus.labels_
print(f"  Inertia (k-means++): {kmeans_plusplus.inertia_:.2f}")
print(f"  Silhouette Score (k-means++): {silhouette_score(X_moons, labels_plusplus):.4f}")

# Random initialization (multiple runs to show variability if random_state was not fixed)
# For demonstration, we'll run it a few times to show potential differences without fixing random_state
print("\nK-means with 'random' initialization (multiple runs):")
random_inits_labels = []
random_inits_inertia = []
random_inits_silhouette = []

# Perform multiple runs with random initialization (setting n_init=1 explicitly overrides auto for demonstration)
for i in range(3): # Run 3 times
    kmeans_random = KMeans(n_clusters=n_clusters_moons, init='random', n_init=1, random_state=random_state + i) # Varying seed
    kmeans_random.fit(X_moons)
    random_inits_labels.append(kmeans_random.labels_)
    random_inits_inertia.append(kmeans_random.inertia_)
    random_inits_silhouette.append(silhouette_score(X_moons, kmeans_random.labels_))
    print(f"  Run {i+1} (random): Inertia={random_inits_inertia[-1]:.2f}, Silhouette={random_inits_silhouette[-1]:.4f}")

# Visualize results of k-means++ (primary result)
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_moons[:, 0], y=X_moons[:, 1], hue=labels_plusplus, palette='viridis', legend='full', s=80, alpha=0.7)
plt.scatter(kmeans_plusplus.cluster_centers_[:, 0], kmeans_plusplus.cluster_centers_[:, 1], marker='X', s=200, color='red', label='Centroids', edgecolor='black')
plt.title(f'K-means with k-means++ Init on Moons Data (K={n_clusters_moons})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# Visualize one of the random initialization runs (e.g., the first one)
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_moons[:, 0], y=X_moons[:, 1], hue=random_inits_labels[0], palette='viridis', legend='full', s=80, alpha=0.7)
plt.title(f'K-means with Random Init (Run 1) on Moons Data (K={n_clusters_moons})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
Moons dataset shape (with outliers): (203, 2)

--- Comparing Initialization Methods for Moons Data (K=2) ---

K-means with 'k-means++' initialization:
  Inertia (k-means++): 86.98
  Silhouette Score (k-means++): 0.4813

K-means with 'random' initialization (multiple runs):
  Run 1 (random): Inertia=86.98, Silhouette=0.4813
  Run 2 (random): Inertia=86.98, Silhouette=0.4813
  Run 3 (random): Inertia=86.98, Silhouette=0.4813
No description has been provided for this image
No description has been provided for this image

Discussion Points:¶

  • How do the results from k-means++ compare to random initialization? What is the purpose of n_init in KMeans?
  • Based on the visualization of the "moons" dataset, how well did K-means perform? Why did it struggle to correctly identify the crescent-shaped clusters?
  • How might outliers affect K-means clustering? What characteristic of K-means makes it sensitive to outliers?

Part 5: Advanced Topics & Discussion¶

This section is for broader discussion and conceptual understanding, extending beyond direct coding.

Discussion Topics:

  • Comparison with other Clustering Algorithms: Briefly compare K-means to Hierarchical Clustering (e.g., Agglomerative Clustering) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). When would you choose one over K-means?
  • Interpreting Clusters: Once clusters are formed, how would you go about interpreting their meaning? What techniques or analyses would you use to characterize each cluster?
  • K-means in Practice: What are some common challenges encountered when applying K-means to real-world data, and how might these be addressed?
  • Beyond Euclidean Distance: Discuss situations where Euclidean distance might not be the most appropriate similarity metric for K-means. What other distance metrics could be used? (e.g., Manhattan, Cosine, Mahalanobis).

1. Comparison with other Clustering Algorithms:

  • Hierarchical Clustering (e.g., Agglomerative):

    • Pros: Does not require pre-specifying K; produces a dendrogram (tree structure) revealing hierarchical relationships; can identify clusters of varying sizes and shapes (depending on linkage).
    • Cons: Computationally more expensive (especially for large datasets, $O(N^3)$ or $O(N^2)$); difficult to decide where to 'cut' the dendrogram; can be sensitive to noise.
    • When to choose over K-means: When the number of clusters is unknown or when a hierarchical structure is expected/desired; for smaller datasets where computational cost is less of an issue.
  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

    • Pros: Can find arbitrarily shaped clusters; can identify outliers (noise points); does not require pre-specifying K (though requires eps and min_samples).
    • Cons: Struggles with clusters of varying densities; sensitive to parameter choice (eps, min_samples); not ideal for high-dimensional data without dimensionality reduction; struggles with global vs. local density variations.
    • When to choose over K-means: When clusters are expected to be non-spherical or when identifying outliers is important; when density variations are distinct.

2. Interpreting Clusters:

  • Once clusters are formed, how do you derive meaning from them?
  • Techniques:
    • Descriptive Statistics: Calculate mean/median/mode of features for each cluster. Compare these statistics across clusters to see what differentiates them (e.g., 'Cluster 1 has high income, low age').
    • Feature Importance: Analyze which features contributed most to the clustering (e.g., by looking at centroid values or feature ranges within clusters).
    • Visualization: Create visualizations (box plots, bar charts, scatter plots) for key features broken down by cluster.
    • Domain Expertise: Bring in domain experts to validate and name the clusters based on the characteristics.
    • Qualitative Analysis: If data points have associated text or images, sample and review them.

3. K-means in Practice: Common Challenges and Solutions:

  • Challenges:
    • Sensitivity to Initial Centroids: Can converge to suboptimal local minima.
    • Requires Pre-defined K: Often difficult to know the optimal number of clusters.
    • Sensitivity to Outliers: Outliers can disproportionately influence centroid positions.
    • Assumes Spherical Clusters: Struggles with non-globular or irregularly shaped clusters.
    • Assumes Equal Cluster Variance/Density: Performs poorly if clusters have vastly different sizes or densities.
    • Scaling: Features on different scales can bias results.
  • Solutions:
    • Initialization: Use k-means++ (default in scikit-learn) or run K-means multiple times with different random initializations (n_init parameter) and pick the best result (lowest inertia).
    • Optimal K: Use Elbow Method, Silhouette Score, Gap Statistic, or domain knowledge.
    • Outliers: Pre-process by detecting and removing/transforming outliers; consider more robust clustering algorithms like DBSCAN.
    • Cluster Shape/Density: Use other algorithms like DBSCAN, Hierarchical Clustering with different linkage methods, or Gaussian Mixture Models.
    • Scaling: Always scale features (e.g., StandardScaler, MinMaxScaler) before applying K-means.

4. Beyond Euclidean Distance:

  • When might Euclidean distance not be appropriate for K-means? What alternatives exist?
  • When Euclidean is Not Appropriate:
    • Categorical Data: Euclidean distance is meaningless for purely categorical features.
    • High-Dimensional Data (Curse of Dimensionality): Distances become less meaningful in very high dimensions.
    • Text Data/Sparse Data: Where feature vectors are very sparse or represent word counts/frequencies.
    • Rotated/Scaled Clusters: If clusters are elongated or rotated, Euclidean distance might perform poorly unless features are transformed (e.g., PCA).
  • Alternative Distance Metrics (and when to use them):
    • Manhattan Distance (L1 norm): Sum of absolute differences. More robust to outliers than Euclidean; useful in grid-like movements.
    • Cosine Similarity (often 1 - Cosine Distance): Measures the angle between two vectors. Commonly used for text analysis, recommendation systems, or high-dimensional data where direction (pattern) is more important than magnitude (length).
    • Mahalanobis Distance: Accounts for correlations between variables and differences in scale. More complex, requires covariance matrix.
    • Hamming Distance: For binary or categorical data, counting positions where symbols differ.
    • Jaccard Distance: For sets or binary data, useful for presence/absence information.

Prepared By

Md. Atikuzzaman
Lecturer
Department of Computer Science and Engineering
Green University of Bangladesh
Email: atik@cse.green.edu.bd