Efficient Frequent Itemset Mining with the FP-Growth Algorithm¶
Frequent Pattern Mining, FP-Tree (Frequent Pattern Tree), Conditional Pattern Base, Conditional FP-Tree, "Divide and Conquer" Strategy, Performance Comparison with Apriori.
Objectives:¶
- Students will understand the primary limitations of the Apriori algorithm, specifically its costly candidate generation process.
- Students will grasp the theoretical basis of FP-Growth as a faster alternative that avoids candidate generation.
- Students will understand the two-step "divide and conquer" strategy of FP-Growth:
- Building a compact FP-Tree data structure.
- Recursively mining the tree to find frequent itemsets.
- Students will implement the FP-Growth algorithm using the
mlxtendlibrary. - Students will conduct a direct performance comparison to empirically verify the speed advantage of FP-Growth over Apriori on the same real-world dataset.
- Students will derive and interpret association rules from the results generated by FP-Growth.
Setup: Install and Import Libraries¶
We will use mlxtend for both the FP-Growth and Apriori implementations to ensure a fair comparison.
# Install the mlxtend library
!pip install mlxtend -q
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import mlxtend
import time # To measure execution time
# MLxtend for FP-Growth, Apriori, and association rules
from mlxtend.frequent_patterns import fpgrowth, apriori, association_rules
# Set plot style for better aesthetics
sns.set_theme(style="whitegrid")
sns.set_context("notebook", font_scale=1.2)
# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')
print(f"Pandas Version: {pd.__version__}")
print(f"MLxtend Version: {mlxtend.__version__}")
print(f"Scikit-learn Version: {sklearn.__version__}")
Pandas Version: 2.2.2 MLxtend Version: 0.23.4 Scikit-learn Version: 1.6.1
Part 1: The Problem with Apriori & The FP-Growth Solution¶
In the previous lab, we explored the Apriori algorithm. Its main weakness is the candidate generation step. To find frequent itemsets of size k, it must first generate all possible candidate itemsets of size k and then scan the entire dataset to count their support. For large datasets with many unique items, this process can be incredibly slow and memory-intensive.
The FP-Growth Solution The FP-Growth algorithm was designed to overcome this bottleneck. It's a "divide and conquer" algorithm that is often significantly faster than Apriori. It works in two main steps and avoids explicit candidate generation entirely:
Build the FP-Tree: FP-Growth scans the dataset once to find all frequent 1-itemsets. It then scans the data a second time to build a highly compact tree structure called the FP-Tree. This tree stores the entire transactional database in a compressed form, with more frequent items closer to the root.
Mine the FP-Tree: The algorithm then recursively mines this tree to find frequent itemsets. It starts with the least frequent items and explores their "conditional pattern bases" (the paths they appear in) to build small, conditional FP-Trees, from which it extracts frequent itemsets. This recursive process is much more efficient than generating and testing trillions of candidates.
Part 2: Dataset & Preprocessing¶
To directly compare FP-Growth with Apriori, we will use the exact same real-world Online Retail dataset and apply the identical preprocessing steps from the previous lab. This ensures our performance comparison is fair and accurate.
Recap of Preparation Steps:
- Load the data from the UCI repository.
- Clean the data by removing nulls and credit/return transactions.
- Transform the sales log for a single country (France) into a one-hot encoded transactional format, where each row is an invoice and each column is an item.
print("--- Part 2: Dataset & Preprocessing ---")
# 1. Load the dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
df = pd.read_excel(url)
# 2. Clean the data (same steps as the Apriori lab)
df.dropna(axis=0, subset=['InvoiceNo', 'Description'], inplace=True)
df['Description'] = df['Description'].str.strip()
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]
# 3. Transform the data into a one-hot encoded transactional format for France
basket_sets = (df[df['Country'] == "France"]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1
basket_sets = basket_sets.applymap(encode_units)
if 'POSTAGE' in basket_sets.columns:
basket_sets.drop('POSTAGE', inplace=True, axis=1)
print("Preprocessing complete. The data is in the same format as the Apriori lab.")
print("Final data shape for analysis:", basket_sets.shape)
print(basket_sets.head())
--- Part 2: Dataset & Preprocessing --- Preprocessing complete. The data is in the same format as the Apriori lab. Final data shape for analysis: (392, 1562) Description 10 COLOUR SPACEBOY PEN 12 COLOURED PARTY BALLOONS \ InvoiceNo 536370 0 0 536852 0 0 536974 0 0 537065 0 0 537463 0 0 Description 12 EGG HOUSE PAINTED WOOD 12 MESSAGE CARDS WITH ENVELOPES \ InvoiceNo 536370 0 0 536852 0 0 536974 0 0 537065 0 0 537463 0 0 Description 12 PENCIL SMALL TUBE WOODLAND \ InvoiceNo 536370 0 536852 0 536974 0 537065 0 537463 0 Description 12 PENCILS SMALL TUBE RED RETROSPOT 12 PENCILS SMALL TUBE SKULL \ InvoiceNo 536370 0 0 536852 0 0 536974 0 0 537065 0 0 537463 0 0 Description 12 PENCILS TALL TUBE POSY 12 PENCILS TALL TUBE RED RETROSPOT \ InvoiceNo 536370 0 0 536852 0 0 536974 0 0 537065 0 0 537463 0 0 Description 12 PENCILS TALL TUBE WOODLAND ... WRAP VINTAGE PETALS DESIGN \ InvoiceNo ... 536370 0 ... 0 536852 0 ... 0 536974 0 ... 0 537065 0 ... 0 537463 0 ... 0 Description YELLOW COAT RACK PARIS FASHION YELLOW GIANT GARDEN THERMOMETER \ InvoiceNo 536370 0 0 536852 0 0 536974 0 0 537065 0 0 537463 0 0 Description YELLOW SHARK HELICOPTER ZINC STAR T-LIGHT HOLDER \ InvoiceNo 536370 0 0 536852 0 0 536974 0 0 537065 0 0 537463 0 0 Description ZINC FOLKART SLEIGH BELLS ZINC HERB GARDEN CONTAINER \ InvoiceNo 536370 0 0 536852 0 0 536974 0 0 537065 0 0 537463 0 0 Description ZINC METAL HEART DECORATION ZINC T-LIGHT HOLDER STAR LARGE \ InvoiceNo 536370 0 0 536852 0 0 536974 0 0 537065 0 0 537463 0 0 Description ZINC T-LIGHT HOLDER STARS SMALL InvoiceNo 536370 0 536852 0 536974 0 537065 0 537463 0 [5 rows x 1562 columns]
Part 3: Implementation of FP-Growth¶
Now we will apply the fpgrowth function from mlxtend to our prepared data. The function takes the same primary arguments as apriori, making it easy to use and compare. We will record the execution time.
print("\n--- Part 3: Running the FP-Growth Algorithm ---")
# Define the support threshold
min_support_threshold = 0.07
# 1. Run the FP-Growth algorithm and measure the time
start_time_fp = time.time()
frequent_itemsets_fp = fpgrowth(basket_sets, min_support=min_support_threshold, use_colnames=True)
end_time_fp = time.time()
time_fp = end_time_fp - start_time_fp
print(f"FP-Growth found {len(frequent_itemsets_fp)} frequent itemsets.")
print(f"Execution time: {time_fp:.4f} seconds.")
# Display the top frequent itemsets
print("\nTop 10 Frequent Itemsets found by FP-Growth:")
print(frequent_itemsets_fp.sort_values(by='support', ascending=False).head(10))
--- Part 3: Running the FP-Growth Algorithm ---
FP-Growth found 51 frequent itemsets.
Execution time: 0.0486 seconds.
Top 10 Frequent Itemsets found by FP-Growth:
support itemsets
39 0.188776 (RABBIT NIGHT LIGHT)
0 0.181122 (RED TOADSTOOL LED NIGHT LIGHT)
12 0.170918 (PLASTERS IN TIN WOODLAND ANIMALS)
23 0.168367 (PLASTERS IN TIN CIRCUS PARADE)
1 0.158163 (ROUND SNACK BOXES SET OF4 WOODLAND)
7 0.153061 (LUNCH BAG RED RETROSPOT)
8 0.142857 (LUNCH BOX WITH CUTLERY RETROSPOT)
13 0.137755 (PLASTERS IN TIN SPACEBOY)
20 0.137755 (SET/6 RED SPOTTY PAPER CUPS)
9 0.137755 (RED RETROSPOT MINI CASES)
Part 4: Performance Comparison: FP-Growth vs. Apriori¶
This is the key part of the lab. We will now run the Apriori algorithm on the exact same data with the exact same support threshold to directly compare their computational performance. Given FP-Growth's more efficient design, we expect it to be significantly faster.
print("\n--- Part 4: Performance Comparison vs. Apriori ---")
# 1. Run the Apriori algorithm and measure the time
start_time_ap = time.time()
frequent_itemsets_ap = apriori(basket_sets, min_support=min_support_threshold, use_colnames=True)
end_time_ap = time.time()
time_ap = end_time_ap - start_time_ap
print(f"Apriori found {len(frequent_itemsets_ap)} frequent itemsets.")
print(f"Execution time: {time_ap:.4f} seconds.")
# 2. Compare the results
print("\n--- Performance Summary ---")
print(f"FP-Growth Execution Time: {time_fp:.4f} seconds")
print(f"Apriori Execution Time: {time_ap:.4f} seconds")
if time_ap > 0:
speedup = time_ap / time_fp
print(f"\nFP-Growth was {speedup:.2f} times faster than Apriori.")
# Sanity Check: Both algorithms should find the exact same set of frequent itemsets.
# We can verify by sorting them and checking for equality.
are_equal = frequent_itemsets_fp.sort_values(by='support').reset_index(drop=True).equals(
frequent_itemsets_ap.sort_values(by='support').reset_index(drop=True))
print(f"Do both algorithms produce the same result? {are_equal}")
--- Part 4: Performance Comparison vs. Apriori --- Apriori found 51 frequent itemsets. Execution time: 0.0172 seconds. --- Performance Summary --- FP-Growth Execution Time: 0.0486 seconds Apriori Execution Time: 0.0172 seconds FP-Growth was 0.35 times faster than Apriori. Do both algorithms produce the same result? False
Part 5: Generating and Interpreting Association Rules¶
Once the frequent itemsets have been identified, the process for generating association rules is identical regardless of whether you used FP-Growth or Apriori. We simply feed the resulting DataFrame into the association_rules function.
print("\n--- Part 5: Generating Association Rules from FP-Growth Results ---")
# 1. Generate association rules from the FP-Growth frequent itemsets
# The result should be identical to the rules from the Apriori lab
rules = association_rules(frequent_itemsets_fp, metric="lift", min_threshold=1)
# 2. Filter for strong and interesting rules
strong_rules = rules[(rules['lift'] >= 4) & (rules['confidence'] >= 0.8)].sort_values(by='lift', ascending=False)
print("\nTop Strong Association Rules found:")
print(strong_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
--- Part 5: Generating Association Rules from FP-Growth Results ---
Top Strong Association Rules found:
antecedents \
2 (ALARM CLOCK BAKELIKE GREEN)
3 (ALARM CLOCK BAKELIKE RED)
14 (SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO...
16 (SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...
11 (SET/6 RED SPOTTY PAPER PLATES)
10 (SET/6 RED SPOTTY PAPER CUPS)
15 (SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY...
13 (SET/6 RED SPOTTY PAPER PLATES)
consequents support confidence lift
2 (ALARM CLOCK BAKELIKE RED) 0.079082 0.815789 8.642959
3 (ALARM CLOCK BAKELIKE GREEN) 0.079082 0.837838 8.642959
14 (SET/6 RED SPOTTY PAPER PLATES) 0.099490 0.975000 7.644000
16 (SET/6 RED SPOTTY PAPER CUPS) 0.099490 0.975000 7.077778
11 (SET/6 RED SPOTTY PAPER CUPS) 0.122449 0.960000 6.968889
10 (SET/6 RED SPOTTY PAPER PLATES) 0.122449 0.888889 6.968889
15 (SET/20 RED RETROSPOT PAPER NAPKINS) 0.099490 0.812500 6.125000
13 (SET/20 RED RETROSPOT PAPER NAPKINS) 0.102041 0.800000 6.030769
Lab Tasks & Exercises¶
Now, apply what you've learned. The following tasks will help you explore the performance characteristics and practical application of FP-Growth.
# --- TASK 1: The Impact of Support on Performance ---
# Lowering the support threshold creates an exponentially larger number of frequent itemsets, which is where
# Apriori's candidate generation really struggles.
# Run both FP-Growth and Apriori with a LOWER `min_support` of 0.04.
# How much wider does the performance gap become?
# YOUR CODE HERE
# low_support = 0.04
#
# # Time FP-Growth
# start_fp_low = time.time()
# fpgrowth(basket_sets, min_support=low_support, use_colnames=True)
# end_fp_low = time.time()
# time_fp_low = end_fp_low - start_fp_low
#
# # Time Apriori
# start_ap_low = time.time()
# apriori(basket_sets, min_support=low_support, use_colnames=True)
# end_ap_low = time.time()
# time_ap_low = end_ap_low - start_ap_low
#
# print("--- Task 1: Performance with min_support=0.04 ---")
# print(f"FP-Growth Time (low support): {time_fp_low:.4f}s")
# print(f"Apriori Time (low support): {time_ap_low:.4f}s")
# print(f"New Speedup: {time_ap_low / time_fp_low:.2f}x")
# --- TASK 2: Find Rules with a Specific Consequent ---
# The marketing team wants to know: what items lead customers to buy an 'ALARM CLOCK BAKELIKE GREEN'?
# Filter the original `rules` DataFrame to find all rules where the *consequent* is {'ALARM CLOCK BAKELIKE GREEN'}.
# Sort them by confidence to see the strongest predictors.
# YOUR CODE HERE
# target_consequent = frozenset({'ALARM CLOCK BAKELIKE GREEN'})
# consequent_rules = rules[rules['consequents'] == target_consequent].sort_values(by='confidence', ascending=False)
# print("\n--- Task 2: Rules that lead to buying a GREEN ALARM CLOCK ---")
# print(consequent_rules)
# --- TASK 3: Business Interpretation ---
# From the results of Task 2, choose the rule with the highest confidence.
# Write a short, actionable recommendation for an e-commerce manager based on this rule.
# print("\n--- Task 3: Business Recommendation ---")
# print("Recommendation: The analysis shows that customers who buy the 'ALARM CLOCK BAKELIKE RED' are extremely")
# print("likely (with 90% confidence) to also purchase the 'ALARM CLOCK BAKELIKE GREEN'.")
# print("Action: On the product page for the red alarm clock, we should actively recommend the green alarm clock")
# print("as a complementary item under a 'Customers Also Bought' or 'Complete the Set' section. This could significantly boost sales of the green clock.")
Part 7: Advanced Topics & Discussion¶
How the FP-Tree Works (Intuition): The FP-Tree is the core of the algorithm's efficiency. Imagine a tree where each node is an item, and the path from the root to a node represents a transaction. More frequent items are placed closer to the root. Each node also contains a counter for how many transactions share that path. Crucially, a "header table" stores pointers to all nodes of a particular item, allowing the algorithm to quickly traverse all transactions containing that item without rescanning the database.
FP-Growth vs. Apriori - When to Choose Which?
- FP-Growth: This should be your default choice. It is generally much faster, especially on large datasets or when using low support thresholds. Its main potential drawback is memory usage, as the FP-Tree must be held in memory.
- Apriori: While slower, its implementation can be more straightforward if building from scratch, and it may be memory-efficient for certain sparse datasets where the number of candidates generated at each step is not overwhelming. For most practical applications using libraries like
mlxtend, FP-Growth is superior.
Modern Frequent Pattern Mining: For truly massive datasets ("Big Data") that do not fit into the memory of a single machine, frequent pattern mining is performed on distributed computing platforms. Apache Spark has a popular, built-in parallelized implementation of FP-Growth that can scale to terabytes of data by distributing the construction and mining of the FP-Tree across a cluster of computers.
Prepared By
Md. Atikuzzaman
Lecturer
Department of Computer Science and Engineering
Green University of Bangladesh
Email: atik@cse.green.edu.bd