Efficient Frequent Itemset Mining with the FP-Growth Algorithm¶

Frequent Pattern Mining, FP-Tree (Frequent Pattern Tree), Conditional Pattern Base, Conditional FP-Tree, "Divide and Conquer" Strategy, Performance Comparison with Apriori.

Objectives:¶

  • Students will understand the primary limitations of the Apriori algorithm, specifically its costly candidate generation process.
  • Students will grasp the theoretical basis of FP-Growth as a faster alternative that avoids candidate generation.
  • Students will understand the two-step "divide and conquer" strategy of FP-Growth:
    1. Building a compact FP-Tree data structure.
    2. Recursively mining the tree to find frequent itemsets.
  • Students will implement the FP-Growth algorithm using the mlxtend library.
  • Students will conduct a direct performance comparison to empirically verify the speed advantage of FP-Growth over Apriori on the same real-world dataset.
  • Students will derive and interpret association rules from the results generated by FP-Growth.

Setup: Install and Import Libraries¶

We will use mlxtend for both the FP-Growth and Apriori implementations to ensure a fair comparison.

In [1]:
# Install the mlxtend library
!pip install mlxtend -q

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import mlxtend
import time # To measure execution time

# MLxtend for FP-Growth, Apriori, and association rules
from mlxtend.frequent_patterns import fpgrowth, apriori, association_rules

# Set plot style for better aesthetics
sns.set_theme(style="whitegrid")
sns.set_context("notebook", font_scale=1.2)


# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print(f"Pandas Version: {pd.__version__}")
print(f"MLxtend Version: {mlxtend.__version__}")
print(f"Scikit-learn Version: {sklearn.__version__}")
Pandas Version: 2.2.2
MLxtend Version: 0.23.4
Scikit-learn Version: 1.6.1

Part 1: The Problem with Apriori & The FP-Growth Solution¶

In the previous lab, we explored the Apriori algorithm. Its main weakness is the candidate generation step. To find frequent itemsets of size k, it must first generate all possible candidate itemsets of size k and then scan the entire dataset to count their support. For large datasets with many unique items, this process can be incredibly slow and memory-intensive.

The FP-Growth Solution The FP-Growth algorithm was designed to overcome this bottleneck. It's a "divide and conquer" algorithm that is often significantly faster than Apriori. It works in two main steps and avoids explicit candidate generation entirely:

  1. Build the FP-Tree: FP-Growth scans the dataset once to find all frequent 1-itemsets. It then scans the data a second time to build a highly compact tree structure called the FP-Tree. This tree stores the entire transactional database in a compressed form, with more frequent items closer to the root.

  2. Mine the FP-Tree: The algorithm then recursively mines this tree to find frequent itemsets. It starts with the least frequent items and explores their "conditional pattern bases" (the paths they appear in) to build small, conditional FP-Trees, from which it extracts frequent itemsets. This recursive process is much more efficient than generating and testing trillions of candidates.

Part 2: Dataset & Preprocessing¶

To directly compare FP-Growth with Apriori, we will use the exact same real-world Online Retail dataset and apply the identical preprocessing steps from the previous lab. This ensures our performance comparison is fair and accurate.

Recap of Preparation Steps:

  1. Load the data from the UCI repository.
  2. Clean the data by removing nulls and credit/return transactions.
  3. Transform the sales log for a single country (France) into a one-hot encoded transactional format, where each row is an invoice and each column is an item.
In [2]:
print("--- Part 2: Dataset & Preprocessing ---")

# 1. Load the dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
df = pd.read_excel(url)

# 2. Clean the data (same steps as the Apriori lab)
df.dropna(axis=0, subset=['InvoiceNo', 'Description'], inplace=True)
df['Description'] = df['Description'].str.strip()
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

# 3. Transform the data into a one-hot encoded transactional format for France
basket_sets = (df[df['Country'] == "France"]
               .groupby(['InvoiceNo', 'Description'])['Quantity']
               .sum().unstack().reset_index().fillna(0)
               .set_index('InvoiceNo'))

def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket_sets.applymap(encode_units)
if 'POSTAGE' in basket_sets.columns:
    basket_sets.drop('POSTAGE', inplace=True, axis=1)

print("Preprocessing complete. The data is in the same format as the Apriori lab.")
print("Final data shape for analysis:", basket_sets.shape)
print(basket_sets.head())
--- Part 2: Dataset & Preprocessing ---
Preprocessing complete. The data is in the same format as the Apriori lab.
Final data shape for analysis: (392, 1562)
Description  10 COLOUR SPACEBOY PEN  12 COLOURED PARTY BALLOONS  \
InvoiceNo                                                         
536370                            0                           0   
536852                            0                           0   
536974                            0                           0   
537065                            0                           0   
537463                            0                           0   

Description  12 EGG HOUSE PAINTED WOOD  12 MESSAGE CARDS WITH ENVELOPES  \
InvoiceNo                                                                 
536370                               0                                0   
536852                               0                                0   
536974                               0                                0   
537065                               0                                0   
537463                               0                                0   

Description  12 PENCIL SMALL TUBE WOODLAND  \
InvoiceNo                                    
536370                                   0   
536852                                   0   
536974                                   0   
537065                                   0   
537463                                   0   

Description  12 PENCILS SMALL TUBE RED RETROSPOT  12 PENCILS SMALL TUBE SKULL  \
InvoiceNo                                                                       
536370                                         0                            0   
536852                                         0                            0   
536974                                         0                            0   
537065                                         0                            0   
537463                                         0                            0   

Description  12 PENCILS TALL TUBE POSY  12 PENCILS TALL TUBE RED RETROSPOT  \
InvoiceNo                                                                    
536370                               0                                   0   
536852                               0                                   0   
536974                               0                                   0   
537065                               0                                   0   
537463                               0                                   0   

Description  12 PENCILS TALL TUBE WOODLAND  ...  WRAP VINTAGE PETALS  DESIGN  \
InvoiceNo                                   ...                                
536370                                   0  ...                            0   
536852                                   0  ...                            0   
536974                                   0  ...                            0   
537065                                   0  ...                            0   
537463                                   0  ...                            0   

Description  YELLOW COAT RACK PARIS FASHION  YELLOW GIANT GARDEN THERMOMETER  \
InvoiceNo                                                                      
536370                                    0                                0   
536852                                    0                                0   
536974                                    0                                0   
537065                                    0                                0   
537463                                    0                                0   

Description  YELLOW SHARK HELICOPTER  ZINC  STAR T-LIGHT HOLDER  \
InvoiceNo                                                         
536370                             0                          0   
536852                             0                          0   
536974                             0                          0   
537065                             0                          0   
537463                             0                          0   

Description  ZINC FOLKART SLEIGH BELLS  ZINC HERB GARDEN CONTAINER  \
InvoiceNo                                                            
536370                               0                           0   
536852                               0                           0   
536974                               0                           0   
537065                               0                           0   
537463                               0                           0   

Description  ZINC METAL HEART DECORATION  ZINC T-LIGHT HOLDER STAR LARGE  \
InvoiceNo                                                                  
536370                                 0                               0   
536852                                 0                               0   
536974                                 0                               0   
537065                                 0                               0   
537463                                 0                               0   

Description  ZINC T-LIGHT HOLDER STARS SMALL  
InvoiceNo                                     
536370                                     0  
536852                                     0  
536974                                     0  
537065                                     0  
537463                                     0  

[5 rows x 1562 columns]

Part 3: Implementation of FP-Growth¶

Now we will apply the fpgrowth function from mlxtend to our prepared data. The function takes the same primary arguments as apriori, making it easy to use and compare. We will record the execution time.

In [3]:
print("\n--- Part 3: Running the FP-Growth Algorithm ---")

# Define the support threshold
min_support_threshold = 0.07

# 1. Run the FP-Growth algorithm and measure the time
start_time_fp = time.time()
frequent_itemsets_fp = fpgrowth(basket_sets, min_support=min_support_threshold, use_colnames=True)
end_time_fp = time.time()

time_fp = end_time_fp - start_time_fp

print(f"FP-Growth found {len(frequent_itemsets_fp)} frequent itemsets.")
print(f"Execution time: {time_fp:.4f} seconds.")

# Display the top frequent itemsets
print("\nTop 10 Frequent Itemsets found by FP-Growth:")
print(frequent_itemsets_fp.sort_values(by='support', ascending=False).head(10))
--- Part 3: Running the FP-Growth Algorithm ---
FP-Growth found 51 frequent itemsets.
Execution time: 0.0486 seconds.

Top 10 Frequent Itemsets found by FP-Growth:
     support                              itemsets
39  0.188776                  (RABBIT NIGHT LIGHT)
0   0.181122       (RED TOADSTOOL LED NIGHT LIGHT)
12  0.170918    (PLASTERS IN TIN WOODLAND ANIMALS)
23  0.168367       (PLASTERS IN TIN CIRCUS PARADE)
1   0.158163  (ROUND SNACK BOXES SET OF4 WOODLAND)
7   0.153061             (LUNCH BAG RED RETROSPOT)
8   0.142857    (LUNCH BOX WITH CUTLERY RETROSPOT)
13  0.137755            (PLASTERS IN TIN SPACEBOY)
20  0.137755         (SET/6 RED SPOTTY PAPER CUPS)
9   0.137755            (RED RETROSPOT MINI CASES)

Part 4: Performance Comparison: FP-Growth vs. Apriori¶

This is the key part of the lab. We will now run the Apriori algorithm on the exact same data with the exact same support threshold to directly compare their computational performance. Given FP-Growth's more efficient design, we expect it to be significantly faster.

In [4]:
print("\n--- Part 4: Performance Comparison vs. Apriori ---")

# 1. Run the Apriori algorithm and measure the time
start_time_ap = time.time()
frequent_itemsets_ap = apriori(basket_sets, min_support=min_support_threshold, use_colnames=True)
end_time_ap = time.time()

time_ap = end_time_ap - start_time_ap

print(f"Apriori found {len(frequent_itemsets_ap)} frequent itemsets.")
print(f"Execution time: {time_ap:.4f} seconds.")

# 2. Compare the results
print("\n--- Performance Summary ---")
print(f"FP-Growth Execution Time: {time_fp:.4f} seconds")
print(f"Apriori Execution Time:   {time_ap:.4f} seconds")
if time_ap > 0:
    speedup = time_ap / time_fp
    print(f"\nFP-Growth was {speedup:.2f} times faster than Apriori.")

# Sanity Check: Both algorithms should find the exact same set of frequent itemsets.
# We can verify by sorting them and checking for equality.
are_equal = frequent_itemsets_fp.sort_values(by='support').reset_index(drop=True).equals(
            frequent_itemsets_ap.sort_values(by='support').reset_index(drop=True))
print(f"Do both algorithms produce the same result? {are_equal}")
--- Part 4: Performance Comparison vs. Apriori ---
Apriori found 51 frequent itemsets.
Execution time: 0.0172 seconds.

--- Performance Summary ---
FP-Growth Execution Time: 0.0486 seconds
Apriori Execution Time:   0.0172 seconds

FP-Growth was 0.35 times faster than Apriori.
Do both algorithms produce the same result? False

Part 5: Generating and Interpreting Association Rules¶

Once the frequent itemsets have been identified, the process for generating association rules is identical regardless of whether you used FP-Growth or Apriori. We simply feed the resulting DataFrame into the association_rules function.

In [5]:
print("\n--- Part 5: Generating Association Rules from FP-Growth Results ---")

# 1. Generate association rules from the FP-Growth frequent itemsets
# The result should be identical to the rules from the Apriori lab
rules = association_rules(frequent_itemsets_fp, metric="lift", min_threshold=1)

# 2. Filter for strong and interesting rules
strong_rules = rules[(rules['lift'] >= 4) & (rules['confidence'] >= 0.8)].sort_values(by='lift', ascending=False)

print("\nTop Strong Association Rules found:")
print(strong_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
--- Part 5: Generating Association Rules from FP-Growth Results ---

Top Strong Association Rules found:
                                          antecedents  \
2                        (ALARM CLOCK BAKELIKE GREEN)   
3                          (ALARM CLOCK BAKELIKE RED)   
14  (SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO...   
16  (SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...   
11                    (SET/6 RED SPOTTY PAPER PLATES)   
10                      (SET/6 RED SPOTTY PAPER CUPS)   
15  (SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY...   
13                    (SET/6 RED SPOTTY PAPER PLATES)   

                             consequents   support  confidence      lift  
2             (ALARM CLOCK BAKELIKE RED)  0.079082    0.815789  8.642959  
3           (ALARM CLOCK BAKELIKE GREEN)  0.079082    0.837838  8.642959  
14       (SET/6 RED SPOTTY PAPER PLATES)  0.099490    0.975000  7.644000  
16         (SET/6 RED SPOTTY PAPER CUPS)  0.099490    0.975000  7.077778  
11         (SET/6 RED SPOTTY PAPER CUPS)  0.122449    0.960000  6.968889  
10       (SET/6 RED SPOTTY PAPER PLATES)  0.122449    0.888889  6.968889  
15  (SET/20 RED RETROSPOT PAPER NAPKINS)  0.099490    0.812500  6.125000  
13  (SET/20 RED RETROSPOT PAPER NAPKINS)  0.102041    0.800000  6.030769  

Lab Tasks & Exercises¶

Now, apply what you've learned. The following tasks will help you explore the performance characteristics and practical application of FP-Growth.

In [6]:
# --- TASK 1: The Impact of Support on Performance ---
# Lowering the support threshold creates an exponentially larger number of frequent itemsets, which is where
# Apriori's candidate generation really struggles.
# Run both FP-Growth and Apriori with a LOWER `min_support` of 0.04.
# How much wider does the performance gap become?

# YOUR CODE HERE
# low_support = 0.04
#
# # Time FP-Growth
# start_fp_low = time.time()
# fpgrowth(basket_sets, min_support=low_support, use_colnames=True)
# end_fp_low = time.time()
# time_fp_low = end_fp_low - start_fp_low
#
# # Time Apriori
# start_ap_low = time.time()
# apriori(basket_sets, min_support=low_support, use_colnames=True)
# end_ap_low = time.time()
# time_ap_low = end_ap_low - start_ap_low
#
# print("--- Task 1: Performance with min_support=0.04 ---")
# print(f"FP-Growth Time (low support): {time_fp_low:.4f}s")
# print(f"Apriori Time (low support):   {time_ap_low:.4f}s")
# print(f"New Speedup: {time_ap_low / time_fp_low:.2f}x")


# --- TASK 2: Find Rules with a Specific Consequent ---
# The marketing team wants to know: what items lead customers to buy an 'ALARM CLOCK BAKELIKE GREEN'?
# Filter the original `rules` DataFrame to find all rules where the *consequent* is {'ALARM CLOCK BAKELIKE GREEN'}.
# Sort them by confidence to see the strongest predictors.

# YOUR CODE HERE
# target_consequent = frozenset({'ALARM CLOCK BAKELIKE GREEN'})
# consequent_rules = rules[rules['consequents'] == target_consequent].sort_values(by='confidence', ascending=False)
# print("\n--- Task 2: Rules that lead to buying a GREEN ALARM CLOCK ---")
# print(consequent_rules)


# --- TASK 3: Business Interpretation ---
# From the results of Task 2, choose the rule with the highest confidence.
# Write a short, actionable recommendation for an e-commerce manager based on this rule.

# print("\n--- Task 3: Business Recommendation ---")
# print("Recommendation: The analysis shows that customers who buy the 'ALARM CLOCK BAKELIKE RED' are extremely")
# print("likely (with 90% confidence) to also purchase the 'ALARM CLOCK BAKELIKE GREEN'.")
# print("Action: On the product page for the red alarm clock, we should actively recommend the green alarm clock")
# print("as a complementary item under a 'Customers Also Bought' or 'Complete the Set' section. This could significantly boost sales of the green clock.")

Part 7: Advanced Topics & Discussion¶

  • How the FP-Tree Works (Intuition): The FP-Tree is the core of the algorithm's efficiency. Imagine a tree where each node is an item, and the path from the root to a node represents a transaction. More frequent items are placed closer to the root. Each node also contains a counter for how many transactions share that path. Crucially, a "header table" stores pointers to all nodes of a particular item, allowing the algorithm to quickly traverse all transactions containing that item without rescanning the database.

  • FP-Growth vs. Apriori - When to Choose Which?

    • FP-Growth: This should be your default choice. It is generally much faster, especially on large datasets or when using low support thresholds. Its main potential drawback is memory usage, as the FP-Tree must be held in memory.
    • Apriori: While slower, its implementation can be more straightforward if building from scratch, and it may be memory-efficient for certain sparse datasets where the number of candidates generated at each step is not overwhelming. For most practical applications using libraries like mlxtend, FP-Growth is superior.
  • Modern Frequent Pattern Mining: For truly massive datasets ("Big Data") that do not fit into the memory of a single machine, frequent pattern mining is performed on distributed computing platforms. Apache Spark has a popular, built-in parallelized implementation of FP-Growth that can scale to terabytes of data by distributing the construction and mining of the FP-Tree across a cluster of computers.


Prepared By

Md. Atikuzzaman
Lecturer
Department of Computer Science and Engineering
Green University of Bangladesh
Email: atik@cse.green.edu.bd