Workshop: On a Parallel Spark Workflow for Frequent Itemset Mining Based on Array Prefix-Tree
Abstract: Frequent Itemset Mining (FIM) is a fundamental procedure in various data mining techniques such as association rule mining. Among many existing algorithms, FP-Growth is considered as a milestone achievement that discovers frequent itemsets without generating candidates. However, due to the high complexity of its mining process and the high cost of its memory usage, FP-Growth still suffers from a performance bottleneck when dealing with large datasets. In this paper, we design a new Array Prefix-Tree structure, and based on that, propose an Array Prefix-Tree Growth (APT-Growth) algorithm, which explicitly obviates the need of recursively constructing conditional FP-Tree as required by FP-Growth. To support big data analytics, we further design and implement a parallel version of APTGrowth, referred to as PAPT-Growth, as a Spark workflow. We conduct FIM workflow experiments on both real-life and synthetic datasets for performance evaluation, and extensive results show that PAPT-Growth outperforms other representative parallel FIM algorithms in terms of execution time, which sheds light on its potential applications to big data mining.