On a Parallel Spark Workflow for Frequent Itemset Mining Based on Array Prefix-Tree

SC19 Proceedings

On a Parallel Spark Workflow for Frequent Itemset Mining Based on Array Prefix-Tree

Workshop: On a Parallel Spark Workflow for Frequent Itemset Mining Based on Array Prefix-Tree

Abstract: Frequent Itemset Mining (FIM) is a fundamental procedure in various data mining techniques such as association rule mining. Among many existing algorithms, FP-Growth is considered as a milestone achievement that discovers frequent itemsets without generating candidates. However, due to the high complexity of its mining process and the high cost of its memory usage, FP-Growth still suffers from a performance bottleneck when dealing with large datasets. In this paper, we design a new Array Prefix-Tree structure, and based on that, propose an Array Prefix-Tree Growth (APT-Growth) algorithm, which explicitly obviates the need of recursively constructing conditional FP-Tree as required by FP-Growth. To support big data analytics, we further design and implement a parallel version of APTGrowth, referred to as PAPT-Growth, as a Spark workflow. We conduct FIM workflow experiments on both real-life and synthetic datasets for performance evaluation, and extensive results show that PAPT-Growth outperforms other representative parallel FIM algorithms in terms of execution time, which sheds light on its potential applications to big data mining.

Back to The 14th Workshop on Workflows in Support of Large-Scale Science (WORKS19) Archive Listing

Back to Full Workshop Archive Listing