Apriori Algorithm in Apache Flink
This project implements the Apriori algorithm as described in the 1994 paper "Fast Algorithms for Mining Association Rules" by Rakesh Agrawal and Ramakrishnan Srikant.
AGRAWAL, Rakesh, et al. Fast Algorithms for Mining Association Rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB. 1994. S. 487-499.
Build the jar file using the following command:
mvn clean package -Pbuild-jar
This should produce a file called flink-apriori-java-1.0-SNAPSHOT.jar
in the target
directory.
input
location of the BMS-POS.dat fileoutput
prints to stdout if not setmin-support
a real number in the range (0,1]itemset-size
an integer in the range (1, Infinity]
- Google Guava 19.0 (Apache License 2.0)
- Apache Commons Lang 3.4 (Apache License 2.0)
Download the KDD Cup 2000 Dataset. More info about the data here.
After downloading the data, unpack the BMS-POS.dat file. Included in this repository is a checksum file for verifying the integrity of the file.
Steps:
unzip -j KDDCup2000.zip assoc/BMS-POS.dat.gz
gunzip BMS-POS.dat.gz
sha1sum -c BMS-POS.dat.sha1
- Tests
- Implement the
ItemSetCalculateFrequency
RichMapFunction in a more efficient manner
Apache License 2.0
This project uses libraries licensed under Apache License 2.0