Apache-Kafka-and-Frequent-Itemsets

Overview

This project implements a streaming pipeline for processing and analyzing the Amazon Metadata dataset using various techniques such as sampling, preprocessing, frequent itemset mining, and database integration.

Why this project?

This project is designed to:

Process and analyze the Amazon Metadata dataset in real-time using a streaming pipeline
Discover frequent itemsets in the dataset using algorithms like Apriori and PCY
Optimize data processing using the Bloom Filter data structure
Store and visualize the results in a MongoDB Compass database
Handle large datasets and scale the processing using Kafka
Perform real-time analytics and gain insights from the dataset
Preprocess the dataset to prepare it for analysis
Use multiple consumer applications to perform different tasks and analyses on the data stream

Files

sampling.py: Python script for sampling the Amazon Metadata dataset.
pre-processing.py: Python script for preprocessing the sampled dataset.
producer.py: Python script for the producer application in the streaming pipeline setup.
consumer1.py, consumer2.py, consumer3.py: Python scripts for the consumer applications subscribing to the producer's data stream.
Apriori.py: Python script implementing the Apriori algorithm for frequent itemset mining.
PCY.py: Python script implementing the PCY algorithm for frequent itemset mining.
Bloomfilter.py: Python script for implementing the Bloom Filter data structure.
Database_Apriori.py: Python script for integrating Apriori results with the MongoDB Compass database.
Database_PCY.py: Python script for integrating PCY results with the MongoDB Compass database.
Database_Bloomfilter.py: Python script for integrating Bloom Filter results with the MongoDB Compass database.

Description

Sampling and Preprocessing

The sampling.py script samples the Amazon Metadata dataset, followed by preprocessing using pre-processing.py.

Streaming Pipeline Setup

The producer.py script generates a data stream, while consumer1.py, consumer2.py, and consumer3.py subscribe to this stream to perform various tasks.

Frequent Itemset Mining

Apriori.py and PCY.py implement different algorithms for frequent itemset mining, while Bloomfilter.py provides support for efficient data processing.

Database Integration

The Database_Apriori.py, Database_PCY.py, and Database_Bloomfilter.py scripts connect to a MongoDB Compass database and store the results of the analysis.

MongoDB Compass

The output generated by Database_Apriori.py, Database_PCY.py, and Database_Bloomfilter.py can be viewed in MongoDB Compass after integration with the MongoDB database.

Analysis

This project performs simple analysis on the dataset, including:

Frequent itemset mining using Apriori and PCY algorithms

Usage

Run sampling.py followed by pre-processing.py to sample and preprocess the dataset.
Run producer.py to start the data stream.
Run consumer1.py, consumer2.py, and consumer3.py to subscribe to the data stream and perform analysis.
Optionally, run Apriori.py, PCY.py, and Bloomfilter.py for additional analysis.
Run Database_Apriori.py, Database_PCY.py, and Database_Bloomfilter.py to integrate with the MongoDB database.

Requirements

Python 3.x
Kafka
MongoDB Compass & MongoDB Connector
Other dependencies as specified in the code

Note

Kindly, remove numbers at the start of the files (e.g., change (3) producer.py to producer.py). They are just given for clarifications.
Run the BONUS.sh file on the terminal with the command ./BONUS.sh.

Contributors

M.Tashfeen Abbasi
Laiba Mazhar
Rafia Khan

Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
0) sampling_ss.png		0) sampling_ss.png
1) sampling.py		1) sampling.py
10) Bloomfilter.py		10) Bloomfilter.py
11) Database_Apriori.py		11) Database_Apriori.py
12) Database_PCY.py		12) Database_PCY.py
13) Database_Bloomfilter.py		13) Database_Bloomfilter.py
2) pre-processing.py		2) pre-processing.py
3) producer.py		3) producer.py
4) consumer1.py		4) consumer1.py
5) consumer2.py		5) consumer2.py
6) consumer3.py		6) consumer3.py
7) producer.py		7) producer.py
8) Apriori.py		8) Apriori.py
9) PCY.py		9) PCY.py
BONUS.sh		BONUS.sh
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache-Kafka-and-Frequent-Itemsets

Overview

Why this project?

Files

Description

Sampling and Preprocessing

Streaming Pipeline Setup

Frequent Itemset Mining

Database Integration

MongoDB Compass

Analysis

Usage

Requirements

Note

Contributors

About

Releases

Packages

Contributors 2

Languages

tashi-2004/Apache-Kafka-and-Frequent-Item-sets

Folders and files

Latest commit

History

Repository files navigation

Apache-Kafka-and-Frequent-Itemsets

Overview

Why this project?

Files

Description

Sampling and Preprocessing

Streaming Pipeline Setup

Frequent Itemset Mining

Database Integration

MongoDB Compass

Analysis

Usage

Requirements

Note

Contributors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages