Reddit Data Pipeline

Overview

This project sets up a real-time data pipeline to fetch data from Reddit, transform it using AWS Glue, and store it in Amazon S3. This involves data streaming, cloud storage, ETL (Extract, Transform, Load) processes, and orchestration using Apache Airflow.

Prerequisites

Python: Version 3.x
Docker: For local development and orchestration
AWS Account: For S3 and AWS Glue
Reddit API Credentials: For accessing Reddit data
Aws Iam policy user key: For writing and reading data from data source

Setup

1. Clone the Repository

git clone https://github.com/olusimeon/reddit-data-pipeline.git
cd reddit-data-pipeline

2. Configure Environment Variables

REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_client_secret
REDDIT_USER_AGENT=your_reddit_user_agent
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key

3 Install Dependencies

pip install -r requiremnts.txt

4 Set up Docker

You can download docker.compose.yaml from the offical apache airflow website apache airflow. or download the docker-compose.yaml provided in this folder

5. Deploy and Run the Pipeline

Airflow: Access the Airflow web interface at http://localhost:8080 and trigger the DAG named pull_comments_and_load.py. AWS Glue: Ensure the Glue job script (glue/api transform.py) is configured and deployed to your AWS Glue environment using the script and python shell environment.

6. Usage

ETL Process: The Airflow DAG (airflow/dags/pullcomments.py) schedules the ETL pipeline, fetching data from Reddit, triggering AWS Glue for transformation, and storing results in S3. Transformation: The AWS Glue job script (glue/api transform.py) processes and transforms the raw data from S3. S3 Bucket Structure Refer to s3/bucket_structure.txt for details on the organization of S3 buckets and directories used in this project.

Additional Notes

Ensure your AWS IAM roles and permissions are correctly configured for accessing S3 and running Glue jobs. Regularly monitor Airflow logs and AWS Glue job logs for any issues or errors in the pipeline execution.

Contributing

Feel free to submit issues or pull requests. Contributions are welcome!

Contact

For any questions or comments, please reach out to olusimeon4@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
airflow/dags		airflow/dags
glue_script		glue_script
s3		s3
Dockerfile		Dockerfile
Readme.Md		Readme.Md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt
scripts.txt		scripts.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Data Pipeline

Overview

Prerequisites

Setup

1. Clone the Repository

2. Configure Environment Variables

3 Install Dependencies

4 Set up Docker

5. Deploy and Run the Pipeline

6. Usage

Additional Notes

Contributing

Contact

About

Releases

Packages

Languages

olusimeon/reddit-sentimentanalyses-pipeline

Folders and files

Latest commit

History

Repository files navigation

Reddit Data Pipeline

Overview

Prerequisites

Setup

1. Clone the Repository

2. Configure Environment Variables

3 Install Dependencies

4 Set up Docker

5. Deploy and Run the Pipeline

6. Usage

Additional Notes

Contributing

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages