GitHub - nitingoyal0996/distributed-data-processing-pipeline-on-biodiversity-data-aggregators: an API service for a distributed data processing pipeline on CloudLab that analyzes data streamed from three biodiversity data aggregators: iDigBio (https://idigbio.org/), GBIF (https://gbif.org/), and OBIS (https://obis.org/). [ Apache Kafka, Apache Spark, CloudLab, Flask, Python ]

The project creates a cluster of Apache spark (Master - Worker) setup along with a Apache Kafka setup with Kraft consensus protocol.

Apache kafka is used to manage the streams of data from 3 bioaggregators and Spark is used to run query over this stream of data. The final results of the spark query is saved in memory and published to kafka topic.

In order to start a stream from any of the available sources, list the sources which are currently active and access the results - we deploy a flask api server.

Flask utilizes the python - subprocess package to create kafka topic, start the stream and submit jobs to spark cluster.

The server is secured with a self-signed certificate with a basic (dummy) token-based auth setup. Port forwarding is enabled to forward the default HTTPS port on the server to the Flask app server port.

To setup the cluster and start the job you could follow below steps:

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
kafka		kafka
schemas		schemas
scripts		scripts
spark		spark
streams		streams
utils		utils
.gitignore		.gitignore
README.md		README.md
api.py		api.py
image-1.png		image-1.png
image-5.png		image-5.png
image-6.png		image-6.png
image-7.png		image-7.png
image-8.png		image-8.png
image.png		image.png
report.md		report.md
requirements.txt		requirements.txt
reverseproxy.conf		reverseproxy.conf
s-tui_log_2024-04-17_18_52_37.csv		s-tui_log_2024-04-17_18_52_37.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

nitingoyal0996/distributed-data-processing-pipeline-on-biodiversity-data-aggregators

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages