Iris Dataset Analysis with PySpark

This project implements K-means, Bisecting K-means, and Decision Tree algorithms in PySpark on the Iris dataset.

Introduction

This project demonstrates the use of PySpark to perform clustering and classification on the Iris dataset. The Iris dataset is a classic dataset in machine learning and statistics, containing measurements of various attributes of iris flowers and their corresponding species.

Description

The project consists of three main components:

K-means Clustering: A traditional clustering algorithm that partitions data into k distinct clusters.
Bisecting K-means Clustering: A variation of K-means that recursively splits clusters to improve clustering quality.
Decision Tree Classification: A classification algorithm that uses a tree-like model to make decisions based on input features.

Installation

To run this project, you need to have PySpark installed. You can install it using pip:

pip install pyspark

Additionally, you need to have Matplotlib and Seaborn for data visualization:

pip install matplotlib seaborn

Usage

Initialize Spark Session: The Spark session is initialized with the name "IrisAnalysis".
Load Data: The Iris dataset is loaded from a CSV file.
Prepare Features: Features are prepared using VectorAssembler.
K-means Clustering: K-means algorithm is applied to the data, and results are visualized.
Bisecting K-means Clustering: Bisecting K-means algorithm is applied, and results are visualized.
Decision Tree Classification: A decision tree classifier is trained, evaluated, and results are visualized.

Results

K-means Silhouette Score: The silhouette score for the K-means clustering model is 0.7482.
Bisecting K-means Silhouette Score: The silhouette score for the Bisecting K-means clustering model is 0.6682.
Decision Tree Accuracy: The accuracy of the Decision Tree classifier is 1.0.

Visualizations for each clustering method and the decision tree classification are generated using Matplotlib and Seaborn.

Technologies Used

PySpark: For data processing and machine learning.
Matplotlib: For data visualization.
Seaborn: For enhanced data visualization.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or additions.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Code.ipynb		Code.ipynb
Iris.csv		Iris.csv
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iris Dataset Analysis with PySpark

Table of Contents

Introduction

Description

Installation

Usage

Results

Technologies Used

Contributing

License

About

Releases

Packages

Languages

License

burhanahmed1/Iris-Dataset-Analysis-with-PySpark

Folders and files

Latest commit

History

Repository files navigation

Iris Dataset Analysis with PySpark

Table of Contents

Introduction

Description

Installation

Usage

Results

Technologies Used

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages