Thyroid-Detection

Problem Statement

To build a classification methodology to predict the type of Thyroid based on the given training data.

Architecture

Data Description

The client will send data in multiple sets of files in batches at a given location. Data will contain different classes of thyroid and 30 columns of different values. "Class" column will have four unique values “negative, compensated_hypothyroid, primary_hypothyroid, secondary_hypothyroid”. Apart from training files, we also require a "schema" file from the client, which contains all the relevant information about the training files such as: Name of the files, Length of Date value in FileName, Length of Time value in FileName, Number of Columns, Name of the Columns, and their datatype.

Data Validation

In this step, we perform different sets of validation on the given set of training files.

Name Validation
Number of Columns
Name of Columns
The datatype of columns
Null values in columns

Data Insertion in Database

Database Creation and connection - Create a database with the given name passed. If the database is already created, open the connection to the database.
Table creation in the database - Table with name
Insertion of files in the table

Model Training

Data Export from Db
Data Preprocessing
a) Drop columns not useful for training the model. Such columns were selected while doing the EDA. b) Replace the invalid values with numpy “nan” so we can use imputer on such values. c) Encode the categorical values d) Check for null values in the columns. If present, impute the null values using the KNN imputer. e) After imputing, handle the imbalanced dataset by using RandomOverSampler.
Clustering - KMeans algorithm is used to create clusters in the preprocessed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using "KneeLocator" function. The idea behind clustering is to implement different algorithms To train data in different clusters. The Kmeans model is trained over preprocessed data and the model is saved for further use in prediction.
Model Selection - After clusters are created, we find the best model for each cluster. We are using two algorithms, "Random Forest" and "KNN". For each cluster, both the algorithms are passed with the best parameters derived from GridSearch. We calculate the AUC scores for both models and select the model with the best score. Similarly, the model is selected for each cluster. All the models for every cluster are saved for use in prediction.

Prediction Data Description

Client will send the data in multiple set of files in batches at a given location. Apart from prediction files, we also require a "schema" file from client which contains all the relevant information about the training files such as:Name of the files, Length of Date value in FileName, Length of Time value in FileName, Number of Columns, Name of the Columns and their datatype.

Data Validation - For Prediction Data

In this step, we perform different sets of validation on the given set of training files.

Name Validation
Number of Columns
Name of Columns
The datatype of columns
Null values in columns

Data Insertion in Database - For Prediction Data

Database Creation and connection - Create a database with the given name passed. If the database is already created, open the connection to the database.
Table creation in the database - Table with name
Insertion of files in the table

Prediction

Data Export from Db
Data Preprocessing
a) Drop columns not useful for training the model. Such columns were selected while doing the EDA. b) Replace the invalid values with numpy “nan” so we can use imputer on such values. c) Encode the categorical values d) Check for null values in the columns. If present, impute the null values using the KNN imputer.
Clustering - KMeans model created during training is loaded, and clusters for the preprocessed prediction data is predicted.
Prediction - Based on the cluster number, the respective model is loaded and is used to predict the data for that cluster.
Once the prediction is made for all the clusters, the predictions along with the original names before label encoder are saved in a CSV file at a given location and the location is returned to the client.

Deployment

We will be deploying the model to the any Cloud Based Service Platform for example, AWS, Azure, GCP, Heroku or Pivotal Web Services. This is a workflow diagram for the prediction of using the trained model.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
DataTransform_Training		DataTransform_Training
DataTransformation_Prediction		DataTransformation_Prediction
DataTypeValidation_Insertion_Prediction		DataTypeValidation_Insertion_Prediction
DataTypeValidation_Insertion_Training		DataTypeValidation_Insertion_Training
EncoderPickle		EncoderPickle
Images		Images
PredictionArchivedBadData		PredictionArchivedBadData
Prediction_Batch_files		Prediction_Batch_files
Prediction_FileFromDB		Prediction_FileFromDB
Prediction_Logs		Prediction_Logs
Prediction_Output_File		Prediction_Output_File
Prediction_Raw_Data_Validation		Prediction_Raw_Data_Validation
TrainingArchiveBadData		TrainingArchiveBadData
Training_Batch_Files		Training_Batch_Files
Training_FileFromDB		Training_FileFromDB
Training_Logs		Training_Logs
Training_Raw_data_validation		Training_Raw_data_validation
application_logging		application_logging
best_model_finder		best_model_finder
data_ingestion		data_ingestion
data_preprocessing		data_preprocessing
file_operations		file_operations
models		models
preprocessing_data		preprocessing_data
templates		templates
LICENSE		LICENSE
README.md		README.md
main.py		main.py
manifest.yml		manifest.yml
predictFromModel.py		predictFromModel.py
prediction_Validation_Insertion.py		prediction_Validation_Insertion.py
schema_prediction.json		schema_prediction.json
schema_training.json		schema_training.json
trainingModel.py		trainingModel.py
training_Validation_Insertion.py		training_Validation_Insertion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thyroid-Detection

Problem Statement

Architecture

Data Description

Data Validation

Data Insertion in Database

Model Training

Prediction Data Description

Data Validation - For Prediction Data

Data Insertion in Database - For Prediction Data

Prediction

Deployment

About

Releases

Packages

Languages

License

sandeepyadav10011995/Thyroid-Detection

Folders and files

Latest commit

History

Repository files navigation

Thyroid-Detection

Problem Statement

Architecture

Data Description

Data Validation

Data Insertion in Database

Model Training

Prediction Data Description

Data Validation - For Prediction Data

Data Insertion in Database - For Prediction Data

Prediction

Deployment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages