This project contains some big data analysis practices in Python. These practices cover a wide range of techniques and models commonly used in the modern machine learning field.
Implementation of handling missing data, standardization, normalization, handling non-numerical data, etc.
Implementation and research of clustering models like KMeans++, Hierarchical, GMM; methods of determine the optimal cluster number like Elbow Method for KMeans++, dentrograms for Hierarchical, Silhouette Coefficient for GMM. Implementation of data visualization using Heatmap, Folium, scatter plot, seaborn, etc. Implementation of image manipulation and depression.
Implementation and research of classification models like Logistic Regression, kNN. Implementation of PCA to reduce data dimensionality.
Usage of Graph or Network to build a recommendation system.
This is an independent project in which I developed a pipeline to predict movie star on real Amazon movie data including movie info and text reviews. In this project, I used TfidfVectorizer to convert text reviews into matrix and then implemented and compared classification models of Ridge Regression, Perceptron, Passive Aggressive, KNN, Random Forest to make the prediction.
The practices are separated by topics. In each folder, you may find a pdf file named “Problems” which contains the topic of that folder as well as all the problems to be solved. The coding part of the solutions are included in the same folder while some summary, chart and result can be found in the “Solutions.pdf”. You need to download the dataset from the links in the “Problems.pdf” to run the code.