Text Classification for Research Papers

Problem Statement

Develop a text classification model to predict the relevant terms or labels associated with research papers based on their titles and summaries.

Objective

The objective of this project is to accurately classify research papers into relevant categories or terms, enabling better organization and retrieval of information.

Data Description

The dataset consists of research papers along with their titles, summaries, and associated terms. Each term represents a category or label relevant to the paper. The dataset is in tabular format, stored as a pandas DataFrame.

Methodology

Data Preprocessing

Lowercasing: Converted all text to lowercase to ensure uniformity.
Tokenization: Splitted the text into individual tokens (words) to facilitate further processing.
Special Character Removal: Removed special characters and non-alphanumeric characters from the text.
Stopword Removal: Eliminated common stopwords (e.g., "and," "the," "is") from the text as they do not contribute significantly to the classification task.
Lemmatization: Reduced words to their base or dictionary form to normalize the text and reduce dimensionality.
Data Filtering: Narrowed down the dataset to focus on a subset of labels with significant occurrences to address class imbalance.
Data Visualization: Visualized the distribution of label frequencies and label combination frequencies to gain insights into the dataset's characteristics.
Text Embeddings: Text Eembedding using Word2Vec, and encoding the target variable using a MultiLabelBinarizer to handle multi-label classification.
Saving Data: Saved the preprocessed data for model training.

Model Selection

SVM-Label Powerset: Utilized Support Vector Machine (SVM) with the Label Powerset approach to handle multilabel classification. This approach transforms the multilabel problem into multiple binary classification problems, where each combination of labels is treated as a separate class.
Random Forest-MultiOutputClassifier:: Employed Random Forest with the MultiOutputClassifier wrapper to handle multilabel classification. Random Forest builds multiple decision trees and predicts multiple outputs (labels) simultaneously.
KNN-Classifier Chains: Applied k-Nearest Neighbors (KNN) with the Classifier Chains approach for multilabel classification. In this approach, multiple binary classifiers are trained in a chain, where each classifier predicts one label and uses the predictions of previous classifiers as additional features.
LSTM and BiLSTM: Implemented Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) neural network architectures for sequence modeling and text classification. These models are well-suited for processing sequential data like text and can capture long-range dependencies effectively.

Training

Trained the model using the preprocessed data. Utilized techniques such as cross-validation and hyperparameter tuning to optimize model performance.

Evaluation

Evaluate the trained model on a separate validation set using appropriate evaluation metrics for multilabel classification, such as F1-score, precision, recall, and accuracy, ROC curve. Analyze the model's performance and iteratively refined the models.

Conclusion

KNN with classifier chains was most promising as it trained fast and gave the highest overall performance.

The developed text classification model demonstrates promising performance in accurately categorizing research papers into relevant terms or labels based on their titles and summaries. By effectively organizing and classifying research papers, the model enhances the accessibility and retrieval of information, facilitating knowledge discovery and research exploration in the domain. Further enhancements and refinements can be explored to improve the model's robustness and applicability in real-world scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
KNN_ClassifierChains.ipynb		KNN_ClassifierChains.ipynb
LSTM_BILSTM.ipynb		LSTM_BILSTM.ipynb
README.md		README.md
RF_MultiOutputClassifier.ipynb		RF_MultiOutputClassifier.ipynb
SVM_LabelPowerSet.ipynb		SVM_LabelPowerSet.ipynb
data_preprocessing.ipynb		data_preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification for Research Papers

Problem Statement

Objective

Data Description

Methodology

Data Preprocessing

Model Selection

Training

Evaluation

Conclusion

About

Releases

Packages

Languages

senguptashritama/Multilabel-Classification-of-ResearchPapers

Folders and files

Latest commit

History

Repository files navigation

Text Classification for Research Papers

Problem Statement

Objective

Data Description

Methodology

Data Preprocessing

Model Selection

Training

Evaluation

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages