There are train and test datasets. Input format is following: id,text,label
Dataset is obtained from SuDer Turkish News Collections.
- Lowercase conversion
- Category --> Integer
- Tokenize
Term Frequency - Inverse Document Frequency is a type of word representation according to word frequency and document frequency. It converts words to numerical vectors. Each vector represents a word. Therefore we can obtain a vector space that represents words. For more information, click here. Also package is accessible here.
GridSearchCV finds the best combination of given parameters. It is used for both Naive Bayes and Logistic Regression. For more information, you can click here.
Results are measured through test data. Naive Bayes has an accuracy of 0.702 and logistic regression has an accuracy of 0.824.