Reddit Comments Classification - Kaggle Competition

My Latest Projects | 2019 | Links: Source

Motivation

Here, my colleague Ilyas Amaniss and I analyzed the different techniques used in text classification, notably the various ways of performing feature extraction and the performance of the different algorithms used in this task. To test and compare these techniques, we entered the 2019 Kaggle Competition on Reddit comments classification where the goal was to design a machine learning algorithm to automatically sort short texts into a pre-determined set of topics. These texts were extracted from raw posts and comments written by Reddit users.

In most classification tasks, working on raw data can prove to be tedious. Therefore, it is beneficial to process and treat the data before starting any analysis. This process is the feature design and feature extraction. While they comprise of countless methods, some may speed up the learning process, reduce the size of the dataset or increase the accuracy of the model. However, it is important to carefully choose the feature extraction methods as they may also decrease the model’s efficiency. For the Reddit comments classification competition, some of the techniques we decided to use are, but not limited to:

Feature design phase

Tokenization
Removal of stop words
Stemming and Lemming
Term Frequency-Inverse Document Frequency (TF-IDF)

After processing the data, it is then time to choose the model to categorize the different text corpus into topics. Obviously, depending on the feature extractions methods used, some models may before better than others. For this project, we focused on 3 methods:

Algorithms

Naive Bayes
Support Vector Machine (SVM)
Mulilayer Perceptron/Feedforward Neural Network

Article

For the complete methodology, results, analysis and the rest of our findings see the full article below.PDF.js Example

Link to the article.

Reddit Comments Classification - Kaggle Competition

Motivation

Article

Arnold Kokoroko

Error

Motivation

Article

Templates (for web app):

Error