A comparative analysis of text categorization methods
Rapid growth of available textual data in digital form led to the emergence of need for extraction and summarization of contained useful information. Text categorization using machine learning methods has become one of the key techniques for this purpose.
The goal of a text categorization is classification of documents in fixed number of predefined categories. Nowadays, it is a part of majority of business applications with requirements for textual content management based on its semantics. Only few of its many use cases are topic classification, sentiment analysis, spam filtering, etc.
At Bravo Systems, dedicated teams of Data Scientists are working on various machine learning problems, including as well text categorization, utilized in our solutions for finding best recommendations, optimal prices, value estimations, etc.
METHODOLOGY
During the years, many of the machine learning approaches were proposed for solving the problem of text categorization and mainly they can be divided into supervised, semi-supervised and unsupervised.
The main goal of this research was a comparative analysis of supervised and unsupervised machine learning methods for text categorization.
All approaches were tested on five standard datasets for text categorization and compared in terms of Precision, Recall and F1 measure.
Prior to applying all the approaches, text pre-processing steps including tokenization, removal of punctuation marks and other special characters, numerical characters and words from stoplist, then lemmatization and lowercasing, were applied.
We decided on four commonly used supervised machine learning algorithms, Naïve Bayes, Support Vector Machine, Decision Tree and Random Forest, and one unsupervised machine learning approach, based on exploatation of WordNet Lexical Database for English and Word2Vec technique, which models text categorization problem as a problem of measuring textual documents similarity, where one document is document from dataset and other is category document, generated through several steps, which are probably the most important part of whole approach.
All the details about applied approaches, used datasets, discussion and results can be found in research paper from authors Ana Bojanić, Data Science Engineer, and Zoran Đurić, PhD, Data Science Lead at Bravo Systems, available on link: https://infom.fon.bg.ac.rs/index.php/infom/article/view/2459/2391
RESULTS
For majority of applied algorithms, on all datasets, achieved precision and recall are in range 70-90%.
In terms of metrics mentioned above, supervised algorithms perform better on four datasets, while unsupervised approach shows better results on one dataset, but with one key difference – unsupervised machine learning approach doesn’t rely on labeled training datasets, while in supervised approaches they are mandatory. Taking into account that process of dataset labeling can be very costly and time-consuming operation, especially in case of very large datasets and number of predefined categories, one of the biggest advantages of this approach is obvious. Furthermore, unsupervised approach allows relatively easy modifications of a set of predefined categories, what is a common request in many applications.
FUTURE WORK
At the end, some of the suggestions for further research in this area could be using other word embedding techniques, instead of Word2Vec, other techniques for textual documents comparison, other lexical resources, etc.
Written by
Ana Bojanić
Data Science Engineer/Software Developer