{"id":2104,"date":"2021-07-29T08:16:53","date_gmt":"2021-07-29T08:16:53","guid":{"rendered":"https:\/\/bravosystems.com\/blog\/?p=2104"},"modified":"2021-07-29T13:25:47","modified_gmt":"2021-07-29T13:25:47","slug":"a-comparative-analysis-of-text-categorization-methods","status":"publish","type":"post","link":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/","title":{"rendered":"A comparative analysis of text categorization methods"},"content":{"rendered":"\n<p>Rapid growth of available textual data in digital form led to the emergence of need for extraction and summarization of contained useful information. Text categorization using machine learning methods has become one of the key techniques for this purpose.<\/p>\n\n\n\n<p>The goal of a text categorization<a> <\/a>is classification of documents in fixed number of predefined categories. Nowadays, it is a part of majority of business applications with requirements for textual content management based on its semantics. Only few of its many use cases are topic classification, sentiment analysis, spam filtering, etc.<\/p>\n\n\n\n<p>At Bravo Systems, dedicated teams of Data Scientists are working on various machine learning problems, including as well text categorization, utilized in our solutions for finding best recommendations, optimal prices, value estimations, etc.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"825\" height=\"400\" src=\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/blog.jpg\" alt=\"\" class=\"wp-image-2121\" srcset=\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/blog.jpg 825w, https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/blog-300x145.jpg 300w, https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/blog-768x372.jpg 768w\" sizes=\"auto, (max-width: 825px) 100vw, 825px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>METHODOLOGY<\/strong><\/h2>\n\n\n\n<p>During the years, many of the machine learning approaches were proposed for solving the problem of text categorization and mainly they can be divided into supervised, semi-supervised and unsupervised.<\/p>\n\n\n\n<p>The main goal of this research was a comparative analysis of supervised and unsupervised machine learning methods for text categorization.<\/p>\n\n\n\n<p>All approaches were tested on five standard datasets for text categorization and compared in terms of Precision, Recall and F1 measure.<\/p>\n\n\n\n<p>Prior to applying all the approaches, text pre-processing steps including tokenization, removal of punctuation marks and other special characters, numerical characters and words from stoplist, then lemmatization and lowercasing, were applied.<\/p>\n\n\n\n<p>We decided on four commonly used supervised machine learning algorithms, Na\u00efve Bayes, Support Vector Machine, Decision Tree and Random Forest, and one unsupervised machine learning approach, based on exploatation of WordNet Lexical Database for English and Word2Vec technique, which models text categorization problem as a problem of measuring textual documents similarity, where one document is document from dataset and other is category document, generated through several steps, which are probably the most important part of whole approach.<\/p>\n\n\n\n<p>All the details about applied approaches, used datasets, discussion and results can be found in research paper from authors Ana Bojani\u0107, Data Science Engineer, and Zoran \u0110uri\u0107, PhD, Data Science Lead at Bravo Systems, available on link: <a href=\"https:\/\/infom.fon.bg.ac.rs\/index.php\/infom\/article\/view\/2459\/2391\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/infom.fon.bg.ac.rs\/index.php\/infom\/article\/view\/2459\/2391<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>RESULTS<\/strong><\/h2>\n\n\n\n<p>For majority of applied algorithms, on all datasets, achieved precision and recall are in range 70-90%.<\/p>\n\n\n\n<p>In terms of metrics mentioned above, supervised algorithms perform better on four datasets, while unsupervised approach shows better results on one dataset, but with one key difference \u2013 unsupervised machine learning approach doesn\u2019t rely on labeled training datasets, while in supervised approaches they are mandatory. Taking into account that process of dataset labeling can be very costly and time-consuming operation, especially in case of very large datasets and number of predefined categories, one of the biggest advantages of this approach is obvious. Furthermore, unsupervised approach allows relatively easy modifications of a set of predefined categories, what is a common request in many applications.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>FUTURE WORK<\/strong><\/h2>\n\n\n\n<p>At the end, some of the suggestions for further research in this area could be using other word embedding techniques, instead of Word2Vec, other techniques for textual documents comparison, other lexical resources, etc.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Rapid growth of available textual data in digital form led to the emergence of need for extraction and summarization of contained useful information. <\/p>\n","protected":false},"author":11,"featured_media":2106,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[71],"tags":[],"class_list":["post-2104","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A comparative analysis of text categorization methods - Bravo Systems d.o.o.<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A comparative analysis of text categorization methods - Bravo Systems d.o.o.\" \/>\n<meta property=\"og:description\" content=\"Rapid growth of available textual data in digital form led to the emergence of need for extraction and summarization of contained useful information.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/\" \/>\n<meta property=\"og:site_name\" content=\"Bravo Systems d.o.o.\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/bravo.systems.doo\/\" \/>\n<meta property=\"article:published_time\" content=\"2021-07-29T08:16:53+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-07-29T13:25:47+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/naslovna.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1620\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Ana Bojani\u0107\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ana Bojani\u0107\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/\",\"url\":\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/\",\"name\":\"A comparative analysis of text categorization methods - Bravo Systems d.o.o.\",\"isPartOf\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/naslovna.jpg\",\"datePublished\":\"2021-07-29T08:16:53+00:00\",\"dateModified\":\"2021-07-29T13:25:47+00:00\",\"author\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/0185c04a07039bff067f48f44be82963\"},\"breadcrumb\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#primaryimage\",\"url\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/naslovna.jpg\",\"contentUrl\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/naslovna.jpg\",\"width\":1620,\"height\":700},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/bravosystems.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A comparative analysis of text categorization methods\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/bravosystems.com\/blog\/#website\",\"url\":\"https:\/\/bravosystems.com\/blog\/\",\"name\":\"Bravo Systems d.o.o.\",\"description\":\"Blog\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/bravosystems.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/0185c04a07039bff067f48f44be82963\",\"name\":\"Ana Bojani\u0107\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/anabojanic.jpg\",\"contentUrl\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/anabojanic.jpg\",\"caption\":\"Ana Bojani\u0107\"},\"url\":\"https:\/\/bravosystems.com\/blog\/author\/ana-bojanic\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A comparative analysis of text categorization methods - Bravo Systems d.o.o.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/","og_locale":"en_US","og_type":"article","og_title":"A comparative analysis of text categorization methods - Bravo Systems d.o.o.","og_description":"Rapid growth of available textual data in digital form led to the emergence of need for extraction and summarization of contained useful information.","og_url":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/","og_site_name":"Bravo Systems d.o.o.","article_publisher":"https:\/\/www.facebook.com\/bravo.systems.doo\/","article_published_time":"2021-07-29T08:16:53+00:00","article_modified_time":"2021-07-29T13:25:47+00:00","og_image":[{"width":1620,"height":700,"url":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/naslovna.jpg","type":"image\/jpeg"}],"author":"Ana Bojani\u0107","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Ana Bojani\u0107","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/","url":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/","name":"A comparative analysis of text categorization methods - Bravo Systems d.o.o.","isPartOf":{"@id":"https:\/\/bravosystems.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#primaryimage"},"image":{"@id":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#primaryimage"},"thumbnailUrl":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/naslovna.jpg","datePublished":"2021-07-29T08:16:53+00:00","dateModified":"2021-07-29T13:25:47+00:00","author":{"@id":"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/0185c04a07039bff067f48f44be82963"},"breadcrumb":{"@id":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#primaryimage","url":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/naslovna.jpg","contentUrl":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/naslovna.jpg","width":1620,"height":700},{"@type":"BreadcrumbList","@id":"https:\/\/bravosystems.com\/blog\/a-comparative-analysis-of-text-categorization-methods\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/bravosystems.com\/blog\/"},{"@type":"ListItem","position":2,"name":"A comparative analysis of text categorization methods"}]},{"@type":"WebSite","@id":"https:\/\/bravosystems.com\/blog\/#website","url":"https:\/\/bravosystems.com\/blog\/","name":"Bravo Systems d.o.o.","description":"Blog","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/bravosystems.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/0185c04a07039bff067f48f44be82963","name":"Ana Bojani\u0107","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/anabojanic.jpg","contentUrl":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2021\/07\/anabojanic.jpg","caption":"Ana Bojani\u0107"},"url":"https:\/\/bravosystems.com\/blog\/author\/ana-bojanic\/"}]}},"_links":{"self":[{"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/posts\/2104","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/comments?post=2104"}],"version-history":[{"count":8,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/posts\/2104\/revisions"}],"predecessor-version":[{"id":2124,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/posts\/2104\/revisions\/2124"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/media\/2106"}],"wp:attachment":[{"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/media?parent=2104"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/categories?post=2104"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/tags?post=2104"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}