{"id":1846,"date":"2020-09-03T09:22:00","date_gmt":"2020-09-03T09:22:00","guid":{"rendered":"https:\/\/bravosystems.com\/blog\/?p=1846"},"modified":"2020-09-22T10:52:16","modified_gmt":"2020-09-22T10:52:16","slug":"using-machine-learning-to-detect-malicious-urls","status":"publish","type":"post","link":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/","title":{"rendered":"Using Machine Learning to Detect Malicious URLs"},"content":{"rendered":"\n<p>Malicious URLs are a common and serious threat to cybersecurity. There are many ways for malicious attackers to try to cheat end user such as hacking attempts, drive-by-download, denial of service, phishing, social engineering, and many others. One of the examples we often see is the <em>Technical Support Scam <\/em>(TSS) which combines online abuse with social engineering over the phone channel. This way scammers succeed in tricking users to give them money for fake technical support service.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"315\" src=\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/picture1-3-1-1024x315.png\" alt=\"\" class=\"wp-image-1870\" srcset=\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/picture1-3-1-1024x315.png 1024w, https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/picture1-3-1-300x92.png 300w, https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/picture1-3-1-768x236.png 768w, https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/picture1-3-1-1536x472.png 1536w, https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/picture1-3-1-2048x629.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n\n\n\n<p class=\"has-text-align-center\"><em><strong>Left:<\/strong> Example of passive TSS page which appears to be professional. <strong>Right<\/strong>: Example of an aggressive TSS page which tries to provoke urgency through audio messages, continuous pop-ups, blocking browser, and warning messages.<\/em><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>By far the most common technique to protect users is <em>blacklisting<\/em>. This technique is extremely fast. However, it is almost impossible to maintain an up-to-date list of malicious URLs. Users may click on a malicious URL before it appears in <em>blacklist<\/em>. Therefore, most of the latest researches explore machine learning approaches to learn as much as possible about malicious URL\u2019s behavior.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Architecture and research<\/strong><\/h2>\n\n\n\n<p>Our machine learning solution uses lexical and host-based features in order to show that efficiently predicting malicious URLs can be done without analyzing page content.<\/p>\n\n\n\n<p>Lexical features include statistical properties of the URL string, like the length of URL, special characters count, length of different parts of the URL (hostname, top-level domain, path, query), delimiter count, longest words length, different tokens count, etc. Host-based features used by our solution include IP address properties, WHOIS, and DNS information. We have implemented automated WHOIS and DNS data collection services. Also, services for the extraction of feature values from collected data have been implemented. During the implementation, we encountered certain technical issues, like preparation of large datasets and proper dataset labeling.<\/p>\n\n\n\n<p>We have explored six binary classifiers: Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, LightGBM, and Multi-layer Perceptron. Two data sets have been used for training and evaluation. The first dataset contains 210 million records &#8211; data manipulation, and model training have been done on Spark cluster using <em>Apache Spark MLlib<\/em>. The second dataset contains 2 million records and thus it was possible to use the <em>Scikit-learn<\/em> library to train the model.<\/p>\n\n\n\n<p>Experimental results show that by combining proposed URL features and classifiers accuracy of 96 to 99% can be achieved. The best performance is achieved using LightGBM and Random Forest.<\/p>\n\n\n\n<p class=\"break_all\">All research details with various examples can be found in the following paper <a href=\"https:\/\/infom.fon.bg.ac.rs\/index.php\/infom\/article\/view\/2427\/2359\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/infom.fon.bg.ac.rs\/index.php\/infom\/article\/view\/2427\/2359<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Production<\/strong><\/h2>\n\n\n\n<p>By providing malicious URL detection as a service in reality we tried to accomplish several goals. First goal is achieving high precision with the highest possible recall. Secondly, malicious URL detection for online systems needs to accomplish predefined low response time. Also, the model has to be constantly retrained to yield a highly accurate classifier. Finally, it must be able to scale up for training the models with millions of new data records.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Next steps<\/strong><\/h2>\n\n\n\n<p>Future work includes adding new features like NLP (<em>Natural Language Processing<\/em>) categories determined from URLs, <em>bag-of-words,<\/em> and <em>n-gram<\/em> representation of keywords, in order to achieve higher accuracy.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Malicious URLs are a common and serious threat to cybersecurity. There are many ways for malicious attackers to try to cheat end user such as hacking attempts, drive-by-download, denial of service, phishing, social engineering, and many others. <\/p>\n","protected":false},"author":6,"featured_media":1858,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[71],"tags":[],"class_list":["post-1846","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Using Machine Learning to Detect Malicious URLs - Bravo Systems d.o.o.<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Machine Learning to Detect Malicious URLs - Bravo Systems d.o.o.\" \/>\n<meta property=\"og:description\" content=\"Malicious URLs are a common and serious threat to cybersecurity. There are many ways for malicious attackers to try to cheat end user such as hacking attempts, drive-by-download, denial of service, phishing, social engineering, and many others.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/\" \/>\n<meta property=\"og:site_name\" content=\"Bravo Systems d.o.o.\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/bravo.systems.doo\/\" \/>\n<meta property=\"article:published_time\" content=\"2020-09-03T09:22:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-09-22T10:52:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/2nd_article.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1620\" \/>\n\t<meta property=\"og:image:height\" content=\"700\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Jelena Joki\u0107\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jelena Joki\u0107\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/\",\"url\":\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/\",\"name\":\"Using Machine Learning to Detect Malicious URLs - Bravo Systems d.o.o.\",\"isPartOf\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/2nd_article.jpg\",\"datePublished\":\"2020-09-03T09:22:00+00:00\",\"dateModified\":\"2020-09-22T10:52:16+00:00\",\"author\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/31d9b4f55c130956d11a56e8a8ff5838\"},\"breadcrumb\":{\"@id\":\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#primaryimage\",\"url\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/2nd_article.jpg\",\"contentUrl\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/2nd_article.jpg\",\"width\":1620,\"height\":700},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/bravosystems.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Machine Learning to Detect Malicious URLs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/bravosystems.com\/blog\/#website\",\"url\":\"https:\/\/bravosystems.com\/blog\/\",\"name\":\"Bravo Systems d.o.o.\",\"description\":\"Blog\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/bravosystems.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/31d9b4f55c130956d11a56e8a8ff5838\",\"name\":\"Jelena Joki\u0107\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/jelena_jokic.jpg\",\"contentUrl\":\"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/jelena_jokic.jpg\",\"caption\":\"Jelena Joki\u0107\"},\"url\":\"https:\/\/bravosystems.com\/blog\/author\/jelena-jokic\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Machine Learning to Detect Malicious URLs - Bravo Systems d.o.o.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/","og_locale":"en_US","og_type":"article","og_title":"Using Machine Learning to Detect Malicious URLs - Bravo Systems d.o.o.","og_description":"Malicious URLs are a common and serious threat to cybersecurity. There are many ways for malicious attackers to try to cheat end user such as hacking attempts, drive-by-download, denial of service, phishing, social engineering, and many others.","og_url":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/","og_site_name":"Bravo Systems d.o.o.","article_publisher":"https:\/\/www.facebook.com\/bravo.systems.doo\/","article_published_time":"2020-09-03T09:22:00+00:00","article_modified_time":"2020-09-22T10:52:16+00:00","og_image":[{"width":1620,"height":700,"url":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/2nd_article.jpg","type":"image\/jpeg"}],"author":"Jelena Joki\u0107","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Jelena Joki\u0107","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/","url":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/","name":"Using Machine Learning to Detect Malicious URLs - Bravo Systems d.o.o.","isPartOf":{"@id":"https:\/\/bravosystems.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#primaryimage"},"image":{"@id":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#primaryimage"},"thumbnailUrl":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/2nd_article.jpg","datePublished":"2020-09-03T09:22:00+00:00","dateModified":"2020-09-22T10:52:16+00:00","author":{"@id":"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/31d9b4f55c130956d11a56e8a8ff5838"},"breadcrumb":{"@id":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#primaryimage","url":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/2nd_article.jpg","contentUrl":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/2nd_article.jpg","width":1620,"height":700},{"@type":"BreadcrumbList","@id":"https:\/\/bravosystems.com\/blog\/using-machine-learning-to-detect-malicious-urls\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/bravosystems.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Using Machine Learning to Detect Malicious URLs"}]},{"@type":"WebSite","@id":"https:\/\/bravosystems.com\/blog\/#website","url":"https:\/\/bravosystems.com\/blog\/","name":"Bravo Systems d.o.o.","description":"Blog","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/bravosystems.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/31d9b4f55c130956d11a56e8a8ff5838","name":"Jelena Joki\u0107","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/bravosystems.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/jelena_jokic.jpg","contentUrl":"https:\/\/bravosystems.com\/blog\/wp-content\/uploads\/2020\/09\/jelena_jokic.jpg","caption":"Jelena Joki\u0107"},"url":"https:\/\/bravosystems.com\/blog\/author\/jelena-jokic\/"}]}},"_links":{"self":[{"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/posts\/1846","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/comments?post=1846"}],"version-history":[{"count":24,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/posts\/1846\/revisions"}],"predecessor-version":[{"id":1898,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/posts\/1846\/revisions\/1898"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/media\/1858"}],"wp:attachment":[{"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/media?parent=1846"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/categories?post=1846"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bravosystems.com\/blog\/wp-json\/wp\/v2\/tags?post=1846"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}