Detecting malicious URLs : a machine learning approach

Weedon, Martyn Andrew (2018) Detecting malicious URLs : a machine learning approach. (MSc(R) thesis), Kingston University, .


Security is a major concern on the Internet today. Phishing and malware attacks are amongst those that many users are falling victim to because of the deceitful tactics used by criminals to lure users to malicious websites (URLs). The number of malicious URLs is growing. The Anti Phishing Work Group (2017) reported that over 1.2 million malicious URLs were detected between October and December 2016. Blacklisting is still the most common defence users have against such malicious URLs, but is failing to cope with the increasing number. In recent years, researchers have devised modern ways of detection using machine learning. The outcome of this research is to build a classifier using only the lexical features of a URL to determine whether it is malicious or benign. A review of the literature reveals that classifiers trained on the lexical features of a URL are high performing in terms of accuracy achieved. In this work, the aim is to detect malicious URLs and as such there is an importance to keep the false negative rate as low as possible. URLs that fall into this category are classed as benign when in fact they are malicious. A misclassification of this nature has the potential of putting the user more at risk than if a URL was classed as malicious when it was benign. In this thesis, the Random Forest algorithm is shown to perform the best against three other machine learning algorithms: Naive Bayes, Logistic Regression and J48, with Random Forest yielding an 86.9% accuracy with a 20.6% false negative rate. The results are shown to be consistent with those published in the literature. Further experiments on the Random Forest algorithm reveal that applying a cost matrix improves the accuracy by over 2.5%. In addition, applying a bag-of-words technique to the data, also has a positive impact on the Random Forest algorithm, producing an overall accuracy of 90.2% and a false negative rate of 15.6%.

Actions (Repository Editors)

Item Control Page Item Control Page