Bjerkestrand, Therese Haalen (2016) Intrusion detection decision tree classifier from imbalanced datasets using feature selection. (MSc(R) thesis), Kingston University, .
Abstract
The work described within this dissertation addresses two questions. Is there any benefit, when building decision trees, in terms of accuracy by dealing with imbalanced datasets and by using feature selection schemes? The dataset KDD99 provided our training and testing sets and a number of experiments were conducted with the WEKA platform and employing the decision tree learning algorithms. We examined how to deal with the imbalanced KDD99 dataset and how different types of feature selection algorithms affect the accuracy of the learning system. Both filter feature selection and wrapper feature selection have been used for the experiments, as well as a cost-sensitive learning system using differenct cost matrices. Previous literature on similar work has been reported in the literature. Most of the reported work however used Denial-of-Service (DoS) attacks, which has time dependencies, and hence are not suitable to represent the knowledge using decision trees. The difference with our work is then that we concentrated on attacks that have no time dependency such as User-to-Root (U2R). The findings from this work illustrate that cleaning and preparation of the dataset is a vital part of the knowledge discovery process. It is crucial to have a complete understanding of the data when attempting to create a more evenly distributed dataset. The results also support the prediction that selecting the most relevant features, an by doing so reducing the time and cost spent on analysis, is beneficial in terms of accuracy. However, the benefit will vary depending on the type of featuree selection algorithm applied, and in this case, the wrapper feature selection has proved to be more beneficial than filter feature selection. Although wrapper feature selection significantly improved the true positive (TP) rate and the recall value for the attack instances, the best results were produced using a cost-sensitive classifier and a cost matrix containing different penalty points for the different classifications.
Actions (Repository Editors)
Item Control Page |