Separation of Benign and Malicious Network Events for Accurate Malware Family Classification
Cornell University Ithaca United States
Pagination or Media Count:
Labeling malware samples with their appropriate malware family helps understand and track malware evolution and develop mitigation techniques. Current malware analysis techniques that use supervised machine learning rely on classification models that are trained on malware traffic generated from a sandbox environment. These models are then used to classify future unseen observations. In practice, however, malware traffic comes mixed with other legitimate background traffic from host machines, such as user browsing and software update traffic. Hence, the classifiers accuracy to predict the correct malware label on unseen mixed traffic is low. We propose a novel classification system that uses an Independent Component Analysis ICA module that applies distribution decomposition to separate the observed traffic into two components, malware traffic and background traffic. We also use a random forest classifier module to learn a classification model for every malware family, and then use it to predict malware family labels using the output of the ICA module. This system is thus capable of labeling malware traffic after removing background artifacts noise, which makes it more efficient and accurate than current classification methods. Our experiments on three malware family datasets show that the performance of our system improves significantly after removing the background traffic artifacts.