Accession Number:



Frequency-Based Feature Extraction for Malware Classification

Descriptive Note:

[Technical Report, Master's Thesis]

Corporate Author:


Personal Author(s):

Report Date:


Pagination or Media Count:



Traditional signature-based malware detection is effective, but it can only identify known malicious programs. This thesis attempts to use machine-learning techniques to successfully identify previously unknown malware from a set of Windows executable programs. We analyzed the histogram of 4-, 8-, and 16-bit-sequence values contained in each program. We then analyzed the effectiveness of using these histograms in part or in full as feature vectors for machine learning experiments. We also explored the effect of an offset at the beginning of each program and its impact on classifier performance. We successfully show that a machine learning classifier can be learned from these features, with an f-measure in excess of 90 attained in one of our experiments. Using a part of the histogram as the feature vector did not significantly affect classifier performance up to a point, nor did including an offset. Our results also suggest that features derived from histograms are better suited to tree-based algorithms compared to Bayesian methods.

Subject Categories:

  • Computer Systems

Distribution Statement:

[A, Approved For Public Release]