Machine Learning Based Malware Detection

Markel, Zane A.

Machine Learning Based Malware Detection

Active / Technical Report | Accession Number: ADA619747 |

Abstract:

Current antivirus software is effective at detecting well known threats but cannot keep up with the rate at which new malware is authored nor modern antivirus avoidance techniques, such as using polymorphic code. Some studies have investigated augmenting current antivirus techniques with machine learning, which could potentially detect some previously unknown malware. However, previously proposed methods either do not detect malware with satisfactory performance, or they have only been tested on laboratory software databases that cannot suitably be projected into realistic performance. This work explores several aspects of machine learning based malware detection. First, we propose an approach to learn primarily from program metadata, particularly header data in the 32-bit Windows Portable Executable PE32 file format. We identify learning methods that learn effectively from this metadata, explore which metadata features can be trivially modified and are not appropriate for malware detection, test it on approximately realistic datasets, and find that it performs favorably compared to Windows API imports, another category of file characteristic that shows promise for machine learning based malware detection. Additionally, we find and explore the drastic performance drop which occurs when using a realistically low proportion of malware in test datasets instead of datasets split evenly between malware and benign software.

Author(s):

Markel, Zane A.

Author Organization(s):

NAVAL ACADEMY ANNAPOLIS MD

Descriptive Note:

Trident Scholar Project rept. no. 440

Supplementary Note:

The original document contains color images.

Pagination:

0026

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution:

Approved For Public Release

Distribution Statement:

Approved For Public Release; Distribution Is Unlimited.

RECORD

Collection: TR

Identifying Numbers

Report Number(s):

USNA-TSPR-440

Monitor Series:

USNA

Subject Terms

Joint Capability Areas:

JCA_6_Net Centric; JCA_3.2_Engagement; JCA_3_Force Application; JCA_6.4.1_Secure Information Exchange; JCA_3.2.2_Non-Kinetic Means; JCA_6.4_Information Assurance; JCA_1_Force Support; JCA_1.2_Force Preparation; JCA_1.2.1_Training; JCA_5_Command and Control; JCA_1.2.7_Experimentation; JCA_4.6.1_General Engineering; JCA_4.6_Engineering; JCA_5.3_Planning; JCA_5.4_Decide; JCA_5.6.1_Assess Compliance with Guidance; JCA_5.6_Monitor; JCA_4_Logistics; JCA_1.2.5_Lessons Learned; JCA_5.5_Direct; JCA_5.5.3_Establish Metrics

Modernization Areas:

Cyber; Autonomy

Communities of Interest:

Cyber

Descriptor(s):

*COMPUTER NETWORK SECURITY, *INTRUSION DETECTION(COMPUTERS), *LEARNING MACHINES, ALGORITHMS, ANTIVIRUS SOFTWARE, BAYES THEOREM, COMMUNICATIONS TRAFFIC, COMPUTER VIRUSES, DATA BASES, ENTROPY, EXPERIMENTAL DATA, FEATURE EXTRACTION, GAUSSIAN NOISE, INFORMATION ASSURANCE, MATHEMATICAL LOGIC, METADATA, NETWORK ARCHITECTURE, OPERATING SYSTEMS(COMPUTERS), PERFORMANCE(ENGINEERING), PROBABILITY DISTRIBUTION FUNCTIONS, REGRESSION ANALYSIS, TARGET CLASSIFICATION, VULNERABILITY

Field(s)/Group(s):

Statistics and Probability, Computer Programming and Software, Computer Systems Management and Standards, Cybernetics

Keyword(s):

MALWARE, CLASS IMBALANCE, ENSEMBLE LEARNING, PORTABLE EXECUTABLE, WINDOWS API, N GRAM ANALYSIS, LOGISTIC REGRESSION

Report Date:

2015 May 18

Creation Date:

2015 Aug 24