Large Language Models for Software Reverse Engineering

Garcia, Miguel

Large Language Models for Software Reverse Engineering

Active / Technical Report | Accesssion Number: AD1225984 |

Open PDF

Abstract:

The role of Software Reverse Engineering (SRE) has skyrocketed over the past decade due to increases in the complexity of software written by bad actors in the cyberspace domain. One key element of SRE is that of code explanation. However, despite the wide range of tools available for SRE, the task still remains a time-intensive and complex endeavor. Software binaries are often stripped and obfuscated, removing key information necessary for binary analysis. Additionally, these binaries come in a wide variety of instruction set architectures, requiring reverse engineers to understand low-level assembly code for multiple architectures. Adding to the complexity of the problem is the fact that SRE is multidisciplinary and requires knowledge not only in low-level programming but also networking, full stack development, mathematics, and more. Due to the extremely specialized combination of skill-sets necessary to reverse engineer software, we propose the use of finetuned Large Language models in conjunction with a software analysis package CFG2VEC in order to generate step-by-step explanations of stripped binary code.

Author(s):

Garcia, Miguel

Author Organization(s):

DAF-MIT AI Accelerator

Funding Organization(s):

Air Force Research Lab Edison Grant, Dayton, OH; DAF-MIT AI Accelerator, Cambridge, MA

Document Type:

Technical Report/Technical Paper

Publication Date:

2024 Mar 25

Pagination:

5

Security Markings

DOCUMENT & CONTEXTUAL SUMMARY

Distribution Code:

A - Approved For Public Release

Distribution Statement: Public Release.

Copyright: Not Copyrighted

RECORD

Collection: TRECMS

Subject Terms

Keyword(s):

sre(Software Reverse Engineering)

Subject Categories:

Mathematical and Computer Sciences

Creation Date:

2024 Apr 19

Update Date:

2024 Apr 22