Large Language Models for Software Reverse Engineering
Abstract:
The role of Software Reverse Engineering (SRE) has skyrocketed over the past decade due to increases in the complexity of software written by bad actors in the cyberspace domain. One key element of SRE is that of code explanation. However, despite the wide range of tools available for SRE, the task still remains a time-intensive and complex endeavor. Software binaries are often stripped and obfuscated, removing key information necessary for binary analysis. Additionally, these binaries come in a wide variety of instruction set architectures, requiring reverse engineers to understand low-level assembly code for multiple architectures. Adding to the complexity of the problem is the fact that SRE is multidisciplinary and requires knowledge not only in low-level programming but also networking, full stack development, mathematics, and more. Due to the extremely specialized combination of skill-sets necessary to reverse engineer software, we propose the use of finetuned Large Language models in conjunction with a software analysis package CFG2VEC in order to generate step-by-step explanations of stripped binary code.