Use of Ensemble Learning to Detect Buffer Overflow Exploitation

Software exploitation detection remains unresolved problem. Software exploits that target known and unknown vulnerabilities are constantly used in attacks. Signature-based detection techniques are limited to known exploits and susceptible to circumvention. Current research on the use of Machine Lear...

Full description

Bibliographic Details
Main Authors: Ayman Youssef, Mohamed Abdelrazek, Chandan Karmakar
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10131927/
Description
Summary:Software exploitation detection remains unresolved problem. Software exploits that target known and unknown vulnerabilities are constantly used in attacks. Signature-based detection techniques are limited to known exploits and susceptible to circumvention. Current research on the use of Machine Learning (ML) for software exploitation detection is limited in quantity and use cases. Existing research lacks the use of public datasets, discussions of feature importance, and elaboration of parameters that affect data preparation and subsequently model performance. This paper presents ML models based on different ensemble algorithms to detect software exploitation using runtime traces. We focus on buffer overflow vulnerabilities in user-space applications within Windows Operating Systems (OS), given the prevalence of the type of vulnerability and the OS. We utilized a publicly available raw dataset of 11 Windows applications under exploitation. Multiple distinct models (based on Random Forest and XGBoost) are created and tested. Testing was performed several times using various aggregation parameters and different testing applications. Our results demonstrate that we can achieve up to 100% recall with 0% false positive rate. We report on the different parameters that must be addressed to curate runtime traces and demonstrate their impact on the performance of the ML models. We demonstrate that the proper training of models on a subset of exploitation techniques enables the model to detect techniques never seen before, such as return-oriented programming. Finally, we conclude with a discussion of the important features that had the highest impact on each of the models, along with the key takeaways.
ISSN:2169-3536