Automatic model learning and its applications in malware detection

A behavior model of a program captures the correct ways of invoking its Application Programming Interfaces (APIs). For instance, one way for a Java programmer to read a text file is to open the file and then read the contents of the file and finally close the file after reading. Automatic learning o...

Full description

Bibliographic Details
Main Author: Hao, Xiao
Other Authors: Sun Jun
Format: Thesis
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10356/72506
Description
Summary:A behavior model of a program captures the correct ways of invoking its Application Programming Interfaces (APIs). For instance, one way for a Java programmer to read a text file is to open the file and then read the contents of the file and finally close the file after reading. Automatic learning of behavior models of programs can benefit many applications such as software verification by generating the models of the target program for verification, software testing by generating models of standard libraries for test generation, security analysis by generating the attack models of malicious software (malware) for malware detection, and software maintenance by generating models of legacy programs for code comprehension. The first part of this thesis is dedicated to automatic and efficient techniques for learning accurate behavior models of program libraries. In our first work, we developed a fully automatic approach to learning more expressive behavior models more efficiently. The learned model can capture behaviors of a single class in object-oriented programs. Existing approaches for learning such models are often not efficient due to the use of model checking or symbolic execution. In this approach, testing and active learning are used to efficiently learn behavior models. A machine learning technique is used to efficiently synthesize the Boolean conditions for invoking an API. To solve the low coverage problem of testing of the first approach without compromising too much on efficiency, the second approach uses symbolic execution to verify and refine the behavior models which are actively learned through testing. Another contribution of the second approach is that learned models are not only precise and accurate but also give users the option to specify the appropriate level of abstraction. The second approach inherits limitations from symbolic execution, which lacks capability in handling commonly used program features such as heap data structures and program loops. To circumvent the problems in the first and second approaches, the third approach harvests the experiences of existing example programs which use the program library to learn behavior models. We adopt a statistical machine learning technique to learn behavior models from API usages in example programs of the library to generate human interpretable behavior models. The second part of the thesis is dedicated to the application of the proposed approaches for malware detection in order to demonstrate the usefulness of automatic model learning. The attack behavior of a malware often consists of several sub-tasks in sequence to achieve malicious intents and thus the attack behavior of malwares can be modelled with deterministic finite automata. Thus we apply the proposed techniques to automatically learn the attack behavior models of malwares. The first application is for the detection of malicious JavaScript programs embedded in Web pages. We extend the active learning algorithm with dynamic analyses to learn behavior models which capture the attack pattern of the malware. Then the learned attack behavior models are used to detect malware variants. The detection part in the first application is implemented in software which can be easily bypassed or disabled by malicious JavaScript programs. The second application is for the hardware-assisted detection of malicious desktop applications. In this application, we use static analyses with active learning to learn attack behavior models of malwares and then encode learned behavior models of known malicious programs in hardware to detect malicious applications. Compared with the first application, the hardware-assisted detection in the second application cannot be easily bypassed, if not impossible. The central idea behind these two applications is that the attack behavior of malicious software can be modelled as behavior models. Then the attack behavior models can be learned automatically and can be used to detect other malicious software. In a nutshell, we developed three automatic approaches for efficient learning of behavior models of program libraries and demonstrated the strengths of the proposed approaches by applying them to learn attack behavior models of malwares for the detection of malware variants.