Assessing malware detection using hardware performance counters
Gupta, Anmol Brijesh
MetadataShow full item record
Despite the use of modern anti-virus (AV) software, malware is a prevailing threat to today's computing systems. AV software cannot cope with the increasing number of evasive malware, calling for more robust malware detection techniques. Out of the many proposed methods for malware detection, researchers have suggested microarchitecture-based mechanisms for detection of malicious software in a system. For example, Intel embeds a shadow stack in their modern architectures that maintains the integrity between function calls and their returns by tracking the function's return address. Any malicious program that exploits an application to overflow the return addresses can be restrained using the shadow stack. Researchers also propose the use of Hardware Performance Counters (HPCs). HPCs are counters embedded in modern computing architectures that count the occurrence of architectural events, such as cache hits, clock cycles, and integer instructions. Malware detectors that leverage HPCs create a profile of an application by reading the counter values periodically. Subsequently, researchers use supervised machine learning-based (ML) classification techniques to differentiate malicious profiles amongst benign ones. It is important to note that HPCs count the occurrence of microarchitectural events during execution of the program. However, whether a program is malicious or benign is the high-level behavior of a program. Since HPCs do not surveil the high-level behavior of an application, we hypothesize that the counters may fail to capture the difference in the behavioral semantics of a malicious and benign software. To investigate whether HPCs capture the behavioral semantics of the program, we recreate the experimental setup from the previously proposed systems. To this end, we leverage HPCs to profile applications such as MS-Office and Chrome as benign applications and known malware binaries as malicious applications. Standard ML classifiers demand a normally distributed dataset, where the variance is independent of the mean of the data points. To transform the profile into more normal-like distribution and to avoid over-fitting the machine learning models, we employ power transform on the profiles of the applications. Moreover, HPCs can monitor a broad range of hardware-based events. We use Principal Component Analysis (PCA) for selecting the top performance events that show maximum variation in the least number of features amongst all the applications profiled. Finally, we train twelve supervised machine learning classifiers such as Support Vector Machine (SVM) and MultiLayer Perceptron (MLPs) on the profiles from the applications. We model each classifier as a binary classifier, where the two classes are 'Benignware' and 'Malware.' Our results show that for the 'Malware' class, the average recall and F2-score across the twelve classifiers is 0.22 and 0.70 respectively. The low recall score shows that the ML classifiers tag malware as benignware. Even though we exercise a statistical approach for selecting our features, the classifiers are not able to distinguish between malware and benignware based on the hardware-based events monitored by the HPCs. The incapability of the profiles from HPCs in capturing the behavioral characteristic of an application force us to question the use of HPCs as malware detectors.