Malicious webpages are a prevalent and severe threat in the Internet security landscape. This fact has motivated numerous static and dynamic techniques for their accurate and efficient detection. Building on this existing literature, this work introduces ADAM, a system that uses machine-learning over network metadata derived from the sandboxed execution of webpage content. Machine-trained models are not novel in this problem space. Instead, it is the dynamic network artifacts (and their subsequent feature representations) collected during rendering that are the greatest contribution of this work.
There were two primary motivations in exploring this line of research. First, iDetermine, VeriSign’s status quo system for detecting malicious webpages is a computationally expensive one. While that system is the basis for our ground-truth and network metadata herein, it also does a great quantity of other analysis to arrive at accurate labels (e.g., packet inspection, system calls). We envision our efforts could well integrate as a tiered classifier that enables greater scalability with minimal performance impact. Second, existing literature on webpage classification were able to provide promising accuracy. Because these approaches rely primarily on static features, we hypothesized that metadata from network dynamics might assist in the task.
This exploration is not without challenges. First, webpages face a host of vulnerabilities: exploit kits, defacement, malicious redirections, code injections, and server-side backdoors – all with different signatures. This malice may not even be the fault of the webpage owner (e.g., advertisement networks). Moreover, the distribution of behavior is highly imbalanced, with our dataset having 40x more benign objects then malicious ones. Despite these challenges, our approach is currently broadly capable of 96% accuracy, with injection attacks and server-side backdoors being identified as areas for performance improvement and future attention.
In the following sections, we present the system description followed by an outline of our approach. Then, we provide our preliminary results and discuss future research directions.