"An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection" by Daniel Grahn

Selected Works of Junjie Zhang

Follow Contact

Article

An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection

Proceedings of the Conference on Applied Machine Learning for Information Security, 2021

Daniel Grahn, Wright State University - Main Campus
Junjie Zhang, Wright State University - Main Campus

Download

Document Type

Article

Publication Date

1-1-2021

Disciplines

Computer Sciences and
Engineering

Abstract

As machine learning-assisted vulnerability detection research matures, it is critical to understand the datasets being used by existing papers. In this paper, we explore 7 C/C++ datasets and evaluate their suitability for machine learning-assisted vulnerability detection. We also present a new dataset, named Wild C, containing over 10.3 million individual opensource C/C++ files – a sufficiently large sample to be reasonably considered representative of typical C/C++ code. To facilitate comparison, we tokenize all of the datasets and perform the analysis at this level. We make three primary contributions. First, while all the datasets differ from our Wild C dataset, some do so to a greater degree. This includes divergence in file lengths and token usage frequency. Additionally, none of the datasets contain the entirety of the C/C++ vocabulary. These missing tokens account for up to 11% of all token usage. Second, we find all the datasets contain duplication with some containing a significant amount. In the Juliet dataset, we describe augmentations of test cases making the dataset susceptible to data leakage. This augmentation occurs with such frequency that a random 80/20 split has roughly 58% overlap of the test with the training data. Finally, we collect and process a large dataset of C code, named Wild C. This dataset is designed to serve as a representative sample of all C/C++ code and is the basis for our analyses.

Comments

This work is licensed under a Creative Commons Attribution 4.0 International License.

Citation Information

Daniel Grahn and Junjie Zhang. "An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection" Proceedings of the Conference on Applied Machine Learning for Information Security, 2021 (2021)
Available at: http://works.bepress.com/junjie_zhang/37/