In this paper, we surveyed and reproduced 9 state-of-the-art (SOTA) deep learning models on 2 widely used vulnerability detection datasets: Devign and MSR.
- We investigated 6 research questions in three areas, namely model capabilities, training data, and model interpretation. We experimentally demonstrated the variability between different runs of a model and the low agreement among different models’ outputs.
- We investigated models trained for specific types of vulnerabilities compared to a model that is trained on all the vulnerabilities at once.
- We explored the types of programs DL may consider “hard” to handle.
- We investigated the relations of training data sizes and training data composition with model performance.
- Finally, we studied model interpretations and analyzed important features that the models used to make predictions.