To Err is Machine: Vulnerability Detection Challenges LLM Reasoning

Published in ArXiv, 2024

Recommended citation: Benjamin Steenhoek, Md Mahbubur Rahman, Monoshi Kumar Roy, Mirza Sanjida Alam, Hengbo Tong, Swarna Das, Earl T. Barr, and Wei Le. 2024. To Err is Machine: Vulnerability Detection Challenges LLM Reasoning. ArXiv. https://arxiv.org/pdf/2403.17218

In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation, even those models pre-trained on large amounts of source code.

We analyzed LLM responses and show that the models struggle to reason about the code semantics relevant to identifying vulnerabilities.
We explored prominent models and training settings to understand their effects on vulnerability detection performance — including better prompts, larger models, more pre-training data, and fine-tuning — but none led to significant improvements.

Our results suggest that new models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection. We speculate that auto-regressive pre-training on source code may not effectively extract code semantics, especially on the current pretraining mixtures, in which execution data is scarce.

Share on

Twitter Facebook LinkedIn