CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Published in ICLR, 2026

Recommended citation: Monoshi Kumar Roy, Simin Chen, Benjamin Steenhoek, Jinjun Peng, Gail Kaiser, Baishakhi Ray, and Wei Le. 2026. CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning. In The Fourteenth International Conference on Learning Representations (ICLR 2026), April 24–28, 2026, Singapore. https://benjijang.com/files/2026-04-23-codesense.pdf

Understanding and reasoning about code semantics is essential for enhancing code LLMs’ abilities to solve real-world software engineering tasks. Most existing benchmarks rely on synthetic datasets or focus on coarse-grained reasoning tasks, limiting their effectiveness in evaluating LLMs in practical SE contexts.

We propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks from real-world Python, C, and Java software projects.
We executed tests from real-world repositories, collected execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks.
Our comprehensive evaluation of state-of-the-art LLMs shows a clear performance gap for fine-grained reasoning tasks.
Prompting techniques such as chain-of-thought and in-context learning helped, but the lack of code semantics in LLMs fundamentally limits models’ reasoning capabilities.
We also produced an execution tracing framework and toolset for easy collection of ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post-training.
Our code and data are available at codesense-bench.github.io.

Also available on arXiv.

Share on

Twitter Facebook LinkedIn