Yi Li bio photo

Yi Li

Associate Professor

College of Computing and Data Science (CCDS)
Nanyang Technological University (NTU)

Address: Block N4-02b-63
50 Nanyang Avenue, Singapore 639798
Phone: +65 6790 4287

Email Twitter LinkedIn GitHub Bitbucket Google Scholar ORCID

Advancing Binary Code Similarity Detection via Context-Content Fusion and LLM Verification

Chaopeng Dong, Jingdong Guo, Shouguo Yang, Yi Li, Dongliang Fang, Yang Xiao, Yongle Chen, and Limin Sun

In Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2025

Abstract: Binary Code Similarity Detection, essential for binary-code related tasks like vulnerability detection, has attracted increasing attention in recent years. However, existing methods frequently fall short of achieving both high precision and recall at scale, and their results often lack interpretability due to the neglect of function context and reliance on purely similarity-driven outputs. Our key insights are twofold: 1) Binary functions are not self-contained; they depend on other code and data beyond their content to fulfill their functionalities. 2) Large language models (LLMs) excel not only at analyzing code but also at generating reasonable explanations. Motivated by these insights, we propose a general BCSD framework, Co^2FuLL. We first systematically select stable and representative code and data features, along with their corresponding dependencies on the functions, to construct the function context. Then, by fusing function context with content similarities computed by the existing BCSD approach, we substantially narrow down the search space. Ultimately, we employ LLMs with a carefully designed prompt to verify the remaining candidates and produce clear, human-readable explanations. We conduct comprehensive experiments on a large function pool under varying compilation settings and after binary stripping. The results show that Co^2FuLL based on HermesSim and DeepSeek-V3 achieves 80.5% precision and 94.4% recall, improving the baseline HermesSim by 142.5% and 42.2%, respectively, providing an accurate and interpretable solution for BCSD.

Cite:

@inproceedings{Dong2025ABC,
  author = {Dong, Chaopeng and Guo, Jingdong and Yang, Shouguo and Li, Yi and Fang, Dongliang and Xiao, Yang and Chen, Yongle and Sun, Limin},
  booktitle = {Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)},
  month = nov,
  title = {Advancing Binary Code Similarity Detection via Context-Content Fusion and {LLM} Verification},
  year = {2025}
}