ijact-book-coverT

Function Similarity Analysis in Stripped Binaries Using Transformer-Based Embeddings

穢 2025 by IJACT

Volume 3 Issue 4

Year of Publication : 2025

Author : Purv Rakeshkumar Chauhan

:10.56472/25838628/IJACT-V3I4P103

Citation :

Purv Rakeshkumar Chauhan, 2025. "Function Similarity Analysis in Stripped Binaries Using Transformer-Based Embeddings" ESP International Journal of Advancements in Computational Technology (ESP-IJACT)  Volume 3, Issue 4: 12-21.

Abstract :

With applications in vulnerability analysis, malware variant detection, and cross-version patching, function similarity analysis in stripped binaries—where symbol tables and debug metadata are absent—is a key challenge in binary code analysis. Promising developments have been made by the introduction of transformer-based embeddings, which allow models to generalize across compiler/architecture variations and capture long-range instruction context. In this review, the development of transformer-centric architectures and binary similarity training methods is critically examined, experimental results are compared, architectural and evaluation gaps are highlighted, and promising future directions are highlighted. Scalability to large function corpora, cross-architecture generalization, benchmark standardization, and robustness to compiler transformations and obfuscation are important challenges. Provable invariance, open large-scale benchmarks, hybrid structural modeling, and cost-effective pretraining should all be investigated in future studies.

References :

[1.] Xu, X., Feng, S., Ye, Y., Shen, G., Su, Z., Cheng, S., Tao, G., Shi, Q., Zhang, Z., & Zhang, X. (2023). Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis. In Proceedings of the 32nd ACM SIGSOFT.
[2.] Dullien, T., & Rolles, R. (2005). Graph-based comparison of executable objects. In Proceedings of the Symposium sur la Sécurité des Technologies de l’Information et des Communications (SSTIC 2005).
[3.] Zhang, Y., & Yin, H. (2016). Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS) (pp. 480–491). ACM.
[4.] Xu, X., Liu, C., Feng, Q., Yin, H., Le, S., & Song, D. (2017). Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS) (pp. 363–376). ACM.
[5.] Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., & Zhang, Z. (2019). Neural machine translation inspired binary code similarity comparison beyond function pairs (INNEREYE). In Proceedings of the Network and Distributed System Security Symposium (NDSS 2019).
[6.] Redmond, K., Luo, L., & Zeng, Q. (2019). A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. In Proceedings of the NDSS Workshop on Binary Analysis Research (BAR 2019).
[7.] Ding, S. H. H., Fung, B. C. M., & Charland, P. (2019). Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In Proceedings of the IEEE Symposium on Security and Privacy (S&P 2019) (pp. 472–489). IEEE.
[8.] Massarelli, L., Di Luna, G. A., Petroni, F., Querzoni, L., & Baldoni, R. (2019). SAFE: Self-attentive function embeddings for binary similarity. In Proceedings of the International Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2019) (Lecture Notes in Computer Science, vol. 11467, pp. 234–254). Springer.
[9.] Duan, Y., Li, X., Wang, J., & Yin, H. (2020). DeepBinDiff: Learning program-wide code representations for binary diffing. In Proceedings of the Network and Distributed System Security Symposium (NDSS 2020).
[10.] Yu, Z., Cao, R., Tang, Q., Nie, S., Huang, J., & Wu, S. (2020). Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2020), 34(01), 1145–1152.
[11.] Wang, H., Qu, W., Katz, G., Zhu, W., Gao, Z., Qiu, H., Zhuge, J., & Zhang, C. (2022). jTrans: Jump-aware transformer for binary code similarity detection. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA 2022) (pp. 1–13). ACM.
[12.] Ahn, S., Ahn, S., Koo, H., & Paek, Y. (2022). Practical Binary Code Similarity Detection with BERT-based Transferable Similarity Learning. Proceedings of the Annual Computer Security Applications Conference (ACSAC ’22). ACM. https://doi.org/10.1145/3564625.3567975
[13.] Liu, G., Zhou, X., Pang, J., Yue, F., Liu, W., & Wang, J. (2023). Codeformer: A GNN-Nested Transformer Model for Binary Code Similarity Detection. Electronics, 12(7), 1722. https://doi.org/10.3390/electronics12071722
[14.] Zhang, Y., & coauthors. (2025). GBsim: A Robust GCN-BERT Approach for Cross-Architecture Binary Code Similarity Analysis. Entropy, 27(4), 392. https://doi.org/10.3390/e27040392
[15.] Sha, Z., Lan, Y., Zhang, C., Wang, H., Gao, Z., & Shu, H. (2024). OpTrans: Enhancing binary code similarity detection with function inlining re-optimization. Empirical Software Engineering, Article 49, Volume 30 (accepted Dec 2024 / published online Dec 26, 2024). https://doi.org/10.1007/s10664-024-10605-x
[16.] Tian, D., Jia, X., Ma, R., Liu, S., & Hu, C. (2021). BinDeep: A deep learning approach to binary code similarity detection. Expert Systems with Applications, 168, 114348. https://doi.org/10.1016/j.eswa.2020.114348
[17.] Yang, S., Dong, C., Xiao, Y., Cheng, Y., Shi, Z., Li, Z., & Sun, L. (2023). Asteria-Pro: Enhancing deep-learning based binary code similarity detection by incorporating domain knowledge. ACM Transactions on Software Engineering and Methodology (TOSEM). (Accepted 2023).
[18.] Wang, H., Qu, W., Katz, G., Zhu, W., Gao, Z., Qiu, H., Zhuge, J., & Zhang, C. (2022). jTrans: Jump-aware transformer for binary code similarity detection. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA 2022), 1–13.
[19.] Liu, G., Zhou, X., Pang, J., Yue, F., Liu, W., & Wang, J. (2023). Codeformer: A GNN-nested transformer model for binary code similarity detection. Electronics, 12(7), 1722.
[20.] Huang, H., Chen, P., Gong, Y., & Zhao, B. (2025). On the effectiveness of custom transformers for binary analysis. In RAID 2025.
[21.] Gu, Y., Shu, H., & Hu, F. (2022). UniASM: Binary code similarity detection without fine-tuning. Journal of Systems and Software.
[22.] Qing, K., Xie, Z., & Yang, X. (2023). Improving binary code similarity transformer models by semantics-driven instruction deemphasis. In Proceedings of the 32nd ACM SIGSOFT Symposium (ISSTA ’23).
[23.] Ruan, L., Xu, Q., Zhu, S., Huang, X., & Lin, X. (2024). A Survey of Binary Code Similarity Detection Techniques. Electronics, 13(9), 1715.
[24.] Du, J., Wei, Q., Wang, Y., Sun, X. (2023). A Review of Deep Learning-Based Binary Code Similarity Analysis. Electronics, 12(22), 4671.
[25.] Ruan, L., Xu, Q., Zhu, S., Huang, X., & Lin, X. (2024). A Survey of Binary Code Similarity Detection Techniques. Electronics, 13(9), 1715. https://doi.org/10.3390/electronics13091715
[26.] Kim, D., Kim, E., Cha, S. K., Son, S., & Kim, Y. (2020). Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Transactions on Software Engineering.
[27.] Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., & Song, D. (2017, October). Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17) (pp. 363–376). ACM. https://doi.org/10.1145/3133956.3134018

Keywords :

Binary Code Similarity, Transformer Embeddings, Stripped Binaries, Cross-Architecture Matching, Representation Learning, Benchmark Standardization.