Software Vulnerability Detection with Machine Learning
Software vulnerabilities are a critical security challenge in modern systems, and adverse actors exploit them to pose persistent and evolving threats. In 2023, zero-day vulnerabilities accounted for a significant portion of successful attacks, as reported by ENISA [1]. Furthermore, even patched vulnerabilities are successfully exploited up to two years after their disclosure, according to CISA [2]. These exploits lead to severe consequences, such as data breaches, credential theft, and the compromise of government and strategic systems [1].
Traditional vulnerability detection approaches, while valuable, face limitations in scalability and detection accuracy for modern software systems. In recent years, the advancements of machine learning methods for software engineering [3] has provided new perspectives on addressing this challenge. Machine learning has demonstrated potential both as a standalone detection method and as an enhancement to traditional approaches [4].
Machine learning methods as a standalone approach leverage deep learning architectures to identify patterns within code representations [5], enabling the detection of diverse vulnerabilities [6]. When combined with static and dynamic analysis, machine learning can enhance the scalability and precision of analysis [4].
While machine learning-aided vulnerability detection shows promise, several challenges remain unresolved. The data-dependent nature of machine learning requires high-quality data, whose availability is scarce for security-critical sectors [4] and whose alignment with production environments can be difficult to achieve [4]. Another barrier is the lack of interpretability in many machine learning models. The opacity of these systems often leads to adoption resistance from stakeholders [7], particularly in environments requiring regulatory compliance [3].
In this research, we aim to enhance the state-of-the-art in vulnerability detection, with a particular focus on the interplay between traditional program analysis and machine learning approaches. Our focus lies in investigating how program analysis can inform and guide machine learning techniques, how machine learning can enhance traditional analysis processes, and how to advance the trustworthiness of machine learning-based detection through validation and interpretation. This research aims to contribute to both theoretical understanding and practical applications in security-critical environments, working toward more effective and reliable vulnerability detection methods.
Referenser
[1] European Union Agency for Cybersecurity (ENISA), “ENISA Threat Landscape 2024,” ENISA, 2024
[2] Cybersecurity and Infrastructure Security Agency (CISA) et al., “2023 Top Routinely Exploited Vulnerabilities,” Cybersecurity Advisory AA24-317A, 2024.
[3] R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” Jul. 12, 2022, arXiv: arXiv:2108.07258. doi: 10.48550/arXiv.2108.07258.
[4] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A Survey of Machine Learning for Big Code and Naturalness,” ACM Comput. Surv., vol. 51, 2018.
[5] B. Casey, J. C. S. Santos, and G. Perry, “A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks,” arXiv.org. [Online]. Available: https://arxiv.org/abs/2403.10646v1
[6] P. Chakraborty, K. K. Arumugam, M. Alfadel, M. Nagappan, and S. McIntosh, “Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets,” IIEEE Trans. Software Eng., vol. 50, no. 8, pp. 2163–2177, Aug. 2024, doi: 10.1109/TSE.2024.3423712.[7] D. Bhusal et al., “SoK: Modeling Explainability in Security Analytics for Interpretability, Trustworthiness, and Usability,” in Proceedings of the 18th International Conference on Availability, Reliability and Security, Benevento Italy: ACM, Aug. 2023, pp. 1–12. doi: 10.1145/3600160.3600193.