EVALUATING THE IMPACT OF AI CODE GENERATED OF SOFTWARE QUALITY AND OBSERVED STUDY OF MAINTAINABILITY, SECURITY AND PERFORMANCE

Asfandyar; Muhammad Waseem Akhtar; Hafiz Shoaib Khalil

Authors

Asfandyar
Muhammad Waseem Akhtar
Hafiz Shoaib Khalil

Keywords:

Artificial general intelligence, software quality, empirical study, code maintainability, software security, code performance, large language models (LLMs).

Abstract

The fast adoption of AI-driven code generation tools in the software development industry has potential to bring many gains in productivity but the effect that the technology has on the most basic, non-functional software quality features has not been properly measured yet. Although the current body of research mainly analyzes the functional correctness, the study is the first empirical and systematic evaluation of the impact of AI-generated code on the three pillars of the software quality: maintainability, security, and performance. We performed a comparative analysis with control, producing 642 solutions of code with state-of-the-art models (GPT-4, Claude 3) and 107 solutions created by human programmers experiencing at least 10 years of experience in creating a curated set of 107 programming tasks of various levels of complexity. Both solutions were hardened with tooling of industry standards: maintainability with SonarQube and Code Climate, vulnerability with security scanners (Bandit, Sempre) and a manual audit and performance with profiling (profile, memory-profiler). We find that there are systematic failures of quality in AI-generated code. It has a 34% greater cyclomatic complexity and 2.1 times greater duplication of code, which points to poor maintainability. In the 22.1% of AI samples there are security vulnerabilities due to the OWASP Top 10, versus 8.4% of human code. AI performance benchmarks indicate that the code is 15-40 percent slower and consumes 25 percent+ more memory. Most importantly, the complexity of the task is closely associated with quality degradation (the spearman 0.78 in the best case). We also find that, as they are currently constructed, AI models are functionally sound but create a quality debt that can be quantified in terms of duplicating patterns without understanding of architecture. This requires a paradigm shift: AI output should be considered as material that needs to be distrusted as draft, there have to exist better quality gates, special tooling and long-term human supervision to ensure the software remains healthy in the long-term.