Our paper compares the
correctness,
efficiency, and
maintainability of human-generated and AI-generated program code. For that, we analyzed the computational resources of AI- and human-generated program code using metrics such as
time and
space complexity as well as
runtime and
memory usage. Additionally, we evaluated the
maintainability using metrics such as
lines of code,
cyclomatic complexity,
Halstead complexity and
maintainability index. For our experiments, we had generative AIs produce program code in Java, Python, and C++ that solves problems defined on the competition coding website
leetcode.com. We selected six LeetCode problems of varying difficulty, resulting in 18 program codes generated by each generative AI. GitHub Copilot, powered by Codex (GPT-3.0), performed best, solving 9 of the 18 problems (50.0%), whereas CodeWhisperer did not solve a single problem. BingAI Chat (GPT-4.0) generated
correct program code for seven problems (38.9%), ChatGPT (GPT-3.5) and Code Llama (Llama 2) for four problems (22.2%) and StarCoder and InstructCodeT5+ for only one problem (5.6%). Surprisingly, although ChatGPT generated only four
correct program codes, it was the only generative AI capable of providing a
correct solution to a coding problem of difficulty level
hard. In summary, 26 AI-generated codes (20.6%) solve the respective problem. For 11 AI-generated
incorrect codes (8.7%), only minimal modifications to the program code are necessary to solve the problem, which results in time savings between 8.9% and even 71.3% in comparison to programming the program code from scratch.
Full article