Effectiveness of ChatGPT in Coding: A Comparative Analysis of Popular Large Language Models

: This study explores the effectiveness and efficiency of the popular OpenAI model ChatGPT, powered by GPT-3.5 and GPT-4, in programming tasks to understand its impact on programming and potentially software development. To measure the performance of these models, a quantitative approach was employed using the Mostly Basic Python Problems (MBPP) dataset. In addition to the direct assessment of GPT-3.5 and GPT-4, a comparative analysis involving other popular large language models in the AI landscape, notably Google’s Bard and Anthropic’s Claude, was conducted to measure and compare their proficiency in the same tasks. The results highlight the strengths of ChatGPT models in programming tasks, offering valuable insights for the AI community, specifically for developers and researchers. As the popularity of artificial intelligence increases, this study serves as an early look into the field of AI-assisted programming.


Introduction
In recent years, artificial intelligence (AI) has experienced exponential growth, marked by significant advancements in natural language processing and machine learning.This surge has brought about a transformation in various industries and applications.One specific area that has garnered considerable attention is AI-assisted programming.Advanced language models have the potential to revolutionize the way developers create, maintain, optimize, and test code.The primary objective of this study is to evaluate the effectiveness of the most popular large language model publicly available, ChatGPT, encompassing both the GPT-4 and GPT-3.5 models.These two models undergo testing across a variety of code generation tasks in this study, with the aim of understanding their potential and overall performance.
OpenAI's release of the GPT models and the widely available ChatGPT represents a substantial breakthrough in the advancement of AI capabilities [1,2].With each iteration, the models have demonstrated improved performance and versatility, generating increased interest in their potential uses and applications across multiple fields.In programming alone, these models have shown significant promise, particularly in automating tasks, improving code, and providing insights to developers.
The impact of AI in programming cannot be underestimated.AI has been proven to have the potential to enhance productivity, reduce human error, and automate tasks.Examples of such tasks include code generation, documentation, and bug detection, effectively streamlining the programming process.This, in turn, allows programmers to focus on more complex and creative aspects of their work.
To further analyze the subject, this study will employ quantitative testing.The primary objectives of this study are as follows: • To compare the performance of GPT-4 and GPT-3.5 to that of other popular LLMs in said tasks; • To identify challenges and limitations of using large language models in programming.
The breakthrough in automated code generation has been significantly propelled [3] and greatly boosted by recent advancements in large language models like GPT-3 [4], surpassing the capabilities of earlier state-of-the-art deep learning methods [5].
As an illustration, OpenAI Codex [6], a refined iteration of GPT-3, can produce entirely accurate code for 29% of unfamiliar programming tasks using just one sample of generated programs.It was found that when testing 100 samples, 72% of them are correct.In [7], the authors evaluate the GPT Python code-writing capabilities and the correctness of the code generated.The results in this paper are based on only a small number of samples, which shows that the model can solve only 28% of the problems.Hammond et al. [8] investigated the possibility of using OpenAI Codex and other large language models (LLMs) to fix software security bugs.The results show that 67% of vulnerabilities in a selection of historical bugs in real-world open source projects can be fixed and discovered by LLMs.Meanwhile, Refs.[9,10] tested the usability of the code generated by LLMs and not the accuracy of the codes.
Xu and colleagues [11] compared the performance of code generated by GPT-Neo, GPT-J, and GPT-NeoX-all large language models (LLMs)-when trained with a substantial number of parameters derived from ready codes in 12 different languages.Zan et al. [12] investigated the existing large language models for NL2Code and summarized them from diverse perspectives.However, neither of these research studies investigated the accuracy and the quality of the code generated by LLMs.
This study serves as an early look into the field of AI-assisted programming, at a time when artificial intelligence is experiencing exponential growth.The goal is to provide useful insights for developers, researchers, and the broader technology community to help shape the future of AI-assisted programming.
In Section 2 of this paper, a short concept about the history of ChatGPT will be introduced, followed by an explanation of how generative AI works.Section 4 of this paper will present the experimental results, and the conclusion will be summarized in Section 5.

Generative AI History
Modern generative AI development began in the 1940s after the conception of the first artificial neural networks (ANNs).However, due to constraints such as limited computational capabilities and insufficient knowledge of the brain's biological workings, ANNs failed to draw significant interest until the 1980s.During this period, parallel advances in hardware and neuroscience, along with the emergence of the backpropagation algorithm, eased the training process of ANNs.Previously, training NNs was a demanding task, as there was no effective method to compute the error's gradient relating to each neuron's parameters or weights.However, backpropagation automated the training procedure, unlocking the potential usage of ANNs [13].
In 2013, Kingma and Welling presented a novel model structure named variational autoencoders (VAEs) in their papers entitled "Auto-Encoding Variational Bayes".VAEs are generative models grounded in the principle of variational inference.They offer a mechanism for learning via a condensed representation of data, where the data are transformed into a lower-dimensional area called the latent space through an encoding process.Then, the decoder component reconstructs the data back into their original data space [14].
In 2017, Google researchers introduced a pivotal development in their research titled "Attention Is All You Need".This new architecture, called Transformer, was a revolution in language generation [15].Unlike previous language models based on long short-term memory (LSTM) [16] or recurrent neural networks (RNN) frameworks [17], Transformer allowed for parallel processing while retaining context memory, leading to superior performance [17].
In 2021, OpenAI released a fine-tuned version of GPT, Codex, which was trained on code publicly available on GitHub.Early results showed that the fine-tuned model was able to solve around 30% of the Python problems used, compared to the 0% that the current GPT version (GPT-3) was able to achieve.This served as an early look into how large language models (LLMs) [10] can learn and generate code.Codex then served as the basis for GitHub Copilot [10].
GitHub Copilot is an AI programming tool that can be installed in the most popular code editors and is powered by GPT-4.It reads the code and can generate suggestions and even write code instantly.In a controlled test environment, researchers found that programmers who used Copilot finished tasks approximately 55.8% quicker than those who did not, speaking to the potential of AI tools in programming [18,19].
In another research work, GPT's Python code generation is deemed remarkable, showing that it can help novice programmers to solve complex coding problems using only a few prompts.However, both studies have shown that human input is almost always required to steer ChatGPT in the correct direction [20].

What Is Generative AI
Generative AI models harness the capabilities of neural networks to discern patterns and structures within existing datasets and create original content [21].These AI models draw inspiration from human neuronal processes, learning from data inputs to create new output that matches learned patterns.This involves advanced techniques that range from generative adversarial networks (GANs) [21], large language models (LLMs), variational autoencoders (VAEs), and transformers to create content across a dynamic range of domains [22].
Numerous methodologies, such as unsupervised or semi-supervised learning, have empowered organizations to utilize abundant unlabeled data for training and laying foundations for more complex AI systems.Referred to as foundation models, these systems, which comprise models like GPT-3 and Stable Diffusion, serve as a base that can be proficient in multiple tasks.They enable users to maximize the potency of language, such as constructing essays from brief text prompts using applications like ChatGPT or creating remarkably realistic images from text inputs with Stable Diffusion [23].
Generative AI models can refine their outputs through repeated training processes by studying the relationships within the data.They can adapt parameters and diminish the gap between the intended and created outputs, continually enhancing their capacity to produce high-quality and contextually appropriate content.The utilization of this technology is often initiated with a prompt, followed by iterative exploration and refining of variations to guide content generation [24].

Testing Methodology
In this section, the methodological approach used to evaluate ChatGPT will be presented, along with the subsequent comparison with other reputable large language models (LLMs).Next, the description of the dataset used for this study will be explained.Finally, the evaluation strategy, detailing how the tests were performed and how the language model scores were calculated, will be provided.

Selection of LLMs
After the massive popularity of ChatGPT emerged in the market, it was only natural that other companies would release their own large language models (LLMs) to compete with OpenAI.In order to paint a more complete picture, it was decided to choose three other popular and easily accessible LLMs and compare their performance.These models were selected based on the company's popularity as well as potential impact.
The initial LLM chosen for testing was Google's Bard.Bard was introduced in February 2023 as Google's next significant step into the AI space [25].Powered by Google's PaLM2 model [26], Bard is said to excel at advanced reasoning tasks, including math and coding.Bard is currently available for free on the Bard website [27].
The second LLM to be tested was Microsoft's Bing (technically known as Bing Chat, but in this paper, it will be referred to as Bing only).Released in February 2023, Bing is the "copilot for the web" and is an AI chatbot powered by GPT-4, considering that Microsoft is one of the top investors in OpenAI.Bing Chat is also generally available on the Bing website and should have similar performance to ChatGPT [27,28].
The third LLM to be tested was Anthropic's Claude.While not as popular as Google and Microsoft, Anthropic's Claude was founded by a research company established in 2021 by former OpenAI employees.Claude was initially released as the next-generation AI assistant in March 2023.The new version, Claude v2, was quickly released in July 2023.According to Anthropic, Claude has similar use cases to ChatGPT and Bard, including coding capabilities [29,30].

Dataset
This research mostly used basic Python problems based on a well-known and tested dataset called Basic Python Programming (MPPP) [7].This dataset was employed to measure the code generation capabilities of the AI models.It was created by Google researchers and consists of approximately 1000 crowd-sourced Python programming problems.These problems cover various programming fundamentals, functionalities, etc., and are designed to be solved by entry-level programmers [7].
From the main dataset, a subset of around 460 problems has been hand-verified by Google, and this was the dataset used in this research.Each problem consists of: • Task_id: a number from 1-1000.The initial LLM chosen for testing was Google's Bard.Bard was introduced in February 2023 as Google's next significant step into the AI space [25].Powered by Google's PaLM2 model [26], Bard is said to excel at advanced reasoning tasks, including math and coding.Bard is currently available for free on the Bard website [27].
The second LLM to be tested was Microsoft's Bing (technically known as Bing Chat, but in this paper, it will be referred to as Bing only).Released in February 2023, Bing is the "copilot for the web" and is an AI chatbot powered by GPT-4, considering that Microsoft is one of the top investors in OpenAI.Bing Chat is also generally available on the Bing website and should have similar performance to ChatGPT [27,28].
The third LLM to be tested was Anthropic's Claude.While not as popular as Google and Microsoft, Anthropic's Claude was founded by a research company established in 2021 by former OpenAI employees.Claude was initially released as the next-generation AI assistant in March 2023.The new version, Claude v2, was quickly released in July 2023.According to Anthropic, Claude has similar use cases to ChatGPT and Bard, including coding capabilities [29,30].

Dataset
This research mostly used basic Python problems based on a well-known and tested dataset called Basic Python Programming (MPPP) [7].This dataset was employed to measure the code generation capabilities of the AI models.It was created by Google researchers and consists of approximately 1000 crowd-sourced Python programming problems.These problems cover various programming fundamentals, functionalities, etc., and are designed to be solved by entry-level programmers [7].
From the main dataset, a subset of around 460 problems has been hand-verified by Google, and this was the dataset used in this research.Each problem consists of:

Evaluation Strategy
Even though OpenAI offers paid API access to its GPT-3.5 and GPT-4 models per token used, it was decided to use the ChatGPT web interface for all of the tests.The reasoning behind this is that it is the most accessible way to access these models and is probably what most people are going to use due to its ease of use.All models are free to use via their web interfaces, except GPT-4, which is locked behind a paid subscription, "ChatGPT Plus".Bing and Bard do not have API access, while Claude does.However, there is currently a waitlist for Claude's API access.This was also another reason behind not using the APIs in the tests, so as to test all models based on their web interfaces.
This study used a purely quantitative approach.Each model was provided with the programming prompts from the MBPP dataset.For these tests, only the "prompt" was provided, as well as the name of the function, in order to match the function name that was used in the test cases.Figure 2 shows the prompt that served as the input to the LLM systems.

Evaluation Strategy
Even though OpenAI offers paid API access to its GPT-3.5 and GPT-4 models per token used, it was decided to use the ChatGPT web interface for all of the tests.The reasoning behind this is that it is the most accessible way to access these models and is probably what most people are going to use due to its ease of use.All models are free to use via their web interfaces, except GPT-4, which is locked behind a paid subscription, "ChatGPT Plus".Bing and Bard do not have API access, while Claude does.However, there is currently a waitlist for Claude's API access.This was also another reason behind not using the APIs in the tests, so as to test all models based on their web interfaces.
This study used a purely quantitative approach.Each model was provided with the programming prompts from the MBPP dataset.For these tests, only the "prompt" was provided, as well as the name of the function, in order to match the function name that was used in the test cases.Figure 2 shows the prompt that served as the input to the LLM systems.

Evaluation Strategy
Even though OpenAI offers paid API access to its GPT-3.5 and GPT-4 models per token used, it was decided to use the ChatGPT web interface for all of the tests.The reasoning behind this is that it is the most accessible way to access these models and is probably what most people are going to use due to its ease of use.All models are free to use via their web interfaces, except GPT-4, which is locked behind a paid subscription, "ChatGPT Plus".Bing and Bard do not have API access, while Claude does.However, there is currently a waitlist for Claude's API access.This was also another reason behind not using the APIs in the tests, so as to test all models based on their web interfaces.
This study used a purely quantitative approach.Each model was provided with the programming prompts from the MBPP dataset.For these tests, only the "prompt" was provided, as well as the name of the function, in order to match the function name that was used in the test cases.Figure 2 shows the prompt that served as the input to the LLM systems.The resulting generated code by the LLM systems was pasted and tested using the code shown in Figure 2, which includes the assertion tests and the test lists to check if the AI-generated code passes or fails the assertion tests.The assertion test was designed to generate the total results of the tests with a score of between 3 and 6 points, where one point was given if the test succeeded and 0 if it failed.The final score for all 100 of the tested programs was 305 points if a 100% pass rate for the tests was recorded.
This process was repeated for all five LLMs with the same prompt.Every test passed was scored one point for each test in the list for the 100 prompts, and then the final score was calculated and compared.
At the second stage of the evaluation process, the lowest scoring LLMs were retested on an equal number of tasks that they did not complete successfully or did not complete 100% of the tests correctly.The goal here was to confirm whether the system was capable of generating the correct code when provided enough feedback from the human user.
At the third stage, the same prompt was provided to the models in a new conversation to remove any history.Some of the tasks were completed on the first retry, as LLMs tend to be creative and generate different code, while others needed extra feedback.After every response, the code was tested, and feedback was provided to the model, including the results of the test.The number of messages input into the models was also counted until the correct code was provided by the AI system.The maximum number of attempts used was set to 10 attempts; after that, testing would stop, and the task would be marked as failed.

Experimental Results
The test was performed using 460 Python problems that were certified by Google [7], employing all five of the above-mentioned LLMs, which added up to a total possible score of 1225.It can be considered that a small number of tests would be enough to provide a clear trend in performance.
The obtained results are shown in Figure 3; the Claude model displayed the worst performance, scoring 875 points, equivalent to 71.43%.Google's Bard model performed similarly to Claude, scoring 933 points, or 76.16%.While these models are generally good at understanding the tasks, they have a lot of room for improvement.
The resulting generated code by the LLM systems was pasted and tested using code shown in Figure 2, which includes the assertion tests and the test lists to check if AI-generated code passes or fails the assertion tests.The assertion test was designe generate the total results of the tests with a score of between 3 and 6 points, where point was given if the test succeeded and 0 if it failed.The final score for all 100 of tested programs was 305 points if a 100% pass rate for the tests was recorded.
This process was repeated for all five LLMs with the same prompt.Every test pa was scored one point for each test in the list for the 100 prompts, and then the final s was calculated and compared.
At the second stage of the evaluation process, the lowest scoring LLMs were rete on an equal number of tasks that they did not complete successfully or did not comp 100% of the tests correctly.The goal here was to confirm whether the system was cap of generating the correct code when provided enough feedback from the human user At the third stage, the same prompt was provided to the models in a new conve tion to remove any history.Some of the tasks were completed on the first retry, as LL tend to be creative and generate different code, while others needed extra feedback.A every response, the code was tested, and feedback was provided to the model, includ the results of the test.The number of messages input into the models was also coun until the correct code was provided by the AI system.The maximum number of attem used was set to 10 attempts; after that, testing would stop, and the task would be mar as failed.

Experimental Results
The test was performed using 460 Python problems that were certified by Google employing all five of the above-mentioned LLMs, which added up to a total possible s of 1225.It can be considered that a small number of tests would be enough to provi clear trend in performance.
The obtained results are shown in Figure 3; the Claude model displayed the w performance, scoring 875 points, equivalent to 71.43%.Google's Bard model perform similarly to Claude, scoring 933 points, or 76.16%.While these models are generally g at understanding the tasks, they have a lot of room for improvement.As for the GPT-based models, the GPT-3.5 model scored an impressive 1019 po equivalent to 83.18%.Similarly, Bing only scored 15 points lower, with 1004, whic equivalent to 81.96%.Bing scoring similarly to GPT-3.5 was definitely not expected Bing is based on the GPT-4 model.GPT-4, however, scored an impressive 1072, or 87.5 As for the GPT-based models, the GPT-3.5 model scored an impressive 1019 points, equivalent to 83.18%.Similarly, Bing only scored 15 points lower, with 1004, which is equivalent to 81.96%.Bing scoring similarly to GPT-3.5 was definitely not expected, as Bing is based on the GPT-4 model.GPT-4, however, scored an impressive 1072, or 87.51%.
Digital 2024, 4 120 These results show that while each model has impressive code generation capabilities, GPT-4 particularly stands out with its highest success rate, closely followed by Bing and GPT-3.5.

Code Quality
One thing that stood out during testing was that the non GPT-based models typically generate longer code, while GPT-based models tend to generate much shorter, more concise code.There were instances of Bard outputting code that when run, entered an infinite loop and the process had to be aborted, while GPT-based models were much more efficient.
One example is task−id 45.The prompt was "Write a python function to find the maximum difference between any two elements in a given array".Bard's generated output was more complicated and less efficient, as it used the basic programming skills which were used by nonprofessional basic programming skills; the results are shown in Figure 4a.Meanwhile, ChatGPT-4 generated correct compact code using readily available methods in Python, as shown in Figure 4b.
These results show that while each model has impressive code generation capa ties, GPT-4 particularly stands out with its highest success rate, closely followed by B and GPT-3.5.

Code Quality
One thing that stood out during testing was that the non GPT-based models typic generate longer code, while GPT-based models tend to generate much shorter, more c cise code.There were instances of Bard outputting code that when run, entered an infi loop and the process had to be aborted, while GPT-based models were much more cient.
One example is task−id 45.The prompt was "Write a python function to find maximum difference between any two elements in a given array".Bard's generated o put was more complicated and less efficient, as it used the basic programming skills wh were used by nonprofessional basic programming skills; the results are shown in Fig 4a .Meanwhile, ChatGPT-4 generated correct compact code using readily available m ods in Python, as shown in Figure 4b.Another example is task−id 101.The prompt was "Write a function to find the element in the given array using based indexing".Bard's output was relatively long resulted in a complex loop.On the other hand, GPT-4′s responses are much more con and made better use of built-in functions, as shown in Figure 5. Another example is task−id 101.The prompt was "Write a function to find the kth element in the given array using based indexing".Bard's output was relatively long and resulted in a complex loop.On the other hand, GPT-4 ′ s responses are much more concise and made better use of built-in functions, as shown in Figure 5.
In order to compare the quality of the generated code for the five LLMs, the average number of lines of code (LOC) was calculated for the whole of the tested dataset, and the results are shown in Figure 6.As noticed from the results, across all of the LLMs, Bard returned the highest average number of LOC, which indicates lower code quality, while Bing produced a higher quality of code compared to Chat GPT-4 and Claude.In order to compare the quality of the generated code for the five LLMs, the aver number of lines of code (LOC) was calculated for the whole of the tested dataset, and results are shown in Figure 6.As noticed from the results, across all of the LLMs, B returned the highest average number of LOC, which indicates lower code quality, w Bing produced a higher quality of code compared to Chat GPT-4 and Claude.In order to compare the quality of the generated code for the five number of lines of code (LOC) was calculated for the whole of the teste results are shown in Figure 6.As noticed from the results, across all returned the highest average number of LOC, which indicates lower c Bing produced a higher quality of code compared to Chat GPT-4 and C

Providing Feedback to the Model
In the second phase of the tests, the highest-scoring model, GPT-4, and the lowestscoring model, Bard, were tested again on previously failed tasks.This time, feedback was provided to the LLM models to comprehend how well they could understand the errors and try to correct them.It was found that out of the 16 tasks that GPT-4 did not complete on the first try in the previous phase, GPT-4 was able to complete 14, while Bard was only able to complete 5 tasks after feedback was provided.Figure 7 compares the results of both GPT-4 and Bard.

Providing Feedback to the Model
In the second phase of the tests, the highest-scoring model, GPT-4, and the lowestscoring model, Bard, were tested again on previously failed tasks.This time, feedback was provided to the LLM models to comprehend how well they could understand the errors and try to correct them.It was found that out of the 16 tasks that GPT-4 did not complete on the first try in the previous phase, GPT-4 was able to complete 14, while Bard was only able to complete 5 tasks after feedback was provided.Figure 7 compares the results of both GPT-4 and Bard.These tests showed that there is a significant difference in the way both models processed and applied feedback to the generated code.GPT-4 demonstrated an impressive understanding of the provided feedback and effectively incorporated fixes into the code, meaning that with each output, the code improved until it performed the desired task with no errors.
On the other hand, Google's Bard, while sometimes able to generate code effectively, did not exhibit the same level of comprehension of the provided feedback as GPT-4.Bard's responses typically acknowledged the error; however, when generating a "fixed" version of the code, it generally just rewrote the exact same code again.
Overall, the results indicate that the AI language models, especially GPT-4, demonstrate proficient abilities for code generation.Additionally, it showed adaptability and the capability of self-improvement when faced with real-world coding scenarios, including bug fixing.In the case of GPT-4, when properly used, it could solve almost 100% of all tasks, showcasing its outstanding ability and potential to be a coding assistant.On the other hand, the second phase of the testing shows that LLMs are best as coding assistants and not as replacements for programmers.These models benefit greatly from feedback from humans and significantly improve when proper feedback is provided.
These results also suggest that LLMs have transformative potential in programming and software development, offering researchers and developers insights into the evergrowing field of AI-assisted programming.

Limitations
While LLMs have demonstrated overall effectiveness at writing code, they also have their limitations.Firstly, LLMs sometimes tend to output inconsistent answers [31].For example, for task_id 11, "Write a Python function to remove the first and last occurrence of a given character from the string", GPT-4 generated two different outputs when the test was repeated with small changes in the prompt.The resulting generated codes are shown in Figure 8.These tests showed that there is a significant difference in the way both models processed and applied feedback to the generated code.GPT-4 demonstrated an impressive understanding of the provided feedback and effectively incorporated fixes into the code, meaning that with each output, the code improved until it performed the desired task with no errors.
On the other hand, Google's Bard, while sometimes able to generate code effectively, did not exhibit the same level of comprehension of the provided feedback as GPT-4.Bard's responses typically acknowledged the error; however, when generating a "fixed" version of the code, it generally just rewrote the exact same code again.
Overall, the results indicate that the AI language models, especially GPT-4, demonstrate proficient abilities for code generation.Additionally, it showed adaptability and the capability of self-improvement when faced with real-world coding scenarios, including bug fixing.In the case of GPT-4, when properly used, it could solve almost 100% of all tasks, showcasing its outstanding ability and potential to be a coding assistant.On the other hand, the second phase of the testing shows that LLMs are best as coding assistants and not as replacements for programmers.These models benefit greatly from feedback from humans and significantly improve when proper feedback is provided.
These results also suggest that LLMs have transformative potential in programming and software development, offering researchers and developers insights into the evergrowing field of AI-assisted programming.

Limitations
While LLMs have demonstrated overall effectiveness at writing code, they also have their limitations.Firstly, LLMs sometimes tend to output inconsistent answers [31].For example, for task_id 11, "Write a Python function to remove the first and last occurrence of a given character from the string", GPT-4 generated two different outputs when the test was repeated with small changes in the prompt.The resulting generated codes are shown in Figure 8.In this example, both functions work as expected and pass the tests.However, this will not always be the case.This introduces an element of randomness, not knowing if the code will be correct or not.This can mean that the user has to spend valuable time testing and correcting the code if it is not correct.This brings us to another limitation: in order to use ChatGPT and LLMs efficiently, it is very likely that human feedback will play a key role.Users have to be able to effectively communicate feedback to the model in a way that the model will understand and learn from, which might also be time-and resource-consuming.
Third, the programming style of the models can differ from that of the users.Every programmer has their own preferences when writing code, and the model may not always follow the same format.This impacts the readability of the code, as well as potentially introducing bugs into the software and affecting future maintainability.
Fourth, the provided code by the models may have compatibility issues with the rest of the code that will be used.The models generate code based on what they were trained on and do not know the entire context in which it will be implemented later on; therefore, the code may lead to bugs and compatibility issues, which in turn will consume resources in order to refactor the provided code.

Conclusions and Future Work
This research work, evaluating the GPT (GPT-3.5 and GPT-4) models, as well as other large language Models (LLMs) from competitors (Bard, Bing, and Claude) using the Mostly Basic Python Problems dataset for code generation tasks, has provided interesting results.These results show some insights into the strengths of these models, as well as areas in which they could improve.Hopefully, this research serves as an empirical demonstration of the current state-of-the-art (SOTA) models in software development.Of all the In this example, both functions work as expected and pass the tests.However, this will not always be the case.This introduces an element of randomness, not knowing if the code will be correct or not.This can mean that the user has to valuable time testing and correcting the code if it is not correct.This brings us to another limitation: in order to use ChatGPT and LLMs efficiently, it is very likely that human feedback will play a key role.Users have to be able to effectively communicate feedback to the model in a way that the model will understand and learn from, which might also be time-and resource-consuming.
Third, the programming style of the models can differ from that of the users.Every programmer has their own preferences when writing code, and the model may not always follow the same format.This impacts the readability of the code, as well as potentially introducing bugs into the software and affecting future maintainability.
Fourth, the provided code by the models may have compatibility issues with the rest of the code that will be used.The models generate code based on what they were trained on and do not know the entire context in which it will be implemented later on; therefore, the code may lead to bugs and compatibility issues, which in turn will consume resources in order to refactor the provided code.

Conclusions and Future Work
This research work, evaluating the GPT (GPT-3.5 and GPT-4) models, as well as other large language Models (LLMs) from competitors (Bard, Bing, and Claude) using the Mostly Basic Python Problems dataset for code generation tasks, has provided interesting results.These results show some insights into the strengths of these models, as well as areas in which they could improve.Hopefully, this research serves as an empirical demonstration of the current state-of-the-art (SOTA) models in software development.Of all the models tested, GPT-4 exhibited the highest proficiency in code generation tasks, achieving a success rate of 86.23% on the small subset that was tested.GPT-based models performed the best compared to the other two models, Bard and Claude.These two models performed the worst, indicating a need for improvement in terms of coding and code generation, considering they are the two most recently released models.It was also found that LLMs had some difficulties concerning code generation, which led to bugs and errors and multiple solutions when a slight change or wrong query was provided.This can lead to the inefficient use of time by programmers.Overall, it was found that ChatGPT and other LLMs, most of the time, are able to generate code effectively and can be used as programming assistant tools, but they are not replacements for human software developers since they require constant feedback and monitoring by a human.
While the results were promising, there is room for future research.More studies can focus on evaluating the generated code itself.New metrics can be considered, such as the cleanliness of the code, execution time, resource usage, and more.Besides the MBPP, more complex tasks can be used to test the LLMs and see how they handle more complicated tasks and code.In addition, future research can also focus on real-world applications.They can examine the use of these LLMs in professional software development processes besides coding, to identify any other practical uses that can essentially enhance productivity in the industry and potentially revolutionize it.

•
Prompt: the instructions for the LLM, explaining what the code should do.• Code: a proposed solution for the code.• Test_imports: libraries that should be imported.• Test_list: usually 3 test cases to verify if the code works as expected.

Figure 1
Figure 1 shows an example of coding problem number 162 from the dataset [7], along with the test code used.

•
Task_id: a number from 1-1000.• Prompt: the instructions for the LLM, explaining what the code should do.• Code: a proposed solution for the code.• Test_imports: libraries that should be imported.• Test_list: usually 3 test cases to verify if the code works as expected.

Figure 1 Figure 1 .Figure 1 .
Figure 1 shows an example of coding problem number 162 from the dataset [7], along with the test code used.

Figure 2 .
Figure 2. Example of the prompt used as an input to the LLM systems.

Figure 1 .
Figure 1.Example of the used Python problems [7].(a) The code problem number 162.(b) The testing code.

Figure 1 .
Figure 1.Example of the used Python problems [7].(a) The code problem number 162.(b) The testing code.

Figure 2 .
Figure 2. Example of the prompt used as an input to the LLM systems.Figure 2. Example of the prompt used as an input to the LLM systems.

Figure 2 .
Figure 2. Example of the prompt used as an input to the LLM systems.Figure 2. Example of the prompt used as an input to the LLM systems.

Figure 3 .
Figure 3. Performance comparison between the different LLMs.

Figure 3 .
Figure 3. Performance comparison between the different LLMs.

Figure 6 .
Figure 6.Average number of lines of code for each model, excluding blank lines, tested for the whole dataset.

Figure 6 .
Figure 6.Average number of lines of code for each model, excluding blank line whole dataset.

4 Figure 6 .
Figure 6.Average number of lines of code for each model, excluding blank lines, tested for the whole dataset.

Figure 7 .
Figure 7.Comparison of GPT-4 and Bard's score when providing feedback.

Figure 7 .
Figure 7.Comparison of GPT-4 and Bard's score when providing feedback.