Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools

Noveski, Gjorgji; Jeroncic, Mathis; Velard, Thomas; Kocuvan, Primož; Gams, Matjaž

doi:10.3390/electronics13204109

Open AccessArticle

Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools

by

Gjorgji Noveski

^1,†

,

Mathis Jeroncic

^2,†,

Thomas Velard

^2,†,

Primož Kocuvan

^1,*,† and

Matjaž Gams

^1,*

¹

Department of Intelligent Systems, Jožef Stefan Institute, 1000 Ljubljana, Slovenia

²

Université Paris-Saclay, 91190 Paris, France

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(20), 4109; https://doi.org/10.3390/electronics13204109

Submission received: 20 August 2024 / Revised: 11 October 2024 / Accepted: 14 October 2024 / Published: 18 October 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

With the rapid advancement of artificial intelligence technologies, the integration of AI concepts into educational curricula represents an increasingly important issue. This paper presents a comparative analysis of four AI large language models, ChatGPT (now GPT-4o), Bard (now Gemini), Copilot, and Auto-GPT, in the last year, progressing from the previous into the newer versions, thus also revealing the progress over time. Tasks were selected from the Valence project, which aims to advance machine learning in high school education with material designed by human experts. The four LLMs were assessed across 13 topics, 35 units, and 12 code segments, focusing on their code generation, definition formulation, and textual task capabilities. The results were analyzed using various metrics to conduct a comprehensive evaluation. Each LLM was allowed up to five attempts to produce outputs closely aligned with human-written materials, with experts providing iterative feedback. This study evaluated the effectiveness and accuracy of these LLMs in educational content creation, offering insights into their potential roles in shaping current and future AI-centric education systems.

Keywords:

artificial intelligence in education; high school curriculum; GPT comparison; LLM testing; e-learning; web-based learning

1. Introduction

Artificial intelligence (AI) has rapidly become a transformative force across diverse sectors, significantly impacting areas such as code generation, image creation, and content development through its unparalleled ability to extract insights from vast datasets. Integrating AI knowledge and skills into educational curricula not only enhances students’ job market prospects but also fosters a culture of innovation and creativity, equipping them to tackle complex problems with advanced tools and knowledge. This integration is crucial in preparing a future workforce adept in AI and machine learning, ensuring global competitiveness.

Generative large language models (LLMs), specifically advanced chatbots (referred to as LLMs or GPTs in this paper), represent a groundbreaking innovation in AI. These models are profoundly transforming the educational landscape by providing intelligent, adaptive learning experiences and optimizing various administrative and educational tasks.

In a recent test, ChatGPT achieved an IQ score of 155, placing it above 99.98 percent of the human population [1]. While human-like properties such as consciousness remain at a reasonably low level, significant progress has also been observed in this area [2]. Furthermore, extensive “classical” Turing tests involving non-specialists demonstrated that humans were often unable to distinguish between the outputs generated by specialized LLMs and those produced by humans [3].

In information and communication technology (ICT) education, programming presents significant challenges, as it demands a combination of technical skills, analytical thinking, and problem-solving abilities. LLMs have emerged as valuable tools in this space, offering personalized instruction and real-time support. By leveraging their natural language processing and programming capabilities, LLMs assist students in overcoming obstacles, reinforce key concepts, and deepen their understanding of programming fundamentals. One of the key advantages of LLMs in programming education is their ability to provide immediate assistance. They can swiftly identify errors, explain underlying issues, and suggest alternative solutions, thereby enhancing students’ problem-solving skills through timely feedback and encouraging the exploration of diverse approaches.

Not all the applications of LLMs have been successful. First, it is well known that LLMs fail at some simple tests, e.g., [4]. Second, it is essential to provide proper prompts to obtain good results. For example, writing a good prompt equals providing program specifications in natural language; therefore, comparing humans with LLMs with perfect prompts might not be fair or comparing with LLMs with insufficient prompts.

This study aimed to fairly and consistently compare four distinct LLMs—ChatGPT (now GPT-4o), Bard (now Gemini), Copilot, and AutoGPT—in terms of their effectiveness in developing machine learning (ML) curricula tailored for high school students. The primary research aim was to investigate the efficacy of generative AI (GAI) in designing and preparing curricula suitable for high schools. The two main questions were

For which tasks in ML high school curricula is at least one of the LLMs useful?
What are the differences between the analyzed LLMs?

We prepared a series of tests for the LLMs, which included generating code from descriptions, generating descriptions from code, creating course text from descriptions, identifying errors and warnings in code, and optimizing code. Although the systems were evaluated on a wide range of tasks, this paper focuses on the most informative results. The conclusions drawn are broadly applicable to both the presented tests and the additional evaluations conducted within this framework.

The first hypothesis in this study was that GAI LLMs would perform comparably to human experts on several tasks designed as part of the international Valence project [5]. That study aimed to identify which tasks would be successfully performed by each LLM and which tasks would present challenges. Although comparing response times is inherently problematic—since the project spanned three years while LLMs generate responses within seconds—significant time was invested in crafting appropriate prompts, often taking weeks, and the duration of each task’s study spanned several months. The second hypothesis anticipated notable differences in the performance of the tested LLMs across various tasks. The third hypothesis was that the timing of the tests would reveal temporal differences in the rankings and performance of the LLMs, indicating progression over time.

This paper is organized as follows: Section 2 provides a review of the relevant scientific literature. Section 3 outlines the benchmark teaching materials used for evaluating the performance of the LLMs and describes the experimental setup in detail. Section 4 presents the key findings from the experiments. Finally, Section 5 discusses the implications of these findings, draws conclusions, and suggests directions for future research.

2. Related Work

Recent advancements in LLMs have demonstrated their potential for addressing diverse educational needs, ranging from personalized tutoring to content creation, as shown by their successful application in numerous educational experiments and studies (e.g., [6]). These studies underscore the growing acceptance and effectiveness of LLMs in enriching both teaching and learning experiences. In this paper, we explore the evolving roles of LLMs in the educational domain, drawing on insights from recent scholarly research.

A study [6] highlights the challenges faced by the growing population of international students, particularly language barriers. The authors acknowledged the potential of LLMs in assisting these students, though they emphasized that LLMs are not replacements for faculty expertise in areas such as answering questions or creating material. Nonetheless, LLMs serve as valuable tools for students seeking assistance in subjects like machine learning.

Field experiments using LLMs to design cultural content in multiple languages are discussed in [7]. In these experiments, LLMs functioned either as ICT-based tutors or as assistants to human tutors. Their user-driven interaction and welcoming language fostered an inviting, self-paced learning environment. The adaptability of LLMs in content creation not only enhanced tutor creativity and productivity but also significantly reduced their cognitive load in designing educational materials. This allowed educators to focus on more critical tasks. An evaluation through questionnaires revealed that students using LLMs to learn history outperformed those in a control group who did not have access to GPT assistance.

The widespread use of mobile phones in educational environments is discussed in [8]. Rather than limiting phone usage, the study investigated the creation of a conversational assistant designed to support students who are new to computer programming. The results showed a positive reception among students and a strong willingness to utilize LLMs for learning. Moreover, devices such as smartphones, smartwatches, and smart wristbands were shown to further enhance the learning experience with LLMs [9,10]. However, our study did not focus on any specific hardware.

In [11], the researchers developed a GPT to assist in teaching the Python programming language. This tool aids in program comprehension for novice programmers, lowering entry barriers to computer programming. The GPT is equipped with a ‘knowledge unit’ for processing user requests and a ‘message bank’ containing a database of predefined responses. The bot can discuss programming concepts, offer scheduling support for tutor meetings, or answer predefined questions. Surveys indicated that many students find computer programming courses challenging, underscoring the need for clear and accessible teaching methods. LLMs like this represent a step toward creating a more conducive learning environment.

The application of LLMs extends to assessing teamwork skills, as demonstrated in [12]. In this study, a GPT was used to simulate human interactions in an online chatroom, unbeknownst to the users, to evaluate teamwork skills. The GPT interactions were analyzed using an algorithm to score users’ teamwork abilities. These scores were found to correlate strongly with the assessments conducted by human experts, suggesting that LLMs can effectively mimic human interactions and deliver reliable assessments, even in areas typically influenced by subjective factors.

A paper [13] discusses the use of LLMs like Code Tutor, ProgBot, and Python-Bot to assist students in learning programming languages such as JAVA, C++, and Python. The LLMs provided coding suggestions, error corrections, and adapt quizzes based on student progress. A GPT utilizing IBM Watson service was used for non-technical students, emphasizing the need to help students learn programming logic. The study suggests redesigning GPT scripts to better support student learning based on the four-component instructional design (4C/ID) framework.

The adoption of GPT applications was studied in [14] by integrating the technology acceptance model (TAM) and emotional intelligence (EI) theory. The study focused on international students using GPT applications in their university, employing machine learning methods to analyze the data. The results showed that classifiers like simple logistic, iterative classifier optimizer, and LMT had high accuracy in predicting users’ intention to use LLMs, suggesting the effectiveness of LLMs in enhancing educational activities.

ML algorithms are increasingly being used to automatically assess student learning outcomes, as illustrated by a recent work [15], where image classification served as a key application. Significant progress has also been made in the field of ophthalmology, with researchers conducting various experimental evaluations to explore the role of AI in medical diagnostics [16]. Moreover, an insightful study [17] delved into content- and language-integrated learning, highlighting the potential of AI in educational contexts. Additionally, [14] examined the integration of emotional intelligence within machine learning frameworks, further enhancing the adaptability and effectiveness of AI in diverse learning environments.

This study stands apart from the existing research by offering a comprehensive evaluation of the performance of four distinct LLMs over one year, with two separate assessments, each lasting several months. The primary focus of this investigation was the integration of ML education into high school curricula. In contrast to previous studies that broadly explored the application of generative pretrained transformers (GPTs) across diverse educational contexts, this research narrows its scope to specifically compare these LLMs to identify the most effective tools for ML instruction in high school courses.

3. Experimental Setup

3.1. Teaching Material Used

This section outlines the experimental setup used to assess the efficacy of the LLMs in generating and replicating educational material. For the test domain, we utilized existing content developed under the Valence project, Advancing Machine Learning in Vocational Education [5], a collaborative initiative involving partners from three European countries. The project’s primary objective was to prepare high school students for the growing demand for ML practitioners in the job market. The Valence project produced a variety of outputs, including survey data, publications, program code, and educational material.

The educational content developed through this project covered thirteen topics designed to support the understanding and application of ML concepts, specifically tailored for high school students. This material was created in the format of interactive Jupyter notebooks [18], allowing for the seamless integration of explanatory text and Python code examples within a single notebook file. This format is highly recommended for providing an engaging and interactive learning experience.

The topics within these notebooks were carefully and strategically selected throughout the three-year project to equip students with both essential knowledge for understanding and applying ML concepts, as well as foundational programming skills. The curriculum begins with an introduction to Python programming basics, including environment setup, data analysis, and data visualization, before progressing to more advanced ML topics. These topics cover the standard processes in a typical ML project, including data preparation, supervised learning, metric computation, neural networks, and other relevant areas.

3.2. Experiments

To objectively evaluate the proficiency of LLMs in generating and replicating high school ML curricula, particularly in scenarios involving human guidance, we implemented an experimental methodology with carefully defined parameters and constraints. Each LLM model’s performance was rigorously assessed, focusing on its capability to replicate essential components of pre-existing teaching materials accurately. The outcomes were systematically quantified using a structured grading process and various metrics. The experiments conducted included the following:

Text-to-text experiment: This experiment assessed the LLMs’ ability to generate machine learning definitions. Given textual prompts, the LLMs’ capacity to produce accurate and coherent definitions was evaluated, reflecting a common requirement in school curricula.
Text-to-code experiment: Here, the focus was on the LLMs’ code-generation capabilities. By providing textual descriptions as inputs, their efficiency in generating relevant programming and ML code was tested.
CodeBERT score: This metric was used to quantitatively assess the LLMs’ code generating and understanding capabilities. The programs designed by the experts were compared to those designed by LLMs.
Code-to-text experiment: Contrary to generating code, this experiment evaluated the LLMs’ comprehension of computer code. Given specific code snippets, the LLMs were tasked with producing textual explanations, a skill valuable for elucidating code segments to students.
Error/warning detection and optimization: The LLMs’ proficiency in identifying and suggesting corrections for errors or warnings in code was tested.
In addition, code optimization and other extensions were also conducted to further evaluate the capabilities of the LLMs.

Two sets of experiments were conducted: the first in the summer of 2023 (referred to as 2023) and the second in the spring of 2024 (referred to as 2024). Four LLMs were initially selected for the 2023 experiments: ChatGPT [19], Bard [20], Auto-GPT [21], and Copilot [22]. ChatGPT, specifically the latest version, ChatGPT-4o, developed by OpenAI [23], is a large language model known for its extensive knowledge base and advanced text-processing capabilities. Bard, in its latest iteration, Gemini, developed by Google, shares similarities with ChatGPT but was, at the time of testing, already uniquely equipped with direct internet access, which might have enhanced its performance across various tasks. Auto-GPT, in contrast, leverages OpenAI’s GPT API to autonomously execute tasks. Unlike the conversational design of ChatGPT and Bard, Auto-GPT is more oriented toward automation, functioning as an autonomous assistant capable of independently completing tasks. Copilot represents Microsoft’s latest integration of GPT/LLM technology. Although this paper considers all systems as LLMs or GPTs for simplicity, it is important to acknowledge that their classification and functionality are more complex.

4. Results

The Results Section consists of subsections corresponding to one specific experiment or test. Each test was conducted using the current LLM version most appropriate for the particular experimental conditions at the time of experiments.

4.1. Text-to-Text Experiment

To evaluate the text generation capabilities of the selected LLMs, the Valence ML definitions within the Jupyter notebooks were utilized. These materials had been thoroughly discussed and refined during the project, ensuring they were relevant and well suited for the experiment. The LLMs, therefore, competed both against each other and against human experts on tasks derived from material carefully tailored for high school education purposes. A total of 35 definitions, covering topics such as programming languages, basic statistics, mathematical concepts, and ML techniques, were extracted from the Valence material to serve as a benchmark.

In all experiments, the Valence material, developed by experts in AI and high school education, served as the ‘ground truth’. The LLMs were tasked with reconstructing the definitions from provided code and examples, with up to five attempts allowed per definition to produce the closest match. In each iteration, the LLMs were given consistent hints to guide their improvement, simulating the instructional feedback typically provided in high school settings. Given the inherent difficulty of achieving an exact replication, we employed a subjective evaluation method similar to a teacher’s assessment. This approach assessed whether the LLM-generated definitions effectively captured most or all critical elements outlined in the original Valence material.

In addition to assessing the readability of the generated content, the code-to-text experiment was expanded to include evaluations of the accuracy of the explanations and the appropriateness of the language style for high school education. Experts with a background in programming evaluated the correctness of the content, ensuring that the explanations accurately reflected the underlying code. Furthermore, the language style was analyzed to ensure clarity, conciseness, and suitability for high school students, with a focus on pedagogical clarity and avoiding overly technical language that could hinder student comprehension.

An illustrative comparison of one such definition is shown in Figure 1. The evaluation results are subsequently presented in Figure 2, Figure 3 and Figure 4, which compare each tested definition alongside the number of prompt attempts required to achieve a satisfactory outcome.

Figure 3 for GPT 3.5 (on average achieved 2.03 in 2023 and 1.86 in 2024) and Figure 2 for Gemini (1.97) and Copilot (2.23; both 2024) show that these LLMs generally performed similarly. It might be that ChatGPT-4 in 2023 (1.69) and ChatGPT-4o in 2024 (1.69) required fewer prompts on average compared to the other LLMs, as presented in Figure 4, but the differences were not significant.

The LLMs therefore, on average, performed nearly as well as humans, working on these items for a substantial period, while LLMs needed seconds to produce results after prompts were generated. This suggests a high level of efficiency by LLMs in generating technical definitions. The LLMs also properly prioritized key details.

At the same time, it was observed that the LLMs’ responses often lacked examples and tended to be verbose. To enhance the precision of the definitions generated, it is advisable to use detailed prompts, specifying aspects such as desired sentence count and response format. Such specificity in prompts can expedite the process of obtaining the required information.

Figure 4 shows that ChatGPT-4 typically required only one or two prompts to generate an accurate definition, although there were two instances where it failed. Excluding these failures, the average number of prompts needed by ChatGPT-4 to reach a satisfactory definition was 1.68, indicating that, in most cases, two prompts were sufficient to achieve the desired result. The other LLMs performed slightly worse in comparison.

ChatGPT-4o (1.69) performed similarly to ChatGPT-4 (1.69) in 2024, which was expected since the older version already performed so well that there was not much room for improvement. Also, the replies seemed a bit more human-like.

In summary, ChatGPT-4o performed slightly better in the text-to-text experiments, and all LLMs performed reasonably well. Please note that only the most interesting comparisons are presented in this paper, while summaries include overall observations.

4.2. Text-to-Code Experiment

In this subsection, the LLMs’ capabilities in generating programming code from textual prompts are tested. Each LLM was given a maximum of five attempts to produce a functional code segment. This experiment involved all four LLMs: Bard (Gemini), AutoGPT, Copilot, and ChatGPT. If an LLM failed to generate a valid code segment after five attempts, the test was concluded as a failure for that segment. A total of 12 diverse code segments, representing various algorithms and methods, were selected for this purpose. This diversity was introduced to challenge the AI’s adaptability to different programming problems.

Table 1 presents an overview of the 12 code segments selected for this study, including the algorithm/method, category, and rationale for each. These code segments represent a diverse set of machine learning tasks, spanning supervised and unsupervised learning, as well as optimization and classification problems. The algorithms were chosen to challenge the adaptability of LLMs to various programming tasks commonly encountered in educational settings.

To ensure the consistency and quality of the results, each LLM was initially provided with a uniformly structured prompt describing the code segment. Follow-up prompts were tailored to address specific errors identified in the generated code. This approach simulated the experience of a novice programmer, who might rely on copying and pasting error messages without fully understanding the underlying issues. The goal was to replicate the educational process, where a teacher and student engage in an interactive, iterative dialogue aimed at improving understanding and performance.

While formalizing this process can be challenging, our method mirrored real-world scenarios of human feedback, in which helpful hints progressively guide the learner toward a more accurate solution. Unlike human interactions, which can be influenced by external factors such as a student’s appearance or demeanor, the interaction with LLMs remains purely objective, ensuring a more consistent and impartial feedback mechanism throughout the experiment.

The generated code from each prompt was assigned a score on a scale from 1 to 5, with criteria as follows:

1/5: Code causes a compilation error, rendering it unusable.
2/5: Code runs but produces an output significantly different from the requested one.
3/5: Code runs but produces an output that only partially aligns with the expected result.
4/5: Code nearly achieves the desired output, with minor discrepancies.
5/5: Code meets or exceeds the expected output requirements.

The performance scores (ranging from 5 for the best to 1 for the worst) are plotted for the four LLMs in Figure 5 and Figure 6 to compare the highest score achieved by each model across the different code segments. During the second evaluation, updated versions of the systems were introduced, bringing additional changes. AutoGPT was replaced with Copilot in the second comparison, and Bard was updated to its newer version, Gemini.

The average scores for each LLM are presented in Table 2 and Table 3.

Comparisons enabled the testing of the significance of the similarities between the LLMs, helping to assess how closely their performance aligned across different tasks. The tests presented in Table 4 indicate that, overall, the LLMs have improved over time.

While these scores indicate overall performance, they do not account for the efficiency in reaching these scores, measured by the number of prompt attempts. Therefore, a new metric was introduced: the efficiency score (1). This score was calculated by dividing the LLM’s average score by the total number of prompt attempts, where the smaller the better.

S_{e f f} = \frac{\sum_{i = 1}^{N} s_{i}}{s_{n} p_{n}}

(1)

The efficiency scores for each LLM are presented in Table 5:

4.3. CodeBERT Score

To quantitatively evaluate the quality of the code segments generated by the LLMs, the CodeBERT score [24] was employed as a key metric. The CodeBERT score measures semantic similarity by comparing the generated code to the original code from the Valence project. It utilizes contextual embeddings from large pretrained models, which have been shown to correlate strongly with human evaluative preferences. The calculation of the CodeBERT score incorporates four key metrics: precision, recall, F1 score, and F3 score. Each of these metrics contributes to a comprehensive assessment of the generated code’s quality:

Precision: measures the proportion of relevant instances among the retrieved instances.

$P r e c i s i o n = \frac{T P}{T P + F P}$

(2)
Recall: assesses the proportion of relevant instances that have been retrieved over the total amount of relevant instances.

$R e c a l l = \frac{T P}{T P + F N}$

(3)
F1 score: provides a balance between precision and recall, calculated as the harmonic mean of these two metrics.

$F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{2 \times T P}{2 \times T P + F P + F N}$

(4)
F3 score: Places more emphasis on recall compared to precision, suitable for scenarios where missing relevant instances (lower recall) are more critical than retrieving irrelevant ones (lower precision).

$F 3 = (1 + 3^{2}) \times \frac{P r e c i s i o n \times R e c a l l}{(3^{2} \times P r e c i s i o n) + R e c a l l}$

(5)

The integration of these metrics within the CodeBERT score framework alloeds for a nuanced analysis of the LLMs’ code generation capabilities, aligning with both technical accuracy and human judgment standards. The CodeBERT scores for each LLM are presented in Table 6.

In summary, in the text-to-program experiments, ChatGPT, Copilot, and AutoGPT consistently performed well across all metrics, while Bard and Gemini did not exhibit similar performance on certain tests.

4.4. Code-to-Text Experiment

This subsection presents an evaluation of the LLMs’ ability to generate textual descriptions from provided code segments. To ensure consistency, an identical prompt was employed across all four LLMs (ChatGPT, Bard, Copilot, and Auto-GPT). A total of 11 distinct code segments were selected for this assessment. The uniform prompt issued to each LLM was “Generate a brief description of this code without using lists or bullet points”.

After the LLMs generated their descriptions, a binary rating system—‘good’ or ‘bad’—was applied. This system was chosen with the assumption that users without a computer science background, who are likely to use this feature, may lack the expertise to identify technical inaccuracies in the descriptions. Such users might also find it difficult to reformulate their queries for more precise explanations. To stay aligned with this user perspective, no additional prompts were issued after the initial description was generated.

This approach aimed to assess the LLMs’ effectiveness in providing clear and accurate descriptions that are easily understandable by nonexperts, reflecting real-world scenarios where laypersons seek to comprehend code without prior programming knowledge. Only Bard and Gemini produced 2 ‘bad’ descriptions according to our standards, while ChatGPT, AutoGPT, and Copilot achieved a perfect score of 11/11.

In conclusion, all LLMs performed well in the code-to-text experiment.

4.5. Error/Warning Detection and Optimization

This study further investigated the potential of the LLMs in assisting with code debugging and providing suggestions for improvement. Specifically, we evaluated the ability of the LLMs to identify errors and warnings within a set of selected code segments. Among these segments, two contained errors, one included a warning, while the remainder functioned correctly. The objective was to assess whether the LLMs could accurately detect and resolve these issues or if they would mistakenly flag nonexistent problems.

The methodology followed the approach used in the from-code-to-text experiment, utilizing a single prompt: ‘Tell me if this code has errors or warnings in it and try to improve it’. Auto-GPT was excluded from this test, as it was unable to interpret code segments and treated them as regular text, reflecting its primary design as a task automation assistant rather than a conversational agent. The results for Bard, ChatGPT, and Copilot are presented in Table 7.

The results in Table 7 indicate that while Bard did not identify any of the existing errors or warnings, it also did not create any false positives. ChatGPT-4, on the other hand, successfully detected two errors but failed to identify the warning and incorrectly reported an additional error. Copilot detected the warning but also incorrectly reported an additional warning/error. Both ChatGPT-4 and Copilot invented new errors/warnings, while Bard did not.

In addition to error detection, the LLMs’ capability to optimize code segments, particularly in terms of execution speed, was also examined. The following results were obtained after prompting the LLMs to enhance the code:

Average optimization by ChatGPT-4 (2023): −271.5 milliseconds;
Average optimization by Bard (2023): −150.5 milliseconds.

Both LLMs, on average, produced code that was slower than the original, indicating a lack of reliability in terms of code optimization. It is noteworthy that there was significant variability in the optimization results, with some instances showing substantial improvements, while others exhibited considerable slowdowns.

In conclusion, compared to human experts, all LLM-optimized code quite poorly. ChatGPT and Copilot provided reasonable error and warning detection, while Bard produced none.

4.6. Extending the Text-to-Code Experiment

To achieve a more comprehensive evaluation of LLM performance, we expanded the initial set of 12 code segments by generating variations and additional examples based on the original set. This expansion resulted in a total of 70 unique examples, after which new examples began to exhibit redundancy. Additionally, GPT was tasked with providing further examples, leading to a total of 103 examples (the dataset can be obtained by contacting the authors). These additional samples were designed to more reliably capture the diversity and complexity of various LLMs, enabling a more robust assessment across a wider range of scenarios. However, it is important to note that the additional samples were not part of the original Valence data.

The evaluation framework for the additional samples incorporates three distinct metrics to assess the comparative performance of the language models. These metrics are explained in Table 8:

Strict comparison: This metric evaluates the outputs by assigning a score of 0 if the output qualities are closely comparable, $+ 1$ if GPT’s performance is superior, and $- 1$ if the alternative LLM outperforms GPT. An overall performance review found that GPT performed better in 13% of the cases.
Soft similarity: This score is used when the outputs are generally similar, with a rating of 0, but marked $+ 1$ if there are significant differences. This is intended to capture broader performance trends rather than focusing on minor discrepancies. GPT was found to have a 6% advantage using this approach.
CodeBERT metric: This metric is a more objective, data-driven measure of similarity between outputs, calculated using the CodeBERT model. CodeBERT focuses on semantic similarity, meaning it assesses how closely the generated outputs match the reference outputs in meaning. The results showed an 88% similarity, reflecting approximately a 10% difference between the outputs, demonstrating that both models performed highly similarly.

These results highlight the importance of using a range of metrics to compare different LLMs. Overall, the findings suggest that GPT-4 and Gemini performed quite similarly until precision-based metrics were applied. The results aligned with the original tests conducted on the 12 original samples, supporting the conclusion that metrics such as CodeBERT can reveal subtle performance differences.

4.7. Summary of Results

Table 9 presents a comparative analysis of the four LLMs (ChatGPT, Bard/Gemini, Copilot, and Auto-GPT) across various tasks such as text generation, code generation, error detection, and optimization. The table highlights key performance metrics including accuracy, average iterations required, success rates, and speed optimization performance.

Overall, the observed results suggested that while all tested LLMs exhibited strong capabilities in generating coherent and contextually appropriate text, there were notable differences in their accuracy and consistency. These differences highlight the importance of the continuous development and refinement of LLMs to meet specific educational needs effectively. The subset of all questions and some of the code is gathered in the Appendix A.

5. Discussion and Conclusions

This paper presents a comparative analysis of four LLMs—ChatGPT, Bard (now known as Gemini), Copilot, and Auto-GPT—evaluated during the summer of 2023 and the spring of 2024. The primary objective was to assess their effectiveness in generating high school curricula focused on ML and AI topics. The performance of these LLMs was evaluated using various metrics in an educational context to provide insights into their suitability for specific tasks and educational levels.

An initial question was whether to integrate LLMs into educational tools. The findings revealed the exceptional performance of LLMs across multiple tasks. Their ability to produce expert-level results within seconds, compared to the weeks typically required by human contributors, is remarkable. Before the advent of LLMs and generative AI, no tool exhibited such a high degree of efficiency across diverse tasks. Consequently, the overall assessment is highly favorable, consistently highlighting the utility of LLMs in educational settings.

This observation aligns with the application of LLMs across various domains:

In the field of medicine, the highest-performing LLMs have demonstrated the ability to correctly answer 90 percent of questions on the United States Medical Licensing Examination (USMLE) and surpassed web-based consultations in both quality and empathy [25].
During the initial phases of innovation, including ideation, exploration, and prototyping, LLMs provided significant support. Real-world examples of AI-assisted innovation highlight their role in accelerating progress and reducing costs across diverse projects [26].
In genomics, researchers have developed an extensive QA database to evaluate GPT models, revealing their potential to revolutionize biomedical research. These models have been shown to substantially reduce AI-induced hallucinations and enhance the accuracy of genomics-related inquiries [27].

While LLMs’ effectiveness is evident across various fields, their applicability varies significantly across educational tasks. Unlike standardized assessments such as the medical USMLE, which provide uniform evaluations, educational materials span a wide spectrum, making objective and consistent evaluation more challenging. This study suggests that some educational tasks align more closely with specific LLM capabilities than others. For tasks where LLMs have demonstrated effectiveness, delaying their integration into education seems unnecessary. However, it is crucial to recognize the risks of misuse when deploying these tools without careful consideration. Rigorous research and the thoughtful application of LLMs in educational contexts are essential to maximize their benefits while minimizing potential pitfalls. Future research should focus on identifying the most effective tools and contexts for specific educational tasks.

Consider, for example, code analysis, where the LLMs exhibited a reasonable capacity to detect errors and warnings in the provided code segments. However, due to the limited scope of the test samples, these findings should be interpreted particularly for short, similar programs, while further extensive testing is necessary to establish more definitive conclusions for other types of programs. Moreover, the effectiveness of this functionality varied, proving beneficial in some instances while being less effective in others. Other publications, such as [28], have similarly noted the variability in LLM performance when applied to different types of code analysis tasks, highlighting the need for more robust benchmarking across diverse programming environments and task complexities. This underscores the importance of future studies focusing not only on broader testing scenarios but also on refining LLMs to improve consistency in code analysis, especially in more complex and longer programming contexts.

Regarding code optimization, particularly in improving execution speed, LLMs showed suboptimal performance. On average, the code modifications proposed by the LLMs resulted in slower execution times compared to the original code. Although LLMs generated results within seconds, human experts required significantly more time to produce optimized code. This implies that while LLMs offer rapid solutions, their optimization capabilities are more suited to novices than experts. These findings highlight a potential area for improvement in future LLM designs. While some studies, such as [29], found LLMs to be competitive in code optimization, our comparison against human experts showed otherwise, likely due to the different nature of the tests used.

Regarding the different versions from 2023 and 2024, the newer versions generally showed improved performance. While it is understandable that high-performing systems may not experience significant upgrades, it is puzzling why some underperforming systems did not show substantial improvement over time, suggesting that certain problems may not be well suited to specific LLMs. However, there were other notable advancements, such as improvements in multimedia capabilities or faster response times compared to the spring version.

The results suggested that ChatGPT, Copilot, and Auto-GPT generally performed slightly better than Bard (later Gemini) on most of the tests, though the differences were not substantial. These three LLMs appeared to demonstrate greater proficiency on tasks related to curriculum creation and problem solving within the context of AI and ML. Bard’s somewhat lower performance on these tests may reflect differences in design and functionality, which are unique to each LLM. It is important to note that Bard (Gemini) has a different architecture than the other three models. Additionally, as Auto-GPT is a more specialized tool, ChatGPT-4 (also known as ChatGPT-4o) and Copilot emerged as particularly well suited for the tasks at hand.

Regarding the potential extension of the results to other educational levels, LLMs may encounter difficulties with tasks aimed at younger students, especially those requiring simpler, more creative approaches, analogies, or explanations tailored to a child-friendly level. The limited familiarity of these students with technical terminology could result in the need for more iterations to achieve satisfactory outcomes. Conversely, college students, who are likely to engage with more advanced topics like deep learning and reinforcement learning, may benefit from LLMs in addressing complex concepts. However, the models might face limitations when tackling tasks that demand deeper reasoning, abstract thinking, or specialized understanding of intricate algorithms. On the other hand, it is also possible that the LLMs could handle these variations in complexity without significant difficulty. Investigating how LLMs perform across different educational levels will be a key focus of future research.

Regarding the three outlined hypotheses,

The first hypothesis, which posited that LLMs could perform comparably to human experts in specific tasks, was confirmed. LLMs demonstrated performance on par with humans, particularly in text generation and error detection, and even matched expert performance in some cases.
The second hypothesis, suggesting significant performance differences between the LLMs, was partially validated. While most LLMs performed well on various tasks, they struggled qirh some tasks, e.g., with code optimization. Notably, ChatGPT and Copilot slightly outperformed Bard and Auto-GPT on several tasks, highlighting the variability in LLM capabilities.
The third hypothesis, which suggested that the timing of tests would reveal temporal differences in LLM performance (potentially fast progress), was not broadly supported. While we observed some improvements in LLM performance on specific experiments over a year and during the several-month testing period, their overall capabilities remained relatively stable, with no significant advancements on some tasks. Notably, there was no exponential growth in terms of performance. This underscores the need for continued development to further enhance LLM effectiveness, particularly for more complex educational tasks.

In addressing the primary questions posed in the Introduction, this study demonstrates that LLMs such as ChatGPT, Bard, Copilot, and Auto-GPT offer varying levels of utility across tasks related to the creation of high school ML curricula. LLMs were particularly effective in generating text-based content, providing accurate definitions, and identifying errors in code. However, their performance was less consistent on more complex tasks like code optimization and problem solving, where human expertise continued to hold an edge. The comparative analysis revealed notable differences among the LLMs, with ChatGPT and Copilot consistently, though only slightly, outperforming Bard and Auto-GPT on most tasks, especially in curriculum development and technical problem solving within the realm of AI and machine learning education.

The LLMs demonstrated certain limitations, particularly on tasks that required a deeper understanding of context, reasoning, and creative problem solving. Their responses often lacked illustrative examples and tended to be overly verbose. Numerous studies have already conducted extensive evaluations of these weaknesses across various domains, including cognitive tasks [2]. Our study adopted a more focused approach, specifically assessing LLM performance within the context of high school machine learning curricula. In this domain, while the LLMs excelled at quickly generating relevant content, they often fell short when faced with advanced problem solving or tasks requiring creative solutions. For instance, for code generation and error detection tasks, the LLMs sometimes failed to fully grasp the context, leading to errors that human experts would typically avoid.

Although these models are not intended to replace human educators, especially in tasks that demand creativity and deep contextual understanding, they offer an unprecedented level of support in automating routine educational tasks and providing real-time assistance to students. Their ability to quickly generate relevant content and assist with immediate feedback makes them a valuable tool in the classroom. As LLMs continue to evolve, future advancements may address their current limitations, potentially allowing them to play an even more integral role in educational settings, complementing human educators in increasingly sophisticated ways.

These unique tests not only contribute to the understanding of LLM efficacy in educational settings but also shed light on pedagogical strategies that may optimize ML education for high school students.

Author Contributions

G.N., M.J., T.V. and P.K. contributed to the data acquisition, definition, implementation of the methods, experiments, experimental evaluation, and writing the manuscript. M.G. defined the problem, conceptualized the approach, analyzed the results, and critically upgraded the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Slovenian Research Agency (research core funding No. P2-0209). We would like to thank EU program Erasmus+ KA202 for financing the Valence project: Advancing Machine Learning in Vocational Education.

Data Availability Statement

This project and the source code for testing can be found on the Mathis Jeroncic Github page: https://github.com/MathisJeroncic/Valence_with_AI (accessed on 5 April 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

This is a subset (10 questions) of all questions posed to generative AI. We provided the description and the LLM needed to generate code.

Please write me a code snippet in Python, where you flip this image: https://github.com/VALENCEML/eBOOK/raw/main/EN/07/flower.jpg (accessed on 20 February 2024) vertically. Output both images.
Please write me a code sample in Python, where you use the k-nearest neighbor algorithm on the iris dataset. Use only the first two features (sepal length and sepal width). Print the graphs of the results for both uniform and distance voting.
Write a program that uses the k-nearest neighbors algorithm to classify the Iris dataset (load the dataset using sklearn.datasets). The program should use only the first two features (sepal length and sepal width) and create color maps for the three classes of flowers. The program should print the graphs of the results for both uniform and distance voting.
Please write me a code segment in Python that builds a model that predicts the weekly tourism profit (in thousands of EUR) of small towns in the Balkans as a function of their population. Do this using linear regression. The population of each town is the value x, while the profit of each town is the value y. Import the data for Population and Profit from this txt file: https://raw.githubusercontent.com/VALENCEML/eBOOK/main/EN/08/profit.txt (accessed on 20 February 2024). Print the visualized data to output (jupyter notebook).
Please write me a code segment in Python that builds a model that predicts the price of a house (in EUR) if we know its size in square feet and the number of bedrooms. Do this using linear regression. The size in square feet is the value x1, the number of bedrooms is the value x2, while the price of a house is the value y. Import the data for Size, Bedrooms, and Price from this txt file: https://raw.githubusercontent.com/VALENCEML/eBOOK/main/EN/08/prices.txt (accessed on 20 February 2024). Print the visualized data to output (jupyter notebook).
Write me an example of an unsupervised learning program using the K-means algorithm in Python that generates 1500 blobs with similar shapes, with a seed of 100. Let there be three different colors. Sort them by color and draw them on a graphing plane which you print to output.
Write me an example of an unsupervised learning program using the DBSCAN algorithm in Python, that generates 1500 blobs with similar shapes, with a seed of 100. Let there be two different colors. Sort them by their color and draw them on two graphing planes; on one sort them into two ellipses (smaller one inside the bigger one) and on the other sort them into two halves of ellipses mirrored one below the other. You can pre-make the shapes and sort the blobs into them afterward. Show the graphs in the output.
Write me an example of color quantization in Python with KMeans algorithm on this image: https://github.com/VALENCEML/eBOOK/blob/main/EN/09/bird_small.png (accessed on 20 February 2024) and mat: https://github.com/VALENCEML/eBOOK/blob/main/EN/09/bird_small.mat (accessed on 20 February 2024). Show the new image in the output (jupyter notebook).
Write me a code that defines a new Neuron class, in the class create a constructor (init), and a function that the neuron will execute when we call it (call). After that initialize one neuron with two inputs with weight coefficients 1 and a bias of $- 1$ . Define an input matrix of combinations of logical 0 and 1 and print the output of the neuron.
Write me a code that defines a new Neuron class, in the class creates a constructor (init), and a function that the neuron will execute when we call it (call). After that initialize one neuron with two inputs with weight coefficients 1 and a bias of $- 1$ . Define an input matrix of combinations of logical 0 and 1 and print the output of the neuron. Initialize a neuron with random weights and bias. Set the random number generator state to 42 for the repeatability of the results.

This is a subset (3 scripts of a total of 12 examples) given to generative AI to create a description.

Make a short description of this code without any bullet points and list

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

from sklearn.feature_extraction import DictVectorizer

from sklearn.model_selection import train_test_split

from sklearn import svm

from skimage import io

from skimage import transform

from skimage.transform import rotate, AffineTransform

from skimage.util import random_noise

from skimage.filters import gaussian

from skimage import∼util

image = io.imread({’https://github.com/VALENCEML/eBOOK/raw/main/EN/07/flower.jpg’)

#plt.imshow(image)

#plt.axis(’off’)

augmentation2=np.flipud(image)

fig=plt.figure(tight_layout=’auto’, figsize=(8,4))

fig.add_subplot(121)

plt.imshow(image)

plt.axis(’off’)

fig.add_subplot(122)

plt.imshow(augmentation2)

plt.axis(’off’)

2.: Make a short description of this code without any bullet points and list

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

from sklearn import neighbors, datasets

from sklearn.metrics import accuracy_score

iris = datasets.load_iris()

X = iris.data[:, :2] # only take the first two features

y = iris.target

print(“There⎵are”,X.shape[0],“data⎵points.⎵Each⎵has”,X.shape[1],“features.”)

# Create color maps for the three classes:

cmap_light = ListedColormap([’orange’, ’cyan’, ’cornflowerblue’])

cmap_bold = ListedColormap([’darkorange’, ’c’, ’darkblue’])

n_neighbors = 15 # define number of neighbors k

acc = [] # define array to store∼accuriacies

for weights in [’uniform’, ’distance’]:

clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)

clf.fit(X, y) # fit the data

pred = clf.predict(X)

acc.append(accuracy_score(pred,y))

# Plot the results.

x_min, x_max = X[:, 0].min() − 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() − 1, X[:, 1].max() + 1

h = .02 # step size in the mesh

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot

Z = Z.reshape(xx.shape)

plt.figure()

plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,

edgecolor=’k’, s=20)

plt.xlim(xx.min(), xx.max())

plt.ylim(yy.min(), yy.max())

plt.xlabel(“Sepal⎵length”)

plt.ylabel(“Sepal⎵width”)

plt.title(“3-Class⎵classification⎵(k⎵=⎵%i,⎵weights⎵=⎵’%s’)”

% (n_neighbors, weights))

plt.show()

print(“The⎵accuracy⎵when⎵using⎵uniform⎵voting⎵is”,“{:.2f}”.format(acc[0]),“.\n”)

print(“The⎵accuracy⎵when⎵using⎵distance⎵voting⎵is”,“{:.2f}”.format(acc[1]),“.\n”)

3.: Make a short description of this code without any bullet points and list

import numpy as np

from sklearn import datasets

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as∼plt

n_samples = 1500

random_state = 100

# Generate blobs with similar variances (shapes):

blobs = datasets.make_blobs(n_samples=n_samples,random_state=random_state)

# Generate blobs with different variances (shapes):

varied = datasets.make_blobs(n_samples=n_samples,cluster_std=[1.0, 2.5, 0.5],

random_state=random_state)

X, y = datsets[’blobs’] # retrieve data

scaler = StandardScaler() # standardize data

kmeans = KMeans(n_clusters=3,init=’random’,n_init=5,max_iter=20)

y_pred = kmeans.fit_predict(scaler.fit_transform(X)) # run∼algorithm

plt.scatter(X[:, 0], X[:, 1], c=y_pred)

plt.xlabel(’Feature⎵1⎵-⎵X1⎵coordinate’)

plt.ylabel(’Feature⎵2⎵-⎵X2⎵coordinate’)

References

Ayers, J.W. I Gave ChatGPT an IQ Test. Here’s What I Discovered. Scientific American. Available online: https://www.scientificamerican.com/article/i-gave-chatgpt-an-iq-test-heres-what-i-discovered/ (accessed on 5 April 2023).
Gams, M.; Kramar, S. Evaluating ChatGPT’s consciousness and its capability to pass the Turing test: A comprehensive analysis. J. Comput. Commun. 2024, 12, 219–237. [Google Scholar] [CrossRef]
Nature. (n.d.). ChatGPT Broke the Turing Test—The Race Is on for New Ways to Assess AI. *Nature*. Available online: https://www.nature.com/articles/d41586-023-02361-7 (accessed on 5 April 2024).
The Decoder. (n.d.). GPT-4 Fails at Simple Tasks That Humans Can Easily Solve. Available online: https://the-decoder.com/gpt-4-fails-at-simple-tasks-that-humans-can-easily-solve/ (accessed on 5 April 2024).
Valence Project. (n.d.). Official Valence Project Website. Available online: https://valence.feit.ukim.edu.mk/ (accessed on 5 April 2024).
Wang, T.; Lund, B.D.; Marengo, A.; Pagano, A.; Mannuru, N.R.; Teel, Z.A.; Pange, J. Exploring the potential impact of artificial intelligence (AI) on international students in higher education: Generative AI, LLMs, analytics, and international student success. Appl. Sci. 2023, 13, 6716. [Google Scholar] [CrossRef]
Kocoń, J.; Cichecki, I.; Kaszyca, O.; Kochanek, M.; Szydło, D.; Baran, J.; Gruza, M.; Janz, A.; Kanclerz, K.; Kocoń, A.; et al. ChatGPT: Jack of all trades, master of none. Inf. Fusion 2023, 99, 101861. [Google Scholar] [CrossRef]
Lin, Y.H.; Tsai, T. A conversational assistant on mobile devices for primitive learners of computer programming. In Proceedings of the 2019 IEEE International Conference on Engineering, Technology and Education (TALE), Yogyakarta, Indonesia, 10–13 December 2019; pp. 1–4. [Google Scholar]
Kocuvan, P.; Hrastič, A.; Kareska, A.; Gams, M. Predicting a fall based on gait anomaly detection: A comparative study of wrist-worn three-axis and mobile phone-based accelerometer sensors. Sensors 2023, 23, 8294. [Google Scholar] [CrossRef] [PubMed]
Tiwari, M.; Kumar, M.; Srivastava, A.; Bala, A. Inbuilt Chat GPT feature in smartwatches. In Proceedings of the 2023 International Conference on Circuit Power and Computing Technologies (ICCPCT), Kollam, India, 10–11 August 2023; pp. 1806–1813. [Google Scholar] [CrossRef]
Chinedu, O.; Ade-Ibijola, A. Python-Bot: A GPT for teaching Python programming. Eng. Lett. 2021, 29, 25–34. [Google Scholar]
Kotlyar, I.; Sharifi, T.; Fiksenbaum, L. Assessing teamwork skills: Can a computer algorithm match human experts? Group Dyn. Theory Res. Pract. 2023, 33, 955–991. [Google Scholar] [CrossRef]
Lin, Y.H. GPT script design for programming language learning. In Proceedings of the 2022 IEEE 5th Eurasian Conference on Educational Innovation (ECEI), Taipei, Taiwan, 10–12 February 2022; pp. 123–125. [Google Scholar]
Rokaya, A.; Islam, S.M.T.; Zhang, H.; Sun, L.; Zhu, M.; Zhao, L. Acceptance of Chatbot based on emotional intelligence through machine learning algorithm. In Proceedings of the 2022 2nd International Conference on Frontiers of Electronics, Information and Computation Technologies (ICFEICT), Wuhan, China, 19–21 August 2022; pp. 610–616. [Google Scholar] [CrossRef]
Marcelo, F.R.; von Wangenheim, C.G.; Bezerra, P.A.; Bassani, A.F.; Moraes, R.M.; Hu, J.C.R. Reliability and validity of an automated model for assessing the learning of machine learning in middle and high school: Experiences from the “ML for All!” course. Inform. Educ. 2023, 23, 409–437. [Google Scholar]
Waisberg, E.; Ong, J.; Masalkhi, M.; Zaman, N.; Sarker, P.; Lee, A.G.; Tavakkoli, A. Google’s AI chatbot “Bard”: A side-by-side comparison with ChatGPT and its utilization in ophthalmology. Eye 2024, 38, 642–645. [Google Scholar] [CrossRef] [PubMed]
Mageira, K.; Pittou, D.; Papasalouros, A.; Kotis, K.; Zangogianni, P.; Daradoumis, A. Educational AI LLMs for content and language integrated learning. Appl. Sci. 2022, 12, 3239. [Google Scholar] [CrossRef]
Project Jupyter. (n.d.). Python Jupyter Notebook. Available online: https://jupyter.org/ (accessed on 5 April 2024).
OpenAI. (n.d.). ChatGPT. Available online: https://openai.com/blog/chatgpt (accessed on 5 April 2024).
Google. (n.d.). Google’s Bard. Available online: https://bard.google.com (accessed on 5 April 2024).
Auto-GPT. (n.d.). Available online: https://autogpt.net (accessed on 5 April 2024).
Microsoft. (n.d.). Copilot. Available online: https://copilot.microsoft.com (accessed on 5 April 2024).
OpenAI. (n.d.). Company OpenAI. Available online: https://openai.com (accessed on 5 April 2024).
Shuyan, Z.; Uri, A.; Sumit, A.; Graham, N. CodeBERTScore: Evaluating code generation with pretrained models of code. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Online, 6–10 December 2023; Association for Computational Linguistics: Singapore, 2023; pp. 13921–13937. [Google Scholar]
Ayers, J.W.; Poliak, A.; Dredze, M.; Leas, E.C.; Zhu, Z.; Kelley, J.B.; Faix, D.J.; Goodman, A.M.; Longhurst, C.A.; Hogarth, M.; et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 2023, 183, 589–596. [Google Scholar] [CrossRef] [PubMed]
Bilgram, V.; Laarmann, F. Accelerating innovation with generative AI: AI-augmented digital prototyping and innovation methods. IEEE Eng. Manag. Rev. 2023, 51, 18–25. [Google Scholar] [CrossRef]
Hou, W.; Ji, Z. GeneTuring tests GPT models in genomics. bioRxiv 2023. [Google Scholar] [CrossRef]
Ni, A.; Yin, P.; Zhao, Y.; Riddell, M.; Feng, T.; Shen, R.; Yin, S.; Liu, Y.; Yavuz, S.; Xiong, C.; et al. L2CEval: Evaluating language-to-code generation capabilities of large language models. arXiv 2023. Available online: https://consensus.app/papers/evaluating-languagetocode-generation-capabilities-ni/d47e0a83250c59549fbe1b68bc87d7c6/?utm-source=chatgpt (accessed on 5 April 2024).
Cummins, C.; Hascoët, M.; Leather, H.; O’Boyle, M. Code optimization using large language models: Evaluating LLM capabilities in compiler optimizations. J. Mach. Learn. Res. 2023, 24, 1–18. [Google Scholar]

Figure 1. An example of a comparison between text from the expert-designed Valence Jupyter notebook and the output of ChatGPT.

Figure 2. The number of tries needed to achieve a satisfactory result on definitions using Gemini (1.97 on average) and Copilot (average 2.23); tests performed in 2024.

Figure 3. The number of tries needed to achieve a satisfactory result on definitions using GPT-3.5 (average 2.03 tries; 2023) and GPT-3.5 (average 1.86 tries; 2024).

Figure 4. The number of tries needed to achieve a satisfactory result on definitions using GPT4 (average 1.69; 2023) and ChatGPT4o (average 1.69; 2024).

Figure 5. Diagram showing best grade achieved on each code segment (2023).

Figure 6. Diagram showing best grade achieved on each code segment (2024).

Table 1. Overview of the 12 code segments used in this study. These algorithms were selected to represent a broad range of machine learning tasks, covering both supervised and unsupervised learning, as well as optimization and classification challenges.

Code Segment	Algorithm/Method	Category	Rationale
1. K-Nearest Neighbors	Classification	Supervised Learning	Common classification task in machine learning
2. Linear Regression	Linear Model	Supervised Learning	Widely used for regression tasks
3. K-Means Clustering	Clustering	Unsupervised Learning	Classic example of unsupervised learning
4. Decision Tree	Decision-based models	Supervised Learning	Popular for interpretable model in classification
5. Gradient Descent	Optimization	Optimization	Fundamental algorithm for optimization tasks
6. Logistic Regression	Classification	Supervised Learning	Simple binary classifier, commonly applied in ML
7. Neural Network	Deep Learning	Supervised Learning	Introduces complex architectures for non-linear relationships
8. Support Vector Machine	Classification	Supervised Learning	Frequently used for margin-based classification tasks
9. Principal Component Analysis (PCA)	Dimensionality Reduction	Unsupervised Learning	Common technique for reducing feature space
10. Random Forest	Ensemble Method	Supervised Learning	Example of ensemble learning with decision trees
11. Naïve Bayes	Probabilistic Models	Supervised Learning	Simple probabilistic classifier applied in various domains
12. DBSCAN	Clustering	Unsupervised Learning	Density-based clustering method for anomaly detection

Table 2. Average scores of LLMs in 2023.

LLM Model	Average Score (out of 5)	Prompt Attempts
Bard	2.5	56
AutoGPT	4.42	36
ChatGPT-4	4.67	32

Table 3. Average scores of LLMs in 2024.

LLM	Average Score (out of 5)	Prompt Attempts
Gemini	2.17	55
Copilot	4.92	26
ChatGPT-4o	4.83	32

Table 4. p-values from one-tailed T-test.

LLM	p-Value
Bard-Gemini	0.23
AutoGPT-Copilot	0.08
ChatGPT-4(2023)-ChatGPT-4o(2024)	0.08

Table 5. Efficiency scores of various chatbots.

Chatbot	Efficiency Score	Year	Context
Bard	0.04	2023	General AI
Gemini	0.04	2023	General AI
AutoGPT	0.12	2023	Automation
ChatGPT-4	0.15	2023	General AI
ChatGPT-4o	0.15	2024	Enhanced AI
Copilot	0.15	2023	Coding AI

Table 6. The CodeBERT score for each LLM.

LLM	Precision	Recall	F1 Score	F3 Score
Gemini (2024)	0.81	0.78	0.79	0.78
ChatGPT-4 (2023)	0.83	0.82	0.82	0.82
Bard (2023)	0.84	0.79	0.81	0.79
Auto-GPT (2023)	0.85	0.80	0.82	0.80
Copilot (2023)	0.83	0.82	0.82	0.82
ChatGPT-4o (2024)	0.83	0.82	0.82	0.82

Table 7. Comparing performance in error and warning detection and optimization.

Performance Metric	Existing Errors Detected	Existing Warnings Detected	Invented New Errors/Warnings
Bard (2023)	0	0	0
ChatGPT-4 (2023)	2	0	1
Copilot (2023)	2	1	1

Table 8. Comparison of LLMs using various metrics.

LLM	Strict Comparison	Soft Comparison	CodeBERT
GPT-GEMINI	0.13	0.06	0.88

Table 9. Performance comparison of ChatGPT, Bard/Gemini, Copilot, and Auto-GPT across various tasks.

Task	ChatGPT	Bard/Gemini	Copilot	Auto-GPT
Text-to-Text Accuracy (%)	89	85	87	82
Text-to-Code (Avg. Iterations)	1.69	1.97	2.23	2.10
Code-to-Text (Good/Bad)	11/11	9/11	11/11	8/11
Error/Warning Detection (Success Rate, %)	92	85	90	80
Code Optimization (Avg. Speed Gain, ms)	−271.5	−150.5	−120.0	−200.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Noveski, G.; Jeroncic, M.; Velard, T.; Kocuvan, P.; Gams, M. Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools. Electronics 2024, 13, 4109. https://doi.org/10.3390/electronics13204109

AMA Style

Noveski G, Jeroncic M, Velard T, Kocuvan P, Gams M. Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools. Electronics. 2024; 13(20):4109. https://doi.org/10.3390/electronics13204109

Chicago/Turabian Style

Noveski, Gjorgji, Mathis Jeroncic, Thomas Velard, Primož Kocuvan, and Matjaž Gams. 2024. "Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools" Electronics 13, no. 20: 4109. https://doi.org/10.3390/electronics13204109

APA Style

Noveski, G., Jeroncic, M., Velard, T., Kocuvan, P., & Gams, M. (2024). Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools. Electronics, 13(20), 4109. https://doi.org/10.3390/electronics13204109

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Large Language Models in Generating Machine Learning Curricula in High Schools

Abstract

1. Introduction

2. Related Work

3. Experimental Setup

3.1. Teaching Material Used

3.2. Experiments

4. Results

4.1. Text-to-Text Experiment

4.2. Text-to-Code Experiment

4.3. CodeBERT Score

4.4. Code-to-Text Experiment

4.5. Error/Warning Detection and Optimization

4.6. Extending the Text-to-Code Experiment

4.7. Summary of Results

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI