Next Article in Journal
Prediction of Spatial Distribution of Soil Heavy Metal Pollution Using Integrated Geochemistry and Three-Dimensional Electrical Resistivity Tomography
Previous Article in Journal
PRPOS: A Periodicity-Aware Resource Prediction Framework for Online Services
Previous Article in Special Issue
Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks

by
Fidan Kaya Gülağız
Department of Computer Engineering, Faculty of Engineering, Kocaeli University, İzmit 41001, Kocaeli, Turkey
Appl. Sci. 2025, 15(20), 10968; https://doi.org/10.3390/app152010968
Submission received: 7 August 2025 / Revised: 27 September 2025 / Accepted: 3 October 2025 / Published: 13 October 2025

Abstract

Large language models (LLMs) are playing an increasingly important role in data science applications. In this study, the performance of LLMs in generating code and designing solutions for data science tasks is systematically evaluated based on different real-world tasks from the Kaggle platform. Models from different LLM families were tested under both default settings and configurations with hyperparameter tuning (HPT) applied. In addition, the effects of few-shot prompting (FSP) and Tree of Thought (ToT) strategies on code generation were compared. Alongside technical metrics such as accuracy, F1 score, Root Mean Squared Error (RMSE), execution time, and peak memory consumption, LLM outputs were also evaluated against Kaggle user-submitted solutions, leaderboard scores, and two established AutoML frameworks (auto-sklearn and AutoGluon). The findings suggest that, with effective prompting strategies and HPT, models can deliver competitive results on certain tasks. The ability of some LLMS to suggest appropriate algorithms reveals that LLMs can be seen not only as code generators, but also as systems capable of designing machine learning (ML) solutions. This study presents a comprehensive analysis of how strategic decisions such as prompting methods, tuning approaches, and algorithm selection, affect the design of LLM-based data science systems, offering insights for future hybrid human–LLM systems.

1. Introduction

Today, the rapid advancement of the internet and digital technologies has led to a significant increase in both the diversity and volume of data [1]. This growth has created major challenges not only in terms of storage, but also in the interpretation and analysis of data. In this context, the concept of Big Data has emerged, characterized by high volume, velocity, and variety [2,3], pushing the limits of traditional data processing methods. In particular, the rise of unstructured data has rendered classical machine learning (ML) and data mining approaches insufficient. Deep learning (DL) models, developed to overcome this bottleneck, have initiated a new era in data analytics through their ability to learn meaningful representations from large-scale data. These advancements, supported by powerful hardware (Graphics Processing Unit (GPU), Tensor Processing Unit (TPU)) and DL libraries (TensorFlow, PyTorch), have enabled scalable solutions.
One of the fastest-growing areas of DL has been natural language processing (NLP). Large language models (LLMs) based on Transformer architecture have revolutionized text comprehension and generation. These models are also being used successfully in technical tasks such as code generation and debugging, increasing the potential for intelligent assistance in software development.
In this context, comparing the performance of different LLM models has gained academic and industrial importance. However, there are limited studies in the literature that provide objective and systematic analysis of LLMs, especially in the context of code generation. In line with this need, the following section presents a summary of research addressing the role of LLMs in software development processes.

1.1. Related Work

LLMs are increasingly applied not only in natural language generation but also in technical domains such as software development and data science. As a result, topics like code generation capabilities, decision-making processes, and the effectiveness of prompting strategies have gained prominence in literature. This section reviews relevant studies under four key themes to provide the theoretical foundation for the present work: the code generation capabilities of LLMs; the impact of prompting strategies; the role of LLMs in tasks such as algorithm selection and hyperparameter tuning (HPT) within data science workflows; and the benchmarking and evaluation of LLM performance using datasets.
The ability of LLMs to produce structured outputs such as programming code has been extensively discussed in the literature. Du et al. [4] stated that the code generation achievements of LLMs on benchmarks examine small-scale code. For this reason, they evaluated LLMs’ competencies in generating grade-level code to be suitable for real-world problems. They stated that the performance of LLMs in class level code generation is lower than the method level. Li et al. [5] stated that evaluating the code generation performance of LLMs is an open problem and proposed EvoCodeBench as a new benchmark. They tested popular LLMs on EvoCodeBench. Fakhoury et al. [6] developed a new method that guides LLM-assisted code generation step-by-step with test-based feedback from the user. They show that the approach helps the user to understand the correctness of the code. Coignion et al. [7] evaluated the code generation efficiency of different LLMs compared to human authored solutions using the Leetcode database. Tambon et al. [8] investigated the errors observed in code generated by LLMs. They identified the points that should be considered for the security of LLM-generated code. These studies analyze how effective LLMs are in code development.
The output of LLM-based systems is highly dependent not only on the capability of LLM but also on the structure of the prompts. Therefore, prompt engineering is a critical process for achieving accurate and efficient outputs. The impact of different prompting strategies, especially for tasks such as code generation, has been extensively studied in the literature. Li et al. [9] proposed a prompt technique for code generation inspired by structured programming techniques used by human programmers. Thus, they showed that contextual sampling improves accuracy (Acc) in code generation. Khojah et al. [10] showed how different prompt techniques used in code generation affect the performance of LLMs. Yang et al. [11] proposed a method that shows that the use of chain-of-thought (CoT) strategies in lightweight LLM models significantly improves code generation performance even in resource-constrained environments. Chen et al. [12] reviewed both basic and advanced prompt design techniques and emphasized the key role of prompt design in this field. This recent work clearly shows that prompt engineering has an impact on both Acc, reliability and interpretability.
LLMs are considered as systems that not only generate code but also make decisions such as algorithm selection and HPT. In these aspects, the role of LLMs, which are positioned as an alternative to Automated Machine Learning (AutoML) approaches, in data science tasks has been increasingly addressed in the literature. Yao et al. [13] examined how LLM-based AutoML approaches can improve the accessibility of ML solutions. The study shows the potential of LLMs to make ML systems accessible to non-technical experts. Fathollahzadeh et al. [14] proposed a system that enables LLMs to create more effective and efficient ML workflows by generating dataset-specific instructions. The study shows that LLMs can be integrated into decision-making processes, not just writing code. The study by Zhao et al. [15] shows that LLMs can perform algorithm design, implementation and evaluation independently by breaking down complex AutoML tasks into discrete sub-prompts. Mulakala et al. [16] stated that LLMs have difficulty in searching over a large hyperparameter space during the fine-tuning process. They proposed a new technique to address this problem. In another study by Zhang et al. [17], the usability of LLMs in HPT processes was investigated. All these studies show that LLMs are positioned not only as passive tools but also as active decision-making systems in data science processes.
Benchmark datasets and test environments used to reliably evaluate the performance of artificial intelligence (AI) systems are of great importance. Accordingly, in the literature, comparative analyses of LLMs are frequently conducted on algorithm-based benchmark datasets such as HumanEval and LeetCode, and platforms such as Kaggle, which provide both data sources containing real-world problems and an evaluation environment. Wang et al. [18] evaluated the performance of LLMs such as Generative Pre-trained Transformer 4 (GPT-4) and GPT-3.5-turbo in terms of their ability to solve various programming problems compiled from the LeetCode platform. Coignion et al. [7] evaluated the code generation efficiency and performance of LLMs on different levels of Letcode problems from LeetCode by comparing them with human-written solutions. They found that for the selected problems, LLMs generate code more efficiently than humans in most cases. Another study [19] evaluated the performance of GPT-3.5, GPT-4 and GPT-4o models on 15 LeetCode problems in Python, Java and C++ languages in terms of runtime and memory usage. Döderlein et al. [20] investigated how the performance of LLM-based code assistants such as Copilot, Codex and StarCoder2 is affected by changes in inputs (prompt format, context, temperature, etc.) on HumanEval and LeetCode problems. Another study [21] evaluated 984 code samples generated by the GPT-3.5-Turbo and GPT-4 models using the HumanEval dataset in terms of code quality by comparing them with human-written code. Mathews et al. [22] examined the impact of Test-Driven Development (TDD) approach on the code generation of LLMs such as GPT-4 and Llama 3. Although there are many studies evaluating the code generation performance of LLMs on shorter and algorithmically oriented problems such as HumanEval and LeetCode, there are very limited studies comparing the end-to-end code generation capabilities of LLMs on ML tasks on platforms such as Kaggle. Ko and Kang [23] evaluated the code generation capabilities of GPT and Gemini LLMs for ML tasks on three different Kaggle datasets. The results show that GPT performs strongly in HP tuning, but both models lag human developers, especially in tasks such as data preprocessing and feature engineering.
While existing research has mostly focused on short tasks with an algorithmic focus, Ko and Kang [23], one of the few studies to perform a similar benchmark on machine learning tasks in Kaggle, provides an important start. However, the variety of LLMs used is limited in terms of task scope, up-to-dateness and evaluation metrics. Moreover, while in that study, ML tasks were divided into different phases and solutions were generated with hybrid prompt techniques, in this study, end-to-end code generation was performed with a single prompt structure and the holistic performance of LLMs was directly evaluated. This paper aims to overcome these limitations and examine the capability of LLMs to provide end-to-end solutions to ML tasks with a more comprehensive and up-to-date approach. In addition, while AutoML–LLM comparisons also exist in the literature, they have typically been limited in model scope, task diversity, and recency. In contrast, the present study not only evaluates multiple LLM families but also directly benchmarks them against AutoML baselines and Kaggle user-submitted notebooks, thereby offering a more comprehensive and up-to-date comparative study.

1.2. The Contributions of This Study

The main goal of this paper is to contribute to the field of comparative LLM evaluation in solving machine learning tasks, which is lacking in literature. Therefore, it examines the performance of different LLMs on five different Kaggle tasks in terms of metrics such as Acc, execution time and the peak memory usage. In addition, code generation is performed for each task using two different prompt generation techniques. Thus, the differences of LLM models in prompt-based response generation are evaluated, and the results are compared with both direct metric analyses and participant results in related Kaggle competitions, providing a comprehensive framework of LLM model performance in a real-world context. This study makes the following original contributions:
  • It presents a systematic and comparative performance analysis of different LLM families (OpenAI, GPT, Gemini, Claude, DeepSeek) on real-world tasks, which is limited in literature.
  • The integration of HPT into the code generation process is one of the first studies to question the effectiveness of LLMs in terms of not only solution generation but also solution calibration.
  • The impact of two different prompt design strategies (few-shot prompting (FSP) and Tree of Thoughts (ToT)) on end-to-end ML workflows including data preprocessing, algorithm selection and code generation is extensively studied. The capabilities of LLM models guided by these strategies in terms of end-to-end solution generation capacity are evaluated.
  • The practical applicability of the generated codes was analyzed in a multidimensional way by comparing their performance with both the public scores of Kaggle participants and the solutions written by experts.
  • Performance evaluation is not limited to classical metrics such as accuracy, F1 score and Root Mean Squared Error (RMSE), but also takes into account resource cost aspects such as execution time and peak memory usage, which are critical in real-world applications.
  • Through the analysis of four different task types (tabular, text, image; including classification and regression), the stronger or weaker performance of the models was revealed in detail in the context of the task type.
  • It extends the scope of LLM evaluation by including comparisons with established AutoML frameworks, thereby situating LLM-driven code generation within the broader landscape of machine learning.
With the aspects listed above, the study not only contributes to the technical evaluation of the end-to-end code generation capabilities of LLMs but also provides a practical perspective on how LLMs can be integrated more effectively into ML processes.
The remainder of this paper is organized as follows: Section 2 describes the materials and methods, including the selected LLMs, ML tasks, prompting strategies, the AutoML frameworks included in the study and evaluation metrics. Section 3 presents the experimental results, followed by a detailed discussion in Section 4. Section 5 concludes the paper and highlights future research directions.

2. Materials and Methods

In this section of the paper, the setting chosen for the experimental study, the ML tasks, the chosen LLMs, the prompt strategies, the AutoML frameworks used in the comparisons and evaluation metrics and are explained systematically and justified.
This study examines LLMs as agents that produce end-to-end, optimized solutions in supervised learning (classification/regression) tasks. For the evaluation to be objective, the Kaggle platform, which provides up-to-date performance data, was preferred. The Kaggle platform offers a fair comparison with its standardized datasets and evaluation process, which minimizes human intervention. With task-specific Leaderboards (LB), it provides transparent access to the performances of teams that offer solutions to the same task. Thanks to the notebook infrastructure it offers, it eliminates hardware differences and enables a fair evaluation in terms of metrics such as memory usage and execution time. Five public Kaggle tasks were selected for this purpose. In the experiments, a single, fixed task description was adapted to two different prompt technique frameworks (FSP and ToT). The task description, data/output format and evaluation criteria were identical in both frameworks; the difference was limited to structural elements such as the presentation of examples and step-by-step reasoning prompts.
In all conditions, only the first response (pass@1) was evaluated. All aspects of the solution including preprocessing, feature engineering, algorithm selection, training, evaluation, and submission formatting were handled by the models. The code outputs were generated in a single pass, based solely on the prompt content, without any external guidance, manual revision, or post-processing. For code generations that could not be executed due to syntax or basic library/module errors, the incorrect script and its error message were resubmitted to the same LLM, and only an executability-level correction was obtained. No modifications were made to the logical structure of the code, solution strategy, algorithm architecture, or hyperparameter settings, ensuring that the evaluation reflects the models’ independent planning and decision-making capabilities. All are part of the decision space. All conditions were run in the same Kaggle working environment. The cross validation (CV)/HP configurations chosen by the agent, execution time and peak memory (MB) values were fully recorded.
All experiments were executed in the standard Kaggle notebook environment without any manual modifications, ensuring reproducibility and fairness. Within this standardized setup, four of the five tasks (Titanic, Digit Recognizer, House Prices, and Disaster Tweets) were executed with Python 3.11.11, while the Beats per Minute of Songs task ran under Python 3.11.13, as this experiment was conducted later when the Kaggle environment had been updated. For clarity, only the key library versions are reported here, which remained consistent across tasks: scikit-learn 1.2.2, pandas 2.2.3, numpy 1.26.4, and PyTorch 2.6.0+cu124. Additional libraries available in the Kaggle environment could also be utilized if required, since the prompts did not impose any restrictions on the choice of libraries or the Python runtime environment.

2.1. Machine Learning Tasks for Evaluation

Within the scope of the study, five different ML tasks were preferred to be used in the evaluation of LLM models. In the selection of the tasks, both the criteria for being widely known problems and being related to data that have been widely studied by people were taken into consideration. We also considered the fact that each task is related to a different ML domain. To make a more accurate selection and to reach up-to-date results, we selected from the indefinite competitions on Kaggle. First, we listed the indefinite competitions on Kaggle according to the total number of teams and submissions. The results obtained are detailed in Table 1. Then, the first 4 competitions belong to different tasks with the highest number of teams among these competitions were selected. As can be seen from Table 1, one classification, one regression, one image classification and one NLP task were selected.
“Titanic—ML from Disaster” (Kaggle. Titanic—Machine Learning from Disaster, https://www.kaggle.com/competitions/titanic, accessed on 7 June 2025) is a ML/binary classification task to predict the survivors from the shipwreck of the Titanic, which sank in 1912, using passenger data. The competition was launched by Kaggle in 2010 and is one of the oldest competitions on Kaggle. The dataset is divided into train and test by Kaggle. There are 12 columns in the dataset. One of these columns represents the class label and one represents the id of the passengers. The remaining 10 columns are raw feature data. There are 1309 passenger records in the dataset, with 891 passengers in the train and 418 passengers in the test dataset. It can be said that it is a small dataset in terms of its size. For this reason, it has been observed that combining simple models with different data processing techniques rather than complex models on the dataset gives more accurate results.
The second selected task is “House Prices—Advanced Regression Techniques” (Kaggle. House Prices—Advanced Regression Techniques, https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques, accessed on 13 June 2025). This challenge is a regression problem designed by Kaggle in 2016 to predict house prices. It is also an indefinite competition and is still ongoing today. The dataset is divided into train and test by Kaggle and has a total of 79 independent variables. There are 1460 records in train and 1459 records in test, totaling 2919 records in the dataset. In terms of the number of records it contains, the dataset is more suitable for ML techniques rather than DL techniques.
Another selected task is “Digit Recognizer” (Kaggle. Digit Recognizer, https://www.kaggle.com/competitions/digit-recognizer, accessed on 26 June 2025). The competition was launched in 2012 and is also an indefinite competition. Within the scope of the competition, the MNIST handwritten digit dataset is used to predict handwritten digits. It aims to classify the 28 × 28 pixel grayscale handwritten digit images it receives as input into the correct digit label in the range (0–9). In other words, the competition can be defined as a multi-class image classification task. The dataset contains a total of 70,000 records, 42,000 in train and 28,000 in test. It can be said that it is a medium-sized data set in terms of the number of records.
The last task included in the comparison is “NLP with Disaster Tweets” (Kaggle. Natural Language Processing with Disaster Tweets, https://www.kaggle.com/competitions/nlp-getting-started, accessed on 15 July 2025). The contest was launched by Kaggle in 2019 indefinitely. The aim of the competition is to classify the texts collected on Twitter into binary categories (0: non-disaster, 1: disaster). The task can be defined as an NLP-based classification task since it is attempted to classify texts. There are total of 10,876 tweets in the dataset, 7613 in the train and 3263 in the test dataset. Again, when evaluated in terms of the number of records, it can be said that it is a medium-sized dataset.
The additional task included in the comparison is “Predicting the Beats-per-Minute of Songs” (Kaggle. Predicting the Beats-per-Minute of Songs, https://www.kaggle.com/competitions/playground-series-s5e9, accessed on 20 September 2025), which is part of the 2025 Kaggle Playground Series. The goal of the contest is to predict the tempo of a given track in terms of beats per minute (BPM) based on audio features. The task can be defined as a regression problem, since the objective is to estimate a continuous target variable (BPM). The dataset consists of 524,164 training samples and 174,722 test samples, making it larger than the previously examined toy datasets. While the previous competitions were selected partly based on their large number of participating teams, this task was chosen for different reasons: it is recent (1–30 September 2025), less explored in the literature, and not one of the classical benchmark datasets frequently used in tutorials and training corpora. For this reason, it provides a more challenging and up-to-date benchmark for evaluating the performance of LLM-based approaches. By including this recent competition, our study extends beyond the classical benchmark datasets and demonstrates the applicability of LLM-based approaches to more contemporary and less explored problems.
Table 2 summarizes the Kaggle datasets used to evaluate the performance of FSP and ToT techniques across different task types. For each dataset, the size of the training and test sets, the number and type of independent and dependent variables, data types, and the corresponding ML task (e.g., classification, regression, or NLP) are presented.
The selected datasets cover a diverse range of tasks, including binary and multiclass classification, regression, and NLP, allowing for a comprehensive assessment of prompting strategies across varied domains. In addition to widely known benchmark datasets, a recent competition dataset (Predicting the Beats-per-Minute of Songs, September 2025) was also included to reduce the risk of relying solely on classical and frequently studied problems.
Table 3 summarizes the data access records for the five Kaggle tasks used in the study. Date of Access indicates the date range in which the data files were imported into the Kaggle Notebooks and the codes were executed. Dates are reported as ranges as multiple notebooks were created for each task for LLM model X prompt technique X tuning combinations. The data was not downloaded locally, only used in the Kaggle working environment. The Data Version column is given as “N/A (competition data)” because there is no official version number on the competition pages.

2.2. Selected LLMs for Experimental Evaluation

In this study, a total of eight different versions of four popular language model families (GPT, Gemini, DeepSeek and Claude), both open source and commercial, were selected. This allows for a cross-family comparison in terms of ecosystem, license and design philosophies, as well as a balanced comparison in terms of performance and accessibility thanks to the selected sub-models. To be able to compare both code development-oriented versions and up-to-date, high-performance versions of different language model families, care was taken to select more than one sub-model from each family. The selected LLM models are detailed in Table 4. All models were queried using the default settings provided by their respective platforms (e.g., temperature, top-p, max tokens, seed). No custom hyperparameters were applied, and prompts were formatted according to each model’s expected interface (chat-style, code block, etc.). This decision was made to reflect real-world usage scenarios where users typically interact with LLMs using default configurations.
GPT is a family of language models based on the Transformer architecture, with the first version released in 2018 [24]. Since 2018, it has grown significantly in terms of both the number of parameters and the context window, and different versions have been developed. However, GPT-4 and its variants are particularly notable for their enhanced reasoning capabilities, enabling more complex inference tasks through advanced reasoning abilities. Among the GPT family, GPT-4 and its successors represent the most capable models in performing complex reasoning tasks. In this study, OpenAI o3, which emphasizes deep reasoning capabilities, and GPT-4.1, which draws attention with its code development performance, are included. Although both were developed by OpenAI, o3 is officially presented under a separate “o-series reasoning model” family, independent from the GPT series. Both OpenAI o3 and GPT-4.1 were released in 2025 [25,26,27]. The comparison of the two sub-models aims to reveal whether deep reasoning or broad context support is more critical in code-centric projects. In addition to these, the GPT series has continued to evolve, and the most recent version, GPT-5 Thinking (OpenAI. (2025). Introducing GPT-5 for developers, https://openai.com/tr-TR/index/introducing-gpt-5-for-developers, accessed on 26 September 2025. OpenAI. (2025). GPT-5, https://openai.com/tr-TR/gpt-5, accessed on 26 September 2025.), was also included in our comparative experiments on a contemporary problem (the Beats-per-Minute task). GPT-5 became publicly accessible during our study. The inclusion of GPT-5 Thinking enables us to assess whether the latest iteration in the GPT series further improves performance under realistic scenarios.
Developed by Google DeepMind, Gemini is a family of LLMs first released in 2023 and has Transformer-based architecture. The Gemini 2.5 [28] version, introduced in 2025, has more advanced reasoning capabilities compared to previous versions [28] and can provide more accurate answers to questions. According to the technical report published in 2025 [28], the Gemini 2.5 family consists of three different versions: Pro, Flash and Flash Lite. The Pro version is optimized for coding and complex tasks, while the Flash version aims to provide high performance for everyday tasks. Flash Lite is presented as the most cost-effective option and was released as a preview in June 2025. Each of these versions provides different advantages in various metrics such as quality, cost and response time [28]. Gemini 2.5 pro version was included in the study as it is the most recent version and has been developed with a focus on coding.
DeepSeek is a China-based company founded in 2023 to develop open source LLMs [29]. Since 2023, many different sub-versions have been released under the DeepSeek name. The two most recent versions are V3 and R1. Both models are included in the study with versions released or updated in 2025. Both models include “reasoning”, but their methods differ in terms of the focus of their development. While the R1 model stands out in technical areas due to its deeper reasoning capability, the V3 model offers reasoning capability in a wider range of areas but cannot go as deep as R1 in a specific area [30,31]. By including both versions in the study, it was aimed to test the effect of deep-thinking ability and comprehensive thinking ability on coding, and to compare the best version with non-open source LLMs.
Claude is a family of LLMs developed by Anthropic [32]. Since its introduction, its development has continued and the most recent versions are Claude Opus 4 and Sonnet 4, which were released in 2025 [33,34]. Opus 4, the most advanced version, stands out with its “extended thinking” model and promises high performance in both coding and complex tasks [33]. Sonnet, on the other hand, is optimized for efficiency and cost-effectiveness, offering a balanced solution for everyday use compared to Opus [33,34]. Both versions are included in this study. In terms of coding, it is aimed to compare the performance of both the “light” model and the “extended thinking” version in terms of accuracy, F1 score, RMSE, execution time, memory consumption, etc.

2.3. Prompt-Driven Code Generation with LLMs

With the widespread of LLMs, the concepts of prompt and prompt engineering have become increasingly popular [12,35,36,37]. Nowadays, obtaining the desired output from LLM directly depends on the structure and content of the instructions given by the user. While these instructions given to the LLM are called prompts, prompt engineering can be defined as the process of systematic, optimized and purposeful preparation of the commands given to achieve the desired result [38,39]. A properly constructed prompt significantly increases the accuracy and efficiency of the responses received from the LLM and provides better quality and reliable results [12,40].
Nowadays, a large number of prompt engineering techniques have been developed for different types of tasks, each serving one or more different purposes [37,41,42]. In addition, many academic studies have been conducted to evaluate in which scenarios these techniques work more efficiently, including systematic comparisons and effectiveness analyses [41,42]. Thus, the quality and reliability of the output from LLM is continuously being improved.
In this paper, we evaluate the code generation performance of LLMs for different types of ML tasks. In the literature, it has been observed that the prompts given to LLMs for code generation play a decisive role in the accuracy/F1 score/RMSE and efficiency of the generated code [37,41,42]. Therefore, choosing the appropriate prompt engineering technique for code generation is critical to maximizing LLM performance [43]. In this study, LLM performances are evaluated by using two prompt methods that have been proven to be suitable for code generation. These are FSP and ToT methods.
The FSP technique was extensively covered in a paper published by OpenAI in 2020 [44], which tested the performance of the GPT-3 model with a few-shot learning approach. In this study, it was shown that LLMs can successfully perform various tasks with prompts containing only a few examples [44]. It is emphasized that the FSP method significantly improves the accuracy and performance of the model, especially for different types of tasks such as tagging, translation and text completion. In addition, there are also studies showing that the FSP technique improves the performance of LLMs for tasks such as code synthesis and code generation and is suitable for code synthesis [41,45,46]. As a result, it can be stated that the FSP technique, which is based on giving more than one example in the prompt as an approach, can give better results in fast and simple tasks, but it will be insufficient in complex tasks [41,47,48].
Another prompting method preferred in the study is the ToT [49] technique. Proposed in 2023, this technique allows LLMs to create multi-step and tree-like reasoning chains, especially for complex tasks [49]. The ToT method is based on the CoT [50] approach and takes CoT to the next level. The classical CoT [50] approach aims to solve a task by building a chain of reasoning. Unlike COT, ToT aims to simultaneously explore different solution paths for a task by constructing a branching tree of ideas over multiple alternatives and selecting the optimal solution path based on the outputs of the intermediate steps [41,49]. The main purpose of choosing this method is to examine the extent to which LLM models are effective in method selection and decision making, especially in solving multi-step problems, depending on the prompt generation technique to be applied. GPT-4.1 [26,51], Gemini 2.5 Pro [28], Claude Opus 4 and Claude Sonnet 4 [33], DeepSeek-R1 [31] and DeepSeek-V3 [30] were chosen as more advanced reasoning-based LLMs [26,28,30,31,32,51] compared to their older versions. In this context, we also analyzed the effectiveness of prompts in reasoning-based LLMs, which are also designed to be reasoning-oriented. Thus, based on two different prompt approaches, one standard and one reasoning-based, the code generation performance of LLMs on AI-based tasks is compared.
In both the FSP and ToT setups, the LLMs were provided with detailed task prompts and instructed to independently generate complete code solutions. To examine the role of HPT, we evaluated two variants within each prompting paradigm: one prompt explicitly required the LLM to perform HPT, while the other did not. Combined with the two prompting paradigms, this design resulted in four distinct prompt templates per task. For HPT, the LLMs were not provided with predefined search grids or optimization ranges. Instead, they received open-ended prompts that allowed them to determine both the search strategy (e.g., grid search, random search, Bayesian optimization) and the parameter ranges. This setup ensured that the HPT process reflected the models’ own planning and decision-making capabilities, rather than being constrained by externally imposed configurations.
This design choice reflects the objective of evaluating each LLM as a problem-solving assistant, capable of approximating certain aspects of reasoning and decision-making. Especially in the ToT scenario, models were encouraged to plan multiple alternative strategies and choose among them internally. However, in both prompting approaches, the model’s output was accepted as-is, providing a fair and realistic assessment of out-of-the-box performance under independent execution. A structural summary of the four different prompt variants designed is given in Table 5.

2.4. AutoML Frameworks: Setup and Parameters

To strengthen the experimental design and provide fair baselines, two widely used and well-established open-source AutoML frameworks were selected: auto-sklearn [52,53] and AutoGluon [54]. Both have been extensively validated in academic studies and real-world ML applications [55,56,57,58,59,60,61], and they represent different design philosophies in automated model selection and hyperparameter optimization. Including these frameworks allows for a more comprehensive comparison between traditional AutoML systems and LLM-driven approaches.
In this study, auto-sklearn (version 0.15.0, Python 3.10.18) [52,53] was employed as one of the AutoML baselines. Built on top of scikit-learn, auto-sklearn combines Bayesian optimization, meta-learning, and ensemble construction to automatically select algorithms and tune hyperparameters [53]. It has been widely adopted in academic and industrial applications for structured ML problems [55,56,57,58]. Within the scope of this study, the framework was configured with a fixed random seed of 42 and a memory limit of 8192 MB. The evaluation metrics were aligned with the task types (accuracy for Titanic and Digit Recognizer, RMSE for House Prices and Beats-per-Minute of Songs, and F1 for Disaster Tweets). A summary of the settings used in this study is provided in Table 6.
AutoGluon-Tabular (version 1.4.0, Python 3.12.x) [54] was employed as another AutoML baseline. AutoGluon is an open-source AutoML framework that supports a wide range of model families and offers ensembling and stacking strategies [54]. Within the scope of this study, the framework was used with the high_quality preset, a fixed random seed of 42, and a memory usage ratio limit of 0.8. The time limit was set to 7200 s for Titanic, House Prices, and Beats-per-Minute of Songs, and 14,400 s for Disaster Tweets and Digit Recognizer. The evaluation metrics were aligned with the task types (accuracy for Titanic and Digit Recognizer, RMSE for House Prices and Beats-per-Minute of Songs, and F1 score for Disaster Tweets). Key configurations included enabling auto-stacking, setting three bagging folds, and applying hyperparameter tuning with up to 80 trials where applicable. A summary of the settings used in this study is provided in Table 7.

2.5. Evaluation Metrics

To ensure transparent evaluation and fair comparison across different LLMs, all results were calculated based on the official submission files generated for Kaggle competitions. The evaluation considered multiple metrics, including Acc, RMSE, execution time, and peak memory consumption. All metrics measured directly within the Kaggle Notebook environment. For the Titanic, House Price and Beats-per-Minute of Songs tasks, the accelerator setting was left at its default value (“None”), whereas for the Disaster Tweets and Digit Recognition tasks, the GPU was set to “P100”. Apart from these adjustments, no other changes were made to the notebook configurations, ensuring that all experiments were conducted under standardized conditions.
The metrics used to evaluate LLMs in the study are given in Equations (1)–(6) and Figure 1. Three of the five tasks included in the study are classification tasks. The evaluation of these tasks is performed on Kaggle by comparing the submission files against the hidden test set labels, using accuracy as defined in Equation (1). This automatic evaluation ensures objective and consistent scoring across different models and participants. In the formula, the term number of predictions refers to the total number of records in the submission file, while number of correct predictions denotes the subset of those predictions that exactly match the corresponding ground truth labels.
In the evaluation of the Disaster Tweets task, the F1 metric was employed as the primary performance measure. F1 is defined as the harmonic mean of precision (Equation (2)) and recall (Equation (3)) and is presented in Equation (4). Precision denotes the proportion of tweets predicted as “disaster” that are indeed correct, whereas recall indicates the proportion of all actual disaster tweets that were successfully identified by the model. The foundations of these measures are the concepts of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), which are provided in Equations (2) and (3). Specifically, TP represents the correct classification of a disaster tweet, TN denotes the correct rejection of a non-disaster tweet, FP refers to a non-disaster tweet that is incorrectly classified as disaster, and FN corresponds to a disaster tweet that the model fails to detect (Kaggle. Natural Language Processing with Disaster Tweets, https://www.kaggle.com/competitions/nlp-getting-started, accessed on 15 July 2025).
Other tasks included in the study, House Prices and Predicting the Beats-per-Minute of Songs, are regression tasks. The evaluation for this task is carried out using the RMSE metric, as defined in Equation (5). In this formula, the term predicted refers to the LLM generated code outputs provided in the submission file, whereas actual denotes the true target values that are hidden during evaluation but used by the Kaggle platform for scoring. The variable n in the formula represents the total number of predictions made, corresponding to the number of records in the submission file.
A c c u r a c y = N u m b e r   o f   C o r r e c t   P r e d i c t i o n s N u m b e r   o f   P r e d i c t i o n s
p r e c i s i o n = T P T P + F P
r e c a l l = T P T P + F N
F 1   S c o r e = 2   p r e c i s i o n r e c a l l p r e c i s i o n + r e c a l l
RMSE = 1 n i = 1 n ( P r e d i c t e d i A c t u a l i ) 2
E x e c u t i o n T i m e = t e n d t s t a r t
Execution time is also considered as one of the evaluation metrics used in this study. Regardless of the task type, it is calculated as shown in Equation (6). In the context of this study, execution time refers to the total wall-clock time required to run all code cells in a notebook, as measured by the Kaggle platform. In equation, t s t a r t denotes the timestamp at which the execution of the first code cell begins, and t e n d represents the timestamp at which the final cell completes execution. This metric reflects the total elapsed time during the model’s end to end processing, including data loading, preprocessing, training, and evaluation steps. On Kaggle, this measurement is automatically reported and provides a standardized way to compare computational efficiency across different submissions.
In addition to accuracy/F1 score/RMSE and execution time, peak memory usage was also monitored during the execution of the notebook on the Kaggle platform. Memory tracking was implemented using Python’s psutil and resource libraries. As shown in the pseudocode (Figure 1), the memory usage was obtained using psutil. Peak memory usage, representing the maximum resident memory utilized during the entire execution, reflects the highest memory load experienced by the process and provides insight into the model’s memory efficiency under real execution conditions.

3. Experiments and Results

This section presents the results of experiments using the prompt-based methods and LLMs described in the previous sections. Each LLM was evaluated on different types of ML tasks under standardized conditions, and the code generation capabilities of the LLMs are extensively compared. The results were organized according to the applied prompting technique and HP tuning variants, and the performance differences of the models are clearly and concisely highlighted. (The datasets, codes, log files, and both functional and non-functional code examples generated by the LLMs that support the results presented in this section are provided in the Supplementary Materials).
Figure 2 shows the accuracy, F1 score and RMSE values obtained by running the codes generated by OpenAI o3, GPT-4.1, Gemini 2.5 Pro, DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4 LLMs in Kaggle environment using four different Kaggle tasks and two different prompt techniques. The figure shows (a) Accuracy for Titanic task, (b) RMSE for House Prices task, (c) Accuracy for Digit Recognizer task and (d) F1 score for Disaster Tweets task, respectively. Blue bars represent FSP, and orange bars represent ToT approach. The bars labeled ‘HP tuned’ show the results for codes generated with prompts where HPT is enforced in the prompt; the other bars show the results for codes generated with prompts where this setting is not enforced. All experiments were conducted under equal hardware/time constraints, with the same data separations and evaluation criteria.
Figure 2 summarizes the general trend of the LLM models in terms of accuracy, F1 score and RMSE metrics: in the classification tasks (Titanic, Digit Recognizer, Disaster Tweets) accuracy and F1 score values concentrated in a narrow band between LLMs (while variance/standard deviation between LLMs was low), while in the regression task (House Prices) RMSE differences remained limited. There is no uniform superiority between the two prompt strategies; ToT produced small gains in some combinations, whereas FSP could produce equal or better results in most tasks. HPT, on the other hand, was not consistently positive; in some models it produced significant gains, while in others test performance remained unchanged or slightly decreased. Below, the effects of LLM × prompt strategy × HPT are analyzed in detail on a task-by-task basis.
Table 8 shows the detailed results of LLM × prompt strategy × HPT combinations for Titanic task. Also detailed hyper-parameter grids are provided in Table A1 and Table A2 of Appendix A.
In Table 8, columns show the corresponding LLM series, rows show CV accuracy (k-fold cross-validation accuracy on training data: where k and the folding method, if any, were left to the code generated by the LLM, each condition was reported with its own CV setting), Test accuracy (accuracy calculated from the submission file on the public LB subsection of the test set whose labels are kept secret from the participants on the Kaggle evaluation server. (only the public LB results are reported since the selected competitions are indefinite), execution time, memory (peak memory usage) and Algorithm (the final learning algorithm used in the code generated by LLM). When comparing FSP and ToT strategies, in the conditions labeled HP-tuned, prompt enforced HPT in the generated code, while in the other conditions this step was not enforced. All experiments were run on the Kaggle platform under equal hardware and time/memory constraints, with the same data separations and evaluation metrics. Since code generation was performed with deterministic settings, the same code was obtained when the same prompt was repeated, and the results were reported as pass@1.
Figure 3 shows the visualization of the test accuracy obtained for the Titanic task on the heatmap. On the left (Figure 3a) is the map of accuracy obtained with non-tuned models, and on the right (Figure 3b) is the map of accuracy obtained with HP-tuned models. The rows represent the LLMs, and the columns represent the prompt techniques used. It was observed that the FSP technique was clearly superior without HPT. In this case, the highest values of 0.77990 were obtained for the GPT-4.1 and Gemini 2.5 Pro models. After the HPT process, it was observed that the FSP technique produced codes with higher accuracy than the ToT technique in most cases. When the figures were evaluated as a whole, it was found that the code with the highest accuracy (0.78229) was obtained with the DeepSeek-V3 model, FSP technique, and HPT process. In the Titanic task, it was observed that the FSP technique was generally better, and the ToT technique became competitive only with tuning in some models such as GPT-4.1 and o3. In this task, it was found that the HPT process acted as a lever that could change the ranking, making the ToT technique competitive but not universally beneficial.
Figure 4a shows the peak memory utilization for the Titanic task, measured during the execution of codes generated by FSP and ToT techniques on different LLMs. The Y-axis shows peak memory and the X-axis shows the LLM models and the non-tuned/HP-tuned variants of each. Peak memory consumption was within a narrow range for all model-strategy-tuning combinations and there was no uniform dominance. The differences varied depending on the LLM model × prompting technique × tuning pairing. We observed that HPT did not result in a consistent increase in task-specific memory utilization (it increased for some models while remaining the same or decreasing for others). The lowest peak memory value for this task was 272.04 MB for the Gemini 2.5 Pro model in the non-tuned condition and using the ToT technique. Under the HP-tuned condition, the lowest value was 273.26 MB on the OpenAI o3 model, again with ToT technique.
Figure 4b shows the execution time of the codes generated by FSP and ToT techniques for the Titanic mission when run on different LLMs. The Y-axis shows the time (seconds) in logarithmic scale, and the X-axis shows the models (non-tuned/HP-tuned). The figure shows that HP-tuning increased execution time in most models; the magnitude of the increase varied from LLM to LLM. For example, the increase was more pronounced in the Claude-4 family compared to other models. In terms of FSP and ToT techniques, there was no uniform speed advantage for this task: in some models, ToT was faster, while in others it produced results close to or slower than FSP. However, when all LLM and tuning variants were considered together, the ToT technique was faster on average (ToT: mean 61.9 s, median 36 s; FSP: mean 83.6 s, median 59 s). However, there was model- and tuning-dependent variability; therefore, task- and model-based results were reported separately. The shortest execution time for the Titanic task was 23 s on the Gemini 2.5 Pro model in the non-tuned condition and codes generated with the ToT technique. The fastest result among HP-tuned conditions was measured as 41 s on the DeepSeek-V3 model with ToT.
Table 9 shows the detailed results of the LLM × prompt strategy × HPT combinations for the House Price task. Also detailed hyper-parameter grids are provided in Table A3 and Table A4 of Appendix A. Columns show the corresponding LLM series, rows show the RMSE, execution time, memory (peak memory usage) and algorithm (the final learning algorithm used in the code generated by LLM). RMSE was computed on the Kaggle evaluation server from the predictions in the submission file in the public LB subsection of the test set, whose labels were kept secret from the participants. Only public LB results are reported since the selected competitions are indefinite. When comparing FSP and ToT strategies, in the conditions labeled HP-tuned, prompt enforced HPT in the generated code, while in the other conditions this step was not enforced. All experiments were run on the Kaggle platform under equal hardware and time/memory constraints, with the same data separations and evaluation metrics. Since code generation was performed with deterministic settings, the same code was obtained when the same prompt was repeated, and the results were reported as pass@1.
Figure 5 shows the heatmaps of the test RMSE values obtained for the House Prices task. On the left (Figure 5a) the results in the non-tuned case and on the right (Figure 5b) the results in the HP-Tuned case are shown. Rows represent LLMs and columns represent the prompt techniques used (FSP, ToT). It could be seen that in the non-tuned case, the FSP technique produced a significantly lower RMSE in most models. In this case, the lowest error value of 0.12007 was obtained with the FSP technique and the Gemini 2.5 Pro model. After the HPT process, the results obtained with the ToT technique became more competitive, as in the Titanic task. The most remarkable improvement was seen in the GPT-4.1 model with an error value of 0.12141. In the House Prices task, the FSP technique provided a reliable and strong start even without tuning. The ToT technique, on the other hand, was weak without tuning. With the right HPT process, it could become competitive. Considering the figures, the best result was 0.12007 with the Gemini 2.5 Pro model, FSP technique and no-tuning.
Figure 6a shows the peak memory utilization measured during the execution of codes generated by FSP and ToT techniques on different LLMs for the House Prices task. The Y-axis shows Peak Memory (MB) and the X-axis shows the LLM models and the non-tuned/HP-tuned variants of each. Peak memory consumption was in a narrow band in the middle for most model-technique-tuning combinations, but there was only one clear outlier in the graph. The GPT-4.1 model had by far the highest consumption of 842.48 MB when used the FSP(HP-tuned) technique. Apart from this, ToT variants were found to use less memory than FSP in many models. When the effect of HP-tuning on memory consumption was analyzed, there was no consistent upward/downward trend in this respect. The lowest values were in the ~250–350 MB band with the o3 model on the ToT side.
Figure 6b shows the execution time of the codes generated by FSP and ToT techniques for the same task. The Y-axis shows time (seconds) on a logarithmic scale, and the X-axis shows the non-tuned/HP-tuned variants of the models. The figure shows that HP-tuning increased the time in most models. The rate of increase varied from LLM to LLM. The most significant increase was seen for the DeepSeek-R1 model in combination with the ToT technique (HP-tuned). There was no uniform superiority in terms of speed, but the ToT (non-tuned) line was often the fastest (e.g., execution time on GPT-4.1 and Gemini 2.5 Pro was on the order of seconds). The FSP (HP-tuned) line was the slowest in most models.
Figure 7 shows the heatmaps of the Test accuracy values obtained for the Digit Recognizer task. The left panel (Figure 7a) shows the results in the non-tuned case, while the right panel (Figure 7b) shows the HP-tuned case. The rows represent the LLMs, and the columns represent the prompt techniques used. According to the figure, in the non-tuned case the ToT technique outperformed the FSP technique in most models. The best result in this case was 0.99621 with the Gemini 2.5 Pro model and the ToT technique. In the HP-tuned case, the ToT technique maintained its superiority over FSP in most models; the best result here was 0.99639 with the Gemini 2.5 Pro model and the ToT (HP-tuned) technique.
When the figure was evaluated across all conditions, the highest overall accuracy was obtained with the Gemini 2.5 Pro model, ToT technique, and HP-tuning. According to the figure, ToT gave the highest accuracy for most models in the digit recognizer task, whereas the FSP technique seemed to be a more reliable choice within the DeepSeek family. In this task, HP-tuning with ToT was generally competitive or superior, but—depending on the LLM (the ToT technique could degrade performance).
Table 10, shows the detailed results of LLM × prompt strategy × HPT combinations for Digit Recognizer task. Also detailed hyper-parameter grids are provided in Table A5 and Table A6 of Appendix A. The descriptions of table rows and columns, execution environment, code generation conditions, hyperparameter tuning rules, and evaluation settings in Table 10 are identical to those described for Table 8.
Figure 8a shows the measured peak memory usage for the Digit Recognizer task when running codes generated by FSP and ToT techniques on different LLMs. The Y-axis shows Peak Memory (MB) and the X-axis shows the LLM models and the non-tuned/HP-tuned variants of each. The overall pattern for this task showed that ToT variants used less memory compared to FSP, especially in the non-HPT case. The FSP technique had the highest consumption in many models even in the no-tuning case. The lowest memory values were seen with the ToT technique in the range of about 2.0–2.4 GB in models like o3/GPT-4.1. The effect of HP-tuning on memory was not uniform. In the FSP technique, tuning resulted in a decrease in most models, while in the ToT technique, small increases or decreases could be seen from LLM to LLM.
Figure 8b shows the execution time of the code generated by FSP and ToT techniques for the same task. The Y-axis shows time (seconds) on a logarithmic scale, while the X-axis shows the non-tuned/HP-tuned variants of the models. According to the graph, it was seen that the effect of HP-tuning on time was model-dependent. The fastest results were in the ~1–3 min band and were seen in the GPT-4.1, Sonnet 4 (FSP) and DeepSeek-V3 (ToT) combinations. In general, there was no uniform speed advantage: ToT was the fastest in some models, while FSP was more advantageous in others.
Table 11, shows the detailed results of LLM × prompt strategy × HPT combinations for NLP with Disaster Tweets task. Also detailed hyper-parameter grids are provided in Table A7 and Table A8 of Appendix A. The descriptions of table rows and columns, execution environment, code generation conditions, hyperparameter tuning rules, and evaluation settings in Table 11 are identical to those described for Table 8. In Table 11, only the evaluation metric differs from Table 8, where F1 score was used instead of accuracy.
Figure 9 shows the heatmaps of the F1 score values obtained for the Disaster Tweets task. The left (Figure 9a) shows the results in the non-tuned case and the right (Figure 9b) in the HP-tuned case. The rows represent the LLMs, and the columns represent the prompt techniques used.
In non-tuned models, the FSP technique produced results with significantly higher F1 score than the ToT technique in almost all models. Under HP-tuned, the ToT technique became competitive or produced better results than FSP in some models. In non-tuned models, the highest F1 score of 0.83695 was obtained with the Gemini 2.5 Pro model and the FSP technique. With HPT, the most successful LLM was obtained with 0.81428 F1 score from the OpenAI o3 model with the ToT technique. When the figures were evaluated in general, it was seen that the best result was obtained with the Gemini 2.5 Pro model and FSP technique. Another conclusion that could be drawn here was that the ToT technique significantly improved the F1 score of the models with the HPT process applied.
Figure 10a shows the measured peak memory utilization for the Disaster Tweets task when running codes generated by FSP and ToT techniques on different LLMs. The Y-axis shows Peak Memory (MB) and the X-axis shows the LLM models and the non-tuned/HP-tuned variants of each. The figure shows that, in general, the FSP technique consumed less memory than the ToT technique in most models. Four extremes stood out in the graph. These were Gemini 2.5 Pro with FSP (non-tuned) version, Claude Opus 4 with ToT (non-tuned) version, and HP-tuned versions of Claude Sonnet 4 for both FSP and ToT. Memory consumption in these datasets varied from about 1.5 GB to 5 GB. The lowest memory consumption value was found to be 269.09 MB in DeepSeek-V3 with FSP (non-tuned) technique. Again, HPT did not seem to have a uniform effect on memory consumption.
Figure 10b shows the execution time of the code generated by FSP and ToT techniques for the same task. The Y-axis is time (seconds) on a logarithmic scale, and the X-axis shows the non-tuned/HP-tuned variants of the models. According to the figure, HPT tended to increase the time for most models, but the magnitude of the increase varied by model. The fastest combinations were DeepSeek-R1 model FSP (non-tuned) & ToT (non-tuned), GPT-4.1 model FSP (non-tuned), and Gemini 2.5 Pro model FSP (HP-tuned).
In addition to the previously reported tasks, we extended the experimental study with the newly introduced “Predicting the Beats-per-Minute of Songs” competition. For this task, we focused on the most competitive and relevant models: Gemini 2.5 and DeepSeek-V3, whose effectiveness had already been demonstrated in earlier experiments, and GPT-5, which represents the most recent version available during the study. This selection allows for a fair and up-to-date evaluation while avoiding unnecessary redundancy.
Table 12 reports the detailed results of the Predicting the Beats-per-Minute of Songs task, considering the combinations of LLM × prompting strategy × hyperparameter tuning. Also detailed hyper-parameters are provided in Table A9 and Table A10 of Appendix A. Columns correspond to the LLM series, while rows provide RMSE, execution time, memory (peak usage), and the final learning algorithm employed by the generated code. RMSE values were obtained from the Kaggle evaluation server based on the submission files, ensuring evaluation consistency through hidden test labels. Figure 11 presents the heatmaps of the RMSE results for this task. Figure 11a shows the non-tuned case, while Figure 11b shows the HP-tuned case. Rows correspond to LLMs and columns to prompting strategies (FSP, ToT).
In the non-tuned setting, the lowest RMSE was obtained by GPT-5-Thinking with the ToT strategy (26.38734). Gemini 2.5 Pro with ToT was a close second (26.38879), while the best FSP result among non-tuned models was Gemini 2.5 Pro (26.39081). DeepSeek-V3’s ToT variant underperformed (26.42273), whereas its FSP result was 26.39554. With hyperparameter tuning, DeepSeek-V3 (FSP) achieved the overall best RMSE 26.38663. GPT-5-Thinking (ToT) remained competitive (26.38760), and DeepSeek-V3 (ToT) improved substantially to 26.38801.
Figure 12a shows the measured peak memory utilization for the Predicting the Beats-per-Minute of Songs task when executing the codes generated by FSP and ToT techniques on different LLMs. The Y-axis indicates peak memory (MB), and the X-axis shows the LLM models with their non-tuned and HP-tuned variants. Memory consumption patterns varied considerably across models and prompting strategies. GPT-5 Thinking with ToT (HP-tuned) consumed the most memory (~2 GB), while its ToT (non-tuned) variant was among the most memory-efficient (~461 MB). Gemini 2.5 Pro with FSP (HP-tuned) showed relatively low memory usage (~607 MB). For DeepSeek-V3, however, the FSP (HP-tuned) variant consumed ~697 MB, which was higher than its ToT (non-tuned) variant (~480 MB). Overall, neither FSP nor ToT was consistently more memory-efficient; the effect depended strongly on the LLM and whether hyperparameter tuning was applied.
Figure 12b presents execution times for the Predicting the Beats-per-Minute of Songs task. GPT-5-Thinking was generally the slowest, exceeding 3 h in the non-tuned runs, though its FSP (HP-tuned) variant reduced this to ~1.3 h. Gemini 2.5 Pro with ToT (non-tuned) achieved the fastest runtime at only 47 s. DeepSeek-V3 was also efficient, ranging from ~9 to 46 min depending on tuning. Overall, runtime efficiency was strongly model- and strategy-dependent, with Gemini and DeepSeek outperforming GPT-5 in this task.

4. Discussion

This section interprets the findings in terms of model, prompting strategy, and tuning interactions. Task-specific performance patterns are discussed, along with comparisons to baseline or reference solutions, consistency with prior work, and practical takeaways.

4.1. Prompting Effectiveness by Task Type

Table 13 summarizes the relationship between the types of tasks examined in the study and the prompt techniques. In the table, Search/Design Space refers to the set of all options that the LLM model can decide on when solving a task. This includes all choices such as data cleaning and transformation steps, feature engineering, algorithm selection (e.g., Logistic Regression (LR), Random Forest (RF), Convolutional Neural Network (CNN), etc.), hyperparameters and training/validation strategies (CV, early stopping, ensemble). The Search/Design Space column is rated by task as Narrow, Medium, Large and Very Large. This rating is based on the solution practices of the respective tasks in Kaggle competitions; in real-world applications, the width of the search space may vary due to data access, business requirements or constraints.
According to our findings, FSP stands out as the “reliable default” in most scenarios. Especially in the tabular tasks, it gave consistent results even without HPT and the best absolute score in the House Prices task was obtained with FSP (non-tuned). In Titanic, the best result was obtained with FSP (HP-tuned). For the Beats-per-Minute of Songs regression task, which represents a more recent and large-scale dataset, the best performance was again achieved with FSP (HP-tuned). As the search space expands, the ToT technique can provide higher ceiling values with the right HPT. For example, for the Digit Recognizer task, the best performance was obtained with the combination of ToT (HP-tuned). For a short and noisy text classification task such as Disaster Tweets, the best result was obtained with FSP (non-tuned); however, ToT (HP-tuned) conditions were also competitive in some LLMs.

4.2. A Practical Comparison of LLM-Generated and Human-Developed Solutions

Table 14 compares the solutions produced by LLMs in five different Kaggle competitions with the notebooks created and shared on the platform by the users participating in the competition. The scores in the “Kaggle Notebooks” column were determined as follows: If there was a pinned notebook in the competition that explicitly reported its result, the score from this notebook was taken directly. Pinned notebooks are typically highlighted by the contest team or Kaggle editors, indicating editorial endorsement. If no such notebook was available, notebooks were sorted by the number of votes, and the result from the highest-voted notebook reporting a score was used. While this approach is not a perfect benchmark, it offers a reasonable reference point for comparison. For the Beats-per-Minute of Songs task, since it is an ongoing competition, Kaggle public LB was taken as the reference, and the best result as of 20 September 2025 was used. As this result reflects the best submission by participants on that date (20 September 2025) but was not accompanied by an open notebook, only the RMSE value could be accessed, and the specific method used to obtain it was not available.
As four of the selected competitions are indefinite and one is still ongoing, only the public LB is available. Public LB results run the risk of adaptive overfitting due to a single public test slice and unlimited trials. Also, heavy ensembles that cannot be reproduced in practice, excessive seed/hyperparameter searches, etc. can make the scores difficult to be realistic and reproducible. In addition, unnoticed data leaks or incorrect validation practices may occur, and measurement noise and small statistically insignificant differences may also affect the ranking. For these reasons, public LB alone is not a reliable baseline in timeless competitions. The notebooks selected in this study aim to provide a more reproducible and representative comparison. Nevertheless, since there is no private LB, one should be cautious about the generalizability of the comparisons made.
According to the results in Table 14, if the LLM-generated solutions are compared with the Kaggle user-submitted notebooks on a task-specific basis, the following observations can be made: In the Titanic task, the Kaggle notebook achieved a higher score with 80.1% accuracy than the LLM (78.2%) and ranked higher on the leaderboard. The LLM solution, however, produced results in a much shorter time. In the House Price competition, both the LLM (0.12007) and the Kaggle notebook (0.12096) had very similar error rates. The LLM ranked slightly higher in percentage terms, while the Kaggle notebook was faster. In the Digit Recognizer task, the Kaggle notebook performed slightly behind the LLM, with the LLM also producing results faster. In this case, the leaderboard percentile rank of the LLM (10.4%) was better than that of the Kaggle notebook (32.7%). In the Disaster Tweets task, the LLM and Kaggle notebook produced almost identical F1 score and leaderboard rankings, with the Kaggle notebook being faster. Finally, for the Beats-per-Minute of Songs regression task, which represents a more recent and large-scale dataset, the Kaggle notebook achieved a slightly lower RMSE (26.38020) compared to the best LLM configuration (26.38663 with DeepSeek-V3 FSP HP-tuned). However, the LLM solution delivered results in a short time (under 10 min), highlighting its efficiency.
The performance differences across models also provide further insight. Gemini 2.5 Pro often excelled in structured classification tasks such as House Price, Digit Recognizer, and Disaster Tweets, which may be attributed to its extended ~1M-token context window (see Table 4) and strong handling of tabular patterns. In contrast, DeepSeek-V3 achieved the best results in Titanic and Beats-per-Minute of Songs, suggesting that its efficiency with regression-style optimization and coding-oriented reasoning (e.g., feature engineering, model selection) allowed it to succeed even with a shorter 128 K context window (see Table 4).
The results of the five Kaggle tasks examined in this study indicate that LLM-based solutions, generated with a single prompt and minimal manual intervention (without iterative refinement), can perform at levels comparable to, and in some cases slightly better than Kaggle user-submitted notebooks that were manually optimized. Performance in terms of speed varies by task: LLMs were faster in some tasks, while human-generated solutions were faster in others.

4.3. Comparison with Prior Literature

Table 15 presents the comparative results obtained from different LLMs on two classification tasks: Titanic survival prediction and MNIST digit recognition. The comparison includes results from three distinct sources: previously reported results from Ref. [23], standard Kaggle notebook implementations, and our own experimental analyses. The models evaluated include different versions of GPT, Gemini, and DeepSeek, tested both with and without fine-tuning. Several prompting strategies were applied, including FSP, CoT, ToT and Specifying Desired Response Format prompt formats. The selection criteria for the Kaggle notebooks included in this table are explained in detail in above.
The results were selected to enable a direct comparison with the best-performing LLM model and configuration reported in Ref. [23]. A notable methodological difference between the two studies is that, in our experiments, the entire process was executed under a single prompting strategy per experiment, allowing end-to-end evaluation. In contrast, Ref. [23] varied prompting strategies across sub-tasks and restricted ML algorithm selection to a fixed pool of three models, without delegating a ML algorithm choice to the LLM.
In our experiments, the highest accuracy was achieved by DeepSeek, reaching 0.78229 for Titanic task. This result was closely aligned with the best LLM-based result reported in Ref. [23], where GPT-3.5, combined with CoT and FSP techniques, achieved an accuracy of 0.7918. Although more recent and advanced models such as GPT-4.1 and Gemini 2.5 Pro were used in our study, they did not outperform the GPT-3.5 configuration from Ref. [23], suggesting that effective prompting strategies may have a greater impact than the LLM model version alone, particularly in low-data structured tasks.
It was also noteworthy that Kaggle notebook achieved the highest overall accuracy at 0.80143. This indicates that in structured datasets such as Titanic, classical models remain highly competitive and can outperform even the most advanced LLMs when appropriate feature engineering and HPT are applied. Overall, the results highlight the importance of not only selecting a capable LLM but also applying the right combination of preprocessing, prompting, and tuning strategies to extract optimal performance, especially in structured prediction tasks.
The results on the Digit Recognizer task demonstrate that LLMs can achieve highly competitive performance even on vision-based classification tasks. In our experiments, the highest accuracy was achieved by the Gemini 2.5 Pro (HP-tuned) model, reaching 99.639%, slightly surpassing the best result reported in Ref. [23] (GPT-3.5: 99.60%). The human-level accuracy reported in Ref. [23] was 98.34%. In our experiments, most LLM configurations achieved comparable or slightly higher results. For instance, GPT-4.1 models, particularly when fine-tuned and combined with classification-oriented prompting, achieved accuracy levels between 99.24% and 99.31%, while Gemini 2.5 Pro models ranged between 98.67% and 99.64%, depending on tuning and prompt design. The baseline Kaggle notebook achieved 99.028% accuracy, which is strong, though several LLM setups produced higher scores. These findings suggest that, with appropriate prompting strategies, LLMs can deliver performance that is competitive with, and in some cases exceeds, established vision models on structured image classification tasks. However, these results are specific to the evaluated benchmarks and should not be generalized across all tasks.
Overall, the results suggest that the success of LLMs in this task is not solely due to model scale, but also to the design of the prompt and the inclusion of targeted tuning strategies. These findings extend the applicability of LLMs beyond traditional NLP settings, highlighting their potential in domains that were previously dominated by task-specific architectures like CNNs. Table 16 illustrates key methodological differences between our study and Ref. [23] over best models.
According to Table 16, In the Titanic task, while both studies used Random Forest Classifier (RFC), the way it was chosen differs significantly. Ref. [23] selected the algorithm manually from a predefined pool of three models, whereas in our study, the ML algorithm choice was left to the LLM itself. Based on the given prompt, the LLM independently selected RFC, demonstrating its capacity not only to perform classification but also to make informed methodological decisions.
In the Digit Recognizer task, both studies employed CNN architecture; however, our design included important enhancements such as dropout regularization and a larger dense layer (256 units). Additionally, dropout was treated as a tunable hyperparameter in our setup, unlike in Ref. [23], where it was not used. These architectural and training choices likely improved the generalization ability of our models.
Taken together, these differences reflect a more independent and regularized approach in our experiments, where algorithm selection, tuning, and architecture were optimized in a prompt-driven workflow. This design not only yielded competitive results but also highlighted the potential of LLMs to go beyond prediction and contribute meaningfully to algorithm configuration in applied ML settings.

4.4. LLM Pipelines vs. Classical AutoML

To strengthen the evaluation and provide fair baselines, we additionally considered two established AutoML frameworks: auto-sklearn and AutoGluon. Their inclusion allows us to assess whether LLM-based code generation can provide advantages beyond conventional AutoML solutions.
Across five benchmark Kaggle tasks, the comparison highlights how traditional AutoML frameworks and LLMs differ in both predictive performance and computational efficiency. The results indicate that while auto-sklearn and AutoGluon generally provide competitive baselines, LLM-driven approaches such as Gemini 2.5 Pro and DeepSeek-V3 demonstrate the ability to generate pipelines that achieve comparable or superior accuracy, F1 score and RMSE values in several tasks. Notably, LLMs often reduced execution time substantially, especially when leveraging GPU resources, whereas AutoML frameworks incurred longer runtimes due to their iterative search and ensembling strategies (see Table 17). Nevertheless, AutoML systems remain strong baselines, particularly in structured regression tasks such as House Price and Beats per Minute, where their ensembles yielded stable performance.
These findings suggest that LLM-based code generation can complement rather than fully replace AutoML frameworks, offering rapid prototyping advantages, while AutoML continues to provide systematically optimized solutions. While AutoML frameworks systematically search for the best-performing combination within a predefined algorithmic space, LLMs can extend beyond this space by proposing novel or unconventional solution strategies. Thus, the two paradigms should be viewed as mutually reinforcing: LLMs provide creativity and flexibility in generating pipelines, whereas AutoML ensures stability and rigor through structured optimization.

4.5. General Insights and Practical Recommendations

In the light of the findings, some general suggestions on which prompting strategy is more appropriate for different task types and search spaces are presented below. The common picture emerging from the five tasks examined in the study is as follows (The findings relate to end-to-end optimized pipeline generation of LLMs on Kaggle classification and regression tasks with different search spaces in a statistical learning context; they should not be directly generalized to other contexts):
  • There is no universally superior prompting technique in terms of accuracy, F1 score/error metrics. FSP offers a strong and stable start without tuning, while ToT is often significantly strengthened with HP-tuning and can achieve the best results in some tasks
  • HPT usually increases execution time but creates a leverage effect that can change the ranking.
  • There is no uniform trend in memory consumption; differences depend on the LLM model × prompting × tuning interaction and are small in most tasks.
  • Among the models, Gemini 2.5 Pro and DeepSeek-V3 often stand out in terms of high accuracy, F1 score/low RMSE.
  • Practical advice: For a fast and stable start FSP (no-tuning) is particularly suitable for medium to narrow search space tasks. For tasks with wide search space, ToT (HP-tuned) can provide better peak performance if there is enough tuning budget.
  • Comparisons with AutoML frameworks indicate that LLM-based solutions can match or even surpass systematically optimized baselines in several tasks, while AutoML remains valuable for stability specifically in structured regression problems.
  • Results from the most recent and relatively underexplored task (Beats-per-Minute of Songs) demonstrate that LLMs can deliver competitive results even on less-studied datasets, with notable efficiency gains, though their performance remains sensitive to prompting and tuning choices.
An important practical dimension of this study concerns the time-to-solution tradeoff between LLM processes and traditional human-driven development. In human workflows, producing competitive ML solutions typically involves hours or even days of iterative coding, debugging, and hyperparameter tuning. In contrast, the LLMs examined here generated complete pipelines in a single pass based solely on the prompt. However, it should be noted that not all outputs were immediately executable: in cases of syntax or basic library errors, the erroneous script and its error message were re-submitted to the same LLM for correction, which added minimal overhead compared to human debugging. Taken together, these results indicate that LLMs can substantially accelerate prototyping and experimentation, while AutoML frameworks and human expertise remain valuable for systematic optimization and domain-specific adaptation.
Finally, when interpreting these results, it is important to consider their scope and generalizability beyond structured competition settings. While the results demonstrate the potential of LLM-based code generation on structured benchmark tasks, their generalizability to complex real-world business scenarios should be considered with caution. Kaggle competitions typically involve well-curated datasets, clearly defined objectives, and standardized evaluation metrics. In contrast, practical applications in industry often require handling noisy or incomplete data, domain-specific feature engineering, integration with existing workflows, and adherence to business constraints. Therefore, the present findings should primarily be interpreted as evidence of feasibility and relative performance under controlled benchmark conditions. Further empirical studies are needed to validate the robustness and applicability of LLM-based approaches across diverse, real-world contexts.

5. Conclusions and Future Work

This study evaluated the performance of various LLMs on five different Kaggle-based structured ML tasks and systematically examined the effects of different prompting strategies (FSP and ToT) and HPT. The findings show that, when guided by well-designed prompts, LLMs can generate high-performing ML solutions that are comparable to, and in some cases slightly surpass, human-crafted solutions, often requiring less development time.
Analyses across tasks with varying complexity and search spaces revealed that no prompting strategy consistently dominated. FSP proved to be a strong baseline, often yielding competitive results in lower-complexity and tabular tasks even without tuning. In contrast, ToT showed clear advantages in broader search spaces, particularly when combined with HPT, though it required greater computational resources and design effort.
From the perspective of efficiency, metrics such as memory usage and execution time varied inconsistently across models depending on the prompt strategies and tuning options; therefore, no single strategy stood out in terms of overall efficiency. However, Gemini 2.5 Pro and DeepSeek-V3 frequently emerged as the top-performing models with the highest accuracy, F1 score and lowest RMSE across various tasks.
The fact that LLMs can suggest algorithms suitable for the given task indicates that these models can be considered not only as prediction engines but also as tools assisting in the design of effective ML pipelines. The findings demonstrate that, when prompt design is handled carefully and aligned with the task, LLMs have the potential to deliver strong solutions with limited human intervention as supportive ML tools.
In addition, the comparison with established AutoML frameworks highlights their complementary role to LLM-based solutions. While AutoML systems provide stable and systematically optimized baselines, LLMs offer clear advantages in rapid prototyping and flexible pipeline generation, sometimes extending beyond the predefined search spaces of AutoML. The results from the most recent and large-scale benchmark task further illustrate this complementarity, showing that LLMs can deliver competitive performance with substantial efficiency gains.
Future studies may test the scalability of these approaches on larger and real-world datasets, as well as their robustness in less structured or noisier data settings. In addition, integrating different prompting methods could contribute to enhancing the generalization ability of LLMs across various tasks.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app152010968/s1, Folder S1: Contains the codes generated by the LLMs, along with the corresponding log and submission files and datasets. File S1: PDF document including functional and non-functional code examples.

Funding

This research received no external funding.

Data Availability Statement

Acknowledgments

I gratefully acknowledge the support of my Ph.D. student, Tuğrul Hakan Gençtürk, for his help with prompt execution in this study.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

Table A1. FSP-Tuned hyperparameters for the Titanic task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
Table A1. FSP-Tuned hyperparameters for the Titanic task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
(a)
LLMOpenAI o3GPT-4.1Gemini 2.5 Pro
Selected AlgorithmGradient Boosting ClassifierRandom Forest ClassifierRandom Forest Classifier
Tuned Hyperparameters:clf__n_estimators: randint(200, 600),
clf__learning_rate: uniform(0.01, 0.2),
clf__max_depth: randint(2, 5),
clf__min_samples_split: randint(2, 20),
clf__min_samples_leaf: randint(1, 20),
clf__subsample: uniform(0.6, 0.4)
n_estimators: [100, 200],
max_depth: [4, 6, 8, None],
min_samples_split: [2, 5, 10],
min_samples_leaf: [1, 2, 4],
class_weight: [‘balanced’]
n_estimators: [100, 200, 300],
max_depth: [None, 10, 20],
min_samples_split: [2, 5, 10],
min_samples_leaf: [1, 2, 4],
class_weight: [None, ‘balanced’]
(b)
LLMDeepSeek-V3DeepSeek-R1Claude Opus 4Claude Sonnet 4
Selected AlgorithmRandom Forest ClassifierRandom Forest ClassifierRandom Forest ClassifierRandom Forest Classifier
Tuned Hyperparameters:classifier__n_estimators: [100, 200, 300],
classifier__max_depth: [None, 5, 10],
classifier__min_samples_split: [2, 5, 10],
classifier__min_samples_leaf: [1, 2, 4]
classifier__n_estimators: [100, 200, 300],
classifier__max_depth: [None, 5, 10],
classifier__min_samples_split: [2, 5, 10],
classifier__min_samples_leaf: [1, 2, 4]
n_estimators: [100, 200, 300],
max_depth: [5, 10, 15, None],
min_samples_split: [2, 5, 10],
min_samples_leaf: [1, 2, 4],
max_features: [‘auto’, ‘sqrt’, ‘log2’]
n_estimators: [100, 200, 300],
max_depth: [3, 5, 7, None],
min_samples_split: [2, 5, 10],
min_samples_leaf: [1, 2, 4],
max_features: [‘sqrt’, ‘log2’]
Table A2. ToT-Tuned hyperparameters for the Titanic task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
Table A2. ToT-Tuned hyperparameters for the Titanic task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
(a)
LLMOpenAI o3GPT-4.1Gemini 2.5 Pro
Selected AlgorithmRandom Forest ClassifierRandom Forest ClassifierRandom Forest Classifier
Tuned Hyperparameters:clf__n_estimators: [100, 200, 300]
clf__max_depth: [None, 10, 20]
clf__min_samples_split: [2, 5]
n_estimators: [100, 200, 300, 400]
max_depth: [3, 5, 7, 9, None]
min_samples_split: [2, 5, 10]
min_samples_leaf: [1, 2, 4]
max_features: [‘sqrt’, ‘log2’]
n_estimators: [100, 200]
max_depth: [5, 10, None]
min_samples_split: [2, 5]
min_samples_leaf: [1, 2]
(b)
LLMDeepSeek-V3DeepSeek-R1Claude Opus 4Claude Sonnet 4
Selected AlgorithmRandom Forest ClassifierRandom Forest ClassifierRandom Forest ClassifierRandom Forest Classifier
Tuned Hyperparameters:classifier__n_estimators: [100, 200]
classifier__max_depth: [None, 10, 20]
classifier__min_samples_split: [2, 5]
classifier__n_estimators: [100, 200]
classifier__max_depth: [5, 10, None]
classifier__min_samples_split: [2, 5]
classifier__min_samples_leaf: [1, 2]
n_estimators: [100, 200, 300]
max_depth: [5, 10, 15, None]
min_samples_split: [2, 5, 10]
min_samples_leaf: [1, 2, 4]
max_features: [‘sqrt’, ‘log2’]
n_estimators: [100, 200, 300]
max_depth: [3, 5, 7, 10, None]
min_samples_split: [2, 5, 10]
min_samples_leaf: [1, 2, 4]
max_features: [‘sqrt’, ‘log2’, None]
Table A3. FSP-Tuned hyperparameters for the House Prices task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
Table A3. FSP-Tuned hyperparameters for the House Prices task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
(a)
LLMOpenAI o3GPT-4.1Gemini 2.5 Pro
Selected AlgorithmXGBRegressorBlending
LightGBM+XGBoost+CatBoost
Blending
LGBMRegressor+XGBRegressor+CatBoostRegressor
Tuned Hyperparameters:model__n_estimators: randint(800, 1600)
model__max_depth: randint(3, 7)
model__learning_rate: uniform(0.01, 0.09)
model__subsample: uniform(0.6, 0.4)
model__colsample_bytree: uniform(0.6, 0.4)
model__reg_alpha: uniform(0.0, 0.6)
model__reg_lambda: uniform(0.3, 1.0)
# LightGBM
objective: “regression”
metric: “rmse”
random_state: SEED
learning_rate: trial.suggest_float(“learning_rate”, 0.01, 0.2)
num_leaves: trial.suggest_int(“num_leaves”, 20, 80)
feature_fraction: trial.suggest_float(“feature_fraction”, 0.6, 1.0)
bagging_fraction: trial.suggest_float(“bagging_fraction”, 0.6, 1.0)
bagging_freq: trial.suggest_int(“bagging_freq”, 1, 7)
min_child_samples: trial.suggest_int(“min_child_samples”, 5, 30)

# XGBoost
objective: “reg:squarederror”
tree_method: “hist”
random_state: SEED
learning_rate: trial.suggest_float(“learning_rate”, 0.01, 0.2)
max_depth: trial.suggest_int(“max_depth”, 3, 10)
subsample: trial.suggest_float(“subsample”, 0.6, 1.0)
colsample_bytree: trial.suggest_float(“colsample_bytree”, 0.6, 1.0)
min_child_weight: trial.suggest_int(“min_child_weight”, 1, 10)

# CatBoost
loss_function: “RMSE”
random_seed: SEED
learning_rate: trial.suggest_float(“learning_rate”, 0.01, 0.2)
depth: trial.suggest_int(“depth”, 3, 10)
l2_leaf_reg: trial.suggest_float(“l2_leaf_reg”, 1, 10)
bagging_temperature: trial.suggest_float(“bagging_temperature”, 0, 1)
border_count: trial.suggest_int(“border_count”, 32, 255)
verbose: 0
# LightGBM
num_leaves: 31
learning_rate: 0.05
n_estimators: 720
max_bin: 55
bagging_fraction: 0.8
bagging_freq: 5
feature_fraction: 0.2319
feature_fraction_seed: 9
bagging_seed: 9
min_data_in_leaf: 6
min_sum_hessian_in_leaf: 11
random_state: SEED
n_jobs: −1

# XGBoost
learning_rate: 0.05
n_estimators: 600
max_depth: 3
min_child_weight: 0
gamma: 0
subsample: 0.7
colsample_bytree: 0.7
reg_alpha: 0.005
random_state: SEED
n_jobs: −1

# CatBoost
iterations: 1000
learning_rate: 0.05
depth: 3
l2_leaf_reg: 4
loss_function: ‘RMSE’
eval_metric: ‘RMSE’
random_seed: SEED
verbose: 0
(b)
LLMDeepSeek-V3DeepSeek-R1Claude Opus 4Claude Sonnet 4
Selected AlgorithmXGBRegressor+ LGBMRegressor+ CatBoostRegressorXGBRegressor+ LGBMRegressor+ CatBoostRegressor + RidgeMeta Model: Ridge
XGBRegressor + LGBMRegressor +
CatBoostRegressor
LGBMRegressor +
XGBRegressor +
CatBoostRegressor
Tuned Hyperparameters:# XGBRegressor:
n_estimators: 100, 2000
max_depth: 3, 12
learning_rate: 0.001, 0.1, log = True
subsample: 0.6, 1.0
colsample_bytree: 0.6, 1.0
reg_alpha: 0, 10
reg_lambda: 0, 10

The LGBMRegressor and CatBoostRegressor models are trained directly with their default parameters.
# xgb:
objective: reg:squarederror
n_estimators: 1000
learning_rate: 0.01
max_depth: 3
subsample: 0.8
colsample_bytree: 0.4
random_state: 42

# lgbm:
objective: regression
n_estimators: 1000
learning_rate: 0.01
max_depth: 3
subsample: 0.8
colsample_bytree: 0.4
random_state: 42

# catboost:
iterations: 1000
learning_rate: 0.01
depth: 3
subsample: 0.8
colsample_bylevel: 0.4
random_seed: 42
verbose: 0

# ridge:
alpha: 10
random_state: 42
# XGB
n_estimators: trial.suggest_int(‘n_estimators’, 100, 1000)
max_depth: trial.suggest_int(‘max_depth’, 3, 10)
learning_rate: trial.suggest_float(‘learning_rate’, 0.01, 0.3)
subsample: trial.suggest_float(‘subsample’, 0.6, 1.0)
colsample_bytree: trial.suggest_float(‘colsample_bytree’, 0.6, 1.0)
reg_alpha: trial.suggest_float(‘reg_alpha’, 0, 10)
reg_lambda: trial.suggest_float(‘reg_lambda’, 0, 10)
random_state: 42

# LGBM
n_estimators: trial.suggest_int(‘n_estimators’, 100, 1000)
max_depth: trial.suggest_int(‘max_depth’, 3, 10)
learning_rate: trial.suggest_float(‘learning_rate’, 0.01, 0.3)
num_leaves: trial.suggest_int(‘num_leaves’, 20, 300)
feature_fraction: trial.suggest_float(‘feature_fraction’, 0.5, 1.0)
bagging_fraction: trial.suggest_float(‘bagging_fraction’, 0.5, 1.0)
bagging_freq: trial.suggest_int(‘bagging_freq’, 1, 7)
reg_alpha: trial.suggest_float(‘reg_alpha’, 0, 10)
reg_lambda: trial.suggest_float(‘reg_lambda’, 0, 10)
random_state: 42
verbosity: −1

# CatBoost
iterations: trial.suggest_int(‘iterations’, 100, 1000)
depth: trial.suggest_int(‘depth’, 4, 10)
learning_rate: trial.suggest_float(‘learning_rate’, 0.01, 0.3)
l2_leaf_reg: trial.suggest_float(‘l2_leaf_reg’, 1, 10)
random_seed: 42
verbose: False
# LGBM
objective: “regression”
metric: “rmse”
boosting_type: “gbdt”
num_leaves: trial.suggest_int(“num_leaves”, 10, 300)
learning_rate: trial.suggest_float(“learning_rate”, 0.01, 0.3)
feature_fraction: trial.suggest_float(“feature_fraction”, 0.4, 1.0)
bagging_fraction: trial.suggest_float(“bagging_fraction”, 0.4, 1.0)
bagging_freq: trial.suggest_int(“bagging_freq”, 1, 7)
min_child_samples: trial.suggest_int(“min_child_samples”, 5, 100)
verbosity: −1
random_state: 42

# XGB
n_estimators: 1000
learning_rate: 0.05
max_depth: 6
random_state: 42

# CatBoost
iterations: 1000
learning_rate: 0.05
depth: 6
verbose: False
random_state: 42
Table A4. ToT-Tuned hyperparameters for the House Prices task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
Table A4. ToT-Tuned hyperparameters for the House Prices task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
(a)
LLMOpenAI o3GPT-4.1Gemini 2.5 Pro
Selected AlgorithmLGBMRegressorLasso + XGBRegressorRidge + LGBMRegressor
Tuned Hyperparameters:num_leaves: [31, 64]
learning_rate: [0.05, 0.1]
n_estimators: [500, 1000]
max_depth: [−1, 10]
# Lasso
alpha: np.logspace(−4, 0, 40)

# XGB
learning_rate: [0.01, 0.03]
max_depth: [3, 4]
n_estimators: [300, 500]
reg_alpha: [0, 0.1]
reg_lambda: [0.7, 1.0]
subsample: [0.7, 1.0]
# Model 1—Ridge Regression (non-tuned)

# Model 2—LightGBM
objective: “regression”
num_leaves: 31
learning_rate: 0.05
n_estimators: 720
max_bin: 55
bagging_fraction: 0.8
bagging_freq: 5
feature_fraction: 0.2319
feature_fraction_seed: 9
bagging_seed: 9
min_data_in_leaf: 6
min_sum_hessian_in_leaf: 11
random_state: RANDOM_SEED
n_jobs: −1
(b)
LLMDeepSeek-V3DeepSeek-R1Claude Opus 4Claude Sonnet 4
Selected AlgorithmLGBMRegressor +
XGBRegressor + CatBoostRegressor
LGBMRegressorridge + lasso + ElasticNet +
XGBoost + LightGBM
XGBoost + LightGBM + Ridge + ElasticNet + Random Forest
Tuned Hyperparameters:# LGBMRegressor
n_estimators: trial.suggest_int(‘n_estimators’, 100, 2000)
max_depth: trial.suggest_int(‘max_depth’, 3, 12)
learning_rate: trial.suggest_float(‘learning_rate’, 0.001, 0.1)
subsample: trial.suggest_float(‘subsample’, 0.6, 1.0)
colsample_bytree: trial.suggest_float(‘colsample_bytree’, 0.6, 1.0)
n_estimators: (100, 1000)
max_depth: (3, 10)
learning_rate: (0.001, 0.1, ‘log-uniform’)
num_leaves: (10, 100)
min_child_samples: (5, 50)
subsample: (0.6, 1.0)
colsample_bytree: (0.6, 1.0)
# XGBoost
max_depth: [3, 4, 5]
learning_rate: [0.01, 0.05, 0.1]
n_estimators: [300, 500]
subsample: [0.8]
colsample_bytree: [0.8]

# LightGBM
num_leaves: [20, 31, 40]
learning_rate: [0.01, 0.05, 0.1]
n_estimators: [300, 500]
subsample: [0.8]
colsample_bytree: [0.8]

# Ridge Regression
alpha: [0.1, 0.5, 1, 5, 10, 20, 50]

# Lasso
alpha: [0.0001, 0.0005, 0.001, 0.005, 0.01]

# ElasticNet
alpha: [0.0001, 0.0005, 0.001, 0.005]
l1_ratio: [0.3, 0.5, 0.7, 0.9]
# XGBoost
n_estimators = 1000,
max_depth = 3,
learning_rate = 0.05,
subsample = 0.8,
colsample_bytree = 0.8,
reg_alpha = 0.05,
reg_lambda = 0.05

# LightGBM
n_estimators = 1000,
max_depth = 3,
learning_rate = 0.05,
subsample = 0.8,
colsample_bytree = 0.8,
reg_alpha = 0.05,
reg_lambda = 0.05,
verbosity = −1

# Ridge Regression
alpha = 10.0

# ElasticNet
alpha = 0.005,
l1_ratio = 0.9,
max_iter = 1000

# Random Forest
n_estimators = 300,
max_depth = 15,
n_jobs = −1
Table A5. FSP-Tuned hyperparameters for the Digit Recognizer task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
Table A5. FSP-Tuned hyperparameters for the Digit Recognizer task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
(a)
LLMOpenAI o3GPT-4.1Gemini 2.5 Pro
Selected AlgorithmCNNCNNCNN
Tuned Hyperparameters:conv1: [32, 48]
conv2: [64, 96]
dense: [128, 256]
dropout: [0.2, 0.3, 0.4]
epochs: [12, 15, 20]
batch_size: [64, 128, 256]
conv1_filters: [32, 64]
conv2_filters: [64, 128]
dense_units: [128, 256]
dropout_rate: [0.2, 0.3]
batch_size: [64, 128]
epochs: [12, 15]
conv_filters: [32, 64]
dropout_rates: [0.25, 0.5]
(b)
LLMDeepSeek-V3DeepSeek-R1Claude Opus 4Claude Sonnet 4
Selected AlgorithmCNNCNNCNNCNN + Random Forest Classifier
Tuned Hyperparameters:learning_rate: [0.001, 0.0005, 0.0001]
batch_size: [64, 128, 256]
epochs: [30, 40, 50]
dropout_rate: [0.3, 0.4, 0.5]
dropout_rate: [0.2, 0.3, 0.4]
optimizer: [‘adam’, ‘rmsprop’]
batch_size: [64, 128]
epochs: [20]
epochs: [20, 30]
batch_size: [64, 128]
model__learning_rate: [0.001, 0.0005]
# CNN
epochs: [5, 10, 15]
batch_size: [64, 128, 256]
model__optimizer: [‘adam’, ‘rmsprop’]
# Random Forest
n_estimators: [100, 200, 300]
max_depth: [10, 20, None]
min_samples_split: [2, 5, 10]
Table A6. ToT-Tuned hyperparameters for the Digit Recognizer task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
Table A6. ToT-Tuned hyperparameters for the Digit Recognizer task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
(a)
LLMOpenAI o3GPT-4.1Gemini 2.5 Pro
Selected AlgorithmMedium CNNSimple CNNCNN
Tuned Hyperparameters:lr: [1 × 10−4, 5 × 10−3]
bs: [64, 128, 256]
lr: [1 × 10−4, 5 × 10−2]
batch_size: [64, 128, 256]
lr: [1 × 10−4, 1 × 10−2]
dropout_rate: [0.1, 0.6]
(b)
LLMDeepSeek-V3DeepSeek-R1Claude Opus 4Claude Sonnet 4
Selected AlgorithmCNNCNNSimple CNNCNN
Tuned Hyperparameters:learning_rate: 0.0003
epochs: 10
batch_size: 64
units: [128, 256]
dropout: [0.3, 0.4, 0.5]
lr: [0.001, 0.0005, 0.0001]
learning_rate: 5 × 10−4
batch_size_train: 128
batch_size_valtest: 256
epochs: 30
batch_size: 128
learning_rate: 0.0005
epochs: 30
Table A7. FSP-Tuned hyperparameters for the Disaster Tweets task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
Table A7. FSP-Tuned hyperparameters for the Disaster Tweets task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
(a)
LLMOpenAI o3GPT-4.1Gemini 2.5 Pro
Selected AlgorithmTF/IDF + Logistic RegressionTF-idf + Stacking ClassifierTF TDF + Logistic Regression
Tuned Hyperparameters:clf__C: [0.5, 1, 2, 5] lr__C: [0.5, 1, 2]
lgbm__num_leaves: [15, 31]
lgbm__learning_rate: [0.05, 0.1]
final_estimator__C: [0.5, 1, 2]
# LogisticRegression
C: 1.0
solver: ‘liblinear’
random_state: 42

# TfidfVectorizer
ngram_range: (1, 2)
max_features: 15,000
(b)
LLMDeepSeek-V3DeepSeek-R1Claude Opus 4Claude Sonnet 4
Selected AlgorithmTF-IDF + Logistic RegressionTF-IDF + Logistic RegressionTF-idf + Logistic Regression/SVM/LGBMClassifierTF-idf+ Logistic Regression + SVM
Tuned Hyperparameters:clf__C: [0.1, 1, 10]
clf__penalty: [‘l1’, ‘l2’]
clf__solver: [‘liblinear’]
# TfidfVectorizer
ngram_range: [(1, 1), (1, 2)]
max_features: [5000, 10,000]

# LogisticRegression
C: [0.1, 1, 10]
solver: [‘liblinear’, ‘saga’]
# Logistic Regression (pipe_lr)
clf__C: [0.1, 0.5, 1.0, 2.0, 5.0]
clf__penalty: [‘l2’]
clf__max_iter: [500]

# SVM (pipe_svm)
clf__C: [0.1, 1.0, 10.0]
clf__kernel: [‘rbf’, ‘linear’]
clf__gamma: [‘scale’, ‘auto’]

# LightGBM (pipe_lgb)
clf__n_estimators: [100, 200, 300]
clf__num_leaves: [31, 50, 100]
clf__learning_rate: [0.05, 0.1, 0.2]
clf__min_child_samples: [20, 30]
# LogisticRegression
C: [0.1, 1, 10]

# SVC
C: [0.1, 1, 10]
kernel: [‘linear’, ‘rbf’]
Table A8. ToT-Tuned hyperparameters for the Disaster Tweets task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
Table A8. ToT-Tuned hyperparameters for the Disaster Tweets task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.
(a)
LLMOpenAI o3GPT-4.1Gemini 2.5 Pro
Selected AlgorithmTF/IDF + Logistic RegressionTF/IDF + Logistic RegressionTF/IDF + Logistic Regression
Tuned Hyperparameters:C: [1, 2, 4, 8]C: [0.3, 1, 3]
penalty: [‘l2’]
C: trial.suggest_loguniform(‘C’, 1 × 10−1, 1 × 10)
solver: ‘liblinear’
penalty: ‘l2’
random_state: 42
(b)
LLMDeepSeek-V3DeepSeek-R1Claude Opus 4Claude Sonnet 4
Selected AlgorithmTF/IDF + Logistic RegressionTF/IDF + Logistic RegressionTF/IDF + XGBClassifierXGBoost + SVM + Multinomial Naive Bayes
Tuned Hyperparameters:C: [0.1, 1, 10]
penalty: [‘l1’, ‘l2’]
solver: [‘liblinear’]
C: [0.1, 1, 10]
class_weight: [None, ‘balanced’]
n_estimators: [100, 500]
max_depth: [3, 9]
learning_rate: [0.01, 0.3]
subsample: [0.6, 1.0]
colsample_bytree: [0.6, 1.0]
min_child_weight: [1, 4]
gamma: [0, 0.5]
# XGB
max_depth: [4, 6, 8]
learning_rate: [0.05, 0.1, 0.15]
n_estimators: [100, 200]
# SVM
C: [0.1, 1, 10]
gamma: [‘scale’, ‘auto’]

# Naive Bayes
alpha: [0.01, 0.1, 1.0]
Table A9. FSP-Tuned hyperparameters for the Predicting the Beats-per-Minute of Songs task: An LLM-based overview.
Table A9. FSP-Tuned hyperparameters for the Predicting the Beats-per-Minute of Songs task: An LLM-based overview.
LLMGPT-5 ThinkingGemini 2.5 ProDeepSeek-V3
Selected AlgorithmCatBoost/XGBoost/RidgeLightGBMLightGBM
Tuned Hyperparameters:# CatBoost
depth: [4, 5, 6, 7, 8, 9, 10]
learning_rate: [0.03, 0.05, 0.07, 0.10]
l2_leaf_reg: [1, 3, 5, 7, 10, 15, 20, 30]
bagging_temperature: [0.0, 0.25, 0.5, 1.0]
random_strength: [0.5, 1.0, 1.5, 2.0]
grow_policy: [‘SymmetricTree’, ‘Depthwise’, ‘Lossguide’]
border_count: [64, 128, 254]
min_data_in_leaf: [1, 5, 10, 20, 50]
n_estimators: [2000, 4000, 8000]

# XGBoost
xgb__n_estimators: [400, 600, 800, 1200, 1600]
xgb__max_depth: [3, 4, 5, 6, 7, 8, 9, 10]
xgb__learning_rate: [0.03, 0.05, 0.07, 0.10, 0.15, 0.20]
xgb__subsample: [0.6, 0.7, 0.8, 0.9, 1.0]
xgb__colsample_bytree: [0.6, 0.7, 0.8, 0.9, 1.0]
xgb__min_child_weight: [1, 2, 3, 5, 7, 10]
xgb__reg_alpha: [0.0, 1 × 10−8, 1 × 10−6, 1 × 10−4, 1 × 10−3, 1 × 10−2,1 × 10−1]
xgb__reg_lambda: [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]

#Ridge
rg__alpha: list(np.logspace(−2, 3, 30)) # 0.01 … 1000
# LightGBM
objective: ‘regression_l1’
metric: ‘rmse’
boosting_type: ‘gbdt’
random_state: 42
n_jobs: −1
verbose: −1
n_estimators: int [200, 2000]
learning_rate: float [0.01, 0.3]
num_leaves: int [20, 300]
max_depth: int [3, 12]
min_child_samples: int [5, 100]
subsample: float [0.6, 1.0]
colsample_bytree: float [0.6, 1.0]
reg_alpha: logfloat [1 × 10−8, 10.0]
reg_lambda: logfloat [1 × 10−8, 10.0]
# LightGBM
objective: ‘regression’
metric: ‘mae’
verbosity: −1
boosting_type: ‘gbdt’
random_state: 42
n_estimators: int [100, 1000]
learning_rate: logfloat [0.01, 0.3]
num_leaves: int [20, 300]
max_depth: int [3, 12]
min_child_samples: int [5, 100]
subsample: float [0.5, 1.0]
colsample_bytree: float [0.5, 1.0]
reg_alpha: logfloat [1 × 10−8, 10.0]
reg_lambda: logfloat [1 × 10−8, 10.0]
Table A10. ToT-Tuned hyperparameters for the Predicting the Beats-per-Minute of Songs task: An LLM-based overview.
Table A10. ToT-Tuned hyperparameters for the Predicting the Beats-per-Minute of Songs task: An LLM-based overview.
LLMGPT-5 ThinkingGemini 2.5 ProDeepSeek-V3
Selected AlgorithmRidge (blender, α = 1.0, OOF-based ensemble)LightGBMXGBoost
Tuned Hyperparameters:# Ridge
alpha: float [1 × 10−3,1 × 103]
random_state: 42

# ElasticNet
alpha: float [1 × 10−3,1 × 103]
l1_ratio: float [0.01, 0.99]
random_state: 42

# ExtraTrees
n_estimators: int [200, 800]
max_depth: {None | 8 | 12 | 16 | 24}
min_samples_split: int [2, 20]
min_samples_leaf: int [1, 12]
max_features: {“sqrt” | “auto” | 0.5 | 0.8}
random_state: 42

# HistGradientBoostingRegressor
max_depth: int [3, 10]
learning_rate: float [0.01, 0.3]
l2_regularization: float [1 × 10−8, 10.0]
max_leaf_nodes: int [15, 63]
random_state: 42


# LightGBM
num_leaves: int [15, 255]
feature_fraction: float [0.5, 1.0]
bagging_fraction: float [0.5, 1.0]
bagging_freq: int [0, 7]
min_data_in_leaf: int [10, 200]
lambda_l1: float [1 × 10−3, 10.0]
lambda_l2: float [1 × 10−3, 10.0]
learning_rate: float [0.01, 0.2]
random_state: 42

# XGBoost
max_depth: int [3, 10]
min_child_weight: int [1, 20]
subsample: float [0.5, 1.0]
colsample_bytree: float [0.5, 1.0]
reg_alpha: float [1 × 10−3, 10.0]
reg_lambda: float [1 × 10−3, 10.0]
eta: float [0.01, 0.2]
random_state: 42

# CatBoost
depth: int [4, 10]
learning_rate: float [0.01, 0.2]
l2_leaf_reg: float [1.0, 15.0]
bagging_temperature: float [0.0, 5.0]
random_state: 42
# LightGBM
objective: ‘regression_l1’
metric: ‘rmse’
n_estimators: 2000
learning_rate: 0.01
feature_fraction: 0.8
bagging_fraction: 0.8
bagging_freq: 1
lambda_l1: 0.1
lambda_l2: 0.1
num_leaves: 31
verbose: −1
n_jobs: −1
seed: RANDOM_SEED
boosting_type: ‘gbdt’
# XGBoost
n_estimators: int [100, 1000]
max_depth: int [3, 10]
learning_rate: [0.01, 0.3]
subsample: float [0.6, 1.0]
colsample_bytree: float [0.6, 1.0]
reg_alpha: float [0, 1.0]
reg_lambda: float [0, 1.0]
random_state: 42

References

  1. Gençtürk, T.H.; Gülağiz, F.K.; Kaya, İ. Detection and segmentation of subdural hemorrhage on head CT images. IEEE Access 2024, 12, 82235–82246. [Google Scholar] [CrossRef]
  2. Laney, D. 3D Data Management: Controlling Data Volume, Velocity and Variety; META Group Research Note; META Group: Stamford, CT, USA, 2001. [Google Scholar]
  3. Gandomi, A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [Google Scholar] [CrossRef]
  4. Du, X.; Liu, M.; Wang, K.; Wang, H.; Liu, J.; Chen, Y.; Feng, J.; Sha, C.; Peng, X.; Lou, Y. Evaluating large language models in class-level code generation. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, New York, NY, USA, 14–20 April 2024; pp. 1–13. [Google Scholar] [CrossRef]
  5. Li, J.; Li, G.; Zhang, X.; Zhao, Y.; Dong, Y.; Jin, Z.; Li, B.; Huang, F.; Li, Y. EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations. Adv. Neural Inf. Process. Syst. 2024, 37, 57619–57641. [Google Scholar] [CrossRef]
  6. Fakhoury, S.; Naik, A.; Sakkas, G.; Chakraborty, S.; Lahiri, S.K. Llm-based test-driven interactive code generation: User study and empirical evaluation. IEEE Trans. Softw. Eng. 2024, 50, 2254–2268. [Google Scholar] [CrossRef]
  7. Coignion, T.; Quinton, C.; Rouvoy, R. A performance study of llm-generated code on leetcode. In Proceedings of the 28th international conference on evaluation and assessment in software engineering, Salerno, Italy, 18–21 June 2024; pp. 79–89. [Google Scholar] [CrossRef]
  8. Tambon, F.; Dakhel, A.M.; Nikanjam, A.; Khomh, F.; Desmarais, M.C.; Antoniol, G. Bugs in Large Language Models Generated Code: An Empirical Study. Empir. Software Eng. 2025, 30, 65. [Google Scholar] [CrossRef]
  9. Li, J.; Li, G.; Li, Y.; Jin, Z. Structured Chain-of-Thought prompting for code generation. ACM Trans. Softw. Eng. Methodol. 2025, 34, 37. [Google Scholar] [CrossRef]
  10. Khojah, R.; de Oliveira Neto, F.G.; Mohamad, M.; Leitner, P. The impact of prompt programming on function-level code generation. IEEE Trans. Softw. Eng. 2025, 51, 2381–2395. [Google Scholar] [CrossRef]
  11. Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Zhuo, T.Y.; Chen, T. Chain-of-thought in neural code generation: From and for lightweight language models. IEEE Trans. Softw. Eng. 2024, 50, 2437–2457. [Google Scholar] [CrossRef]
  12. Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef]
  13. Yao, J.; Zhang, L.; Huang, J. Evaluation of Large Language Model-Driven AutoML in Data and Model Management from Human-Centered Perspective. Front. Artif. Intell. Sec. Nat. Lang. Process. 2025, 8, 1590105. [Google Scholar] [CrossRef]
  14. Fathollahzadeh, S.; Mansour, E.; Boehm, M. Demonstrating CatDB: LLM-based Generation of Data-centric ML Pipelines. In Proceedings of the Companion of the 2025 International Conference on Management of Data, New York, NY, USA, 22–27 June 2025; pp. 87–90. [Google Scholar] [CrossRef]
  15. Zhao, Y.; Pang, J.; Zhu, X.; Shao, W. LLM-Prompting Driven AutoML: From Sleep Disorder—Classification to Beyond. Trans. Artif. Intell. 2025, 1, 59–82. [Google Scholar] [CrossRef]
  16. Mulakala, B.; Saini, M.L.; Singh, A.; Bhukya, V.; Mukhopadhyay, A. Adaptive multi-fidelity hyperparameter optimization in large language models. In Proceedings of the 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS), Bengaluru, India, 7–9 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
  17. Zhang, M.R.; Desai, N.; Bae, J.; Lorraine, J.; Ba, J. Using large language models for hyperparameter optimization. In Proceedings of the NeurIPS 2023 Workshop, New Orleans, LA, USA, 16 December 2023. [Google Scholar] [CrossRef]
  18. Wang, L.; Shi, C.; Du, S.; Tao, Y.; Shen, Y.; Zheng, H.; Qiu, X. Performance Review on LLM for solving leetcode problems. In Proceedings of the 2024 4th International Symposium on Artificial Intelligence and Intelligent Manufacturing (AIIM), Chengdu, China, 20–22 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1050–1054. [Google Scholar] [CrossRef]
  19. Jain, R.; Thanvi, J.; Subasinghe, A. The evolution of ChatGPT for programming: A comparative study. Eng. Res. Express 2025, 7, 015242. [Google Scholar] [CrossRef]
  20. Döderlein, J.B.; Kouadio, N.H.; Acher, M.; Khelladi, D.E.; Combemale, B. Piloting Copilot, Codex, and StarCoder2: Hot temperature, cold prompts, or black magic? J. Syst. Softw. 2025, 230, 112562. [Google Scholar] [CrossRef]
  21. Jamil, M.T.; Abid, S.; Shamail, S. Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study. In Proceedings of the 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), Ottawa, ON, Canada, 28–29 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 478–489. [Google Scholar] [CrossRef]
  22. Mathews, N.S.; Nagappan, M. Test-driven development and llm-based code generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1583–1594. [Google Scholar] [CrossRef]
  23. Ko, E.; Kang, P. Evaluating Coding Proficiency of Large Language Models: An Investigation Through Machine Learning Problems. IEEE Access 2025, 13, 52925–52938. [Google Scholar] [CrossRef]
  24. Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. OpenAI. 2018. Available online: https://api.semanticscholar.org/CorpusID:49313245 (accessed on 3 August 2025).
  25. OpenAI. Introducing OpenAI o3 and o4-Mini. OpenAI. 2025. Available online: https://openai.com/index/introducing-o3-and-o4-mini (accessed on 3 August 2025).
  26. OpenAI. Introducing GPT-4.1 in the API. OpenAI. 2025. Available online: https://openai.com/index/gpt-4-1/ (accessed on 3 August 2025).
  27. OpenAI. ChatGPT Release Notes. OpenAI Help Center. 2025. Available online: https://help.openai.com/en/articles/6825453-chatgpt-release-notes (accessed on 3 August 2025).
  28. Comanici, G.; Bieber, E.; Schaekermann, M.; Pasupat, I.; Sachdeva, N.; Dhillon, I.; Blistein, M.; Ram, O.; Zhang, D.; Rosen, E.; et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv 2025, arXiv:2507.06261. [Google Scholar] [CrossRef]
  29. DeepSeek-AI. Deepseek-AI/Organization Card. Hugging Face. Available online: https://huggingface.co/deepseek-ai (accessed on 3 August 2025).
  30. DeepSeek-AI; Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar] [CrossRef]
  31. DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
  32. Anthropic. Introducing Claude. 2023. Available online: https://www.anthropic.com/index/introducing-claude (accessed on 3 August 2025).
  33. Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. Anthropic. 2025. Available online: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf (accessed on 3 August 2025).
  34. Anthropic. Introducing Claude 4. Anthropic. 2025. Available online: https://www.anthropic.com/news/claude-4 (accessed on 3 August 2025).
  35. Tonmoy, S.M.; Zaman, S.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv 2024, arXiv:2401.01313. [Google Scholar] [CrossRef]
  36. Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 195. [Google Scholar] [CrossRef]
  37. Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
  38. Tafesse, W.; Wood, B. Hey ChatGPT: An examination of ChatGPT prompts in marketing. J. Mark. Anal. 2024, 12, 790–805. [Google Scholar] [CrossRef]
  39. Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
  40. Lee, Y.; Oh, J.H.; Lee, D.; Kang, M.; Lee, S. Prompt engineering in ChatGPT for literature review: Practical guide exemplified with studies on white phosphors. Sci. Rep. 2025, 15310. [Google Scholar] [CrossRef] [PubMed]
  41. Debnath, T.; Siddiky, M.N.A.; Rahman, M.E.; Das, P.; Guha, A.K. A comprehensive survey of prompt engineering techniques in large language models. TechRxiv 2025. [Google Scholar] [CrossRef]
  42. Phoenix, J.; Taylor, M. Prompt Engineering for Generative AI; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2024. [Google Scholar]
  43. Saint-Jean, D.; Al Smadi, B.; Raza, S.; Linton, S.; Igweagu, U. A Study of Prompt Engineering Techniques for Code Generation: Focusing on Data Science Applications. In International Conference on Information Technology-New Generations; Springer Nature: Cham, Switzerland, 2025; pp. 445–453. [Google Scholar] [CrossRef]
  44. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
  45. Bareiß, P.; Souza, B.; d’Amorim, M.; Pradel, M. Code generation tools (almost) for free? A study of few-shot, pre-trained language models on code. arXiv 2022, arXiv:2206.01335. [Google Scholar] [CrossRef]
  46. Xu, D.; Xie, T.; Xia, B.; Li, H.; Bai, Y.; Sun, Y.; Wang, W. Does few-shot learning help LLM performance in code synthesis? arXiv 2024, arXiv:2412.02906. [Google Scholar] [CrossRef]
  47. Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv 2022, arXiv:2210.02406. [Google Scholar] [CrossRef]
  48. Suzgun, M.; Scales, N.; Scharli, N.; Gehrmann, S.; Tay, Y.; Chung, H.W.; Chowdhery, A.; Le, Q.V.; Chi, E.H.; Zhou, D.; et al. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022. [Google Scholar]
  49. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems. arXiv 2023, arXiv:2305.10601. [Google Scholar]
  50. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  51. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
  52. Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; Hutter, F. Efficient and robust automated machine learning. Adv. Neural Inf. Process. Syst. 2015, 28, 2962–2970. [Google Scholar]
  53. Machine Learning Professorship Freiburg. auto-sklearn—AutoSklearn 0.15.0 documentation. Machine Learning Professorship Freiburg. Available online: https://automl.github.io/auto-sklearn/master/ (accessed on 26 September 2025).
  54. Erickson, N.; Mueller, J.; Shirkov, A.; Zhang, H.; Larroy, P.; Li, M.; Smola, A. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv 2020, arXiv:2003.06505. [Google Scholar] [CrossRef]
  55. Truong, A.; Walters, A.; Goodsitt, J.; Hines, K.; Bruss, C.B.; Farivar, R. Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1471–1479. [Google Scholar]
  56. Tian, J.; Che, C. Automated machine learning: A survey of tools and techniques. J. Ind. Eng. Appl. Sci. 2024, 2, 71–76. [Google Scholar] [CrossRef]
  57. Baratchi, M.; Wang, C.; Limmer, S.; Van Rijn, J.N.; Hoos, H.; Bäck, T.; Olhofer, M. Automated machine learning: Past, present and future. Artif. Intell. Rev. 2024, 57, 122. [Google Scholar] [CrossRef]
  58. Quaranta, L.; Azevedo, K.; Calefato, F.; Kalinowski, M. A multivocal literature review on the benefits and limitations of industry-leading AutoML tools. Inf. Softw. Technol. 2025, 178, 107608. [Google Scholar] [CrossRef]
  59. An, J.; Kim, I.S.; Kim, K.J.; Park, J.H.; Kang, H.; Kim, H.J.; Kim, Y.S.; Ahn, J.H. Efficacy of automated machine learning models and feature engineering for diagnosis of equivocal appendicitis using clinical and computed tomography findings. Sci. Rep. 2024, 14, 22658. [Google Scholar] [CrossRef] [PubMed]
  60. Wang, J.; Xue, Q.; Zhang, C.W.; Wong, K.K.L.; Liu, Z. Explainable coronary artery disease prediction model based on AutoGluon from AutoML framework. Front. Cardiovasc. Med. 2024, 11, 1360548. [Google Scholar] [CrossRef]
  61. Shoaib, H.A.; Rahman, M.A.; Maua, J.; Rahman, A.; Mridha, M.F.; Kim, P.; Shin, J. An enhanced deep learning approach to potential purchaser prediction: AutoGluon ensembles for cross-industry profit maximization. IEEE Open J. Comput. Soc. 2025, 6, 468–479. [Google Scholar] [CrossRef]
Figure 1. Pseudocode for measuring runtime memory footprint.
Figure 1. Pseudocode for measuring runtime memory footprint.
Applsci 15 10968 g001
Figure 2. Summary of LLM performance across four Kaggle benchmarks under two prompting strategies. (a) Titanic—Machine Learning from Disaster; (b) House Prices—Advanced Regression Techniques; (c) Digit Recognizer; (d) Natural Language Processing with Disaster Tweets. FSP: Few Shot Prompting; ToT: Tree of Thoughts; RMSE: Root Mean Squared Error.
Figure 2. Summary of LLM performance across four Kaggle benchmarks under two prompting strategies. (a) Titanic—Machine Learning from Disaster; (b) House Prices—Advanced Regression Techniques; (c) Digit Recognizer; (d) Natural Language Processing with Disaster Tweets. FSP: Few Shot Prompting; ToT: Tree of Thoughts; RMSE: Root Mean Squared Error.
Applsci 15 10968 g002
Figure 3. Heatmaps show the effect of hyperparameter tuning on test accuracy for the Titanic task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 3. Heatmaps show the effect of hyperparameter tuning on test accuracy for the Titanic task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g003
Figure 4. Efficiency metrics of Titanic task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 4. Efficiency metrics of Titanic task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g004
Figure 5. Heatmaps showing the effect of hyperparameter tuning on test RMSE for the House Prices task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 5. Heatmaps showing the effect of hyperparameter tuning on test RMSE for the House Prices task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g005
Figure 6. Efficiency metrics of House Prices task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 6. Efficiency metrics of House Prices task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g006
Figure 7. Heatmaps show the effect of hyperparameter tuning on test accuracy for the Digit Recognizer task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 7. Heatmaps show the effect of hyperparameter tuning on test accuracy for the Digit Recognizer task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g007
Figure 8. Efficiency metrics of Digit Recognizer task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 8. Efficiency metrics of Digit Recognizer task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g008
Figure 9. Heatmaps showing the effect of hyperparameter tuning on test F1 score for the Disaster Tweets task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 9. Heatmaps showing the effect of hyperparameter tuning on test F1 score for the Disaster Tweets task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g009
Figure 10. Efficiency metrics of Disaster Tweets task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 10. Efficiency metrics of Disaster Tweets task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g010
Figure 11. Heatmaps showing the effect of hyperparameter tuning on RMSE for the Predicting the Beats-per-Minute of Songs task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 11. Heatmaps showing the effect of hyperparameter tuning on RMSE for the Predicting the Beats-per-Minute of Songs task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g011
Figure 12. Efficiency metrics of Predicting the Beats-per-Minute of Songs task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Figure 12. Efficiency metrics of Predicting the Beats-per-Minute of Songs task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Applsci 15 10968 g012
Table 1. Ranking of the indefinite competitions on Kaggle based on the number of total teams.
Table 1. Ranking of the indefinite competitions on Kaggle based on the number of total teams.
NameTotal TeamsSubmissions
1Titanic—ML from Disaster15,86660,236
2Housing Prices Competition for Kaggle Learn Users596618,933
3House Prices—Advanced Regression Techniques465322,907
4Spaceship Titanic211613,123
5Digit Recognizer1630 4702
6NLP with Disaster Tweets9013520
7Store Sales—Time Series Forecasting7462643
8LLM Classification Finetuning2551091
9Connect X193508
10I’m Something of a Painter Myself129315
All counts (Table 1) were recorded from the Kaggle public competition page on [18 May 2025]; values may change over time due to account removals, team merges, or Kaggle’s periodic data updates.
Table 2. Overview of Kaggle datasets used for comparative evaluation.
Table 2. Overview of Kaggle datasets used for comparative evaluation.
DatasetSizeAttributesData TypeTask
IndependentDependent
TitanicTrain89110
(+Passenger ID column)
Survived
Yes:1/No: 0
Numerical
&
Categorical
Binary Classification
Test418-
House PriceTrain146079
(+ID column)
Sale Price
(Continuous)
Numerical
&
Categorical
Regression
Test1459-
MNISTTrain42,000784 pixelsLabel (Digit 0–9)Numerical (flattened 28 × 28 pixels)Multiclass Classification
Test28,000-
Disaster
Tweets
Train76133
(+ID column)
Disaster
Relevant:1/Irrelevant:0
Numerical
&
String
Binary Text Classification (NLP)
Test3263-
Predicting the Beats-per-Minute of Songs
Train524,1649
(+ID column)
Beats Per Minute
(Continuous)
NumericalRegression
Test174,722-
Table 3. Kaggle competition datasets and access dates used in this study.
Table 3. Kaggle competition datasets and access dates used in this study.
Competition Data VersionDate of Access
Titanic—ML from DisasterN/A (competition data)3–7 June 2025
House Prices—Advanced Regression TechniquesN/A (competition data)7–13 June 2025
Digit RecognizerN/A (competition data)13–26 June 2025
NLP with Disaster TweetsN/A (competition data)15 June–15 July 2025
Predicting the Beats-per-Minute of SongsN/A (competition data)18–25 September 2025
Table 4. Configuration details of the LLMs used in the experiments.
Table 4. Configuration details of the LLMs used in the experiments.
ModelVersionPlatformContext LengthInference Setting
OpenAI o3o3OpenAI ChatGPT/WebUp to 200 KDefault
GPT-4.1gpt-4.1OpenAI ChatGPT/WebUp to 1 MDefault
GPT-5 Thinking (standart)OpenAI ChatGPT/WebUp to 400 KDefault
Gemini 2.5 Progemini-proGoogle AI Studio~1 MDefault
DeepSeek-V3deepseek-coder-v3-0324Hugging Face/Web interface128 KDefault
DeepSeek-R1deepseek-coder-R1Hugging Face/Web interface128 KDefault
Claude Opus 4claude-4-opusClaude AI Web (Anthropic)Up to 200 KDefault
Claude Sonnet 4claude-4-sonnetClaude AI Web (Anthropic)Up to 200 KDefault
Table 5. Summary of prompting templates designed for FSP and ToT Approaches. FSP: Few Shot Prompting; ToT: Tree of Thoughts; MLOps: Machine Learning operations.
Table 5. Summary of prompting templates designed for FSP and ToT Approaches. FSP: Few Shot Prompting; ToT: Tree of Thoughts; MLOps: Machine Learning operations.
StepFSP TemplateToT Template
Role DefinitionYou are a Kaggle Grandmaster and senior MLOps engineer.You are a Kaggle Grandmaster and senior MLOps engineer.
Task DescriptionBuild the best possible, fully-reproducible solution for the “Kaggle House Prices—Advanced Regression Techniques” competitionBuild the best possible, fully-reproducible solution for the “Kaggle House Prices—Advanced Regression Techniques” competition
Requirements
  • Load train.csv & test.csv
  • Perform cleaning, feature engineering, and exploratory data analysis
  • Select, train & evaluate algorithm or algorithms.
  • Apply CV
  • Optimize hyperparameters
  • Save submission.csv
  • Set seed
  • Same technical requirements
  • Plus: At each step, brainstorm alternatives, evaluate them, and select the best strategy before continuing
Output Constraints
  • A single line starting with PLAN:
  • Python code blocks only
  • No markdown, comments, or explanations outside code
  • A single PLAN: line
(step-by-step strategy)
  • One or more Python code blocks
  • No extra prose or commentary outside those
Example PLAN Output
1. 
Load data.
2. 
Preprocess (impute, encode).
3. 
Train with CV.
4. 
Predict.
5. 
Save submission.
1. 
Load data.
2. 
Compare feature engineering opt.
3. 
Choose best algorithm strategy.
4. 
Train and tune.
5. 
Predict and save.
In-Context ExamplesPresents representative input-output pairs with full code implementations to clarify the intended output structure.
Example 1:
Example 2:
Relies on internal reasoning at inference time, without supplying prior example demonstrations.
Example Code Blockimport pandas as pd, numpy as np
from sklearn.linear_model import Ridge
import pandas as pd, numpy as np
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
Reasoning TypeLearns from examples.
Extracts structure from examples
Thinks iteratively.
Evaluates alternatives.
Chooses the best solution reasoning.
Table 6. Auto-sklearn configuration summary for the evaluated tasks. RMSE: Root mean squared error.
Table 6. Auto-sklearn configuration summary for the evaluated tasks. RMSE: Root mean squared error.
Problems:TitanicHouse PriceBeats per Minute of SongsNLP with Disaster TweetsDigit Recognizer
Framework &versionauto-sklearn 0.15.0, Python 3.10.18
Core settingsmetricaccuracyRMSERMSEF1 ScoreAccuracy
seed42
memory_limit8192 MB
Table 7. AutoGluon configuration summary for the evaluated tasks. RMSE: Root mean squared error.
Table 7. AutoGluon configuration summary for the evaluated tasks. RMSE: Root mean squared error.
TasksTitanicHouse PriceBeats per Minute of SongsNLP with Disaster TweetsDigit Recognizer
Framework &versionAutoGluon-Tabular 1.4.0, Python 3.12.x
Core settingsPresethigh_quality
MetricAccuracyRMSEF1 ScoreAccuracy
time_limit7200 s.14,400 s.
auto_stackTrue
excluded_model_typesNone
Seed42
ag_args_fitmax_memory_usage_ratio: 0.8,
num_gpus: 0
max_memory_usage_ratio: 0.8, num_gpus: 1
hyperparameter_tune_kwargsAutonum_trials: 80
num_stack_levels-0
num_bag_folds-3
Table 8. Results for the Titanic task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv Acc.: Cross Validation Accuracy; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; GBC: Gradient Boosting Classifier; RFC: Random Forest Classifier.
Table 8. Results for the Titanic task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv Acc.: Cross Validation Accuracy; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; GBC: Gradient Boosting Classifier; RFC: Random Forest Classifier.
(a)
LLMOpenAIGPTGemini
Serieso3o3
(HP-Tuned)
4.14.1
(HP-Tuned)
2.5 Pro2.5 Pro
(HP-Tuned)
FSPCV Acc.0.83160.8440.82600.83390.83950.8384
Test Acc.0.753580.753580.779900.765550.779900.77033
Exec. Time27s 1 min 43 s25 s1 min 22 s24 s2 min 33 s
Memory (MB)282.12 278.36 288.18288.58279.84298.91
AlgorithmGBCGBCRFCRFCRFCRFC
ToTCV Acc.0.81370.82940.81370.84060.82270.8316
Test Acc.0.746410.765550.746410.772720.748800.74401
Exec. Time29 s44 s28 s1 min 3 s23 s48 s
Memory (MB)296.57273.26296.32297.64272.04293.06
AlgorithmRFCRFCRFCRFCRFCRFC
(b)
LLMDeepSeekClaude
SeriesV3V3
(HP-tuned)
R1R1
(HP-tuned)
Opus 4Opus 4
(HP-tuned)
Sonnet 4Sonnet 4
(HP-tuned)
FSPCV Acc.0.792320.8260.82270.83280.82710.83510.83390.8395
Test Acc.0.751190.782290.765550.779900.748800.767940.767940.76555
Exec. Time24 s1 min 37 s26 s1 min 36 s36 s4 min 25 s29 s3 min 6 s
Memory (MB)282.03283.47277.07291.67295.98295.02281.41306.54
AlgorithmRFCRFCRFCRFCGBCRFCRFCRFC
ToTCV Acc.0.8130.82940.81030.83500.81820.83390.81590.8395
Test Acc.0.746410.765550.736840.777510.744010.765550.746410.76794
Exec. Time26 s41 s31 s42 s29 s3 min 27 s30 s3 min 46 s
Memory (MB)282.22273.88287.54286.63301.98300.11296.32309.67
AlgorithmRFCRFCRFCRFCRFCRFCRFCRFC
Table 9. Results for the House Prices task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. RMSE: Root Mean Squared Error; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor; GBR: Gradient Boosting Regressor; RFR: Random Forest Regressor.
Table 9. Results for the House Prices task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. RMSE: Root Mean Squared Error; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor; GBR: Gradient Boosting Regressor; RFR: Random Forest Regressor.
(a)
LLMOpenAIGPTGemini
Serieso3o3
(HP-Tuned)
4.14.1
(HP-Tuned)
2.5 Pro2.5 Pro
(HP-Tuned)
FSPRMSE0.124040.128020.123010.125300.120070.12256
Exec. Time1 min 42 s7 min 58 s32 s28 min 13 s1 min 43 s41 s
Memory (MB)438.14299.00273.68842.48454.28420.48
AlgorithmLGBMR + XGBR + RidgeXGBRRidge + LassoLGBMR
+ XGBR+ CBR
LGBMR + XGBR+ CBR + RidgeLGBMR + XGBR + CBR
ToTRMSE0.121310.130980.130520.121410.121610.12211
Exec. Time5 min 8 s5 min 8 s22 s1 min 48 s34 s35 s
Memory (MB)274.68365.88368.68298.44375.98380.50
AlgorithmLasso + ElasticNet + GBRLGBMRLGBMRLasso + XGBRRidge + LGBMRRidge + LGBMR
(b)
LLMDeepSeekClaude
SeriesV3V3
(HP-tuned)
R1R1
(HP-tuned)
Opus 4Opus 4
(HP-tuned)
Sonnet 4Sonnet 4
(HP-tuned)
FSPRMSE0.124520.125080.126020.123170.126910.123610.124460.12487
Exec. Time4 min 48 s8 min 28 s1 min 26 s51 s55 s22 min 26 s44 s3 min 18 s
Memory (MB)442.81483.73458.61438.46439.98590.97413.82455.34
AlgorithmXGBR+ LGBMR+ CBR,
Final Model: Ridge
XGBR + LGBMR + CBRCBR+ XGBR+ LGBMRXGBR+ LGBMR+ CBR+ RidgeCBR+ XGBR+ LGBMRMeta Model: Ridge
XGBR+ LGBMR+ CBR
LGBMR+ XGBR+
CBR
LGBM+ XGBR+
CBR
ToTRMSE0.130270.130950.146490.128430.136600.123400.133480.12438
Exec. Time35 s4 min 50 s43 s10 h 41 min 50 s40 s3 min 16 s41 s1 min 45 s
Memory (MB)363.61437.12280.83383.1309.52393.02417.71425.72
AlgorithmLGBMRLGBMR+ XGBR+
CBR
RFRLGBMRRidge +
Lasso+ RFR+ XGBR
Ridge + Lasso + ElasticNet + XGBR + LGBMRRFR+ XGBR+ LGBMRXGBR+ LGBMR+ Ridge+ ElasticNet+ RFR
Table 10. Results for the Digit Recognizer task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv Acc.: Cross Validation Accuracy; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; CNN: Convolutional Neural Network; VGG: Visual Geometry Group.
Table 10. Results for the Digit Recognizer task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv Acc.: Cross Validation Accuracy; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; CNN: Convolutional Neural Network; VGG: Visual Geometry Group.
(a)
LLMOpenAIGPTGemini
Serieso3o3
(HP-Tuned)
4.14.1
(HP-Tuned)
2.5 pro2.5 pro
(HP-tuned)
FSPCV Acc.99.28570.98760.98530.990.98720.9956
Test Acc.0.991250.989390.987500.991280.986750.99450
Exec. Time23 min 3 s14 min 28 s 2 min 8 s 3 min 13 s4 min 33 s37 min 17 s
Memory (MB)3194.353067.412772.102721.932817.534200.00
AlgorithmCNNCNNCNNCNNCNNCNN
ToTCV Acc.99.278699.457140.99290.992240.995690.9937
Test Acc.0.994670.995170.993100.992530.996210.99639
Exec. Time8 min 7 s19 min 33 s13 min 26 s44 min 59 s26 min 8 s45 min 15 s
Memory (MB)2020.122050.821760.642027.672632.752451.20
AlgorithmSmall CNNMedium CNNSimple CNNSimple CNNVGG like CNNCNN
(b)
LLMDeepSeekClaude
SeriesV3V3
(HP-tuned)
R1R1
(HP-tuned)
Opus 4Opus 4
(HP-tuned)
Sonnet 4Sonnet 4
(HP-tuned)
FSPCV Acc.0.98600.99570.99110.98950.98790.9960.990.99
Test Acc.0.989960.993210.992250.992710.983850.994820.987030.99246
Exec. Time6 min 39 s13 min 7 s5 min 27 s9 min 53 s8 min 34 s12 min 43 s2 min 7 s27 min 4 s
Memory (MB)4143.052697.582752.932714.183079.885080.943073.102927.02
AlgorithmCNNCNNCNNCNNCNNCNNCNNCNN+ RFC
ToTCV Acc.97.830.99260.99160.983799.4899.490.993290.99314
Test Acc.0.981170.988530.992600.985140.995460.995390.993350.99307
Exec. Time1 min 48 s1 min 42 s6 min 26 s41 min 1 s19 min 55 s20 min 25 s17 min 29 s15 min 42 s
Memory (MB)2630.612154.872226.444344.482198.39 2172.882912.292912.65
AlgorithmCNNCNNCNNCNNSimple CNNSimple CNNCNNCNN
Table 11. Results for the NLP with Disaster Tweets task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv F1 score: Cross Validation F1 score; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; TF-idf: Term Frequency–Inverse Document Frequency; LRC: Logistic Regression Classifier; LinearSVC: Linear Support Vector Classifier; RFC: Random Forest Classifier; XGBC: eXtreme Gradient Boosting Classifier; SVM: Support Vector Machine; LGBMC: Light Gradient Boosting Machine Classifier; LRC: Logistic Regression Classifier; NB: Naive Bayes.
Table 11. Results for the NLP with Disaster Tweets task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv F1 score: Cross Validation F1 score; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; TF-idf: Term Frequency–Inverse Document Frequency; LRC: Logistic Regression Classifier; LinearSVC: Linear Support Vector Classifier; RFC: Random Forest Classifier; XGBC: eXtreme Gradient Boosting Classifier; SVM: Support Vector Machine; LGBMC: Light Gradient Boosting Machine Classifier; LRC: Logistic Regression Classifier; NB: Naive Bayes.
(a)
LLMOpenAIGPTGemini
Serieso3o3
(HP-Tuned)
4.14.1
(HP-Tuned)
2.5 Pro2.5 Pro
(HP-Tuned)
FSPCV F1 Score0.79270.79720.80110.78080.838300.80494
Test F1 Score0.797420.794050.796500.783020.836950.79865
Exec. Time28 s25 s22 s8 min 52 s25 min 25 s22 s
Memory (MB)284.97288.98314.32391.314864.11279.46
AlgorithmTF-idf +LRCTF-idf +LRCTF-idf +LRCTF-idf +
Stacking Classifier
Hugging Face
Bert-Base-Uncased
TF-idf +LRC
ToTCV F1 Score0.79180.81410.79230.80740.800080.80192
Test F1 Score0.790070.814280.789150.806920.787920.79926
Exec. Time39s41 s35 s28 s27 s47 s
Memory (MB)363.88429.84352.58348.71273.73285.75
AlgorithmTF-idf + LinearSVCTF-idf +LRCTF-idf + LinearSVCTF-idf +LRCTF-idf +LRCTF-idf +LRC
(b)
LLMDeepSeekClaude
SeriesV3V3
(HP-tuned)
R1R1
(HP-tuned)
Opus 4Opus 4
(HP-tuned)
Sonnet 4Sonnet 4(HP-tuned)
FSPCV F1 Score0.73440.73050.80240.8028LRC: 0.8007
RFC: 0.7787
XGBC: 0.7906
0.81010.78920.7950
Test F1 Score0.798950.797110.798340.796810.796190.807530.793130.79129
Exec. Time36 s31 s21 s1 min 12 s1 min 1 s49 min 5 s31 s7 h 8 min 35 s
Memory (MB)269.09316.75280.21272.31333.52586.85292.432221.02
AlgorithmTF-idf + LRCTF-idf + LRCTF-idf + LRCTF-idf + LRCTF-idf+
(LRC/
RFC/
XGBC)
TF-idf +
(LRC/SVM/LGBMC)
TF-idf + LRC+ NBTF-idf+ LRC+
SVM
ToTCV F1 Score0.77740.73160.73230.80090.78870.79020.77930.7760
Test F1 Score0.789760.795280.798950.797730.783320.785470.789450.77811
Exec. Time1 min 36 s25 s23 s33 s32 s3 min 68 s26 s3 h 51 min 51 s
Memory (MB)319.79284.76283.12285.931804.99329.46389.113875.11
AlgorithmTF-idf + LRCTF-idf + LRCTF-idf + LRCTF-idf + LRCTF-idf+ LGBMC TF-idf + XGBCTF-idf + LGBMCTF-idf +
XGBC+ SVM + NB
Table 12. Results for the Predicting the Beats-per-Minute of Songs task: LLM × prompting strategy × hyperparameter tuning. RMSE: Root Mean Squared Error; Exec. Time: Execution Time; FSP: Few-Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; RFR: Random Forest Regressor; CBR: CatBoost Regressor; GBR: Gradient Boosting Regressor; HGBR: Histogram-based Gradient Boosting Regressor; ETR: Extra Trees Regressor.
Table 12. Results for the Predicting the Beats-per-Minute of Songs task: LLM × prompting strategy × hyperparameter tuning. RMSE: Root Mean Squared Error; Exec. Time: Execution Time; FSP: Few-Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; RFR: Random Forest Regressor; CBR: CatBoost Regressor; GBR: Gradient Boosting Regressor; HGBR: Histogram-based Gradient Boosting Regressor; ETR: Extra Trees Regressor.
LLMGPTGeminiDeepSeek
Series5 Thinking5 Thinking
(HP-Tuned)
2.5 Pro2.5 Pro
(HP-Tuned)
V3V3
(HP-Tuned)
FSPRMSE26.3926626.4019226.3908126.4047926.3955426.38663
Exec. Time3 h 31 min 18 s1 h 22 min 16 s1 h 24 min 46 s15 min 15 s35 min 44 s9 min 41 s
Memory (MB)1123.58957.88791.73606.72905.31697.42
AlgorithmLGBMRCBR/
XGBR/Ridge
Ridge/
RFR/
LGBMR/
XGBR/
CBR
LGBMRRidge/
RFR/
LGBMR/
XGBR/
CBR
LGBMR
ToTRMSE26.3873426.3876026.3887926.3929626.4227326.38801
Exec. Time3 h 35 min 29 s2 h 55 min 28 s47 s1 min 21 s46 min 28 s23 min 21 s
Memory (MB)461.141987.59684.24631.26480.211278.91
AlgorithmRidge/
Lasso/
ElasticNet/
RFR/
GBR/
HGBR
Ridge/
ElasticNet/ETR/HGBR/LGBMR/XGBR/CBR

Ridge blend TOP_K OOF/TEST preds
LGBMRLGBMRRFRXGBR
Table 13. Appropriate prompting techniques and practical trends across different Kaggle tasks. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
Table 13. Appropriate prompting techniques and practical trends across different Kaggle tasks. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.
TasksSearch/Design SpaceAppropriate Technique
Prompt TechniqueIn Practice
(General Trend)Best
TitanicNarrowFSPFSP FSP (HP-tuned)
House PricesMediumFSP (Usually)FSP (non-tuned)/ToT (HP-tuned)FSP (non-tuned)
Digit RecognizerVery LargeFSPToTToT (HP-tuned)
Disaster TweetsLargeToTFSP (non-tuned)/ToT (HP-tuned)FSP (non-tuned)
Beats-per-Minute of SongsLargeFSP (Usually)FSP (non-tuned)/ToT (HP-tuned)FSP (HP-tuned)
Table 14. Comparison of LLM-generated and human-submitted solutions across benchmark tasks in terms of accuracy, F1 score, RMSE, execution time, algorithm choice, and Kaggle leaderboard/notebook results. Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; RFC: Random Forest Classifier; GBT: Gradient Boosted Trees; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor; CNN: Convolutional Neural Network; Distil BERT: Bidirectional Encoder Representations from Transformers.
Table 14. Comparison of LLM-generated and human-submitted solutions across benchmark tasks in terms of accuracy, F1 score, RMSE, execution time, algorithm choice, and Kaggle leaderboard/notebook results. Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; RFC: Random Forest Classifier; GBT: Gradient Boosted Trees; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor; CNN: Convolutional Neural Network; Distil BERT: Bidirectional Encoder Representations from Transformers.
TaskBEST LLMMetrics Kaggle Notebooks
TitanicDeepSeek-V3
(FSP
(HP-tuned))
Accuracy0.78229 0.80143
Exec. Time1 min 37 s5 min 37 s
AlgorithmRFCGBT
Kaggle Learderboard %17.32.82
House PriceGemini 2.5 Pro
(FSP)
RMSE0.12007 0.12096
Exec. Time1 min 43 s47 s
AlgorithmLGBMR + XGBR+ CBR + RidgeRegularized Linear
Regression Model
Kaggle Learderboard %5.256.84
Digit RecognizerGemini 2.5 Pro
(ToT
(HP-tuned))
Accuracy0.996390.99028
Exec. Time (GPU 100)45 min 15 s1 h 47 min 7 s
AlgorithmCNNCNN
Kaggle Learderboard %10.432.7
NLP with Disaster TweetsGemini 2.5 Pro (FSP)F1 Score0.836950.83726
Exec. Time (GPU 100)25 min 25 s5 min 53 s
AlgorithmHugging Face
Bert-Base-Uncased
Distil BERT
Kaggle Learderboard %10.19.3
Beats-per-Minute of SongsDeepSeek-V3
(FSP
(HP-tuned))
RMSE26.3866326.38020 (20.09.2025)
Exec. Time9 min 41 s-
AlgorithmLGBMR-
Table 15. Accuracy, Algorithm, and Prompt Techniques: A literature-based evaluation of LLM-guided ML solutions. N/A: not mentioned in the study; HPT: Hyperparameter Tuning; FSP: Few Shot Prompting; ToT: Tree of Thoughts; Alg.: Algorithm; RFC: Random Forest Classifier; XGBoost: eXtreme Gradient Boosting; GBTC: Gradient Boosted Trees Classifier; CNN: Convolutional Neural Network.
Table 15. Accuracy, Algorithm, and Prompt Techniques: A literature-based evaluation of LLM-guided ML solutions. N/A: not mentioned in the study; HPT: Hyperparameter Tuning; FSP: Few Shot Prompting; ToT: Tree of Thoughts; Alg.: Algorithm; RFC: Random Forest Classifier; XGBoost: eXtreme Gradient Boosting; GBTC: Gradient Boosted Trees Classifier; CNN: Convolutional Neural Network.
TaskModelVersionTuningPrompting TechniquesAccuracyAlgorithm
TitanicGPT [23]3.5NoPreprocess:
FSP
Chain of Thought
Specifying Desired Response Format
HPT:
FSP
Specifying Desired Response Format
0.7511RFC
GPT [23]3.5Yes0.7918RFC
Gemini [23]N/ANo0.7655XGBoost
Gemini [23]N/AYes0.7583XGBoost
Human [23]-N/A-0.7966XGBoost
Kaggle Notebook-Yes-0.80143GBTC
GPT 4.1NoPreprocess+ Alg. Selection+ HPT:
FSP
Specifying Desired Response Format
0.7799RFC
GPT 4.1Yes0.76555RFC
Gemini 2.5 ProNo0.7799RFC
Gemini 2.5 ProYes0.77033RFC
GPT 4.1NoPreprocess+ Alg. Selection+ HPT:
ToT
Specifying Desired Response Format
0.74641RFC
GPT4.1Yes0.77272RFC
Gemini 2.5 ProNo0.7488RFC
Gemini 2.5 ProYes0.74401RFC
Best LLM
DeepSeek
V3YesPreprocess+ Alg. Selection+ HPT:
FSP
Specifying Desired Response Format
0.78229RFC
Digit RecognizerGPT [23]3.5NoClassification + HPT:
FSP
Specifying Desired Response Format
0.9863CNN
GPT [23]3.5Yes0.9960
Gemini [23]N/ANo0.9801
Gemini [23]N/AYes0.9820
Human [23]-N/A-0.9834
Kaggle Notebook-N/A-0.99028
GPT 4.1NoClassification + HPT:
FSP
Specifying Desired Response Format
0.98750
GPT 4.1Yes0.99128
Gemini 2.5 ProNo0.98675
Gemini 2.5 ProYes0.99450
GPT 4.1NoClassification + HPT:
ToT
Specifying Desired Response Format
0.99310
GPT4.1Yes0.99224
Gemini 2.5 ProNo0.99569
Best LLM
Gemini
2.5 ProYes0.99639
Table 16. Performance-oriented literature comparison of hyperparameter and architecture choices by leading LLMs. Note. RFC: Random Forest Classifier; CNN: Convolutional Neural Network.
Table 16. Performance-oriented literature comparison of hyperparameter and architecture choices by leading LLMs. Note. RFC: Random Forest Classifier; CNN: Convolutional Neural Network.
TaskHyperparameterGPT [23]DeepSeek-V3
Titanicn_estimator[100, 200, 300][100, 200, 300]
max_depth[None, 10, 20][None, 5, 10]
min_samples_split[2, 5, 10][2, 5, 10]
min_samples_leaf[1, 2, 4][1, 2, 4]
max_features[‘auto’, ‘sqrt’]-
ModelRFCRFC
TaskHyperparameterGPT [23]Gemini 2.5 Pro
Digit Recognizerfilter[32, 64][32, 64]
units[64, 128][256]
learning_rate[1 × 10−5, 1 × 10−2][1 × 10−4, 1 × 10−2]
dropout-[0.1, 0.6]
ModelCNNCNN
ArchitectureInputConv2D(32,(3,3), activation = ‘relu’)Conv2D(32, (3,3), activation = ‘relu’)
HiddenMaxPooling2D((2, 2))
Conv2D(64, (3, 3), activation = ‘relu’)
MaxPooling2D((2, 2))
Conv2D(64, (3, 3), activation = ‘relu’)
Dense(64, activation = ‘relu’)
Conv2D(32, (3,3), activation = ‘relu’)
MaxPooling2D((2,2))
Dropout(rate=dropout/2)
Conv2D(64, (3,3), activation = ‘relu’)
Conv2D(64, (3,3), activation = ‘relu’)
MaxPooling2D((2,2))
Dropout(rate = dropout/2)
Dense(256, activation = ‘relu’)
Dropout(rate = dropout)
OutputDense(10, activation=‘softmax’)Dense(10, activation = ‘softmax’)
Table 17. Comparison of best-performing LLM configurations and AutoML frameworks across five Kaggle tasks. LinearSVC: Support Vector Classification with LIBLINEAR solver; LinearSVR: Support Vector Regression with LIBLINEAR solver; PAC: Passive Aggressive Classifier; GBR: Gradient Boosting Regressor; ETC: Extra Trees Classifier; LGBMR: Light Gradient Boosting Machine Regressor; RFC: Random Forest Classifier; ETR: Extra Trees Regressor; GBTC: Gradient Boosted Trees Classifier; CNN: Convolutional Neural Network; Distil BERT: Bidirectional Encoder Representations from Transformers; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor.
Table 17. Comparison of best-performing LLM configurations and AutoML frameworks across five Kaggle tasks. LinearSVC: Support Vector Classification with LIBLINEAR solver; LinearSVR: Support Vector Regression with LIBLINEAR solver; PAC: Passive Aggressive Classifier; GBR: Gradient Boosting Regressor; ETC: Extra Trees Classifier; LGBMR: Light Gradient Boosting Machine Regressor; RFC: Random Forest Classifier; ETR: Extra Trees Regressor; GBTC: Gradient Boosted Trees Classifier; CNN: Convolutional Neural Network; Distil BERT: Bidirectional Encoder Representations from Transformers; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor.
Task DeepSeek-V3
(FSP & HP-Tuned)
Auto-SklearnAutoGluonKaggle Notebooks
TitanicAccuracy0.782290.772720.763150.80143
Exec. Time1 min 37 s59 min 56.5 s1 h 37 min 42.7 s5 min 37 s
AlgorithmRFCLinearSVCETC (BAG L1 FULL)GBTC
House Price Gemini 2.5 Pro (FSP)auto-sklearnAutoGluonKaggle Notebooks
RMSE0.120070.127750.11810.12096
Exec. Time1 min 43 s1 h 48.5 s1 h 36 min 2.1 s47 s
AlgorithmLGBMR + XGBR+ CBR + RidgeLinearSVRLGBMR (BAG L1/T2 FULL)Regularized Linear
Regression Model
Digit
Recognizer
Gemini 2.5 Pro
(ToT & HP-tuned)
auto-sklearnAutoGluonKaggle Notebooks
Accuracy0.996390.981670.979960.99028
Exec. Time (GPU)45 min 15 s1 h 5 min 23.9 s2 h 51 min 3.6 s1 h 47 min7 s
AlgorithmCNNLinearSVCRFC (BAG L1 FULL)CNN
NLP with Disaster Tweets Gemini 2.5 Pro (FSP)auto-sklearnAutoGluonKaggle Notebooks
F1 Score0.836950.784240.79190.83726
Exec. Time (GPU)25 min 25 s1 h 9.9 s3 h 30 min 58.2 s5 min 53 s
AlgorithmHugging Face
Bert-Base-Uncased
PACETC (BAG L1 FULL)Distil BERT
The Beats per
Minute of Songs
DeepSeek-V3
(FSP & HP-tuned)
auto-sklearnAutoGluonKaggle Notebooks
RMSE26.3866326.3940026.3902926.38020 (20 September 2025)
Exec. Time9 min 41 s59 min 56.8 s2 h 52 min 16.6 s-
AlgorithmLGBMRGBRETR (BAG L1 FULL)-
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kaya Gülağız, F. Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks. Appl. Sci. 2025, 15, 10968. https://doi.org/10.3390/app152010968

AMA Style

Kaya Gülağız F. Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks. Applied Sciences. 2025; 15(20):10968. https://doi.org/10.3390/app152010968

Chicago/Turabian Style

Kaya Gülağız, Fidan. 2025. "Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks" Applied Sciences 15, no. 20: 10968. https://doi.org/10.3390/app152010968

APA Style

Kaya Gülağız, F. (2025). Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks. Applied Sciences, 15(20), 10968. https://doi.org/10.3390/app152010968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop