Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks

Kaya Gülağız, Fidan

doi:10.3390/app152010968

Open AccessArticle

Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks

by

Fidan Kaya Gülağız

Department of Computer Engineering, Faculty of Engineering, Kocaeli University, İzmit 41001, Kocaeli, Turkey

Appl. Sci. 2025, 15(20), 10968; https://doi.org/10.3390/app152010968

Submission received: 7 August 2025 / Revised: 27 September 2025 / Accepted: 3 October 2025 / Published: 13 October 2025

(This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges)

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) are playing an increasingly important role in data science applications. In this study, the performance of LLMs in generating code and designing solutions for data science tasks is systematically evaluated based on different real-world tasks from the Kaggle platform. Models from different LLM families were tested under both default settings and configurations with hyperparameter tuning (HPT) applied. In addition, the effects of few-shot prompting (FSP) and Tree of Thought (ToT) strategies on code generation were compared. Alongside technical metrics such as accuracy, F1 score, Root Mean Squared Error (RMSE), execution time, and peak memory consumption, LLM outputs were also evaluated against Kaggle user-submitted solutions, leaderboard scores, and two established AutoML frameworks (auto-sklearn and AutoGluon). The findings suggest that, with effective prompting strategies and HPT, models can deliver competitive results on certain tasks. The ability of some LLMS to suggest appropriate algorithms reveals that LLMs can be seen not only as code generators, but also as systems capable of designing machine learning (ML) solutions. This study presents a comprehensive analysis of how strategic decisions such as prompting methods, tuning approaches, and algorithm selection, affect the design of LLM-based data science systems, offering insights for future hybrid human–LLM systems.

Keywords:

large language models; code generation; prompt engineering

1. Introduction

Today, the rapid advancement of the internet and digital technologies has led to a significant increase in both the diversity and volume of data [1]. This growth has created major challenges not only in terms of storage, but also in the interpretation and analysis of data. In this context, the concept of Big Data has emerged, characterized by high volume, velocity, and variety [2,3], pushing the limits of traditional data processing methods. In particular, the rise of unstructured data has rendered classical machine learning (ML) and data mining approaches insufficient. Deep learning (DL) models, developed to overcome this bottleneck, have initiated a new era in data analytics through their ability to learn meaningful representations from large-scale data. These advancements, supported by powerful hardware (Graphics Processing Unit (GPU), Tensor Processing Unit (TPU)) and DL libraries (TensorFlow, PyTorch), have enabled scalable solutions.

One of the fastest-growing areas of DL has been natural language processing (NLP). Large language models (LLMs) based on Transformer architecture have revolutionized text comprehension and generation. These models are also being used successfully in technical tasks such as code generation and debugging, increasing the potential for intelligent assistance in software development.

In this context, comparing the performance of different LLM models has gained academic and industrial importance. However, there are limited studies in the literature that provide objective and systematic analysis of LLMs, especially in the context of code generation. In line with this need, the following section presents a summary of research addressing the role of LLMs in software development processes.

1.1. Related Work

LLMs are increasingly applied not only in natural language generation but also in technical domains such as software development and data science. As a result, topics like code generation capabilities, decision-making processes, and the effectiveness of prompting strategies have gained prominence in literature. This section reviews relevant studies under four key themes to provide the theoretical foundation for the present work: the code generation capabilities of LLMs; the impact of prompting strategies; the role of LLMs in tasks such as algorithm selection and hyperparameter tuning (HPT) within data science workflows; and the benchmarking and evaluation of LLM performance using datasets.

The ability of LLMs to produce structured outputs such as programming code has been extensively discussed in the literature. Du et al. [4] stated that the code generation achievements of LLMs on benchmarks examine small-scale code. For this reason, they evaluated LLMs’ competencies in generating grade-level code to be suitable for real-world problems. They stated that the performance of LLMs in class level code generation is lower than the method level. Li et al. [5] stated that evaluating the code generation performance of LLMs is an open problem and proposed EvoCodeBench as a new benchmark. They tested popular LLMs on EvoCodeBench. Fakhoury et al. [6] developed a new method that guides LLM-assisted code generation step-by-step with test-based feedback from the user. They show that the approach helps the user to understand the correctness of the code. Coignion et al. [7] evaluated the code generation efficiency of different LLMs compared to human authored solutions using the Leetcode database. Tambon et al. [8] investigated the errors observed in code generated by LLMs. They identified the points that should be considered for the security of LLM-generated code. These studies analyze how effective LLMs are in code development.

The output of LLM-based systems is highly dependent not only on the capability of LLM but also on the structure of the prompts. Therefore, prompt engineering is a critical process for achieving accurate and efficient outputs. The impact of different prompting strategies, especially for tasks such as code generation, has been extensively studied in the literature. Li et al. [9] proposed a prompt technique for code generation inspired by structured programming techniques used by human programmers. Thus, they showed that contextual sampling improves accuracy (Acc) in code generation. Khojah et al. [10] showed how different prompt techniques used in code generation affect the performance of LLMs. Yang et al. [11] proposed a method that shows that the use of chain-of-thought (CoT) strategies in lightweight LLM models significantly improves code generation performance even in resource-constrained environments. Chen et al. [12] reviewed both basic and advanced prompt design techniques and emphasized the key role of prompt design in this field. This recent work clearly shows that prompt engineering has an impact on both Acc, reliability and interpretability.

LLMs are considered as systems that not only generate code but also make decisions such as algorithm selection and HPT. In these aspects, the role of LLMs, which are positioned as an alternative to Automated Machine Learning (AutoML) approaches, in data science tasks has been increasingly addressed in the literature. Yao et al. [13] examined how LLM-based AutoML approaches can improve the accessibility of ML solutions. The study shows the potential of LLMs to make ML systems accessible to non-technical experts. Fathollahzadeh et al. [14] proposed a system that enables LLMs to create more effective and efficient ML workflows by generating dataset-specific instructions. The study shows that LLMs can be integrated into decision-making processes, not just writing code. The study by Zhao et al. [15] shows that LLMs can perform algorithm design, implementation and evaluation independently by breaking down complex AutoML tasks into discrete sub-prompts. Mulakala et al. [16] stated that LLMs have difficulty in searching over a large hyperparameter space during the fine-tuning process. They proposed a new technique to address this problem. In another study by Zhang et al. [17], the usability of LLMs in HPT processes was investigated. All these studies show that LLMs are positioned not only as passive tools but also as active decision-making systems in data science processes.

Benchmark datasets and test environments used to reliably evaluate the performance of artificial intelligence (AI) systems are of great importance. Accordingly, in the literature, comparative analyses of LLMs are frequently conducted on algorithm-based benchmark datasets such as HumanEval and LeetCode, and platforms such as Kaggle, which provide both data sources containing real-world problems and an evaluation environment. Wang et al. [18] evaluated the performance of LLMs such as Generative Pre-trained Transformer 4 (GPT-4) and GPT-3.5-turbo in terms of their ability to solve various programming problems compiled from the LeetCode platform. Coignion et al. [7] evaluated the code generation efficiency and performance of LLMs on different levels of Letcode problems from LeetCode by comparing them with human-written solutions. They found that for the selected problems, LLMs generate code more efficiently than humans in most cases. Another study [19] evaluated the performance of GPT-3.5, GPT-4 and GPT-4o models on 15 LeetCode problems in Python, Java and C++ languages in terms of runtime and memory usage. Döderlein et al. [20] investigated how the performance of LLM-based code assistants such as Copilot, Codex and StarCoder2 is affected by changes in inputs (prompt format, context, temperature, etc.) on HumanEval and LeetCode problems. Another study [21] evaluated 984 code samples generated by the GPT-3.5-Turbo and GPT-4 models using the HumanEval dataset in terms of code quality by comparing them with human-written code. Mathews et al. [22] examined the impact of Test-Driven Development (TDD) approach on the code generation of LLMs such as GPT-4 and Llama 3. Although there are many studies evaluating the code generation performance of LLMs on shorter and algorithmically oriented problems such as HumanEval and LeetCode, there are very limited studies comparing the end-to-end code generation capabilities of LLMs on ML tasks on platforms such as Kaggle. Ko and Kang [23] evaluated the code generation capabilities of GPT and Gemini LLMs for ML tasks on three different Kaggle datasets. The results show that GPT performs strongly in HP tuning, but both models lag human developers, especially in tasks such as data preprocessing and feature engineering.

While existing research has mostly focused on short tasks with an algorithmic focus, Ko and Kang [23], one of the few studies to perform a similar benchmark on machine learning tasks in Kaggle, provides an important start. However, the variety of LLMs used is limited in terms of task scope, up-to-dateness and evaluation metrics. Moreover, while in that study, ML tasks were divided into different phases and solutions were generated with hybrid prompt techniques, in this study, end-to-end code generation was performed with a single prompt structure and the holistic performance of LLMs was directly evaluated. This paper aims to overcome these limitations and examine the capability of LLMs to provide end-to-end solutions to ML tasks with a more comprehensive and up-to-date approach. In addition, while AutoML–LLM comparisons also exist in the literature, they have typically been limited in model scope, task diversity, and recency. In contrast, the present study not only evaluates multiple LLM families but also directly benchmarks them against AutoML baselines and Kaggle user-submitted notebooks, thereby offering a more comprehensive and up-to-date comparative study.

1.2. The Contributions of This Study

The main goal of this paper is to contribute to the field of comparative LLM evaluation in solving machine learning tasks, which is lacking in literature. Therefore, it examines the performance of different LLMs on five different Kaggle tasks in terms of metrics such as Acc, execution time and the peak memory usage. In addition, code generation is performed for each task using two different prompt generation techniques. Thus, the differences of LLM models in prompt-based response generation are evaluated, and the results are compared with both direct metric analyses and participant results in related Kaggle competitions, providing a comprehensive framework of LLM model performance in a real-world context. This study makes the following original contributions:

It presents a systematic and comparative performance analysis of different LLM families (OpenAI, GPT, Gemini, Claude, DeepSeek) on real-world tasks, which is limited in literature.
The integration of HPT into the code generation process is one of the first studies to question the effectiveness of LLMs in terms of not only solution generation but also solution calibration.
The impact of two different prompt design strategies (few-shot prompting (FSP) and Tree of Thoughts (ToT)) on end-to-end ML workflows including data preprocessing, algorithm selection and code generation is extensively studied. The capabilities of LLM models guided by these strategies in terms of end-to-end solution generation capacity are evaluated.
The practical applicability of the generated codes was analyzed in a multidimensional way by comparing their performance with both the public scores of Kaggle participants and the solutions written by experts.
Performance evaluation is not limited to classical metrics such as accuracy, F1 score and Root Mean Squared Error (RMSE), but also takes into account resource cost aspects such as execution time and peak memory usage, which are critical in real-world applications.
Through the analysis of four different task types (tabular, text, image; including classification and regression), the stronger or weaker performance of the models was revealed in detail in the context of the task type.
It extends the scope of LLM evaluation by including comparisons with established AutoML frameworks, thereby situating LLM-driven code generation within the broader landscape of machine learning.

With the aspects listed above, the study not only contributes to the technical evaluation of the end-to-end code generation capabilities of LLMs but also provides a practical perspective on how LLMs can be integrated more effectively into ML processes.

The remainder of this paper is organized as follows: Section 2 describes the materials and methods, including the selected LLMs, ML tasks, prompting strategies, the AutoML frameworks included in the study and evaluation metrics. Section 3 presents the experimental results, followed by a detailed discussion in Section 4. Section 5 concludes the paper and highlights future research directions.

2. Materials and Methods

In this section of the paper, the setting chosen for the experimental study, the ML tasks, the chosen LLMs, the prompt strategies, the AutoML frameworks used in the comparisons and evaluation metrics and are explained systematically and justified.

This study examines LLMs as agents that produce end-to-end, optimized solutions in supervised learning (classification/regression) tasks. For the evaluation to be objective, the Kaggle platform, which provides up-to-date performance data, was preferred. The Kaggle platform offers a fair comparison with its standardized datasets and evaluation process, which minimizes human intervention. With task-specific Leaderboards (LB), it provides transparent access to the performances of teams that offer solutions to the same task. Thanks to the notebook infrastructure it offers, it eliminates hardware differences and enables a fair evaluation in terms of metrics such as memory usage and execution time. Five public Kaggle tasks were selected for this purpose. In the experiments, a single, fixed task description was adapted to two different prompt technique frameworks (FSP and ToT). The task description, data/output format and evaluation criteria were identical in both frameworks; the difference was limited to structural elements such as the presentation of examples and step-by-step reasoning prompts.

In all conditions, only the first response (pass@1) was evaluated. All aspects of the solution including preprocessing, feature engineering, algorithm selection, training, evaluation, and submission formatting were handled by the models. The code outputs were generated in a single pass, based solely on the prompt content, without any external guidance, manual revision, or post-processing. For code generations that could not be executed due to syntax or basic library/module errors, the incorrect script and its error message were resubmitted to the same LLM, and only an executability-level correction was obtained. No modifications were made to the logical structure of the code, solution strategy, algorithm architecture, or hyperparameter settings, ensuring that the evaluation reflects the models’ independent planning and decision-making capabilities. All are part of the decision space. All conditions were run in the same Kaggle working environment. The cross validation (CV)/HP configurations chosen by the agent, execution time and peak memory (MB) values were fully recorded.

All experiments were executed in the standard Kaggle notebook environment without any manual modifications, ensuring reproducibility and fairness. Within this standardized setup, four of the five tasks (Titanic, Digit Recognizer, House Prices, and Disaster Tweets) were executed with Python 3.11.11, while the Beats per Minute of Songs task ran under Python 3.11.13, as this experiment was conducted later when the Kaggle environment had been updated. For clarity, only the key library versions are reported here, which remained consistent across tasks: scikit-learn 1.2.2, pandas 2.2.3, numpy 1.26.4, and PyTorch 2.6.0+cu124. Additional libraries available in the Kaggle environment could also be utilized if required, since the prompts did not impose any restrictions on the choice of libraries or the Python runtime environment.

2.1. Machine Learning Tasks for Evaluation

Within the scope of the study, five different ML tasks were preferred to be used in the evaluation of LLM models. In the selection of the tasks, both the criteria for being widely known problems and being related to data that have been widely studied by people were taken into consideration. We also considered the fact that each task is related to a different ML domain. To make a more accurate selection and to reach up-to-date results, we selected from the indefinite competitions on Kaggle. First, we listed the indefinite competitions on Kaggle according to the total number of teams and submissions. The results obtained are detailed in Table 1. Then, the first 4 competitions belong to different tasks with the highest number of teams among these competitions were selected. As can be seen from Table 1, one classification, one regression, one image classification and one NLP task were selected.

“Titanic—ML from Disaster” (Kaggle. Titanic—Machine Learning from Disaster, https://www.kaggle.com/competitions/titanic, accessed on 7 June 2025) is a ML/binary classification task to predict the survivors from the shipwreck of the Titanic, which sank in 1912, using passenger data. The competition was launched by Kaggle in 2010 and is one of the oldest competitions on Kaggle. The dataset is divided into train and test by Kaggle. There are 12 columns in the dataset. One of these columns represents the class label and one represents the id of the passengers. The remaining 10 columns are raw feature data. There are 1309 passenger records in the dataset, with 891 passengers in the train and 418 passengers in the test dataset. It can be said that it is a small dataset in terms of its size. For this reason, it has been observed that combining simple models with different data processing techniques rather than complex models on the dataset gives more accurate results.

The second selected task is “House Prices—Advanced Regression Techniques” (Kaggle. House Prices—Advanced Regression Techniques, https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques, accessed on 13 June 2025). This challenge is a regression problem designed by Kaggle in 2016 to predict house prices. It is also an indefinite competition and is still ongoing today. The dataset is divided into train and test by Kaggle and has a total of 79 independent variables. There are 1460 records in train and 1459 records in test, totaling 2919 records in the dataset. In terms of the number of records it contains, the dataset is more suitable for ML techniques rather than DL techniques.

Another selected task is “Digit Recognizer” (Kaggle. Digit Recognizer, https://www.kaggle.com/competitions/digit-recognizer, accessed on 26 June 2025). The competition was launched in 2012 and is also an indefinite competition. Within the scope of the competition, the MNIST handwritten digit dataset is used to predict handwritten digits. It aims to classify the 28 × 28 pixel grayscale handwritten digit images it receives as input into the correct digit label in the range (0–9). In other words, the competition can be defined as a multi-class image classification task. The dataset contains a total of 70,000 records, 42,000 in train and 28,000 in test. It can be said that it is a medium-sized data set in terms of the number of records.

The last task included in the comparison is “NLP with Disaster Tweets” (Kaggle. Natural Language Processing with Disaster Tweets, https://www.kaggle.com/competitions/nlp-getting-started, accessed on 15 July 2025). The contest was launched by Kaggle in 2019 indefinitely. The aim of the competition is to classify the texts collected on Twitter into binary categories (0: non-disaster, 1: disaster). The task can be defined as an NLP-based classification task since it is attempted to classify texts. There are total of 10,876 tweets in the dataset, 7613 in the train and 3263 in the test dataset. Again, when evaluated in terms of the number of records, it can be said that it is a medium-sized dataset.

The additional task included in the comparison is “Predicting the Beats-per-Minute of Songs” (Kaggle. Predicting the Beats-per-Minute of Songs, https://www.kaggle.com/competitions/playground-series-s5e9, accessed on 20 September 2025), which is part of the 2025 Kaggle Playground Series. The goal of the contest is to predict the tempo of a given track in terms of beats per minute (BPM) based on audio features. The task can be defined as a regression problem, since the objective is to estimate a continuous target variable (BPM). The dataset consists of 524,164 training samples and 174,722 test samples, making it larger than the previously examined toy datasets. While the previous competitions were selected partly based on their large number of participating teams, this task was chosen for different reasons: it is recent (1–30 September 2025), less explored in the literature, and not one of the classical benchmark datasets frequently used in tutorials and training corpora. For this reason, it provides a more challenging and up-to-date benchmark for evaluating the performance of LLM-based approaches. By including this recent competition, our study extends beyond the classical benchmark datasets and demonstrates the applicability of LLM-based approaches to more contemporary and less explored problems.

Table 2 summarizes the Kaggle datasets used to evaluate the performance of FSP and ToT techniques across different task types. For each dataset, the size of the training and test sets, the number and type of independent and dependent variables, data types, and the corresponding ML task (e.g., classification, regression, or NLP) are presented.

The selected datasets cover a diverse range of tasks, including binary and multiclass classification, regression, and NLP, allowing for a comprehensive assessment of prompting strategies across varied domains. In addition to widely known benchmark datasets, a recent competition dataset (Predicting the Beats-per-Minute of Songs, September 2025) was also included to reduce the risk of relying solely on classical and frequently studied problems.

Table 3 summarizes the data access records for the five Kaggle tasks used in the study. Date of Access indicates the date range in which the data files were imported into the Kaggle Notebooks and the codes were executed. Dates are reported as ranges as multiple notebooks were created for each task for LLM model X prompt technique X tuning combinations. The data was not downloaded locally, only used in the Kaggle working environment. The Data Version column is given as “N/A (competition data)” because there is no official version number on the competition pages.

2.2. Selected LLMs for Experimental Evaluation

In this study, a total of eight different versions of four popular language model families (GPT, Gemini, DeepSeek and Claude), both open source and commercial, were selected. This allows for a cross-family comparison in terms of ecosystem, license and design philosophies, as well as a balanced comparison in terms of performance and accessibility thanks to the selected sub-models. To be able to compare both code development-oriented versions and up-to-date, high-performance versions of different language model families, care was taken to select more than one sub-model from each family. The selected LLM models are detailed in Table 4. All models were queried using the default settings provided by their respective platforms (e.g., temperature, top-p, max tokens, seed). No custom hyperparameters were applied, and prompts were formatted according to each model’s expected interface (chat-style, code block, etc.). This decision was made to reflect real-world usage scenarios where users typically interact with LLMs using default configurations.

GPT is a family of language models based on the Transformer architecture, with the first version released in 2018 [24]. Since 2018, it has grown significantly in terms of both the number of parameters and the context window, and different versions have been developed. However, GPT-4 and its variants are particularly notable for their enhanced reasoning capabilities, enabling more complex inference tasks through advanced reasoning abilities. Among the GPT family, GPT-4 and its successors represent the most capable models in performing complex reasoning tasks. In this study, OpenAI o3, which emphasizes deep reasoning capabilities, and GPT-4.1, which draws attention with its code development performance, are included. Although both were developed by OpenAI, o3 is officially presented under a separate “o-series reasoning model” family, independent from the GPT series. Both OpenAI o3 and GPT-4.1 were released in 2025 [25,26,27]. The comparison of the two sub-models aims to reveal whether deep reasoning or broad context support is more critical in code-centric projects. In addition to these, the GPT series has continued to evolve, and the most recent version, GPT-5 Thinking (OpenAI. (2025). Introducing GPT-5 for developers, https://openai.com/tr-TR/index/introducing-gpt-5-for-developers, accessed on 26 September 2025. OpenAI. (2025). GPT-5, https://openai.com/tr-TR/gpt-5, accessed on 26 September 2025.), was also included in our comparative experiments on a contemporary problem (the Beats-per-Minute task). GPT-5 became publicly accessible during our study. The inclusion of GPT-5 Thinking enables us to assess whether the latest iteration in the GPT series further improves performance under realistic scenarios.

Developed by Google DeepMind, Gemini is a family of LLMs first released in 2023 and has Transformer-based architecture. The Gemini 2.5 [28] version, introduced in 2025, has more advanced reasoning capabilities compared to previous versions [28] and can provide more accurate answers to questions. According to the technical report published in 2025 [28], the Gemini 2.5 family consists of three different versions: Pro, Flash and Flash Lite. The Pro version is optimized for coding and complex tasks, while the Flash version aims to provide high performance for everyday tasks. Flash Lite is presented as the most cost-effective option and was released as a preview in June 2025. Each of these versions provides different advantages in various metrics such as quality, cost and response time [28]. Gemini 2.5 pro version was included in the study as it is the most recent version and has been developed with a focus on coding.

DeepSeek is a China-based company founded in 2023 to develop open source LLMs [29]. Since 2023, many different sub-versions have been released under the DeepSeek name. The two most recent versions are V3 and R1. Both models are included in the study with versions released or updated in 2025. Both models include “reasoning”, but their methods differ in terms of the focus of their development. While the R1 model stands out in technical areas due to its deeper reasoning capability, the V3 model offers reasoning capability in a wider range of areas but cannot go as deep as R1 in a specific area [30,31]. By including both versions in the study, it was aimed to test the effect of deep-thinking ability and comprehensive thinking ability on coding, and to compare the best version with non-open source LLMs.

Claude is a family of LLMs developed by Anthropic [32]. Since its introduction, its development has continued and the most recent versions are Claude Opus 4 and Sonnet 4, which were released in 2025 [33,34]. Opus 4, the most advanced version, stands out with its “extended thinking” model and promises high performance in both coding and complex tasks [33]. Sonnet, on the other hand, is optimized for efficiency and cost-effectiveness, offering a balanced solution for everyday use compared to Opus [33,34]. Both versions are included in this study. In terms of coding, it is aimed to compare the performance of both the “light” model and the “extended thinking” version in terms of accuracy, F1 score, RMSE, execution time, memory consumption, etc.

2.3. Prompt-Driven Code Generation with LLMs

With the widespread of LLMs, the concepts of prompt and prompt engineering have become increasingly popular [12,35,36,37]. Nowadays, obtaining the desired output from LLM directly depends on the structure and content of the instructions given by the user. While these instructions given to the LLM are called prompts, prompt engineering can be defined as the process of systematic, optimized and purposeful preparation of the commands given to achieve the desired result [38,39]. A properly constructed prompt significantly increases the accuracy and efficiency of the responses received from the LLM and provides better quality and reliable results [12,40].

Nowadays, a large number of prompt engineering techniques have been developed for different types of tasks, each serving one or more different purposes [37,41,42]. In addition, many academic studies have been conducted to evaluate in which scenarios these techniques work more efficiently, including systematic comparisons and effectiveness analyses [41,42]. Thus, the quality and reliability of the output from LLM is continuously being improved.

In this paper, we evaluate the code generation performance of LLMs for different types of ML tasks. In the literature, it has been observed that the prompts given to LLMs for code generation play a decisive role in the accuracy/F1 score/RMSE and efficiency of the generated code [37,41,42]. Therefore, choosing the appropriate prompt engineering technique for code generation is critical to maximizing LLM performance [43]. In this study, LLM performances are evaluated by using two prompt methods that have been proven to be suitable for code generation. These are FSP and ToT methods.

The FSP technique was extensively covered in a paper published by OpenAI in 2020 [44], which tested the performance of the GPT-3 model with a few-shot learning approach. In this study, it was shown that LLMs can successfully perform various tasks with prompts containing only a few examples [44]. It is emphasized that the FSP method significantly improves the accuracy and performance of the model, especially for different types of tasks such as tagging, translation and text completion. In addition, there are also studies showing that the FSP technique improves the performance of LLMs for tasks such as code synthesis and code generation and is suitable for code synthesis [41,45,46]. As a result, it can be stated that the FSP technique, which is based on giving more than one example in the prompt as an approach, can give better results in fast and simple tasks, but it will be insufficient in complex tasks [41,47,48].

Another prompting method preferred in the study is the ToT [49] technique. Proposed in 2023, this technique allows LLMs to create multi-step and tree-like reasoning chains, especially for complex tasks [49]. The ToT method is based on the CoT [50] approach and takes CoT to the next level. The classical CoT [50] approach aims to solve a task by building a chain of reasoning. Unlike COT, ToT aims to simultaneously explore different solution paths for a task by constructing a branching tree of ideas over multiple alternatives and selecting the optimal solution path based on the outputs of the intermediate steps [41,49]. The main purpose of choosing this method is to examine the extent to which LLM models are effective in method selection and decision making, especially in solving multi-step problems, depending on the prompt generation technique to be applied. GPT-4.1 [26,51], Gemini 2.5 Pro [28], Claude Opus 4 and Claude Sonnet 4 [33], DeepSeek-R1 [31] and DeepSeek-V3 [30] were chosen as more advanced reasoning-based LLMs [26,28,30,31,32,51] compared to their older versions. In this context, we also analyzed the effectiveness of prompts in reasoning-based LLMs, which are also designed to be reasoning-oriented. Thus, based on two different prompt approaches, one standard and one reasoning-based, the code generation performance of LLMs on AI-based tasks is compared.

In both the FSP and ToT setups, the LLMs were provided with detailed task prompts and instructed to independently generate complete code solutions. To examine the role of HPT, we evaluated two variants within each prompting paradigm: one prompt explicitly required the LLM to perform HPT, while the other did not. Combined with the two prompting paradigms, this design resulted in four distinct prompt templates per task. For HPT, the LLMs were not provided with predefined search grids or optimization ranges. Instead, they received open-ended prompts that allowed them to determine both the search strategy (e.g., grid search, random search, Bayesian optimization) and the parameter ranges. This setup ensured that the HPT process reflected the models’ own planning and decision-making capabilities, rather than being constrained by externally imposed configurations.

This design choice reflects the objective of evaluating each LLM as a problem-solving assistant, capable of approximating certain aspects of reasoning and decision-making. Especially in the ToT scenario, models were encouraged to plan multiple alternative strategies and choose among them internally. However, in both prompting approaches, the model’s output was accepted as-is, providing a fair and realistic assessment of out-of-the-box performance under independent execution. A structural summary of the four different prompt variants designed is given in Table 5.

2.4. AutoML Frameworks: Setup and Parameters

To strengthen the experimental design and provide fair baselines, two widely used and well-established open-source AutoML frameworks were selected: auto-sklearn [52,53] and AutoGluon [54]. Both have been extensively validated in academic studies and real-world ML applications [55,56,57,58,59,60,61], and they represent different design philosophies in automated model selection and hyperparameter optimization. Including these frameworks allows for a more comprehensive comparison between traditional AutoML systems and LLM-driven approaches.

In this study, auto-sklearn (version 0.15.0, Python 3.10.18) [52,53] was employed as one of the AutoML baselines. Built on top of scikit-learn, auto-sklearn combines Bayesian optimization, meta-learning, and ensemble construction to automatically select algorithms and tune hyperparameters [53]. It has been widely adopted in academic and industrial applications for structured ML problems [55,56,57,58]. Within the scope of this study, the framework was configured with a fixed random seed of 42 and a memory limit of 8192 MB. The evaluation metrics were aligned with the task types (accuracy for Titanic and Digit Recognizer, RMSE for House Prices and Beats-per-Minute of Songs, and F1 for Disaster Tweets). A summary of the settings used in this study is provided in Table 6.

AutoGluon-Tabular (version 1.4.0, Python 3.12.x) [54] was employed as another AutoML baseline. AutoGluon is an open-source AutoML framework that supports a wide range of model families and offers ensembling and stacking strategies [54]. Within the scope of this study, the framework was used with the high_quality preset, a fixed random seed of 42, and a memory usage ratio limit of 0.8. The time limit was set to 7200 s for Titanic, House Prices, and Beats-per-Minute of Songs, and 14,400 s for Disaster Tweets and Digit Recognizer. The evaluation metrics were aligned with the task types (accuracy for Titanic and Digit Recognizer, RMSE for House Prices and Beats-per-Minute of Songs, and F1 score for Disaster Tweets). Key configurations included enabling auto-stacking, setting three bagging folds, and applying hyperparameter tuning with up to 80 trials where applicable. A summary of the settings used in this study is provided in Table 7.

2.5. Evaluation Metrics

To ensure transparent evaluation and fair comparison across different LLMs, all results were calculated based on the official submission files generated for Kaggle competitions. The evaluation considered multiple metrics, including Acc, RMSE, execution time, and peak memory consumption. All metrics measured directly within the Kaggle Notebook environment. For the Titanic, House Price and Beats-per-Minute of Songs tasks, the accelerator setting was left at its default value (“None”), whereas for the Disaster Tweets and Digit Recognition tasks, the GPU was set to “P100”. Apart from these adjustments, no other changes were made to the notebook configurations, ensuring that all experiments were conducted under standardized conditions.

The metrics used to evaluate LLMs in the study are given in Equations (1)–(6) and Figure 1. Three of the five tasks included in the study are classification tasks. The evaluation of these tasks is performed on Kaggle by comparing the submission files against the hidden test set labels, using accuracy as defined in Equation (1). This automatic evaluation ensures objective and consistent scoring across different models and participants. In the formula, the term number of predictions refers to the total number of records in the submission file, while number of correct predictions denotes the subset of those predictions that exactly match the corresponding ground truth labels.

In the evaluation of the Disaster Tweets task, the F1 metric was employed as the primary performance measure. F1 is defined as the harmonic mean of precision (Equation (2)) and recall (Equation (3)) and is presented in Equation (4). Precision denotes the proportion of tweets predicted as “disaster” that are indeed correct, whereas recall indicates the proportion of all actual disaster tweets that were successfully identified by the model. The foundations of these measures are the concepts of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), which are provided in Equations (2) and (3). Specifically, TP represents the correct classification of a disaster tweet, TN denotes the correct rejection of a non-disaster tweet, FP refers to a non-disaster tweet that is incorrectly classified as disaster, and FN corresponds to a disaster tweet that the model fails to detect (Kaggle. Natural Language Processing with Disaster Tweets, https://www.kaggle.com/competitions/nlp-getting-started, accessed on 15 July 2025).

Other tasks included in the study, House Prices and Predicting the Beats-per-Minute of Songs, are regression tasks. The evaluation for this task is carried out using the RMSE metric, as defined in Equation (5). In this formula, the term predicted refers to the LLM generated code outputs provided in the submission file, whereas actual denotes the true target values that are hidden during evaluation but used by the Kaggle platform for scoring. The variable n in the formula represents the total number of predictions made, corresponding to the number of records in the submission file.

A c c u r a c y = \frac{N u m b e r o f C o r r e c t P r e d i c t i o n s}{N u m b e r o f P r e d i c t i o n s}

(1)

p r e c i s i o n = \frac{T P}{T P + F P}

(2)

r e c a l l = \frac{T P}{T P + F N}

(3)

F 1 S c o r e = 2 * \frac{p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

(4)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({P r e d i c t e d}_{i} - {A c t u a l}_{i})}^{2}}

(5)

E x e c u t i o n T i m e = t_{e n d} - t_{s t a r t}

(6)

Execution time is also considered as one of the evaluation metrics used in this study. Regardless of the task type, it is calculated as shown in Equation (6). In the context of this study, execution time refers to the total wall-clock time required to run all code cells in a notebook, as measured by the Kaggle platform. In equation,

t_{s t a r t}

denotes the timestamp at which the execution of the first code cell begins, and

t_{e n d}

represents the timestamp at which the final cell completes execution. This metric reflects the total elapsed time during the model’s end to end processing, including data loading, preprocessing, training, and evaluation steps. On Kaggle, this measurement is automatically reported and provides a standardized way to compare computational efficiency across different submissions.

In addition to accuracy/F1 score/RMSE and execution time, peak memory usage was also monitored during the execution of the notebook on the Kaggle platform. Memory tracking was implemented using Python’s psutil and resource libraries. As shown in the pseudocode (Figure 1), the memory usage was obtained using psutil. Peak memory usage, representing the maximum resident memory utilized during the entire execution, reflects the highest memory load experienced by the process and provides insight into the model’s memory efficiency under real execution conditions.

3. Experiments and Results

This section presents the results of experiments using the prompt-based methods and LLMs described in the previous sections. Each LLM was evaluated on different types of ML tasks under standardized conditions, and the code generation capabilities of the LLMs are extensively compared. The results were organized according to the applied prompting technique and HP tuning variants, and the performance differences of the models are clearly and concisely highlighted. (The datasets, codes, log files, and both functional and non-functional code examples generated by the LLMs that support the results presented in this section are provided in the Supplementary Materials).

Figure 2 shows the accuracy, F1 score and RMSE values obtained by running the codes generated by OpenAI o3, GPT-4.1, Gemini 2.5 Pro, DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4 LLMs in Kaggle environment using four different Kaggle tasks and two different prompt techniques. The figure shows (a) Accuracy for Titanic task, (b) RMSE for House Prices task, (c) Accuracy for Digit Recognizer task and (d) F1 score for Disaster Tweets task, respectively. Blue bars represent FSP, and orange bars represent ToT approach. The bars labeled ‘HP tuned’ show the results for codes generated with prompts where HPT is enforced in the prompt; the other bars show the results for codes generated with prompts where this setting is not enforced. All experiments were conducted under equal hardware/time constraints, with the same data separations and evaluation criteria.

Figure 2 summarizes the general trend of the LLM models in terms of accuracy, F1 score and RMSE metrics: in the classification tasks (Titanic, Digit Recognizer, Disaster Tweets) accuracy and F1 score values concentrated in a narrow band between LLMs (while variance/standard deviation between LLMs was low), while in the regression task (House Prices) RMSE differences remained limited. There is no uniform superiority between the two prompt strategies; ToT produced small gains in some combinations, whereas FSP could produce equal or better results in most tasks. HPT, on the other hand, was not consistently positive; in some models it produced significant gains, while in others test performance remained unchanged or slightly decreased. Below, the effects of LLM × prompt strategy × HPT are analyzed in detail on a task-by-task basis.

Table 8 shows the detailed results of LLM × prompt strategy × HPT combinations for Titanic task. Also detailed hyper-parameter grids are provided in Table A1 and Table A2 of Appendix A.

In Table 8, columns show the corresponding LLM series, rows show CV accuracy (k-fold cross-validation accuracy on training data: where k and the folding method, if any, were left to the code generated by the LLM, each condition was reported with its own CV setting), Test accuracy (accuracy calculated from the submission file on the public LB subsection of the test set whose labels are kept secret from the participants on the Kaggle evaluation server. (only the public LB results are reported since the selected competitions are indefinite), execution time, memory (peak memory usage) and Algorithm (the final learning algorithm used in the code generated by LLM). When comparing FSP and ToT strategies, in the conditions labeled HP-tuned, prompt enforced HPT in the generated code, while in the other conditions this step was not enforced. All experiments were run on the Kaggle platform under equal hardware and time/memory constraints, with the same data separations and evaluation metrics. Since code generation was performed with deterministic settings, the same code was obtained when the same prompt was repeated, and the results were reported as pass@1.

Figure 3 shows the visualization of the test accuracy obtained for the Titanic task on the heatmap. On the left (Figure 3a) is the map of accuracy obtained with non-tuned models, and on the right (Figure 3b) is the map of accuracy obtained with HP-tuned models. The rows represent the LLMs, and the columns represent the prompt techniques used. It was observed that the FSP technique was clearly superior without HPT. In this case, the highest values of 0.77990 were obtained for the GPT-4.1 and Gemini 2.5 Pro models. After the HPT process, it was observed that the FSP technique produced codes with higher accuracy than the ToT technique in most cases. When the figures were evaluated as a whole, it was found that the code with the highest accuracy (0.78229) was obtained with the DeepSeek-V3 model, FSP technique, and HPT process. In the Titanic task, it was observed that the FSP technique was generally better, and the ToT technique became competitive only with tuning in some models such as GPT-4.1 and o3. In this task, it was found that the HPT process acted as a lever that could change the ranking, making the ToT technique competitive but not universally beneficial.

Figure 4a shows the peak memory utilization for the Titanic task, measured during the execution of codes generated by FSP and ToT techniques on different LLMs. The Y-axis shows peak memory and the X-axis shows the LLM models and the non-tuned/HP-tuned variants of each. Peak memory consumption was within a narrow range for all model-strategy-tuning combinations and there was no uniform dominance. The differences varied depending on the LLM model × prompting technique × tuning pairing. We observed that HPT did not result in a consistent increase in task-specific memory utilization (it increased for some models while remaining the same or decreasing for others). The lowest peak memory value for this task was 272.04 MB for the Gemini 2.5 Pro model in the non-tuned condition and using the ToT technique. Under the HP-tuned condition, the lowest value was 273.26 MB on the OpenAI o3 model, again with ToT technique.

Figure 4b shows the execution time of the codes generated by FSP and ToT techniques for the Titanic mission when run on different LLMs. The Y-axis shows the time (seconds) in logarithmic scale, and the X-axis shows the models (non-tuned/HP-tuned). The figure shows that HP-tuning increased execution time in most models; the magnitude of the increase varied from LLM to LLM. For example, the increase was more pronounced in the Claude-4 family compared to other models. In terms of FSP and ToT techniques, there was no uniform speed advantage for this task: in some models, ToT was faster, while in others it produced results close to or slower than FSP. However, when all LLM and tuning variants were considered together, the ToT technique was faster on average (ToT: mean 61.9 s, median 36 s; FSP: mean 83.6 s, median 59 s). However, there was model- and tuning-dependent variability; therefore, task- and model-based results were reported separately. The shortest execution time for the Titanic task was 23 s on the Gemini 2.5 Pro model in the non-tuned condition and codes generated with the ToT technique. The fastest result among HP-tuned conditions was measured as 41 s on the DeepSeek-V3 model with ToT.

Table 9 shows the detailed results of the LLM × prompt strategy × HPT combinations for the House Price task. Also detailed hyper-parameter grids are provided in Table A3 and Table A4 of Appendix A. Columns show the corresponding LLM series, rows show the RMSE, execution time, memory (peak memory usage) and algorithm (the final learning algorithm used in the code generated by LLM). RMSE was computed on the Kaggle evaluation server from the predictions in the submission file in the public LB subsection of the test set, whose labels were kept secret from the participants. Only public LB results are reported since the selected competitions are indefinite. When comparing FSP and ToT strategies, in the conditions labeled HP-tuned, prompt enforced HPT in the generated code, while in the other conditions this step was not enforced. All experiments were run on the Kaggle platform under equal hardware and time/memory constraints, with the same data separations and evaluation metrics. Since code generation was performed with deterministic settings, the same code was obtained when the same prompt was repeated, and the results were reported as pass@1.

Figure 5 shows the heatmaps of the test RMSE values obtained for the House Prices task. On the left (Figure 5a) the results in the non-tuned case and on the right (Figure 5b) the results in the HP-Tuned case are shown. Rows represent LLMs and columns represent the prompt techniques used (FSP, ToT). It could be seen that in the non-tuned case, the FSP technique produced a significantly lower RMSE in most models. In this case, the lowest error value of 0.12007 was obtained with the FSP technique and the Gemini 2.5 Pro model. After the HPT process, the results obtained with the ToT technique became more competitive, as in the Titanic task. The most remarkable improvement was seen in the GPT-4.1 model with an error value of 0.12141. In the House Prices task, the FSP technique provided a reliable and strong start even without tuning. The ToT technique, on the other hand, was weak without tuning. With the right HPT process, it could become competitive. Considering the figures, the best result was 0.12007 with the Gemini 2.5 Pro model, FSP technique and no-tuning.

Figure 6a shows the peak memory utilization measured during the execution of codes generated by FSP and ToT techniques on different LLMs for the House Prices task. The Y-axis shows Peak Memory (MB) and the X-axis shows the LLM models and the non-tuned/HP-tuned variants of each. Peak memory consumption was in a narrow band in the middle for most model-technique-tuning combinations, but there was only one clear outlier in the graph. The GPT-4.1 model had by far the highest consumption of 842.48 MB when used the FSP(HP-tuned) technique. Apart from this, ToT variants were found to use less memory than FSP in many models. When the effect of HP-tuning on memory consumption was analyzed, there was no consistent upward/downward trend in this respect. The lowest values were in the ~250–350 MB band with the o3 model on the ToT side.

Figure 6b shows the execution time of the codes generated by FSP and ToT techniques for the same task. The Y-axis shows time (seconds) on a logarithmic scale, and the X-axis shows the non-tuned/HP-tuned variants of the models. The figure shows that HP-tuning increased the time in most models. The rate of increase varied from LLM to LLM. The most significant increase was seen for the DeepSeek-R1 model in combination with the ToT technique (HP-tuned). There was no uniform superiority in terms of speed, but the ToT (non-tuned) line was often the fastest (e.g., execution time on GPT-4.1 and Gemini 2.5 Pro was on the order of seconds). The FSP (HP-tuned) line was the slowest in most models.

Figure 7 shows the heatmaps of the Test accuracy values obtained for the Digit Recognizer task. The left panel (Figure 7a) shows the results in the non-tuned case, while the right panel (Figure 7b) shows the HP-tuned case. The rows represent the LLMs, and the columns represent the prompt techniques used. According to the figure, in the non-tuned case the ToT technique outperformed the FSP technique in most models. The best result in this case was 0.99621 with the Gemini 2.5 Pro model and the ToT technique. In the HP-tuned case, the ToT technique maintained its superiority over FSP in most models; the best result here was 0.99639 with the Gemini 2.5 Pro model and the ToT (HP-tuned) technique.

When the figure was evaluated across all conditions, the highest overall accuracy was obtained with the Gemini 2.5 Pro model, ToT technique, and HP-tuning. According to the figure, ToT gave the highest accuracy for most models in the digit recognizer task, whereas the FSP technique seemed to be a more reliable choice within the DeepSeek family. In this task, HP-tuning with ToT was generally competitive or superior, but—depending on the LLM (the ToT technique could degrade performance).

Table 10, shows the detailed results of LLM × prompt strategy × HPT combinations for Digit Recognizer task. Also detailed hyper-parameter grids are provided in Table A5 and Table A6 of Appendix A. The descriptions of table rows and columns, execution environment, code generation conditions, hyperparameter tuning rules, and evaluation settings in Table 10 are identical to those described for Table 8.

Figure 8a shows the measured peak memory usage for the Digit Recognizer task when running codes generated by FSP and ToT techniques on different LLMs. The Y-axis shows Peak Memory (MB) and the X-axis shows the LLM models and the non-tuned/HP-tuned variants of each. The overall pattern for this task showed that ToT variants used less memory compared to FSP, especially in the non-HPT case. The FSP technique had the highest consumption in many models even in the no-tuning case. The lowest memory values were seen with the ToT technique in the range of about 2.0–2.4 GB in models like o3/GPT-4.1. The effect of HP-tuning on memory was not uniform. In the FSP technique, tuning resulted in a decrease in most models, while in the ToT technique, small increases or decreases could be seen from LLM to LLM.

Figure 8b shows the execution time of the code generated by FSP and ToT techniques for the same task. The Y-axis shows time (seconds) on a logarithmic scale, while the X-axis shows the non-tuned/HP-tuned variants of the models. According to the graph, it was seen that the effect of HP-tuning on time was model-dependent. The fastest results were in the ~1–3 min band and were seen in the GPT-4.1, Sonnet 4 (FSP) and DeepSeek-V3 (ToT) combinations. In general, there was no uniform speed advantage: ToT was the fastest in some models, while FSP was more advantageous in others.

Table 11, shows the detailed results of LLM × prompt strategy × HPT combinations for NLP with Disaster Tweets task. Also detailed hyper-parameter grids are provided in Table A7 and Table A8 of Appendix A. The descriptions of table rows and columns, execution environment, code generation conditions, hyperparameter tuning rules, and evaluation settings in Table 11 are identical to those described for Table 8. In Table 11, only the evaluation metric differs from Table 8, where F1 score was used instead of accuracy.

Figure 9 shows the heatmaps of the F1 score values obtained for the Disaster Tweets task. The left (Figure 9a) shows the results in the non-tuned case and the right (Figure 9b) in the HP-tuned case. The rows represent the LLMs, and the columns represent the prompt techniques used.

In non-tuned models, the FSP technique produced results with significantly higher F1 score than the ToT technique in almost all models. Under HP-tuned, the ToT technique became competitive or produced better results than FSP in some models. In non-tuned models, the highest F1 score of 0.83695 was obtained with the Gemini 2.5 Pro model and the FSP technique. With HPT, the most successful LLM was obtained with 0.81428 F1 score from the OpenAI o3 model with the ToT technique. When the figures were evaluated in general, it was seen that the best result was obtained with the Gemini 2.5 Pro model and FSP technique. Another conclusion that could be drawn here was that the ToT technique significantly improved the F1 score of the models with the HPT process applied.

Figure 10a shows the measured peak memory utilization for the Disaster Tweets task when running codes generated by FSP and ToT techniques on different LLMs. The Y-axis shows Peak Memory (MB) and the X-axis shows the LLM models and the non-tuned/HP-tuned variants of each. The figure shows that, in general, the FSP technique consumed less memory than the ToT technique in most models. Four extremes stood out in the graph. These were Gemini 2.5 Pro with FSP (non-tuned) version, Claude Opus 4 with ToT (non-tuned) version, and HP-tuned versions of Claude Sonnet 4 for both FSP and ToT. Memory consumption in these datasets varied from about 1.5 GB to 5 GB. The lowest memory consumption value was found to be 269.09 MB in DeepSeek-V3 with FSP (non-tuned) technique. Again, HPT did not seem to have a uniform effect on memory consumption.

Figure 10b shows the execution time of the code generated by FSP and ToT techniques for the same task. The Y-axis is time (seconds) on a logarithmic scale, and the X-axis shows the non-tuned/HP-tuned variants of the models. According to the figure, HPT tended to increase the time for most models, but the magnitude of the increase varied by model. The fastest combinations were DeepSeek-R1 model FSP (non-tuned) & ToT (non-tuned), GPT-4.1 model FSP (non-tuned), and Gemini 2.5 Pro model FSP (HP-tuned).

In addition to the previously reported tasks, we extended the experimental study with the newly introduced “Predicting the Beats-per-Minute of Songs” competition. For this task, we focused on the most competitive and relevant models: Gemini 2.5 and DeepSeek-V3, whose effectiveness had already been demonstrated in earlier experiments, and GPT-5, which represents the most recent version available during the study. This selection allows for a fair and up-to-date evaluation while avoiding unnecessary redundancy.

Table 12 reports the detailed results of the Predicting the Beats-per-Minute of Songs task, considering the combinations of LLM × prompting strategy × hyperparameter tuning. Also detailed hyper-parameters are provided in Table A9 and Table A10 of Appendix A. Columns correspond to the LLM series, while rows provide RMSE, execution time, memory (peak usage), and the final learning algorithm employed by the generated code. RMSE values were obtained from the Kaggle evaluation server based on the submission files, ensuring evaluation consistency through hidden test labels. Figure 11 presents the heatmaps of the RMSE results for this task. Figure 11a shows the non-tuned case, while Figure 11b shows the HP-tuned case. Rows correspond to LLMs and columns to prompting strategies (FSP, ToT).

In the non-tuned setting, the lowest RMSE was obtained by GPT-5-Thinking with the ToT strategy (26.38734). Gemini 2.5 Pro with ToT was a close second (26.38879), while the best FSP result among non-tuned models was Gemini 2.5 Pro (26.39081). DeepSeek-V3’s ToT variant underperformed (26.42273), whereas its FSP result was 26.39554. With hyperparameter tuning, DeepSeek-V3 (FSP) achieved the overall best RMSE 26.38663. GPT-5-Thinking (ToT) remained competitive (26.38760), and DeepSeek-V3 (ToT) improved substantially to 26.38801.

Figure 12a shows the measured peak memory utilization for the Predicting the Beats-per-Minute of Songs task when executing the codes generated by FSP and ToT techniques on different LLMs. The Y-axis indicates peak memory (MB), and the X-axis shows the LLM models with their non-tuned and HP-tuned variants. Memory consumption patterns varied considerably across models and prompting strategies. GPT-5 Thinking with ToT (HP-tuned) consumed the most memory (~2 GB), while its ToT (non-tuned) variant was among the most memory-efficient (~461 MB). Gemini 2.5 Pro with FSP (HP-tuned) showed relatively low memory usage (~607 MB). For DeepSeek-V3, however, the FSP (HP-tuned) variant consumed ~697 MB, which was higher than its ToT (non-tuned) variant (~480 MB). Overall, neither FSP nor ToT was consistently more memory-efficient; the effect depended strongly on the LLM and whether hyperparameter tuning was applied.

Figure 12b presents execution times for the Predicting the Beats-per-Minute of Songs task. GPT-5-Thinking was generally the slowest, exceeding 3 h in the non-tuned runs, though its FSP (HP-tuned) variant reduced this to ~1.3 h. Gemini 2.5 Pro with ToT (non-tuned) achieved the fastest runtime at only 47 s. DeepSeek-V3 was also efficient, ranging from ~9 to 46 min depending on tuning. Overall, runtime efficiency was strongly model- and strategy-dependent, with Gemini and DeepSeek outperforming GPT-5 in this task.

4. Discussion

This section interprets the findings in terms of model, prompting strategy, and tuning interactions. Task-specific performance patterns are discussed, along with comparisons to baseline or reference solutions, consistency with prior work, and practical takeaways.

4.1. Prompting Effectiveness by Task Type

Table 13 summarizes the relationship between the types of tasks examined in the study and the prompt techniques. In the table, Search/Design Space refers to the set of all options that the LLM model can decide on when solving a task. This includes all choices such as data cleaning and transformation steps, feature engineering, algorithm selection (e.g., Logistic Regression (LR), Random Forest (RF), Convolutional Neural Network (CNN), etc.), hyperparameters and training/validation strategies (CV, early stopping, ensemble). The Search/Design Space column is rated by task as Narrow, Medium, Large and Very Large. This rating is based on the solution practices of the respective tasks in Kaggle competitions; in real-world applications, the width of the search space may vary due to data access, business requirements or constraints.

According to our findings, FSP stands out as the “reliable default” in most scenarios. Especially in the tabular tasks, it gave consistent results even without HPT and the best absolute score in the House Prices task was obtained with FSP (non-tuned). In Titanic, the best result was obtained with FSP (HP-tuned). For the Beats-per-Minute of Songs regression task, which represents a more recent and large-scale dataset, the best performance was again achieved with FSP (HP-tuned). As the search space expands, the ToT technique can provide higher ceiling values with the right HPT. For example, for the Digit Recognizer task, the best performance was obtained with the combination of ToT (HP-tuned). For a short and noisy text classification task such as Disaster Tweets, the best result was obtained with FSP (non-tuned); however, ToT (HP-tuned) conditions were also competitive in some LLMs.

4.2. A Practical Comparison of LLM-Generated and Human-Developed Solutions

Table 14 compares the solutions produced by LLMs in five different Kaggle competitions with the notebooks created and shared on the platform by the users participating in the competition. The scores in the “Kaggle Notebooks” column were determined as follows: If there was a pinned notebook in the competition that explicitly reported its result, the score from this notebook was taken directly. Pinned notebooks are typically highlighted by the contest team or Kaggle editors, indicating editorial endorsement. If no such notebook was available, notebooks were sorted by the number of votes, and the result from the highest-voted notebook reporting a score was used. While this approach is not a perfect benchmark, it offers a reasonable reference point for comparison. For the Beats-per-Minute of Songs task, since it is an ongoing competition, Kaggle public LB was taken as the reference, and the best result as of 20 September 2025 was used. As this result reflects the best submission by participants on that date (20 September 2025) but was not accompanied by an open notebook, only the RMSE value could be accessed, and the specific method used to obtain it was not available.

As four of the selected competitions are indefinite and one is still ongoing, only the public LB is available. Public LB results run the risk of adaptive overfitting due to a single public test slice and unlimited trials. Also, heavy ensembles that cannot be reproduced in practice, excessive seed/hyperparameter searches, etc. can make the scores difficult to be realistic and reproducible. In addition, unnoticed data leaks or incorrect validation practices may occur, and measurement noise and small statistically insignificant differences may also affect the ranking. For these reasons, public LB alone is not a reliable baseline in timeless competitions. The notebooks selected in this study aim to provide a more reproducible and representative comparison. Nevertheless, since there is no private LB, one should be cautious about the generalizability of the comparisons made.

According to the results in Table 14, if the LLM-generated solutions are compared with the Kaggle user-submitted notebooks on a task-specific basis, the following observations can be made: In the Titanic task, the Kaggle notebook achieved a higher score with 80.1% accuracy than the LLM (78.2%) and ranked higher on the leaderboard. The LLM solution, however, produced results in a much shorter time. In the House Price competition, both the LLM (0.12007) and the Kaggle notebook (0.12096) had very similar error rates. The LLM ranked slightly higher in percentage terms, while the Kaggle notebook was faster. In the Digit Recognizer task, the Kaggle notebook performed slightly behind the LLM, with the LLM also producing results faster. In this case, the leaderboard percentile rank of the LLM (10.4%) was better than that of the Kaggle notebook (32.7%). In the Disaster Tweets task, the LLM and Kaggle notebook produced almost identical F1 score and leaderboard rankings, with the Kaggle notebook being faster. Finally, for the Beats-per-Minute of Songs regression task, which represents a more recent and large-scale dataset, the Kaggle notebook achieved a slightly lower RMSE (26.38020) compared to the best LLM configuration (26.38663 with DeepSeek-V3 FSP HP-tuned). However, the LLM solution delivered results in a short time (under 10 min), highlighting its efficiency.

The performance differences across models also provide further insight. Gemini 2.5 Pro often excelled in structured classification tasks such as House Price, Digit Recognizer, and Disaster Tweets, which may be attributed to its extended ~1M-token context window (see Table 4) and strong handling of tabular patterns. In contrast, DeepSeek-V3 achieved the best results in Titanic and Beats-per-Minute of Songs, suggesting that its efficiency with regression-style optimization and coding-oriented reasoning (e.g., feature engineering, model selection) allowed it to succeed even with a shorter 128 K context window (see Table 4).

The results of the five Kaggle tasks examined in this study indicate that LLM-based solutions, generated with a single prompt and minimal manual intervention (without iterative refinement), can perform at levels comparable to, and in some cases slightly better than Kaggle user-submitted notebooks that were manually optimized. Performance in terms of speed varies by task: LLMs were faster in some tasks, while human-generated solutions were faster in others.

4.3. Comparison with Prior Literature

Table 15 presents the comparative results obtained from different LLMs on two classification tasks: Titanic survival prediction and MNIST digit recognition. The comparison includes results from three distinct sources: previously reported results from Ref. [23], standard Kaggle notebook implementations, and our own experimental analyses. The models evaluated include different versions of GPT, Gemini, and DeepSeek, tested both with and without fine-tuning. Several prompting strategies were applied, including FSP, CoT, ToT and Specifying Desired Response Format prompt formats. The selection criteria for the Kaggle notebooks included in this table are explained in detail in above.

The results were selected to enable a direct comparison with the best-performing LLM model and configuration reported in Ref. [23]. A notable methodological difference between the two studies is that, in our experiments, the entire process was executed under a single prompting strategy per experiment, allowing end-to-end evaluation. In contrast, Ref. [23] varied prompting strategies across sub-tasks and restricted ML algorithm selection to a fixed pool of three models, without delegating a ML algorithm choice to the LLM.

In our experiments, the highest accuracy was achieved by DeepSeek, reaching 0.78229 for Titanic task. This result was closely aligned with the best LLM-based result reported in Ref. [23], where GPT-3.5, combined with CoT and FSP techniques, achieved an accuracy of 0.7918. Although more recent and advanced models such as GPT-4.1 and Gemini 2.5 Pro were used in our study, they did not outperform the GPT-3.5 configuration from Ref. [23], suggesting that effective prompting strategies may have a greater impact than the LLM model version alone, particularly in low-data structured tasks.

It was also noteworthy that Kaggle notebook achieved the highest overall accuracy at 0.80143. This indicates that in structured datasets such as Titanic, classical models remain highly competitive and can outperform even the most advanced LLMs when appropriate feature engineering and HPT are applied. Overall, the results highlight the importance of not only selecting a capable LLM but also applying the right combination of preprocessing, prompting, and tuning strategies to extract optimal performance, especially in structured prediction tasks.

The results on the Digit Recognizer task demonstrate that LLMs can achieve highly competitive performance even on vision-based classification tasks. In our experiments, the highest accuracy was achieved by the Gemini 2.5 Pro (HP-tuned) model, reaching 99.639%, slightly surpassing the best result reported in Ref. [23] (GPT-3.5: 99.60%). The human-level accuracy reported in Ref. [23] was 98.34%. In our experiments, most LLM configurations achieved comparable or slightly higher results. For instance, GPT-4.1 models, particularly when fine-tuned and combined with classification-oriented prompting, achieved accuracy levels between 99.24% and 99.31%, while Gemini 2.5 Pro models ranged between 98.67% and 99.64%, depending on tuning and prompt design. The baseline Kaggle notebook achieved 99.028% accuracy, which is strong, though several LLM setups produced higher scores. These findings suggest that, with appropriate prompting strategies, LLMs can deliver performance that is competitive with, and in some cases exceeds, established vision models on structured image classification tasks. However, these results are specific to the evaluated benchmarks and should not be generalized across all tasks.

Overall, the results suggest that the success of LLMs in this task is not solely due to model scale, but also to the design of the prompt and the inclusion of targeted tuning strategies. These findings extend the applicability of LLMs beyond traditional NLP settings, highlighting their potential in domains that were previously dominated by task-specific architectures like CNNs. Table 16 illustrates key methodological differences between our study and Ref. [23] over best models.

According to Table 16, In the Titanic task, while both studies used Random Forest Classifier (RFC), the way it was chosen differs significantly. Ref. [23] selected the algorithm manually from a predefined pool of three models, whereas in our study, the ML algorithm choice was left to the LLM itself. Based on the given prompt, the LLM independently selected RFC, demonstrating its capacity not only to perform classification but also to make informed methodological decisions.

In the Digit Recognizer task, both studies employed CNN architecture; however, our design included important enhancements such as dropout regularization and a larger dense layer (256 units). Additionally, dropout was treated as a tunable hyperparameter in our setup, unlike in Ref. [23], where it was not used. These architectural and training choices likely improved the generalization ability of our models.

Taken together, these differences reflect a more independent and regularized approach in our experiments, where algorithm selection, tuning, and architecture were optimized in a prompt-driven workflow. This design not only yielded competitive results but also highlighted the potential of LLMs to go beyond prediction and contribute meaningfully to algorithm configuration in applied ML settings.

4.4. LLM Pipelines vs. Classical AutoML

To strengthen the evaluation and provide fair baselines, we additionally considered two established AutoML frameworks: auto-sklearn and AutoGluon. Their inclusion allows us to assess whether LLM-based code generation can provide advantages beyond conventional AutoML solutions.

Across five benchmark Kaggle tasks, the comparison highlights how traditional AutoML frameworks and LLMs differ in both predictive performance and computational efficiency. The results indicate that while auto-sklearn and AutoGluon generally provide competitive baselines, LLM-driven approaches such as Gemini 2.5 Pro and DeepSeek-V3 demonstrate the ability to generate pipelines that achieve comparable or superior accuracy, F1 score and RMSE values in several tasks. Notably, LLMs often reduced execution time substantially, especially when leveraging GPU resources, whereas AutoML frameworks incurred longer runtimes due to their iterative search and ensembling strategies (see Table 17). Nevertheless, AutoML systems remain strong baselines, particularly in structured regression tasks such as House Price and Beats per Minute, where their ensembles yielded stable performance.

These findings suggest that LLM-based code generation can complement rather than fully replace AutoML frameworks, offering rapid prototyping advantages, while AutoML continues to provide systematically optimized solutions. While AutoML frameworks systematically search for the best-performing combination within a predefined algorithmic space, LLMs can extend beyond this space by proposing novel or unconventional solution strategies. Thus, the two paradigms should be viewed as mutually reinforcing: LLMs provide creativity and flexibility in generating pipelines, whereas AutoML ensures stability and rigor through structured optimization.

4.5. General Insights and Practical Recommendations

In the light of the findings, some general suggestions on which prompting strategy is more appropriate for different task types and search spaces are presented below. The common picture emerging from the five tasks examined in the study is as follows (The findings relate to end-to-end optimized pipeline generation of LLMs on Kaggle classification and regression tasks with different search spaces in a statistical learning context; they should not be directly generalized to other contexts):

There is no universally superior prompting technique in terms of accuracy, F1 score/error metrics. FSP offers a strong and stable start without tuning, while ToT is often significantly strengthened with HP-tuning and can achieve the best results in some tasks
HPT usually increases execution time but creates a leverage effect that can change the ranking.
There is no uniform trend in memory consumption; differences depend on the LLM model × prompting × tuning interaction and are small in most tasks.
Among the models, Gemini 2.5 Pro and DeepSeek-V3 often stand out in terms of high accuracy, F1 score/low RMSE.
Practical advice: For a fast and stable start FSP (no-tuning) is particularly suitable for medium to narrow search space tasks. For tasks with wide search space, ToT (HP-tuned) can provide better peak performance if there is enough tuning budget.
Comparisons with AutoML frameworks indicate that LLM-based solutions can match or even surpass systematically optimized baselines in several tasks, while AutoML remains valuable for stability specifically in structured regression problems.
Results from the most recent and relatively underexplored task (Beats-per-Minute of Songs) demonstrate that LLMs can deliver competitive results even on less-studied datasets, with notable efficiency gains, though their performance remains sensitive to prompting and tuning choices.

An important practical dimension of this study concerns the time-to-solution tradeoff between LLM processes and traditional human-driven development. In human workflows, producing competitive ML solutions typically involves hours or even days of iterative coding, debugging, and hyperparameter tuning. In contrast, the LLMs examined here generated complete pipelines in a single pass based solely on the prompt. However, it should be noted that not all outputs were immediately executable: in cases of syntax or basic library errors, the erroneous script and its error message were re-submitted to the same LLM for correction, which added minimal overhead compared to human debugging. Taken together, these results indicate that LLMs can substantially accelerate prototyping and experimentation, while AutoML frameworks and human expertise remain valuable for systematic optimization and domain-specific adaptation.

Finally, when interpreting these results, it is important to consider their scope and generalizability beyond structured competition settings. While the results demonstrate the potential of LLM-based code generation on structured benchmark tasks, their generalizability to complex real-world business scenarios should be considered with caution. Kaggle competitions typically involve well-curated datasets, clearly defined objectives, and standardized evaluation metrics. In contrast, practical applications in industry often require handling noisy or incomplete data, domain-specific feature engineering, integration with existing workflows, and adherence to business constraints. Therefore, the present findings should primarily be interpreted as evidence of feasibility and relative performance under controlled benchmark conditions. Further empirical studies are needed to validate the robustness and applicability of LLM-based approaches across diverse, real-world contexts.

5. Conclusions and Future Work

This study evaluated the performance of various LLMs on five different Kaggle-based structured ML tasks and systematically examined the effects of different prompting strategies (FSP and ToT) and HPT. The findings show that, when guided by well-designed prompts, LLMs can generate high-performing ML solutions that are comparable to, and in some cases slightly surpass, human-crafted solutions, often requiring less development time.

Analyses across tasks with varying complexity and search spaces revealed that no prompting strategy consistently dominated. FSP proved to be a strong baseline, often yielding competitive results in lower-complexity and tabular tasks even without tuning. In contrast, ToT showed clear advantages in broader search spaces, particularly when combined with HPT, though it required greater computational resources and design effort.

From the perspective of efficiency, metrics such as memory usage and execution time varied inconsistently across models depending on the prompt strategies and tuning options; therefore, no single strategy stood out in terms of overall efficiency. However, Gemini 2.5 Pro and DeepSeek-V3 frequently emerged as the top-performing models with the highest accuracy, F1 score and lowest RMSE across various tasks.

The fact that LLMs can suggest algorithms suitable for the given task indicates that these models can be considered not only as prediction engines but also as tools assisting in the design of effective ML pipelines. The findings demonstrate that, when prompt design is handled carefully and aligned with the task, LLMs have the potential to deliver strong solutions with limited human intervention as supportive ML tools.

In addition, the comparison with established AutoML frameworks highlights their complementary role to LLM-based solutions. While AutoML systems provide stable and systematically optimized baselines, LLMs offer clear advantages in rapid prototyping and flexible pipeline generation, sometimes extending beyond the predefined search spaces of AutoML. The results from the most recent and large-scale benchmark task further illustrate this complementarity, showing that LLMs can deliver competitive performance with substantial efficiency gains.

Future studies may test the scalability of these approaches on larger and real-world datasets, as well as their robustness in less structured or noisier data settings. In addition, integrating different prompting methods could contribute to enhancing the generalization ability of LLMs across various tasks.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app152010968/s1, Folder S1: Contains the codes generated by the LLMs, along with the corresponding log and submission files and datasets. File S1: PDF document including functional and non-functional code examples.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available on Kaggle at https://www.kaggle.com/competitions/titanic, https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques, https://www.kaggle.com/competitions/digit-recognizer, https://www.kaggle.com/competitions/nlp-getting-started, https://www.kaggle.com/competitions/playground-series-s5e9.

Acknowledgments

I gratefully acknowledge the support of my Ph.D. student, Tuğrul Hakan Gençtürk, for his help with prompt execution in this study.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

Table A1. FSP-Tuned hyperparameters for the Titanic task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.

(a)
LLM	OpenAI o3	GPT-4.1	Gemini 2.5 Pro
Selected Algorithm	Gradient Boosting Classifier	Random Forest Classifier	Random Forest Classifier
Tuned Hyperparameters:	clf__n_estimators: randint(200, 600), clf__learning_rate: uniform(0.01, 0.2), clf__max_depth: randint(2, 5), clf__min_samples_split: randint(2, 20), clf__min_samples_leaf: randint(1, 20), clf__subsample: uniform(0.6, 0.4)	n_estimators: [100, 200], max_depth: [4, 6, 8, None], min_samples_split: [2, 5, 10], min_samples_leaf: [1, 2, 4], class_weight: [‘balanced’]	n_estimators: [100, 200, 300], max_depth: [None, 10, 20], min_samples_split: [2, 5, 10], min_samples_leaf: [1, 2, 4], class_weight: [None, ‘balanced’]
(b)
LLM	DeepSeek-V3	DeepSeek-R1	Claude Opus 4	Claude Sonnet 4
Selected Algorithm	Random Forest Classifier	Random Forest Classifier	Random Forest Classifier	Random Forest Classifier
Tuned Hyperparameters:	classifier__n_estimators: [100, 200, 300], classifier__max_depth: [None, 5, 10], classifier__min_samples_split: [2, 5, 10], classifier__min_samples_leaf: [1, 2, 4]	classifier__n_estimators: [100, 200, 300], classifier__max_depth: [None, 5, 10], classifier__min_samples_split: [2, 5, 10], classifier__min_samples_leaf: [1, 2, 4]	n_estimators: [100, 200, 300], max_depth: [5, 10, 15, None], min_samples_split: [2, 5, 10], min_samples_leaf: [1, 2, 4], max_features: [‘auto’, ‘sqrt’, ‘log2’]	n_estimators: [100, 200, 300], max_depth: [3, 5, 7, None], min_samples_split: [2, 5, 10], min_samples_leaf: [1, 2, 4], max_features: [‘sqrt’, ‘log2’]

Table A2. ToT-Tuned hyperparameters for the Titanic task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.

(a)
LLM	OpenAI o3	GPT-4.1	Gemini 2.5 Pro
Selected Algorithm	Random Forest Classifier	Random Forest Classifier	Random Forest Classifier
Tuned Hyperparameters:	clf__n_estimators: [100, 200, 300] clf__max_depth: [None, 10, 20] clf__min_samples_split: [2, 5]	n_estimators: [100, 200, 300, 400] max_depth: [3, 5, 7, 9, None] min_samples_split: [2, 5, 10] min_samples_leaf: [1, 2, 4] max_features: [‘sqrt’, ‘log2’]	n_estimators: [100, 200] max_depth: [5, 10, None] min_samples_split: [2, 5] min_samples_leaf: [1, 2]
(b)
LLM	DeepSeek-V3	DeepSeek-R1	Claude Opus 4	Claude Sonnet 4
Selected Algorithm	Random Forest Classifier	Random Forest Classifier	Random Forest Classifier	Random Forest Classifier
Tuned Hyperparameters:	classifier__n_estimators: [100, 200] classifier__max_depth: [None, 10, 20] classifier__min_samples_split: [2, 5]	classifier__n_estimators: [100, 200] classifier__max_depth: [5, 10, None] classifier__min_samples_split: [2, 5] classifier__min_samples_leaf: [1, 2]	n_estimators: [100, 200, 300] max_depth: [5, 10, 15, None] min_samples_split: [2, 5, 10] min_samples_leaf: [1, 2, 4] max_features: [‘sqrt’, ‘log2’]	n_estimators: [100, 200, 300] max_depth: [3, 5, 7, 10, None] min_samples_split: [2, 5, 10] min_samples_leaf: [1, 2, 4] max_features: [‘sqrt’, ‘log2’, None]

Table A3. FSP-Tuned hyperparameters for the House Prices task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.

(a)
LLM	OpenAI o3		GPT-4.1	Gemini 2.5 Pro
Selected Algorithm	XGBRegressor		Blending LightGBM+XGBoost+CatBoost	Blending LGBMRegressor+XGBRegressor+CatBoostRegressor
Tuned Hyperparameters:	model__n_estimators: randint(800, 1600) model__max_depth: randint(3, 7) model__learning_rate: uniform(0.01, 0.09) model__subsample: uniform(0.6, 0.4) model__colsample_bytree: uniform(0.6, 0.4) model__reg_alpha: uniform(0.0, 0.6) model__reg_lambda: uniform(0.3, 1.0)		# LightGBM objective: “regression” metric: “rmse” random_state: SEED learning_rate: trial.suggest_float(“learning_rate”, 0.01, 0.2) num_leaves: trial.suggest_int(“num_leaves”, 20, 80) feature_fraction: trial.suggest_float(“feature_fraction”, 0.6, 1.0) bagging_fraction: trial.suggest_float(“bagging_fraction”, 0.6, 1.0) bagging_freq: trial.suggest_int(“bagging_freq”, 1, 7) min_child_samples: trial.suggest_int(“min_child_samples”, 5, 30) # XGBoost objective: “reg:squarederror” tree_method: “hist” random_state: SEED learning_rate: trial.suggest_float(“learning_rate”, 0.01, 0.2) max_depth: trial.suggest_int(“max_depth”, 3, 10) subsample: trial.suggest_float(“subsample”, 0.6, 1.0) colsample_bytree: trial.suggest_float(“colsample_bytree”, 0.6, 1.0) min_child_weight: trial.suggest_int(“min_child_weight”, 1, 10) # CatBoost loss_function: “RMSE” random_seed: SEED learning_rate: trial.suggest_float(“learning_rate”, 0.01, 0.2) depth: trial.suggest_int(“depth”, 3, 10) l2_leaf_reg: trial.suggest_float(“l2_leaf_reg”, 1, 10) bagging_temperature: trial.suggest_float(“bagging_temperature”, 0, 1) border_count: trial.suggest_int(“border_count”, 32, 255) verbose: 0	# LightGBM num_leaves: 31 learning_rate: 0.05 n_estimators: 720 max_bin: 55 bagging_fraction: 0.8 bagging_freq: 5 feature_fraction: 0.2319 feature_fraction_seed: 9 bagging_seed: 9 min_data_in_leaf: 6 min_sum_hessian_in_leaf: 11 random_state: SEED n_jobs: −1 # XGBoost learning_rate: 0.05 n_estimators: 600 max_depth: 3 min_child_weight: 0 gamma: 0 subsample: 0.7 colsample_bytree: 0.7 reg_alpha: 0.005 random_state: SEED n_jobs: −1 # CatBoost iterations: 1000 learning_rate: 0.05 depth: 3 l2_leaf_reg: 4 loss_function: ‘RMSE’ eval_metric: ‘RMSE’ random_seed: SEED verbose: 0
(b)
LLM	DeepSeek-V3	DeepSeek-R1	Claude Opus 4	Claude Sonnet 4
Selected Algorithm	XGBRegressor+ LGBMRegressor+ CatBoostRegressor	XGBRegressor+ LGBMRegressor+ CatBoostRegressor + Ridge	Meta Model: Ridge XGBRegressor + LGBMRegressor + CatBoostRegressor	LGBMRegressor + XGBRegressor + CatBoostRegressor
Tuned Hyperparameters:	# XGBRegressor: n_estimators: 100, 2000 max_depth: 3, 12 learning_rate: 0.001, 0.1, log = True subsample: 0.6, 1.0 colsample_bytree: 0.6, 1.0 reg_alpha: 0, 10 reg_lambda: 0, 10 The LGBMRegressor and CatBoostRegressor models are trained directly with their default parameters.	# xgb: objective: reg:squarederror n_estimators: 1000 learning_rate: 0.01 max_depth: 3 subsample: 0.8 colsample_bytree: 0.4 random_state: 42 # lgbm: objective: regression n_estimators: 1000 learning_rate: 0.01 max_depth: 3 subsample: 0.8 colsample_bytree: 0.4 random_state: 42 # catboost: iterations: 1000 learning_rate: 0.01 depth: 3 subsample: 0.8 colsample_bylevel: 0.4 random_seed: 42 verbose: 0 # ridge: alpha: 10 random_state: 42	# XGB n_estimators: trial.suggest_int(‘n_estimators’, 100, 1000) max_depth: trial.suggest_int(‘max_depth’, 3, 10) learning_rate: trial.suggest_float(‘learning_rate’, 0.01, 0.3) subsample: trial.suggest_float(‘subsample’, 0.6, 1.0) colsample_bytree: trial.suggest_float(‘colsample_bytree’, 0.6, 1.0) reg_alpha: trial.suggest_float(‘reg_alpha’, 0, 10) reg_lambda: trial.suggest_float(‘reg_lambda’, 0, 10) random_state: 42 # LGBM n_estimators: trial.suggest_int(‘n_estimators’, 100, 1000) max_depth: trial.suggest_int(‘max_depth’, 3, 10) learning_rate: trial.suggest_float(‘learning_rate’, 0.01, 0.3) num_leaves: trial.suggest_int(‘num_leaves’, 20, 300) feature_fraction: trial.suggest_float(‘feature_fraction’, 0.5, 1.0) bagging_fraction: trial.suggest_float(‘bagging_fraction’, 0.5, 1.0) bagging_freq: trial.suggest_int(‘bagging_freq’, 1, 7) reg_alpha: trial.suggest_float(‘reg_alpha’, 0, 10) reg_lambda: trial.suggest_float(‘reg_lambda’, 0, 10) random_state: 42 verbosity: −1 # CatBoost iterations: trial.suggest_int(‘iterations’, 100, 1000) depth: trial.suggest_int(‘depth’, 4, 10) learning_rate: trial.suggest_float(‘learning_rate’, 0.01, 0.3) l2_leaf_reg: trial.suggest_float(‘l2_leaf_reg’, 1, 10) random_seed: 42 verbose: False	# LGBM objective: “regression” metric: “rmse” boosting_type: “gbdt” num_leaves: trial.suggest_int(“num_leaves”, 10, 300) learning_rate: trial.suggest_float(“learning_rate”, 0.01, 0.3) feature_fraction: trial.suggest_float(“feature_fraction”, 0.4, 1.0) bagging_fraction: trial.suggest_float(“bagging_fraction”, 0.4, 1.0) bagging_freq: trial.suggest_int(“bagging_freq”, 1, 7) min_child_samples: trial.suggest_int(“min_child_samples”, 5, 100) verbosity: −1 random_state: 42 # XGB n_estimators: 1000 learning_rate: 0.05 max_depth: 6 random_state: 42 # CatBoost iterations: 1000 learning_rate: 0.05 depth: 6 verbose: False random_state: 42

Table A4. ToT-Tuned hyperparameters for the House Prices task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.

(a)
LLM	OpenAI o3	GPT-4.1	Gemini 2.5 Pro
Selected Algorithm	LGBMRegressor	Lasso + XGBRegressor	Ridge + LGBMRegressor
Tuned Hyperparameters:	num_leaves: [31, 64] learning_rate: [0.05, 0.1] n_estimators: [500, 1000] max_depth: [−1, 10]	# Lasso alpha: np.logspace(−4, 0, 40) # XGB learning_rate: [0.01, 0.03] max_depth: [3, 4] n_estimators: [300, 500] reg_alpha: [0, 0.1] reg_lambda: [0.7, 1.0] subsample: [0.7, 1.0]	# Model 1—Ridge Regression (non-tuned) # Model 2—LightGBM objective: “regression” num_leaves: 31 learning_rate: 0.05 n_estimators: 720 max_bin: 55 bagging_fraction: 0.8 bagging_freq: 5 feature_fraction: 0.2319 feature_fraction_seed: 9 bagging_seed: 9 min_data_in_leaf: 6 min_sum_hessian_in_leaf: 11 random_state: RANDOM_SEED n_jobs: −1
(b)
LLM	DeepSeek-V3	DeepSeek-R1	Claude Opus 4	Claude Sonnet 4
Selected Algorithm	LGBMRegressor + XGBRegressor + CatBoostRegressor	LGBMRegressor	ridge + lasso + ElasticNet + XGBoost + LightGBM	XGBoost + LightGBM + Ridge + ElasticNet + Random Forest
Tuned Hyperparameters:	# LGBMRegressor n_estimators: trial.suggest_int(‘n_estimators’, 100, 2000) max_depth: trial.suggest_int(‘max_depth’, 3, 12) learning_rate: trial.suggest_float(‘learning_rate’, 0.001, 0.1) subsample: trial.suggest_float(‘subsample’, 0.6, 1.0) colsample_bytree: trial.suggest_float(‘colsample_bytree’, 0.6, 1.0)	n_estimators: (100, 1000) max_depth: (3, 10) learning_rate: (0.001, 0.1, ‘log-uniform’) num_leaves: (10, 100) min_child_samples: (5, 50) subsample: (0.6, 1.0) colsample_bytree: (0.6, 1.0)	# XGBoost max_depth: [3, 4, 5] learning_rate: [0.01, 0.05, 0.1] n_estimators: [300, 500] subsample: [0.8] colsample_bytree: [0.8] # LightGBM num_leaves: [20, 31, 40] learning_rate: [0.01, 0.05, 0.1] n_estimators: [300, 500] subsample: [0.8] colsample_bytree: [0.8] # Ridge Regression alpha: [0.1, 0.5, 1, 5, 10, 20, 50] # Lasso alpha: [0.0001, 0.0005, 0.001, 0.005, 0.01] # ElasticNet alpha: [0.0001, 0.0005, 0.001, 0.005] l1_ratio: [0.3, 0.5, 0.7, 0.9]	# XGBoost n_estimators = 1000, max_depth = 3, learning_rate = 0.05, subsample = 0.8, colsample_bytree = 0.8, reg_alpha = 0.05, reg_lambda = 0.05 # LightGBM n_estimators = 1000, max_depth = 3, learning_rate = 0.05, subsample = 0.8, colsample_bytree = 0.8, reg_alpha = 0.05, reg_lambda = 0.05, verbosity = −1 # Ridge Regression alpha = 10.0 # ElasticNet alpha = 0.005, l1_ratio = 0.9, max_iter = 1000 # Random Forest n_estimators = 300, max_depth = 15, n_jobs = −1

Table A5. FSP-Tuned hyperparameters for the Digit Recognizer task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.

(a)
LLM	OpenAI o3	GPT-4.1	Gemini 2.5 Pro
Selected Algorithm	CNN	CNN	CNN
Tuned Hyperparameters:	conv1: [32, 48] conv2: [64, 96] dense: [128, 256] dropout: [0.2, 0.3, 0.4] epochs: [12, 15, 20] batch_size: [64, 128, 256]	conv1_filters: [32, 64] conv2_filters: [64, 128] dense_units: [128, 256] dropout_rate: [0.2, 0.3] batch_size: [64, 128] epochs: [12, 15]	conv_filters: [32, 64] dropout_rates: [0.25, 0.5]
(b)
LLM	DeepSeek-V3	DeepSeek-R1	Claude Opus 4	Claude Sonnet 4
Selected Algorithm	CNN	CNN	CNN	CNN + Random Forest Classifier
Tuned Hyperparameters:	learning_rate: [0.001, 0.0005, 0.0001] batch_size: [64, 128, 256] epochs: [30, 40, 50] dropout_rate: [0.3, 0.4, 0.5]	dropout_rate: [0.2, 0.3, 0.4] optimizer: [‘adam’, ‘rmsprop’] batch_size: [64, 128] epochs: [20]	epochs: [20, 30] batch_size: [64, 128] model__learning_rate: [0.001, 0.0005]	# CNN epochs: [5, 10, 15] batch_size: [64, 128, 256] model__optimizer: [‘adam’, ‘rmsprop’] # Random Forest n_estimators: [100, 200, 300] max_depth: [10, 20, None] min_samples_split: [2, 5, 10]

Table A6. ToT-Tuned hyperparameters for the Digit Recognizer task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.

(a)
LLM	OpenAI o3	GPT-4.1	Gemini 2.5 Pro
Selected Algorithm	Medium CNN	Simple CNN	CNN
Tuned Hyperparameters:	lr: [1 × 10⁻⁴, 5 × 10⁻³] bs: [64, 128, 256]	lr: [1 × 10⁻⁴, 5 × 10⁻²] batch_size: [64, 128, 256]	lr: [1 × 10⁻⁴, 1 × 10⁻²] dropout_rate: [0.1, 0.6]
(b)
LLM	DeepSeek-V3	DeepSeek-R1	Claude Opus 4	Claude Sonnet 4
Selected Algorithm	CNN	CNN	Simple CNN	CNN
Tuned Hyperparameters:	learning_rate: 0.0003 epochs: 10 batch_size: 64	units: [128, 256] dropout: [0.3, 0.4, 0.5] lr: [0.001, 0.0005, 0.0001]	learning_rate: 5 × 10⁻⁴ batch_size_train: 128 batch_size_valtest: 256 epochs: 30	batch_size: 128 learning_rate: 0.0005 epochs: 30

Table A7. FSP-Tuned hyperparameters for the Disaster Tweets task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.

(a)
LLM	OpenAI o3	GPT-4.1	Gemini 2.5 Pro
Selected Algorithm	TF/IDF + Logistic Regression	TF-idf + Stacking Classifier	TF TDF + Logistic Regression
Tuned Hyperparameters:	clf__C: [0.5, 1, 2, 5]	lr__C: [0.5, 1, 2] lgbm__num_leaves: [15, 31] lgbm__learning_rate: [0.05, 0.1] final_estimator__C: [0.5, 1, 2]	# LogisticRegression C: 1.0 solver: ‘liblinear’ random_state: 42 # TfidfVectorizer ngram_range: (1, 2) max_features: 15,000
(b)
LLM	DeepSeek-V3	DeepSeek-R1	Claude Opus 4	Claude Sonnet 4
Selected Algorithm	TF-IDF + Logistic Regression	TF-IDF + Logistic Regression	TF-idf + Logistic Regression/SVM/LGBMClassifier	TF-idf+ Logistic Regression + SVM
Tuned Hyperparameters:	clf__C: [0.1, 1, 10] clf__penalty: [‘l1’, ‘l2’] clf__solver: [‘liblinear’]	# TfidfVectorizer ngram_range: [(1, 1), (1, 2)] max_features: [5000, 10,000] # LogisticRegression C: [0.1, 1, 10] solver: [‘liblinear’, ‘saga’]	# Logistic Regression (pipe_lr) clf__C: [0.1, 0.5, 1.0, 2.0, 5.0] clf__penalty: [‘l2’] clf__max_iter: [500] # SVM (pipe_svm) clf__C: [0.1, 1.0, 10.0] clf__kernel: [‘rbf’, ‘linear’] clf__gamma: [‘scale’, ‘auto’] # LightGBM (pipe_lgb) clf__n_estimators: [100, 200, 300] clf__num_leaves: [31, 50, 100] clf__learning_rate: [0.05, 0.1, 0.2] clf__min_child_samples: [20, 30]	# LogisticRegression C: [0.1, 1, 10] # SVC C: [0.1, 1, 10] kernel: [‘linear’, ‘rbf’]

Table A8. ToT-Tuned hyperparameters for the Disaster Tweets task: An LLM-based overview. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4.

(a)
LLM	OpenAI o3	GPT-4.1	Gemini 2.5 Pro
Selected Algorithm	TF/IDF + Logistic Regression	TF/IDF + Logistic Regression	TF/IDF + Logistic Regression
Tuned Hyperparameters:	C: [1, 2, 4, 8]	C: [0.3, 1, 3] penalty: [‘l2’]	C: trial.suggest_loguniform(‘C’, 1 × 10⁻¹, 1 × 10) solver: ‘liblinear’ penalty: ‘l2’ random_state: 42
(b)
LLM	DeepSeek-V3	DeepSeek-R1	Claude Opus 4	Claude Sonnet 4
Selected Algorithm	TF/IDF + Logistic Regression	TF/IDF + Logistic Regression	TF/IDF + XGBClassifier	XGBoost + SVM + Multinomial Naive Bayes
Tuned Hyperparameters:	C: [0.1, 1, 10] penalty: [‘l1’, ‘l2’] solver: [‘liblinear’]	C: [0.1, 1, 10] class_weight: [None, ‘balanced’]	n_estimators: [100, 500] max_depth: [3, 9] learning_rate: [0.01, 0.3] subsample: [0.6, 1.0] colsample_bytree: [0.6, 1.0] min_child_weight: [1, 4] gamma: [0, 0.5]	# XGB max_depth: [4, 6, 8] learning_rate: [0.05, 0.1, 0.15] n_estimators: [100, 200] # SVM C: [0.1, 1, 10] gamma: [‘scale’, ‘auto’] # Naive Bayes alpha: [0.01, 0.1, 1.0]

Table A9. FSP-Tuned hyperparameters for the Predicting the Beats-per-Minute of Songs task: An LLM-based overview.

LLM	GPT-5 Thinking	Gemini 2.5 Pro	DeepSeek-V3
Selected Algorithm	CatBoost/XGBoost/Ridge	LightGBM	LightGBM
Tuned Hyperparameters:	# CatBoost depth: [4, 5, 6, 7, 8, 9, 10] learning_rate: [0.03, 0.05, 0.07, 0.10] l2_leaf_reg: [1, 3, 5, 7, 10, 15, 20, 30] bagging_temperature: [0.0, 0.25, 0.5, 1.0] random_strength: [0.5, 1.0, 1.5, 2.0] grow_policy: [‘SymmetricTree’, ‘Depthwise’, ‘Lossguide’] border_count: [64, 128, 254] min_data_in_leaf: [1, 5, 10, 20, 50] n_estimators: [2000, 4000, 8000] # XGBoost xgb__n_estimators: [400, 600, 800, 1200, 1600] xgb__max_depth: [3, 4, 5, 6, 7, 8, 9, 10] xgb__learning_rate: [0.03, 0.05, 0.07, 0.10, 0.15, 0.20] xgb__subsample: [0.6, 0.7, 0.8, 0.9, 1.0] xgb__colsample_bytree: [0.6, 0.7, 0.8, 0.9, 1.0] xgb__min_child_weight: [1, 2, 3, 5, 7, 10] xgb__reg_alpha: [0.0, 1 × 10⁻⁸, 1 × 10⁻⁶, 1 × 10⁻⁴, 1 × 10⁻³, 1 × 10⁻²,1 × 10⁻¹] xgb__reg_lambda: [0.1, 0.5, 1.0, 2.0, 5.0, 10.0] #Ridge rg__alpha: list(np.logspace(−2, 3, 30)) # 0.01 … 1000	# LightGBM objective: ‘regression_l1’ metric: ‘rmse’ boosting_type: ‘gbdt’ random_state: 42 n_jobs: −1 verbose: −1 n_estimators: int [200, 2000] learning_rate: float [0.01, 0.3] num_leaves: int [20, 300] max_depth: int [3, 12] min_child_samples: int [5, 100] subsample: float [0.6, 1.0] colsample_bytree: float [0.6, 1.0] reg_alpha: logfloat [1 × 10⁻⁸, 10.0] reg_lambda: logfloat [1 × 10⁻⁸, 10.0]	# LightGBM objective: ‘regression’ metric: ‘mae’ verbosity: −1 boosting_type: ‘gbdt’ random_state: 42 n_estimators: int [100, 1000] learning_rate: logfloat [0.01, 0.3] num_leaves: int [20, 300] max_depth: int [3, 12] min_child_samples: int [5, 100] subsample: float [0.5, 1.0] colsample_bytree: float [0.5, 1.0] reg_alpha: logfloat [1 × 10⁻⁸, 10.0] reg_lambda: logfloat [1 × 10⁻⁸, 10.0]

Table A10. ToT-Tuned hyperparameters for the Predicting the Beats-per-Minute of Songs task: An LLM-based overview.

LLM	GPT-5 Thinking	Gemini 2.5 Pro	DeepSeek-V3
Selected Algorithm	Ridge (blender, α = 1.0, OOF-based ensemble)	LightGBM	XGBoost
Tuned Hyperparameters:	# Ridge alpha: float [1 × 10⁻³,1 × 10³] random_state: 42 # ElasticNet alpha: float [1 × 10⁻³,1 × 10³] l1_ratio: float [0.01, 0.99] random_state: 42 # ExtraTrees n_estimators: int [200, 800] max_depth: {None \| 8 \| 12 \| 16 \| 24} min_samples_split: int [2, 20] min_samples_leaf: int [1, 12] max_features: {“sqrt” \| “auto” \| 0.5 \| 0.8} random_state: 42 # HistGradientBoostingRegressor max_depth: int [3, 10] learning_rate: float [0.01, 0.3] l2_regularization: float [1 × 10⁻⁸, 10.0] max_leaf_nodes: int [15, 63] random_state: 42 # LightGBM num_leaves: int [15, 255] feature_fraction: float [0.5, 1.0] bagging_fraction: float [0.5, 1.0] bagging_freq: int [0, 7] min_data_in_leaf: int [10, 200] lambda_l1: float [1 × 10⁻³, 10.0] lambda_l2: float [1 × 10⁻³, 10.0] learning_rate: float [0.01, 0.2] random_state: 42 # XGBoost max_depth: int [3, 10] min_child_weight: int [1, 20] subsample: float [0.5, 1.0] colsample_bytree: float [0.5, 1.0] reg_alpha: float [1 × 10⁻³, 10.0] reg_lambda: float [1 × 10⁻³, 10.0] eta: float [0.01, 0.2] random_state: 42 # CatBoost depth: int [4, 10] learning_rate: float [0.01, 0.2] l2_leaf_reg: float [1.0, 15.0] bagging_temperature: float [0.0, 5.0] random_state: 42	# LightGBM objective: ‘regression_l1’ metric: ‘rmse’ n_estimators: 2000 learning_rate: 0.01 feature_fraction: 0.8 bagging_fraction: 0.8 bagging_freq: 1 lambda_l1: 0.1 lambda_l2: 0.1 num_leaves: 31 verbose: −1 n_jobs: −1 seed: RANDOM_SEED boosting_type: ‘gbdt’	# XGBoost n_estimators: int [100, 1000] max_depth: int [3, 10] learning_rate: [0.01, 0.3] subsample: float [0.6, 1.0] colsample_bytree: float [0.6, 1.0] reg_alpha: float [0, 1.0] reg_lambda: float [0, 1.0] random_state: 42

References

Gençtürk, T.H.; Gülağiz, F.K.; Kaya, İ. Detection and segmentation of subdural hemorrhage on head CT images. IEEE Access 2024, 12, 82235–82246. [Google Scholar] [CrossRef]
Laney, D. 3D Data Management: Controlling Data Volume, Velocity and Variety; META Group Research Note; META Group: Stamford, CT, USA, 2001. [Google Scholar]
Gandomi, A.; Haider, M. Beyond the hype: Big data concepts, methods, and analytics. Int. J. Inf. Manag. 2015, 35, 137–144. [Google Scholar] [CrossRef]
Du, X.; Liu, M.; Wang, K.; Wang, H.; Liu, J.; Chen, Y.; Feng, J.; Sha, C.; Peng, X.; Lou, Y. Evaluating large language models in class-level code generation. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, New York, NY, USA, 14–20 April 2024; pp. 1–13. [Google Scholar] [CrossRef]
Li, J.; Li, G.; Zhang, X.; Zhao, Y.; Dong, Y.; Jin, Z.; Li, B.; Huang, F.; Li, Y. EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations. Adv. Neural Inf. Process. Syst. 2024, 37, 57619–57641. [Google Scholar] [CrossRef]
Fakhoury, S.; Naik, A.; Sakkas, G.; Chakraborty, S.; Lahiri, S.K. Llm-based test-driven interactive code generation: User study and empirical evaluation. IEEE Trans. Softw. Eng. 2024, 50, 2254–2268. [Google Scholar] [CrossRef]
Coignion, T.; Quinton, C.; Rouvoy, R. A performance study of llm-generated code on leetcode. In Proceedings of the 28th international conference on evaluation and assessment in software engineering, Salerno, Italy, 18–21 June 2024; pp. 79–89. [Google Scholar] [CrossRef]
Tambon, F.; Dakhel, A.M.; Nikanjam, A.; Khomh, F.; Desmarais, M.C.; Antoniol, G. Bugs in Large Language Models Generated Code: An Empirical Study. Empir. Software Eng. 2025, 30, 65. [Google Scholar] [CrossRef]
Li, J.; Li, G.; Li, Y.; Jin, Z. Structured Chain-of-Thought prompting for code generation. ACM Trans. Softw. Eng. Methodol. 2025, 34, 37. [Google Scholar] [CrossRef]
Khojah, R.; de Oliveira Neto, F.G.; Mohamad, M.; Leitner, P. The impact of prompt programming on function-level code generation. IEEE Trans. Softw. Eng. 2025, 51, 2381–2395. [Google Scholar] [CrossRef]
Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Zhuo, T.Y.; Chen, T. Chain-of-thought in neural code generation: From and for lightweight language models. IEEE Trans. Softw. Eng. 2024, 50, 2437–2457. [Google Scholar] [CrossRef]
Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef]
Yao, J.; Zhang, L.; Huang, J. Evaluation of Large Language Model-Driven AutoML in Data and Model Management from Human-Centered Perspective. Front. Artif. Intell. Sec. Nat. Lang. Process. 2025, 8, 1590105. [Google Scholar] [CrossRef]
Fathollahzadeh, S.; Mansour, E.; Boehm, M. Demonstrating CatDB: LLM-based Generation of Data-centric ML Pipelines. In Proceedings of the Companion of the 2025 International Conference on Management of Data, New York, NY, USA, 22–27 June 2025; pp. 87–90. [Google Scholar] [CrossRef]
Zhao, Y.; Pang, J.; Zhu, X.; Shao, W. LLM-Prompting Driven AutoML: From Sleep Disorder—Classification to Beyond. Trans. Artif. Intell. 2025, 1, 59–82. [Google Scholar] [CrossRef]
Mulakala, B.; Saini, M.L.; Singh, A.; Bhukya, V.; Mukhopadhyay, A. Adaptive multi-fidelity hyperparameter optimization in large language models. In Proceedings of the 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS), Bengaluru, India, 7–9 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, M.R.; Desai, N.; Bae, J.; Lorraine, J.; Ba, J. Using large language models for hyperparameter optimization. In Proceedings of the NeurIPS 2023 Workshop, New Orleans, LA, USA, 16 December 2023. [Google Scholar] [CrossRef]
Wang, L.; Shi, C.; Du, S.; Tao, Y.; Shen, Y.; Zheng, H.; Qiu, X. Performance Review on LLM for solving leetcode problems. In Proceedings of the 2024 4th International Symposium on Artificial Intelligence and Intelligent Manufacturing (AIIM), Chengdu, China, 20–22 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1050–1054. [Google Scholar] [CrossRef]
Jain, R.; Thanvi, J.; Subasinghe, A. The evolution of ChatGPT for programming: A comparative study. Eng. Res. Express 2025, 7, 015242. [Google Scholar] [CrossRef]
Döderlein, J.B.; Kouadio, N.H.; Acher, M.; Khelladi, D.E.; Combemale, B. Piloting Copilot, Codex, and StarCoder2: Hot temperature, cold prompts, or black magic? J. Syst. Softw. 2025, 230, 112562. [Google Scholar] [CrossRef]
Jamil, M.T.; Abid, S.; Shamail, S. Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study. In Proceedings of the 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), Ottawa, ON, Canada, 28–29 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 478–489. [Google Scholar] [CrossRef]
Mathews, N.S.; Nagappan, M. Test-driven development and llm-based code generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1583–1594. [Google Scholar] [CrossRef]
Ko, E.; Kang, P. Evaluating Coding Proficiency of Large Language Models: An Investigation Through Machine Learning Problems. IEEE Access 2025, 13, 52925–52938. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. OpenAI. 2018. Available online: https://api.semanticscholar.org/CorpusID:49313245 (accessed on 3 August 2025).
OpenAI. Introducing OpenAI o3 and o4-Mini. OpenAI. 2025. Available online: https://openai.com/index/introducing-o3-and-o4-mini (accessed on 3 August 2025).
OpenAI. Introducing GPT-4.1 in the API. OpenAI. 2025. Available online: https://openai.com/index/gpt-4-1/ (accessed on 3 August 2025).
OpenAI. ChatGPT Release Notes. OpenAI Help Center. 2025. Available online: https://help.openai.com/en/articles/6825453-chatgpt-release-notes (accessed on 3 August 2025).
Comanici, G.; Bieber, E.; Schaekermann, M.; Pasupat, I.; Sachdeva, N.; Dhillon, I.; Blistein, M.; Ram, O.; Zhang, D.; Rosen, E.; et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv 2025, arXiv:2507.06261. [Google Scholar] [CrossRef]
DeepSeek-AI. Deepseek-AI/Organization Card. Hugging Face. Available online: https://huggingface.co/deepseek-ai (accessed on 3 August 2025).
DeepSeek-AI; Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar] [CrossRef]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
Anthropic. Introducing Claude. 2023. Available online: https://www.anthropic.com/index/introducing-claude (accessed on 3 August 2025).
Anthropic. System Card: Claude Opus 4 & Claude Sonnet 4. Anthropic. 2025. Available online: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf (accessed on 3 August 2025).
Anthropic. Introducing Claude 4. Anthropic. 2025. Available online: https://www.anthropic.com/news/claude-4 (accessed on 3 August 2025).
Tonmoy, S.M.; Zaman, S.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv 2024, arXiv:2401.01313. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 195. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
Tafesse, W.; Wood, B. Hey ChatGPT: An examination of ChatGPT prompts in marketing. J. Mark. Anal. 2024, 12, 790–805. [Google Scholar] [CrossRef]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
Lee, Y.; Oh, J.H.; Lee, D.; Kang, M.; Lee, S. Prompt engineering in ChatGPT for literature review: Practical guide exemplified with studies on white phosphors. Sci. Rep. 2025, 15310. [Google Scholar] [CrossRef] [PubMed]
Debnath, T.; Siddiky, M.N.A.; Rahman, M.E.; Das, P.; Guha, A.K. A comprehensive survey of prompt engineering techniques in large language models. TechRxiv 2025. [Google Scholar] [CrossRef]
Phoenix, J.; Taylor, M. Prompt Engineering for Generative AI; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2024. [Google Scholar]
Saint-Jean, D.; Al Smadi, B.; Raza, S.; Linton, S.; Igweagu, U. A Study of Prompt Engineering Techniques for Code Generation: Focusing on Data Science Applications. In International Conference on Information Technology-New Generations; Springer Nature: Cham, Switzerland, 2025; pp. 445–453. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Bareiß, P.; Souza, B.; d’Amorim, M.; Pradel, M. Code generation tools (almost) for free? A study of few-shot, pre-trained language models on code. arXiv 2022, arXiv:2206.01335. [Google Scholar] [CrossRef]
Xu, D.; Xie, T.; Xia, B.; Li, H.; Bai, Y.; Sun, Y.; Wang, W. Does few-shot learning help LLM performance in code synthesis? arXiv 2024, arXiv:2412.02906. [Google Scholar] [CrossRef]
Khot, T.; Trivedi, H.; Finlayson, M.; Fu, Y.; Richardson, K.; Clark, P.; Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv 2022, arXiv:2210.02406. [Google Scholar] [CrossRef]
Suzgun, M.; Scales, N.; Scharli, N.; Gehrmann, S.; Tay, Y.; Chung, H.W.; Chowdhery, A.; Le, Q.V.; Chi, E.H.; Zhou, D.; et al. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022. [Google Scholar]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems. arXiv 2023, arXiv:2305.10601. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; Hutter, F. Efficient and robust automated machine learning. Adv. Neural Inf. Process. Syst. 2015, 28, 2962–2970. [Google Scholar]
Machine Learning Professorship Freiburg. auto-sklearn—AutoSklearn 0.15.0 documentation. Machine Learning Professorship Freiburg. Available online: https://automl.github.io/auto-sklearn/master/ (accessed on 26 September 2025).
Erickson, N.; Mueller, J.; Shirkov, A.; Zhang, H.; Larroy, P.; Li, M.; Smola, A. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv 2020, arXiv:2003.06505. [Google Scholar] [CrossRef]
Truong, A.; Walters, A.; Goodsitt, J.; Hines, K.; Bruss, C.B.; Farivar, R. Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1471–1479. [Google Scholar]
Tian, J.; Che, C. Automated machine learning: A survey of tools and techniques. J. Ind. Eng. Appl. Sci. 2024, 2, 71–76. [Google Scholar] [CrossRef]
Baratchi, M.; Wang, C.; Limmer, S.; Van Rijn, J.N.; Hoos, H.; Bäck, T.; Olhofer, M. Automated machine learning: Past, present and future. Artif. Intell. Rev. 2024, 57, 122. [Google Scholar] [CrossRef]
Quaranta, L.; Azevedo, K.; Calefato, F.; Kalinowski, M. A multivocal literature review on the benefits and limitations of industry-leading AutoML tools. Inf. Softw. Technol. 2025, 178, 107608. [Google Scholar] [CrossRef]
An, J.; Kim, I.S.; Kim, K.J.; Park, J.H.; Kang, H.; Kim, H.J.; Kim, Y.S.; Ahn, J.H. Efficacy of automated machine learning models and feature engineering for diagnosis of equivocal appendicitis using clinical and computed tomography findings. Sci. Rep. 2024, 14, 22658. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Xue, Q.; Zhang, C.W.; Wong, K.K.L.; Liu, Z. Explainable coronary artery disease prediction model based on AutoGluon from AutoML framework. Front. Cardiovasc. Med. 2024, 11, 1360548. [Google Scholar] [CrossRef]
Shoaib, H.A.; Rahman, M.A.; Maua, J.; Rahman, A.; Mridha, M.F.; Kim, P.; Shin, J. An enhanced deep learning approach to potential purchaser prediction: AutoGluon ensembles for cross-industry profit maximization. IEEE Open J. Comput. Soc. 2025, 6, 468–479. [Google Scholar] [CrossRef]

Figure 1. Pseudocode for measuring runtime memory footprint.

Figure 2. Summary of LLM performance across four Kaggle benchmarks under two prompting strategies. (a) Titanic—Machine Learning from Disaster; (b) House Prices—Advanced Regression Techniques; (c) Digit Recognizer; (d) Natural Language Processing with Disaster Tweets. FSP: Few Shot Prompting; ToT: Tree of Thoughts; RMSE: Root Mean Squared Error.

Figure 3. Heatmaps show the effect of hyperparameter tuning on test accuracy for the Titanic task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 4. Efficiency metrics of Titanic task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 5. Heatmaps showing the effect of hyperparameter tuning on test RMSE for the House Prices task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 6. Efficiency metrics of House Prices task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 7. Heatmaps show the effect of hyperparameter tuning on test accuracy for the Digit Recognizer task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 8. Efficiency metrics of Digit Recognizer task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 9. Heatmaps showing the effect of hyperparameter tuning on test F1 score for the Disaster Tweets task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 10. Efficiency metrics of Disaster Tweets task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 11. Heatmaps showing the effect of hyperparameter tuning on RMSE for the Predicting the Beats-per-Minute of Songs task across LLMs and prompting strategies. (a) Non-tuned models; (b) HP-tuned models. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Figure 12. Efficiency metrics of Predicting the Beats-per-Minute of Songs task across LLMs and prompting strategies. (a) Peak memory usage (MB); (b) Execution time (s, log scale). FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Table 1. Ranking of the indefinite competitions on Kaggle based on the number of total teams.

	Name	Total Teams	Submissions
1	Titanic—ML from Disaster	15,866	60,236
2	Housing Prices Competition for Kaggle Learn Users	5966	18,933
3	House Prices—Advanced Regression Techniques	4653	22,907
4	Spaceship Titanic	2116	13,123
5	Digit Recognizer	1630	4702
6	NLP with Disaster Tweets	901	3520
7	Store Sales—Time Series Forecasting	746	2643
8	LLM Classification Finetuning	255	1091
9	Connect X	193	508
10	I’m Something of a Painter Myself	129	315

All counts (Table 1) were recorded from the Kaggle public competition page on [18 May 2025]; values may change over time due to account removals, team merges, or Kaggle’s periodic data updates.

Table 2. Overview of Kaggle datasets used for comparative evaluation.

Dataset		Size	Attributes		Data Type	Task
Dataset		Size	Independent	Dependent	Data Type	Task
Titanic	Train	891	10 (+Passenger ID column)	Survived Yes:1/No: 0	Numerical & Categorical	Binary Classification
Titanic	Test	418	10 (+Passenger ID column)	-	Numerical & Categorical	Binary Classification
House Price	Train	1460	79 (+ID column)	Sale Price (Continuous)	Numerical & Categorical	Regression
House Price	Test	1459	79 (+ID column)	-	Numerical & Categorical	Regression
MNIST	Train	42,000	784 pixels	Label (Digit 0–9)	Numerical (flattened 28 × 28 pixels)	Multiclass Classification
MNIST	Test	28,000	784 pixels	-	Numerical (flattened 28 × 28 pixels)	Multiclass Classification
Disaster Tweets	Train	7613	3 (+ID column)	Disaster Relevant:1/Irrelevant:0	Numerical & String	Binary Text Classification (NLP)
Disaster Tweets	Test	3263	3 (+ID column)	-	Numerical & String	Binary Text Classification (NLP)
Predicting the Beats-per-Minute of Songs	Train	524,164	9 (+ID column)	Beats Per Minute (Continuous)	Numerical	Regression
Predicting the Beats-per-Minute of Songs	Test	174,722	9 (+ID column)	-	Numerical	Regression

Table 3. Kaggle competition datasets and access dates used in this study.

Competition	Data Version	Date of Access
Titanic—ML from Disaster	N/A (competition data)	3–7 June 2025
House Prices—Advanced Regression Techniques	N/A (competition data)	7–13 June 2025
Digit Recognizer	N/A (competition data)	13–26 June 2025
NLP with Disaster Tweets	N/A (competition data)	15 June–15 July 2025
Predicting the Beats-per-Minute of Songs	N/A (competition data)	18–25 September 2025

Table 4. Configuration details of the LLMs used in the experiments.

Model	Version	Platform	Context Length	Inference Setting
OpenAI o3	o3	OpenAI ChatGPT/Web	Up to 200 K	Default
GPT-4.1	gpt-4.1	OpenAI ChatGPT/Web	Up to 1 M	Default
GPT-5	Thinking (standart)	OpenAI ChatGPT/Web	Up to 400 K	Default
Gemini 2.5 Pro	gemini-pro	Google AI Studio	~1 M	Default
DeepSeek-V3	deepseek-coder-v3-0324	Hugging Face/Web interface	128 K	Default
DeepSeek-R1	deepseek-coder-R1	Hugging Face/Web interface	128 K	Default
Claude Opus 4	claude-4-opus	Claude AI Web (Anthropic)	Up to 200 K	Default
Claude Sonnet 4	claude-4-sonnet	Claude AI Web (Anthropic)	Up to 200 K	Default

Table 5. Summary of prompting templates designed for FSP and ToT Approaches. FSP: Few Shot Prompting; ToT: Tree of Thoughts; MLOps: Machine Learning operations.

Step	FSP Template	ToT Template
Role Definition	You are a Kaggle Grandmaster and senior MLOps engineer.	You are a Kaggle Grandmaster and senior MLOps engineer.
Task Description	Build the best possible, fully-reproducible solution for the “Kaggle House Prices—Advanced Regression Techniques” competition	Build the best possible, fully-reproducible solution for the “Kaggle House Prices—Advanced Regression Techniques” competition
Requirements	Load train.csv & test.csv Perform cleaning, feature engineering, and exploratory data analysis Select, train & evaluate algorithm or algorithms. Apply CV Optimize hyperparameters Save submission.csv Set seed	Same technical requirements Plus: At each step, brainstorm alternatives, evaluate them, and select the best strategy before continuing
Output Constraints	A single line starting with PLAN: Python code blocks only No markdown, comments, or explanations outside code	A single PLAN: line (step-by-step strategy) One or more Python code blocks No extra prose or commentary outside those
Example PLAN Output	1. Load data. 2. Preprocess (impute, encode). 3. Train with CV. 4. Predict. 5. Save submission.	1. Load data. 2. Compare feature engineering opt. 3. Choose best algorithm strategy. 4. Train and tune. 5. Predict and save.
In-Context Examples	Presents representative input-output pairs with full code implementations to clarify the intended output structure. Example 1: Example 2:	Relies on internal reasoning at inference time, without supplying prior example demonstrations.
Example Code Block	import pandas as pd, numpy as np from sklearn.linear_model import Ridge …	import pandas as pd, numpy as np from catboost import CatBoostRegressor from xgboost import XGBRegressor …
Reasoning Type	Learns from examples. Extracts structure from examples	Thinks iteratively. Evaluates alternatives. Chooses the best solution reasoning.

Table 6. Auto-sklearn configuration summary for the evaluated tasks. RMSE: Root mean squared error.

Problems:		Titanic	House Price	Beats per Minute of Songs	NLP with Disaster Tweets	Digit Recognizer
Framework &version		auto-sklearn 0.15.0, Python 3.10.18
Core settings	metric	accuracy	RMSE	RMSE	F1 Score	Accuracy
	seed	42
	memory_limit	8192 MB

Table 7. AutoGluon configuration summary for the evaluated tasks. RMSE: Root mean squared error.

Tasks		Titanic	House Price	NLP with Disaster Tweets	Digit Recognizer
Framework &version		AutoGluon-Tabular 1.4.0, Python 3.12.x
Core settings	Preset	high_quality
	Metric	Accuracy	RMSE	F1 Score	Accuracy
	time_limit	7200 s.		14,400 s.
	auto_stack	True
	excluded_model_types	None
	Seed	42
	ag_args_fit	max_memory_usage_ratio: 0.8, num_gpus: 0		max_memory_usage_ratio: 0.8, num_gpus: 1
	hyperparameter_tune_kwargs	Auto			num_trials: 80
	num_stack_levels	-			0
	num_bag_folds	-			3

Table 8. Results for the Titanic task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv Acc.: Cross Validation Accuracy; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; GBC: Gradient Boosting Classifier; RFC: Random Forest Classifier.

(a)
LLM		OpenAI			GPT		Gemini
Series		o3		o3 (HP-Tuned)	4.1	4.1 (HP-Tuned)	2.5 Pro	2.5 Pro (HP-Tuned)
FSP	CV Acc.	0.8316		0.844	0.8260	0.8339	0.8395	0.8384
	Test Acc.	0.75358		0.75358	0.77990	0.76555	0.77990	0.77033
	Exec. Time	27s		1 min 43 s	25 s	1 min 22 s	24 s	2 min 33 s
	Memory (MB)	282.12		278.36	288.18	288.58	279.84	298.91
	Algorithm	GBC		GBC	RFC	RFC	RFC	RFC
ToT	CV Acc.	0.8137		0.8294	0.8137	0.8406	0.8227	0.8316
	Test Acc.	0.74641		0.76555	0.74641	0.77272	0.74880	0.74401
	Exec. Time	29 s		44 s	28 s	1 min 3 s	23 s	48 s
	Memory (MB)	296.57		273.26	296.32	297.64	272.04	293.06
	Algorithm	RFC		RFC	RFC	RFC	RFC	RFC
(b)
LLM		DeepSeek				Claude
Series		V3	V3 (HP-tuned)	R1	R1 (HP-tuned)	Opus 4	Opus 4 (HP-tuned)	Sonnet 4	Sonnet 4 (HP-tuned)
FSP	CV Acc.	0.79232	0.826	0.8227	0.8328	0.8271	0.8351	0.8339	0.8395
	Test Acc.	0.75119	0.78229	0.76555	0.77990	0.74880	0.76794	0.76794	0.76555
	Exec. Time	24 s	1 min 37 s	26 s	1 min 36 s	36 s	4 min 25 s	29 s	3 min 6 s
	Memory (MB)	282.03	283.47	277.07	291.67	295.98	295.02	281.41	306.54
	Algorithm	RFC	RFC	RFC	RFC	GBC	RFC	RFC	RFC
ToT	CV Acc.	0.813	0.8294	0.8103	0.8350	0.8182	0.8339	0.8159	0.8395
	Test Acc.	0.74641	0.76555	0.73684	0.77751	0.74401	0.76555	0.74641	0.76794
	Exec. Time	26 s	41 s	31 s	42 s	29 s	3 min 27 s	30 s	3 min 46 s
	Memory (MB)	282.22	273.88	287.54	286.63	301.98	300.11	296.32	309.67
	Algorithm	RFC	RFC	RFC	RFC	RFC	RFC	RFC	RFC

Table 9. Results for the House Prices task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. RMSE: Root Mean Squared Error; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor; GBR: Gradient Boosting Regressor; RFR: Random Forest Regressor.

(a)
LLM		OpenAI		GPT			Gemini
Series		o3	o3 (HP-Tuned)	4.1		4.1 (HP-Tuned)	2.5 Pro	2.5 Pro (HP-Tuned)
FSP	RMSE	0.12404	0.12802	0.12301		0.12530	0.12007	0.12256
	Exec. Time	1 min 42 s	7 min 58 s	32 s		28 min 13 s	1 min 43 s	41 s
	Memory (MB)	438.14	299.00	273.68		842.48	454.28	420.48
	Algorithm	LGBMR + XGBR + Ridge	XGBR	Ridge + Lasso		LGBMR + XGBR+ CBR	LGBMR + XGBR+ CBR + Ridge	LGBMR + XGBR + CBR
ToT	RMSE	0.12131	0.13098	0.13052		0.12141	0.12161	0.12211
	Exec. Time	5 min 8 s	5 min 8 s	22 s		1 min 48 s	34 s	35 s
	Memory (MB)	274.68	365.88	368.68		298.44	375.98	380.50
	Algorithm	Lasso + ElasticNet + GBR	LGBMR	LGBMR		Lasso + XGBR	Ridge + LGBMR	Ridge + LGBMR
(b)
LLM		DeepSeek				Claude
Series		V3	V3 (HP-tuned)	R1	R1 (HP-tuned)	Opus 4	Opus 4 (HP-tuned)	Sonnet 4	Sonnet 4 (HP-tuned)
FSP	RMSE	0.12452	0.12508	0.12602	0.12317	0.12691	0.12361	0.12446	0.12487
	Exec. Time	4 min 48 s	8 min 28 s	1 min 26 s	51 s	55 s	22 min 26 s	44 s	3 min 18 s
	Memory (MB)	442.81	483.73	458.61	438.46	439.98	590.97	413.82	455.34
	Algorithm	XGBR+ LGBMR+ CBR, Final Model: Ridge	XGBR + LGBMR + CBR	CBR+ XGBR+ LGBMR	XGBR+ LGBMR+ CBR+ Ridge	CBR+ XGBR+ LGBMR	Meta Model: Ridge XGBR+ LGBMR+ CBR	LGBMR+ XGBR+ CBR	LGBM+ XGBR+ CBR
ToT	RMSE	0.13027	0.13095	0.14649	0.12843	0.13660	0.12340	0.13348	0.12438
	Exec. Time	35 s	4 min 50 s	43 s	10 h 41 min 50 s	40 s	3 min 16 s	41 s	1 min 45 s
	Memory (MB)	363.61	437.12	280.83	383.1	309.52	393.02	417.71	425.72
	Algorithm	LGBMR	LGBMR+ XGBR+ CBR	RFR	LGBMR	Ridge + Lasso+ RFR+ XGBR	Ridge + Lasso + ElasticNet + XGBR + LGBMR	RFR+ XGBR+ LGBMR	XGBR+ LGBMR+ Ridge+ ElasticNet+ RFR

Table 10. Results for the Digit Recognizer task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv Acc.: Cross Validation Accuracy; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; CNN: Convolutional Neural Network; VGG: Visual Geometry Group.

(a)
LLM		OpenAI			GPT			Gemini
Series		o3	o3 (HP-Tuned)		4.1	4.1 (HP-Tuned)		2.5 pro	2.5 pro (HP-tuned)
FSP	CV Acc.	99.2857	0.9876		0.9853	0.99		0.9872	0.9956
	Test Acc.	0.99125	0.98939		0.98750	0.99128		0.98675	0.99450
	Exec. Time	23 min 3 s	14 min 28 s		2 min 8 s	3 min 13 s		4 min 33 s	37 min 17 s
	Memory (MB)	3194.35	3067.41		2772.10	2721.93		2817.53	4200.00
	Algorithm	CNN	CNN		CNN	CNN		CNN	CNN
ToT	CV Acc.	99.2786	99.45714		0.9929	0.99224		0.99569	0.9937
	Test Acc.	0.99467	0.99517		0.99310	0.99253		0.99621	0.99639
	Exec. Time	8 min 7 s	19 min 33 s		13 min 26 s	44 min 59 s		26 min 8 s	45 min 15 s
	Memory (MB)	2020.12	2050.82		1760.64	2027.67		2632.75	2451.20
	Algorithm	Small CNN	Medium CNN		Simple CNN	Simple CNN		VGG like CNN	CNN
(b)
LLM		DeepSeek				Claude
Series		V3	V3 (HP-tuned)	R1	R1 (HP-tuned)	Opus 4	Opus 4 (HP-tuned)	Sonnet 4	Sonnet 4 (HP-tuned)
FSP	CV Acc.	0.9860	0.9957	0.9911	0.9895	0.9879	0.996	0.99	0.99
	Test Acc.	0.98996	0.99321	0.99225	0.99271	0.98385	0.99482	0.98703	0.99246
	Exec. Time	6 min 39 s	13 min 7 s	5 min 27 s	9 min 53 s	8 min 34 s	12 min 43 s	2 min 7 s	27 min 4 s
	Memory (MB)	4143.05	2697.58	2752.93	2714.18	3079.88	5080.94	3073.10	2927.02
	Algorithm	CNN	CNN	CNN	CNN	CNN	CNN	CNN	CNN+ RFC
ToT	CV Acc.	97.83	0.9926	0.9916	0.9837	99.48	99.49	0.99329	0.99314
	Test Acc.	0.98117	0.98853	0.99260	0.98514	0.99546	0.99539	0.99335	0.99307
	Exec. Time	1 min 48 s	1 min 42 s	6 min 26 s	41 min 1 s	19 min 55 s	20 min 25 s	17 min 29 s	15 min 42 s
	Memory (MB)	2630.61	2154.87	2226.44	4344.48	2198.39	2172.88	2912.29	2912.65
	Algorithm	CNN	CNN	CNN	CNN	Simple CNN	Simple CNN	CNN	CNN

Table 11. Results for the NLP with Disaster Tweets task: LLM × prompting strategy × hyperparameter tuning. (a) OpenAI o3, GPT-4.1, Gemini 2.5 Pro; (b) DeepSeek-V3 and R1 and Claude Opus 4 and Claude Sonnet 4. Cv F1 score: Cross Validation F1 score; Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; TF-idf: Term Frequency–Inverse Document Frequency; LRC: Logistic Regression Classifier; LinearSVC: Linear Support Vector Classifier; RFC: Random Forest Classifier; XGBC: eXtreme Gradient Boosting Classifier; SVM: Support Vector Machine; LGBMC: Light Gradient Boosting Machine Classifier; LRC: Logistic Regression Classifier; NB: Naive Bayes.

(a)
LLM		OpenAI			GPT			Gemini
Series		o3	o3 (HP-Tuned)		4.1	4.1 (HP-Tuned)		2.5 Pro	2.5 Pro (HP-Tuned)
FSP	CV F1 Score	0.7927	0.7972		0.8011	0.7808		0.83830	0.80494
	Test F1 Score	0.79742	0.79405		0.79650	0.78302		0.83695	0.79865
	Exec. Time	28 s	25 s		22 s	8 min 52 s		25 min 25 s	22 s
	Memory (MB)	284.97	288.98		314.32	391.31		4864.11	279.46
	Algorithm	TF-idf +LRC	TF-idf +LRC		TF-idf +LRC	TF-idf + Stacking Classifier		Hugging Face Bert-Base-Uncased	TF-idf +LRC
ToT	CV F1 Score	0.7918	0.8141		0.7923	0.8074		0.80008	0.80192
	Test F1 Score	0.79007	0.81428		0.78915	0.80692		0.78792	0.79926
	Exec. Time	39s	41 s		35 s	28 s		27 s	47 s
	Memory (MB)	363.88	429.84		352.58	348.71		273.73	285.75
	Algorithm	TF-idf + LinearSVC	TF-idf +LRC		TF-idf + LinearSVC	TF-idf +LRC		TF-idf +LRC	TF-idf +LRC
(b)
LLM		DeepSeek				Claude
Series		V3	V3 (HP-tuned)	R1	R1 (HP-tuned)	Opus 4	Opus 4 (HP-tuned)	Sonnet 4	Sonnet 4(HP-tuned)
FSP	CV F1 Score	0.7344	0.7305	0.8024	0.8028	LRC: 0.8007 RFC: 0.7787 XGBC: 0.7906	0.8101	0.7892	0.7950
	Test F1 Score	0.79895	0.79711	0.79834	0.79681	0.79619	0.80753	0.79313	0.79129
	Exec. Time	36 s	31 s	21 s	1 min 12 s	1 min 1 s	49 min 5 s	31 s	7 h 8 min 35 s
	Memory (MB)	269.09	316.75	280.21	272.31	333.52	586.85	292.43	2221.02
	Algorithm	TF-idf + LRC	TF-idf + LRC	TF-idf + LRC	TF-idf + LRC	TF-idf+ (LRC/ RFC/ XGBC)	TF-idf + (LRC/SVM/LGBMC)	TF-idf + LRC+ NB	TF-idf+ LRC+ SVM
ToT	CV F1 Score	0.7774	0.7316	0.7323	0.8009	0.7887	0.7902	0.7793	0.7760
	Test F1 Score	0.78976	0.79528	0.79895	0.79773	0.78332	0.78547	0.78945	0.77811
	Exec. Time	1 min 36 s	25 s	23 s	33 s	32 s	3 min 68 s	26 s	3 h 51 min 51 s
	Memory (MB)	319.79	284.76	283.12	285.93	1804.99	329.46	389.11	3875.11
	Algorithm	TF-idf + LRC	TF-idf + LRC	TF-idf + LRC	TF-idf + LRC	TF-idf+ LGBMC	TF-idf + XGBC	TF-idf + LGBMC	TF-idf + XGBC+ SVM + NB

Table 12. Results for the Predicting the Beats-per-Minute of Songs task: LLM × prompting strategy × hyperparameter tuning. RMSE: Root Mean Squared Error; Exec. Time: Execution Time; FSP: Few-Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; RFR: Random Forest Regressor; CBR: CatBoost Regressor; GBR: Gradient Boosting Regressor; HGBR: Histogram-based Gradient Boosting Regressor; ETR: Extra Trees Regressor.

LLM		GPT		Gemini		DeepSeek
Series		5 Thinking	5 Thinking (HP-Tuned)	2.5 Pro	2.5 Pro (HP-Tuned)	V3	V3 (HP-Tuned)
FSP	RMSE	26.39266	26.40192	26.39081	26.40479	26.39554	26.38663
	Exec. Time	3 h 31 min 18 s	1 h 22 min 16 s	1 h 24 min 46 s	15 min 15 s	35 min 44 s	9 min 41 s
	Memory (MB)	1123.58	957.88	791.73	606.72	905.31	697.42
	Algorithm	LGBMR	CBR/ XGBR/Ridge	Ridge/ RFR/ LGBMR/ XGBR/ CBR	LGBMR	Ridge/ RFR/ LGBMR/ XGBR/ CBR	LGBMR
ToT	RMSE	26.38734	26.38760	26.38879	26.39296	26.42273	26.38801
	Exec. Time	3 h 35 min 29 s	2 h 55 min 28 s	47 s	1 min 21 s	46 min 28 s	23 min 21 s
	Memory (MB)	461.14	1987.59	684.24	631.26	480.21	1278.91
	Algorithm	Ridge/ Lasso/ ElasticNet/ RFR/ GBR/ HGBR	Ridge/ ElasticNet/ETR/HGBR/LGBMR/XGBR/CBR Ridge blend TOP_K OOF/TEST preds	LGBMR	LGBMR	RFR	XGBR

Table 13. Appropriate prompting techniques and practical trends across different Kaggle tasks. FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned.

Tasks	Search/Design Space	Appropriate Technique
		Prompt Technique	In Practice
		Prompt Technique	(General Trend)	Best
Titanic	Narrow	FSP	FSP	FSP (HP-tuned)
House Prices	Medium	FSP (Usually)	FSP (non-tuned)/ToT (HP-tuned)	FSP (non-tuned)
Digit Recognizer	Very Large	FSP	ToT	ToT (HP-tuned)
Disaster Tweets	Large	ToT	FSP (non-tuned)/ToT (HP-tuned)	FSP (non-tuned)
Beats-per-Minute of Songs	Large	FSP (Usually)	FSP (non-tuned)/ToT (HP-tuned)	FSP (HP-tuned)

Table 14. Comparison of LLM-generated and human-submitted solutions across benchmark tasks in terms of accuracy, F1 score, RMSE, execution time, algorithm choice, and Kaggle leaderboard/notebook results. Exec. Time: Execution Time; FSP: Few Shot Prompting; ToT: Tree of Thoughts; HP-tuned: Hyperparameter Tuned; RFC: Random Forest Classifier; GBT: Gradient Boosted Trees; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor; CNN: Convolutional Neural Network; Distil BERT: Bidirectional Encoder Representations from Transformers.

Task	BEST LLM	Metrics		Kaggle Notebooks
Titanic	DeepSeek-V3 (FSP (HP-tuned))	Accuracy	0.78229	0.80143
		Exec. Time	1 min 37 s	5 min 37 s
		Algorithm	RFC	GBT
		Kaggle Learderboard %	17.3	2.82
House Price	Gemini 2.5 Pro (FSP)	RMSE	0.12007	0.12096
		Exec. Time	1 min 43 s	47 s
		Algorithm	LGBMR + XGBR+ CBR + Ridge	Regularized Linear Regression Model
		Kaggle Learderboard %	5.25	6.84
Digit Recognizer	Gemini 2.5 Pro (ToT (HP-tuned))	Accuracy	0.99639	0.99028
		Exec. Time (GPU 100)	45 min 15 s	1 h 47 min 7 s
		Algorithm	CNN	CNN
		Kaggle Learderboard %	10.4	32.7
NLP with Disaster Tweets	Gemini 2.5 Pro (FSP)	F1 Score	0.83695	0.83726
		Exec. Time (GPU 100)	25 min 25 s	5 min 53 s
		Algorithm	Hugging Face Bert-Base-Uncased	Distil BERT
		Kaggle Learderboard %	10.1	9.3
Beats-per-Minute of Songs	DeepSeek-V3 (FSP (HP-tuned))	RMSE	26.38663	26.38020 (20.09.2025)
		Exec. Time	9 min 41 s	-
		Algorithm	LGBMR	-

Table 15. Accuracy, Algorithm, and Prompt Techniques: A literature-based evaluation of LLM-guided ML solutions. N/A: not mentioned in the study; HPT: Hyperparameter Tuning; FSP: Few Shot Prompting; ToT: Tree of Thoughts; Alg.: Algorithm; RFC: Random Forest Classifier; XGBoost: eXtreme Gradient Boosting; GBTC: Gradient Boosted Trees Classifier; CNN: Convolutional Neural Network.

Task	Model	Version	Tuning	Prompting Techniques	Accuracy	Algorithm
Titanic	GPT [23]	3.5	No	Preprocess: FSP Chain of Thought Specifying Desired Response Format HPT: FSP Specifying Desired Response Format	0.7511	RFC
	GPT [23]	3.5	Yes		0.7918	RFC
	Gemini [23]	N/A	No		0.7655	XGBoost
	Gemini [23]	N/A	Yes		0.7583	XGBoost
	Human [23]	-	N/A	-	0.7966	XGBoost
	Kaggle Notebook	-	Yes	-	0.80143	GBTC
	GPT	4.1	No	Preprocess+ Alg. Selection+ HPT: FSP Specifying Desired Response Format	0.7799	RFC
	GPT	4.1	Yes		0.76555	RFC
	Gemini	2.5 Pro	No		0.7799	RFC
	Gemini	2.5 Pro	Yes		0.77033	RFC
	GPT	4.1	No	Preprocess+ Alg. Selection+ HPT: ToT Specifying Desired Response Format	0.74641	RFC
	GPT	4.1	Yes		0.77272	RFC
	Gemini	2.5 Pro	No		0.7488	RFC
	Gemini	2.5 Pro	Yes		0.74401	RFC
	Best LLM DeepSeek	V3	Yes	Preprocess+ Alg. Selection+ HPT: FSP Specifying Desired Response Format	0.78229	RFC
Digit Recognizer	GPT [23]	3.5	No	Classification + HPT: FSP Specifying Desired Response Format	0.9863	CNN
	GPT [23]	3.5	Yes		0.9960
	Gemini [23]	N/A	No		0.9801
	Gemini [23]	N/A	Yes		0.9820
	Human [23]	-	N/A	-	0.9834
	Kaggle Notebook	-	N/A	-	0.99028
	GPT	4.1	No	Classification + HPT: FSP Specifying Desired Response Format	0.98750
	GPT	4.1	Yes		0.99128
	Gemini	2.5 Pro	No		0.98675
	Gemini	2.5 Pro	Yes		0.99450
	GPT	4.1	No	Classification + HPT: ToT Specifying Desired Response Format	0.99310
	GPT	4.1	Yes		0.99224
	Gemini	2.5 Pro	No		0.99569
	Best LLM Gemini	2.5 Pro	Yes		0.99639

Table 16. Performance-oriented literature comparison of hyperparameter and architecture choices by leading LLMs. Note. RFC: Random Forest Classifier; CNN: Convolutional Neural Network.

Task	Hyperparameter	GPT [23]	DeepSeek-V3
Titanic	n_estimator	[100, 200, 300]	[100, 200, 300]
	max_depth	[None, 10, 20]	[None, 5, 10]
	min_samples_split	[2, 5, 10]	[2, 5, 10]
	min_samples_leaf	[1, 2, 4]	[1, 2, 4]
	max_features	[‘auto’, ‘sqrt’]	-
Model		RFC	RFC
Task	Hyperparameter	GPT [23]	Gemini 2.5 Pro
Digit Recognizer	filter	[32, 64]	[32, 64]
	units	[64, 128]	[256]
	learning_rate	[1 × 10⁻⁵, 1 × 10⁻²]	[1 × 10⁻⁴, 1 × 10⁻²]
	dropout	-	[0.1, 0.6]
Model		CNN	CNN
Architecture	Input	Conv2D(32,(3,3), activation = ‘relu’)	Conv2D(32, (3,3), activation = ‘relu’)
	Hidden	MaxPooling2D((2, 2)) Conv2D(64, (3, 3), activation = ‘relu’) MaxPooling2D((2, 2)) Conv2D(64, (3, 3), activation = ‘relu’) Dense(64, activation = ‘relu’)	Conv2D(32, (3,3), activation = ‘relu’) MaxPooling2D((2,2)) Dropout(rate=dropout/2) Conv2D(64, (3,3), activation = ‘relu’) Conv2D(64, (3,3), activation = ‘relu’) MaxPooling2D((2,2)) Dropout(rate = dropout/2) Dense(256, activation = ‘relu’) Dropout(rate = dropout)
	Output	Dense(10, activation=‘softmax’)	Dense(10, activation = ‘softmax’)

Table 17. Comparison of best-performing LLM configurations and AutoML frameworks across five Kaggle tasks. LinearSVC: Support Vector Classification with LIBLINEAR solver; LinearSVR: Support Vector Regression with LIBLINEAR solver; PAC: Passive Aggressive Classifier; GBR: Gradient Boosting Regressor; ETC: Extra Trees Classifier; LGBMR: Light Gradient Boosting Machine Regressor; RFC: Random Forest Classifier; ETR: Extra Trees Regressor; GBTC: Gradient Boosted Trees Classifier; CNN: Convolutional Neural Network; Distil BERT: Bidirectional Encoder Representations from Transformers; LGBMR: Light Gradient Boosting Machine Regressor; XGBR: eXtreme Gradient Boosting Regressor; CBR: CatBoost Regressor.

Task		DeepSeek-V3 (FSP & HP-Tuned)	Auto-Sklearn	AutoGluon	Kaggle Notebooks
Titanic	Accuracy	0.78229	0.77272	0.76315	0.80143
	Exec. Time	1 min 37 s	59 min 56.5 s	1 h 37 min 42.7 s	5 min 37 s
	Algorithm	RFC	LinearSVC	ETC (BAG L1 FULL)	GBTC
House Price		Gemini 2.5 Pro (FSP)	auto-sklearn	AutoGluon	Kaggle Notebooks
	RMSE	0.12007	0.12775	0.1181	0.12096
	Exec. Time	1 min 43 s	1 h 48.5 s	1 h 36 min 2.1 s	47 s
	Algorithm	LGBMR + XGBR+ CBR + Ridge	LinearSVR	LGBMR (BAG L1/T2 FULL)	Regularized Linear Regression Model
Digit Recognizer		Gemini 2.5 Pro (ToT & HP-tuned)	auto-sklearn	AutoGluon	Kaggle Notebooks
	Accuracy	0.99639	0.98167	0.97996	0.99028
	Exec. Time (GPU)	45 min 15 s	1 h 5 min 23.9 s	2 h 51 min 3.6 s	1 h 47 min7 s
	Algorithm	CNN	LinearSVC	RFC (BAG L1 FULL)	CNN
NLP with Disaster Tweets		Gemini 2.5 Pro (FSP)	auto-sklearn	AutoGluon	Kaggle Notebooks
	F1 Score	0.83695	0.78424	0.7919	0.83726
	Exec. Time (GPU)	25 min 25 s	1 h 9.9 s	3 h 30 min 58.2 s	5 min 53 s
	Algorithm	Hugging Face Bert-Base-Uncased	PAC	ETC (BAG L1 FULL)	Distil BERT
The Beats per Minute of Songs		DeepSeek-V3 (FSP & HP-tuned)	auto-sklearn	AutoGluon	Kaggle Notebooks
	RMSE	26.38663	26.39400	26.39029	26.38020 (20 September 2025)
	Exec. Time	9 min 41 s	59 min 56.8 s	2 h 52 min 16.6 s	-
	Algorithm	LGBMR	GBR	ETR (BAG L1 FULL)	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kaya Gülağız, F. Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks. Appl. Sci. 2025, 15, 10968. https://doi.org/10.3390/app152010968

AMA Style

Kaya Gülağız F. Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks. Applied Sciences. 2025; 15(20):10968. https://doi.org/10.3390/app152010968

Chicago/Turabian Style

Kaya Gülağız, Fidan. 2025. "Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks" Applied Sciences 15, no. 20: 10968. https://doi.org/10.3390/app152010968

APA Style

Kaya Gülağız, F. (2025). Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks. Applied Sciences, 15(20), 10968. https://doi.org/10.3390/app152010968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models for Machine Learning Design Assistance: Prompt-Driven Algorithm Selection and Optimization in Diverse Supervised Learning Tasks

Abstract

1. Introduction

1.1. Related Work

1.2. The Contributions of This Study

2. Materials and Methods

2.1. Machine Learning Tasks for Evaluation

2.2. Selected LLMs for Experimental Evaluation

2.3. Prompt-Driven Code Generation with LLMs

2.4. AutoML Frameworks: Setup and Parameters

2.5. Evaluation Metrics

3. Experiments and Results

4. Discussion

4.1. Prompting Effectiveness by Task Type

4.2. A Practical Comparison of LLM-Generated and Human-Developed Solutions

4.3. Comparison with Prior Literature

4.4. LLM Pipelines vs. Classical AutoML

4.5. General Insights and Practical Recommendations

5. Conclusions and Future Work

Supplementary Materials

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI