LLMs in Wind Turbine Gearbox Failure Prediction

Tan, Yoke Wang; Carroll, James

doi:10.3390/en18174659

Open AccessArticle

LLMs in Wind Turbine Gearbox Failure Prediction

by

Yoke Wang Tan

^*

and

James Carroll

Electronic and Electrical Engineering, University of Strathclyde, Glasgow G1 1XQ, UK

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(17), 4659; https://doi.org/10.3390/en18174659

Submission received: 25 July 2025 / Revised: 19 August 2025 / Accepted: 28 August 2025 / Published: 2 September 2025

(This article belongs to the Special Issue Renewable Energy System Forecasting and Maintenance Management)

Download

Browse Figures

Versions Notes

Abstract

Predictive maintenance strategies in wind turbine operations have risen in popularity with the growth of renewable electricity demand. The capacity of the strategy to predict system health, especially for the wind turbine gearboxes, is critical in reducing wind turbine operation and maintenance cost. Driven by the emergence of the application of large language models (LLMs) in diverse domains, this work explores the potential of LLMs in the development of wind turbine gearbox prognosis. A comparative analysis is designed to investigate the capability of two state-of-the-art LLMs—GPT-4o and DeepSeek-V3—in proposing machine learning (ML) pipelines to classify gearbox conditions based on a labelled SCADA dataset. The LLMs were prompted with the context of the task and detailed information about the SCADA dataset investigated. The outputs generated by the LLMs were evaluated in terms of pipeline quality and prediction performance using the confusion metric. Baseline ML models were developed and fine-tuned as benchmarks using Python 3.12 libraries. Among the baseline models, the random forest and XGBoost models achieved the highest cross-validated average F1-scores. The results have shown that the ML pipeline proposed by DeepSeek-V3 was significantly better than both GPT-4o and baseline models in terms of data analytical scope and prediction accuracy.

Keywords:

predictive maintenance; fault detection and diagnosis; condition monitoring; machine learning; large language model

1. Introduction

Wind energy remains a central pillar in the global transition towards sustainable energy systems, generating over 2330 TWh in 2023, nearly equivalent to all the other renewable sources combined [1]. With the rise in renewable electricity demand, the expansion of offshore wind farm developments is both expected and necessary. Although offshore wind offers superior capacity factors and energy yields than onshore installations, the associated operation and maintenance (O&M) cost, which accounts for up to 30% of the levelised cost of energy (LCOE), continues to pose a major economic challenge for the industry [2]. Consequently, the development of effective maintenance strategies is critical in reducing both O&M expenditure and the overall LCOE of offshore wind energy.

In this context, recent studies show growing emphasis on predictive maintenance (PdM) strategies for wind farm operations. This is driven by their capacity to estimate system performance and enable timely prefailure intervention [3]. A significant portion of PdM research has focused on the wind turbine gearbox, one of the components with the highest failure rates [4]. Operational data of various turbine components are continuously collected at a high sampling rate via condition monitoring systems (CMSs) and supervisory control and data acquisition (SCADA) systems [3,5], both of which are widely deployed across modern wind farms. The availability of such large repositories of data contributes significantly to the development of data-driven approaches such as failure diagnosis, failure prognosis and effective maintenance plans of WTs in the literature.

Machine learning (ML) models such as linear regression (LINREG), random forest (RF), support vector machine (SVM) and extreme gradient boosting (XGBoost) are widely utilised in the early stages of data-driven gearbox research. These models often serve as baselines for developing more complex architectures and for validating novel methodological approaches. In [6], ML algorithms were implemented to develop an interpretable predictive model for bearing failure by using SCADA data from wind turbines. The models’ performances were evaluated by comparing the predicted to the observed generator bearing temperature. The study also examined the model’s compatibility with Shapley additive explanations (SHAP)—a technique for interpreting ML models’ outputs. Similar ML models were adopted in [7] to classify equipment failure types in WTs based on vibration data. The algorithms were trained on labelled datasets and tasked to classify simulated data for three distinct failure modes—gear scuffing, bearing overheating and bearing fretting corrosion. In contrast to prior studies, ref. [8] integrated SCADA data with modelled data derived from physics-based models of the gearbox design parameters to enhance existing gearbox prognostics methods, which typically rely solely on SCADA data. Bearing-specific features obtained from the modelled data were utilised in conjunction with SCADA data to train an ML classifier for binary classification tasks.

In addition to conventional ML models, deep learning (DL) models with complex non-linear relationships have also been explored in wind turbine failure prognosis research. While DL models tend to achieve higher prediction accuracy, they require longer development time and are more demanding in terms of computational resources and data volume [9]. In [5], the study was granted full access to the extensive dataset, detailing operational and failure logs, SCADA data and vibration measurements sourced from multiple wind farms via an industrial partnership. This enabled the study to perform comparative binary classification analysis involving artificial neural networks (ANNs), logistic regression and support vector machine (SVM). When trained on the complete dataset, the ANN model achieved the highest prediction accuracy of 72.5%. In a different study [10], the long short-term memory (LSTM) neural network was employed to perform early prediction of main bearing failure utilising SCADA data of an operational wind farm. A normality model was developed using only healthy data to predict the temperature associated with main bearing failure. Although the LSTM model developed was able to detect main bearing failure with high accuracy, the lack of failure data for validation raises concerns about model robustness when extended to more diverse datasets with more failure examples.

The advancement of data-driven approaches has accelerated significantly in recent years, driven in part by the emergence of large language models (LLMs) such as ChatGPT. LLMs have demonstrated great performance across a range of diverse tasks in natural language processing (NLP) without the need for specific fine-tuning. Recent studies have shown growing interest in applying LLMs beyond traditional NLP domains, including dynamic system problems such as energy load forecasting, anomaly detection and failure prediction due to their reasoning and generalisation capabilities [11,12,13].

The zero-shot capabilities of LLMs have attracted growing attention in the time-series forecasting literature, particularly for addressing the challenge of data scarcity that often limits traditional machine learning models. In [11], an LLM framework known as SigLLM was developed for time-series anomaly detection using LLMs. The approach incorporated a time-to-text conversion and end-to-end prompting timeline, enabling the LLM to perform time-series anomaly detection. Two LLMs—GPT-3.5-turbo and MISTRAL—were investigated, with a focus on MISTRAL due to cost efficiency. The study proposed two detection pipelines: PROMPTER, which directly queried the LLM to detect anomalies within a time-series window, and DETECTOR, which involved tasking the LLM to predict future values of a time-series window. These pipelines were benchmarked against state-of-the-art (SOTA) unsupervised time-series prediction models such as ARIMA, LSTM and MS Azure’s forecasting tools. Under zero-shot conditions, the DETECTOR pipeline was able to identify all anomalies correctly with only a single error, while the PROMPTER pipeline was more effective at detecting local outliers than identifying anomalies within the signal sequence. Limitations were also identified, particularly the LLM’s inability to consistently capture temporal trends in time-series signals and the PROMPTER pipeline’s relatively low precision of 0.219. In a related study [12], the TIME-LLM framework was introduced to perform forecasting tasks. Unlike SigLLM, TIME-LLM employed an additional layer of input augmentation through declarative prompting that incorporated domain expert knowledge. This layer, known as Prompt-as-Prefix (PaP), provided the LLM with contextual information such as task description and input statistics to further improve LLM reasoning. Using a similar evaluation setup under both few-shot and zero-shot conditions, TIME-LLM achieved 5% reduction in mean squared error (MSE) compared to existing time-series LLMs under few-shot settings and outperformed competitive baselines under zero-shot conditions.

LLMs have also demonstrated promising potential in cross-domain applications, where minor or major fine-tuning is performed on pretrained LLMs with specific data to perform tasks in a defined domain. In [13], an LLM-based framework was proposed for bearing fault diagnosis in rotating machinery. The study introduced a fine-tuning design to enhance LLM performance under challenging conditions such as cross-dataset generalisation, limited data availability and unseen operational scenarios. The framework converts numerical time-series features extracted from vibration data into textual format, which is then used to fine-tune the LLM using low-rank adaptation (LoRA) and quantised LoRA (QLoRA) techniques. This approach exhibited strong adaptability across multiple bearing fault datasets, demonstrating generalisation capabilities in diverse experimental setups. Within the context of wind turbine failure prognosis, LLM implementation remains an emerging research direction. In the domain of power forecasting, ref. [14] proposed a model that integrates cross-modal data preprocessing, prompt engineering and a pretrained LLM-LLaMA. The approach extracts relevant information from SCADA data and reformats it into a modality suitable for LLM input. The forecasting process is further enhanced by incorporating prompt prefixes containing prior domain knowledge, improving the model’s contextual understanding of the power generation data. This enabled more accurate power forecasting while minimising information loss commonly associated with cross-modal data conversion.

Previous studies have demonstrated the strong potential of LLM to adapt to time-series data analysis tasks under zero-shot conditions. Some studies which were less inclined to architectural design also focused on improving interpretability of LLM-based anomaly detection [15]. Motivated by these findings, this work investigates the implementations of LLMs for gearbox failure prediction. To the best of the authors’ knowledge, the application of LLMs in wind turbine gearbox prognosis remains largely unexplored in the current literature. The following are the summaries of the contributions of this research:

-: The study has performed a comparative analysis on both GPT-4o and DeepSeek-V3 LLMs’ performance in assisting the development of an ML pipeline for gearbox failure prediction.
-: The LLMs are tasked through a designed prompt to develop ML algorithms in the Python programming language to perform binary classification of gear tooth failure for wind turbines based on labelled SCADA data under a zero-shot prompt. The quality of the LLM-proposed ML methods is evaluated.
-: An ML pipeline, adopting logistic regression, RF, SVM, XGBoost and multi-layer perceptron (MLP) classifier, is formulated to serve as the baseline of the comparative study. XGBoost and MLP classifier, which are more sensitive to hyperparameter tuning, are fine-tuned based on the SCADA dataset utilised for this study.
-: The outputs of each LLM are analysed and compared to the baseline models. Additionally, the future possibilities of LLMs in the work of gearbox failure prognosis are discussed.

The remainder of this paper is organised as follows:

-: Section 2 provides an overview of the methodology of this work in the following sequence—SCADA dataset investigated, baseline models, data preprocessing steps, baseline models optimisation strategy, the selected LLMs, the prompt design process and the evaluation metric adopted in this study.
-: Section 3 presents and discusses the performance results of the baseline models and ChatGPT-generated and DeepSeek-generated ML pipelines. A comparative analysis of the pipeline designs is also presented and discussed.
-: Section 4 offers concluding remarks and future research directions revolving around LLM applications in wind turbine gearbox failure predictions.

2. Methodology

The selection of failure prediction methods is closely linked to the nature and distribution of the available data. Factors such as whether the data consists of continuous signals or moving averages or the availability of class labels determine the suitability of different ML models. In this study, a comprehensive labelled SCADA dataset from a wind farm developer that was also investigated in [5] was employed to develop baseline models which have been commonly used in the literature for supervised classification tasks [5,6,7,8,10]. A binary classification task was formulated to evaluate model performance across different failure modes.

Based on this SCADA dataset, a structured prompt was subsequently designed to supply LLMs with relevant metadata, including dataset dimensions, feature types, data distribution and unique variables. The prompt provided explicit instructions regarding the expected model outputs. Details of the prompt design are presented in Section 2.5. The performance of the baseline models and LLM-generated ML models is then evaluated and presented in Section 3.1 and Section 3.2.

2.1. SCADA Dataset

The database utilised in this study was provided by a wind turbine OEM industrial partner. The dataset is a collection of 99,904 data entries of 10 min average SCADA data points located at different wind farms. The SCADA dataset was labelled by the developer based on historical failure logs and corresponding prefailure time windows. The database records the operational data of 27 different wind turbines of types A and B collected from 1 August 2008 to 30 July 2015. The data variables selected to train our baseline models are listed in Table 1. The same variables will be used for the LLM-proposed pipelines to avoid discrepancy due to differences in training data. The gear oil temperature and gear bearing temperature were measured by using resistance temperature detectors (RTDs), where the sensors were located in the sump of the gearbox lubrication system and on the intermediate speed bearing, respectively, in each wind turbine. The SCADA dataset is labelled based on time before failure, where the classifications are defined as “Healthy” for greater than 1 year, “1 year”, “6 months” and “3 months” before failure. The “Healthy” and “3 months” before failure data are selected for the binary classification task. An example data sample of the SCADA dataset is presented in Appendix A.1, Figure A1.

2.2. Baseline Machine Learning Models

This section presents an overview of the machine learning algorithms adopted for baseline models. A brief description of each algorithm is provided, with emphasis on their underlying classification mechanisms and suitability for the task of gearbox failure prediction.

2.2.1. Logistic Regression [16]

Logistic regression is a statistical approach used to illustrate the correlation between a categorical outcome and a set of input variables. In contrast to linear regression, which can only predict continuous outcomes, the algorithm utilises a logistic sigmoid function to transform linear combinations of input variables to predict the probability of the output corresponding to two defined classes of categorical variables (i.e., “Healthy”, “Unhealthy”). The algorithm is commonly implemented in binary and multi-class classification problems.

2.2.2. Support Vector Machine [17]

SVM is a supervised max-margin model, often used for classification and regression analysis tasks. The algorithm separates input vectors by defining the optimal hyperplane that maximises the margin between different classes in the feature space. Additionally, SVM supports non-linear classification by utilising the kernel trick, transforming the data into higher dimensions. The robustness of the model against noisy data and overfitting has contributed to its popularity in predictive tasks.

2.2.3. Random Forest [18]

RF, also known as random decision forests, is an ensemble learning method formed from a collection of decision trees that are trained on bootstrap samples of a given dataset. RF addresses the common issue of singular decision tree being overfitted by training each tree using random subsets of features at each split to improve generalisation. The final prediction made by RF is based on the majority voting of all decision trees for classification tasks and value averaging of all trees for regression tasks.

2.2.4. XGBoost [19]

Gradient boosting is an ensemble machine learning technique that involves sequential error correction of weak learners such as decision trees. A new model is trained using corrected residual errors of previous models at each iteration until a predetermined halt criterion is reached. Utilising this framework, XGBoost is one of the best ML libraries in terms of scalability and memory efficiency for model development.

2.2.5. Multi-Layer Perceptron Classifier [20]

MLP is a feedforward neural network made of multiple layers of interconnected artificial neurons. Non-linear activation functions such as sigmoid or ReLU are applied at each layer of neurons, allowing the neural network to learn complex relationships in the data. MLP forms the foundation of deep learning models and can be applied to different applications.

2.3. Baseline Model Data Preprocessing

2.3.1. Missing Data

Data quality plays a critical role in establishing the performance of ML models. Raw SCADA data are often affected by incompleteness or inaccuracies arising from faulty sensors, which typically manifest as outliers or missing values. While such anomalies may be apparent to human perceptions, ML algorithms lack inherent mechanisms to identify them. Consequently, training models on unfiltered or erroneous data will introduce bias in the model predictions. Notably, the SCADA dataset in this work only has one entry with missing values. A simple elimination strategy is adopted to remove the entry from the training dataset.

2.3.2. Data Type Handling

Presented in Table 1, the SCADA dataset comprises both categorical and numerical features. These features require different preprocessing techniques to ensure compatibility with the selected baseline ML models. Although categorical features can be as informative as numerical features in the model training, they must first be converted into numerical representations to be processed by most ML algorithms. The transformation process is known as encoding, a standard step in data preprocessing. Common encoding methods employed in ML development include one-hot encoding, ordinal encoding and target encoding [21].

This study applies one-hot encoding and ordinal encoding in the data preprocessing pipeline. One-hot encoding transforms categorical features into binary vectors based on the number of unique values, assigning 1 to the active category and 0 to the rest. Ordinal encoding is used to transform the class labels into ordinal numbers ranging from 0 to 3. Furthermore, numerical features are scaled using StandardScaler to ensure consistent feature ranges for model input.

2.3.3. Class Imbalance

Supervised learning algorithms perform classification tasks by identifying linear or non-linear boundaries that separate different classes within the training data. In this study, the distribution of classification is relatively balanced, where the data classification has the ratio of “Healthy”—0.24, “1 year”—0.24, “6 months”—0.26 and “3 months”—0.26. Although the data imbalance is not severe, the ML models may still display a minor bias towards “6 months” and “3 months” class predictions. In the case of severe class imbalance and multi-class classification, the synthetic minority over-sampling technique (SMOTE) [22]—an oversampling approach that synthetically generates new data from the minority classes—is often adopted to address the issue.

2.3.4. Model Training

In this study, the SCADA dataset is split into training and validation data using a 7:3 ratio split after undergoing preprocessing to improve data quality. The same split ratio was adopted in [5].

2.3.5. Baseline ML Model Optimisation

Hyperparameter optimisation is an important process in the development of ML algorithms as it has a heavy influence on their performance on a given dataset. The hyperparameters govern the training behaviour, model complexity and generalisation capability of the ML algorithms. There are two conventional strategies in hyperparameter tuning—grid search and random search. The grid search method involves an exhaustive process of examining all combinations of hyperparameter values within a predefined range, while the random search method involves random sampling of hyperparameter values within a predefined range. Comparatively, the random search approach is more efficient due to its larger dimensional coverage of tuning parameters with less computational time [23]. Hence, the random sampling method is adopted in this study to fine-tune XGBoost and MLP classifier to maximise their performances. The ranges of hyperparameters tested are as shown in Table 2 and Table 3.

2.4. Large Language Models

Large language models are complex, large-scale AI systems pretrained on extensive datasets. Built upon advanced deep learning architectures, particularly transformer networks, LLMs are designed to perform next token prediction with high efficiency and accuracy. Although LLMs were originally developed for natural language processing (NLP) and natural language understanding (NLU) tasks, recent advancements have significantly broadened their application scope. The deployment of various LLMs has enabled their use in domains such as image processing, code generation and content summarisation. Studies demonstrated that modern LLMs are capable of performing zero-shot and few-shot learning tasks when provided with appropriately designed prompts [24,25]. The instruction-based zero-shot capabilities of LLMs motivate this study to explore their potential for wind turbine gearbox failure prognosis. In this work, 2 popular multimodal LLMs—GPT-4o and DeepSeek-V3—are examined.

2.4.1. ChatGPT (GPT-4o) [26]

ChatGPT is a large language model developed by OpenAI, which was initially designed to have separate multi-modal systems (i.e., GPT-4 with vision). The latest version, GPT-4o, is now a multi-modal model capable of receiving image and text inputs and providing text outputs in return. The model retains its transformer backbone, which is optimised for parallel processing to improve the next token prediction with long-term dependencies. Unique to GPT-4o, text, image and audio are integrated into one model architecture, eliminating the need for separate models for speech and visuals.

2.4.2. DeepSeek-V3 [27]

DeepSeek made a revolutionary entrance into the LLM industry with their first model, DeepSeek-R1, in 2023 with mixture-of-experts (MoE) architecture, achieving competitive performance compared to the leading LLM at the time, GPT-4, with significantly lower training cost. Their latest model, DeepSeek-V3, adopts multi-head latent attention (MLA) and DeepSeekMoE architectures, which are different from the transformer architecture. The model is trained on 14.8 trillion tokens, which is accompanied by supervised fine-tuning and reinforcement learning, enabling the model to maximise its capability.

2.5. Prompt Design [28]

Prompt engineering is an active research area in LLM development due to its capability in enhancing model effectiveness without advanced architectural tinkering such as hyperparameter fine-tuning. In-context learning (ICL) and chain of thought (CoT) are two of the most common strategies adopted in the literature. As our work involves the generation of ML pipelines by the LLMs through prompting, the selection of prompt strategy will greatly affect the performance of these LLM-generated pipelines.

ICL involves the process of adding input–output example(s) to the LLM in a single prompt, generating responses based on the examples. While task description is crucial in the prompt design, the number of input–output examples provided in ICL differs based on the approach taken, which can be further divided into few-shot learning (multiple examples), one-shot learning (one example) and zero-shot learning (no example). In contrast to ICL, CoT provides precise guidance to the LLM based on the output generated for a given task description. Explanation and correction are given at every step to guide the LLM to generate the desired output rather than providing an input–output example directly. This technique may require multiple prompts to complete the task appropriately.

This work focuses on the zero-shot capability of the LLMs in the development of ML pipelines for carrying out binary classification tasks for wind turbine gearbox failure classes. Accordingly, a zero-shot ICL approach was employed in our prompt design. For the ML pipeline generation by the LLMs, the details and characteristics of the SCADA dataset were prompted to the LLMs as the context of a problem, without including any example of the output code or ML algorithms required to perform the described task. The full template of prompt design implemented is shown in Appendix A.2. The task description and dataset information provided to the LLMs can be summarised as follows:

-: Source context of the SCADA dataset (i.e., number of wind turbines);
-: Number of data entries and features included in the SCADA dataset;
-: List of numerical features and categorical features provided by the dataset;
-: Included unique categorical feature details such as the number of unit types and class labelling;
-: The measurement units for power generation and component temperatures;
-: Expected output format: Python code.

2.6. Model Performance Evaluation

The baseline model performances are evaluated after being optimised with hyperparameter tuning in this study. This section presents an overview of the evaluation metric adopted to evaluate the performance of the baseline models.

The most widely adopted evaluation metric for classification tasks (binary/multi-class) in the literature is the confusion matrix. An illustration of the technique for multiple classes is presented in Figure 1. The confusion matrix records the occurrences of the actual classification and the predicted classification by the models. The occurrences are termed as true positive (TP)—elements labelled as the selected class (k) by the model and are actually correctly classified, false positive (FP)—elements labelled as the selected class (k) by the model but are actually wrongly classified, true negative (TN)—elements labelled as other classes by the model and are actually correctly classified and false negative (FN)—elements labelled as other classes by the model but are actually the selected class (k). The confusion matrix in Figure 1 highlights the results of the selected class b, where the TP was highlighted in the pink cell, while the FN and FP were highlighted in green and yellow cells respectively. The green texts in the figure represent the predictions when the predicted and actual classification aligns with one another.

The record helps to serve as the building blocks for calculating other evaluation metrics, namely precision, recall, F1-score and area under receiver operating characteristic curve (AUC). The formulas of these metrics for the generic class, k, are shown in Table 4. Precision determines the reliability of the model to predict a positive value of a generic class, while recall determines the accuracy of the model in the prediction of the generic class. These measurements help to define the reliability and the accuracy of the model in making predictions for each class in the task. The F1-score then gathers both measures in the form of a weighted average between precision and recall through the concept of the harmonic mean, resulting in a score which indicates the overall performance of the model, where the best value is 1 and the worst score is 0.

For this study, the models’ performances in binary classification were evaluated using confusion metrics. To avoid overfitting and possible data leakage issues, cross-validation at 5 folds is applied to investigate the average F1-score of the models.

3. Results and Discussion

This section presents and discusses the results of each pipeline in the binary classification of wind turbine gearbox failure based on the labelled SCADA dataset. The baseline models were assessed using the confusion metric and cross-validated evaluation at five folds to minimise overfitting. The precision, recall, F1-score and cross-validated average F1-scores of each model from baseline, GPT-4o and DeepSeek-V3 pipelines are presented in Table 5, Table 6 and Table 7, respectively. For simplicity, the cross-validated (CV) average F1-score will be represented as CV-score moving forward. The ML model with the best CV-score in each pipeline is highlighted in the tables below. Section 3.1 will cover the results produced by our baseline models, while Section 3.2 will cover the results of LLM pipelines. Discussion on the comparison between the three pipelines will be presented in Section 3.3.

3.1. Baseline Models

In the baseline pipeline, it is observed that the linear models (logistic regression and SVM) performed weakly compared to the ensemble models (random forest and XGBoost) and the neural network model (MLP classifier). The logistic regression model had the lowest CV-score at 0.6878, while XGBoost achieved the highest CV-score at 0.933. This shows that simple linear models are weaker in prediction tasks involving non-linear behaviour observed in wind turbine operation data. Although logistic regression and SVM are both inherently linear models, it is observed that the results of the SVM model were significantly better than those of the logistic regression model in our setup. The performance gain of SVM can be attributed to the use of the kernel function, which allowed SVM to project the input data into a higher-dimensional space where a linear hyperplane that separates non-linear patterns can be determined. On the other hand, logistic regression, lacking such capability, limits the model’s formation of a linear decision boundary in the input space, resulting in lower performance.

Notably, the MLP classifier scored lower than RF, while XGBoost achieved similar performance to RF despite having higher architectural complexity and undergoing fine-tuning. For the non-linear models, the leading performance of random forest despite the absence of fine-tuning can be attributed to the model’s strength in recognising average patterns in datasets. As XGBoost and MLP classifier are complex models that generally perform better when dealing with datasets with high complexity by capturing subtle feature interactions (i.e., continuous raw temperature signals at high frequency), their performances were relatively low considering the models’ complexity. In terms of scalability, it is important to note that XGBoost and MLP classifier will outperform random forest with an increase in data samples and reduced failure examples. As this is not the focus of this work, a comparative study on datasets with higher sampling rates is not conducted.

Comparing individual model performance in each class prediction, although minor, it is observed that the overall performance of the models was lower in the “Healthy” class prediction when compared to the “3 months” predictions. This is due to the slight data imbalance between the two classes, as mentioned in Section 2.3.3, resulting in bias towards the class with more data samples.

3.2. LLM-Suggested ML Pipelines

The results of ML models generated by the LLMs based on the prompt outlined in Section 2.5 are presented and discussed in this section. Due to the differences in LLM architecture, the resulting code differs in quality, length and suggested performance metrics. To enable consistent evaluation, the evaluation metrics introduced in the previous section are applied for comparison. This section compares the results of the LLM-suggested models in their default settings without additional fine-tuning as the component was not defined by GPT-4o’s design. The key differences in terms of pipeline functionality and design of the LLM-generated pipelines are further discussed in Section 3.3.

All ML pipelines defined in this work included SVM and RF as part of their classification tools. By comparing the results of SVM and RF models from all three pipelines, it is observed that DeepSeek-V3’s model exceeded both GPT-4o and baseline models in CV-score. Notably, the XGBoost model from the DeepSeek-V3 pipeline achieved similar results to the fine-tuned baseline model despite operating at default settings. The leading performance observed across multiple models of the DeepSeek-V3 pipeline is likely enabled by an additional preprocessing step that was absent in the other two pipelines.

In the preprocessing steps of ML models, data timestamps were commonly excluded from the training data due to the risk of data leakage that is associated with the clustering of failure samples in a specific period window within the timeline of the dataset. While each pipeline addressed this risk by excluding timestamps from the training data, DeepSeek-V3 leveraged the temporal information that can be found within timestamp data before removing it from the training. This was performed by generating datetime features extracted from the timestamps. The extracted features were “year”, “month”, “day”, “hour”, “day of week” and “is weekend”. The additional features provided temporal information on the data seasonality to the models, consequently enabling better prediction accuracy. Although this technique is effective in uncovering seasonal patterns commonly observed in wind turbine operational data, the generated features in DeepSeek-V3’s ML pipeline might have resulted in data leakage from the timestamps. This is because the pipeline had extracted timestamp information as detailed as the “hour” of the data samples, which may have revealed more temporal information than just seasonality.

3.3. Comparative Analysis

In this work, both LLMs were given the same prompt to propose an ML pipeline aimed at performing binary classification on wind turbine gearbox failure based on SCADA datasets. As the ML pipelines proposed by the LLMs varied in code structures, the key components of all three pipelines will be compared and discussed in this section. The comparison will be made in terms of pipeline structure and design with reference to the baseline models. Notably, the same algorithms (SVM, RF, XGBoost) in the pipelines were sourced from the same ML libraries, and the processing time of each pipeline will be discussed in terms of overall time taken from code initiation to result generation instead of individual training times of the algorithms. The scope of data analysis and overall feasibility of the proposed pipelines will be discussed at the end of this section.

The pipeline structure and design can be divided into data preprocessing strategies, defined ML algorithms, defined evaluation matrices, fine-tuning techniques and overall processing time. Among the LLMs, GPT-4o provided the most streamlined ML pipeline comprising three basic ML models—random forest, gradient boosting classifier and SVM. The model presented results using a confusion matrix, along with performing cross-validation. This approach aligns with the evaluation metrics adopted for baseline models. GPT-4o proposed a test–train split of 20:80, where an additional 10% of the data was allocated for training compared to the test–train split ratio that this work applied. The GPT-4o pipeline enabled baseline model functionality without additional input by including essential data preprocessing steps—applying a StandardScaler to scale numerical features and OneHotEncoder to encode categorical features. However, the simplicity of the GPT-4o pipeline came at the expense of comprehensiveness. Notably, GPT-4o did not incorporate data cleaning processes, thereby overlooking critical concerns such as missing values. The issue can introduce biases during model training and reduce prediction accuracies. In terms of the overall processing time from code initiation to generating results, GPT-4o’s ML pipeline is much quicker than the other two pipelines due to the lesser number of ML algorithms investigated and the absence of fine-tuning steps in its code structure.

In contrast, DeepSeek-V3 proposed a more elaborate ML pipeline that encompassed a wide range of data analysis techniques prior to classification. The pipeline initiated with an exploratory data analysis (EDA) phase, providing overviews of the dataset including feature types, missing value distribution and descriptive statistics with categorical counts. Furthermore, it introduced visualisations of numerical data distributions and feature correlations, which are often underemphasised in the early stages of ML pipelines. Although these steps do not affect the prediction accuracy of the ML models, the EDA techniques are important in improving developers’ understanding of the dataset. The pipeline also implemented SimpleImputer for handling missing data, which was absent in GPT-4o’s approach. DeepSeek-V3 defined up to seven ML algorithms, including logistic regression, random forest, gradient boosting classifier, XGBoost classifier, SVM, K-nearest neighbours and naïve Bayes. The LLM utilised both confusion matrices and receiver operating characteristic (ROC) curves for performance evaluation. The same test–train split ratio was adopted by both LLM-proposed pipelines. Additionally, the selection of the best model under default settings was incorporated in the pipeline, where the best model was optimised via the grid search method. However, the selection process was hard-coded to consider only three classifiers—RF, XGBoost and gradient boosting classifier—thereby limiting the flexibility and robustness of the fine-tuning approach. Notably, the grid search method is more time and computationally demanding than the random search method that was employed for the fine-tuning of the baseline model. This results in a longer processing time for the code to fine-tune the selected model than the baseline model. While the random search method is adopted in baseline models for fine-tuning XGBoost and MLP classifier which were more sensitive to hyperparameter tuning, it is worth considering also fine-tuning RF to achieve better prediction accuracy for the baseline model. Between the ML pipelines in this work, DeepSeek-V3’s pipeline had significantly longer total processing time for the pipeline to generate the result due to the implementation of EDA for the dataset and adopting the grid search method for model fine-tuning.

In terms of functionality, both LLMs generated functionally accurate solutions to the classification task. However, DeepSeek-V3’s ML pipeline addressed a wider scope of data analysis, covering data exploration steps. While GPT-4o delivered a concise and efficient pipeline, DeepSeek-V3 offered a more holistic design, accounting for crucial stages of data exploration, cleaning and model optimisation without being explicitly prompted. Of particular concern is GPT-4o’s omission of missing data handling, which reduces the reliability of its generated code. From an ML development standpoint, these findings suggest that DeepSeek-V3 may be more suitable for scenarios requiring comprehensive pipeline generation, especially when developer input is limited or exploratory rigor is essential under zero-shot prompting. However, it is important to point out that, although GPT-4o performs more weakly in zero-shot prompting, this does not imply that GPT-4o cannot achieve similar performance under CoT prompting where feedback is provided by users based on the LLMs’ responses. Further investigation into the implementation of CoT prompting could be considered for future studies.

4. Conclusions

This study presents a comparative analysis of two state-of-the-art LLMs—GPT-4o and DeepSeek-V3—when generating ML methods targeting gearbox failure binary classification. Using zero-shot learning prompts, both LLMs were provided with detailed descriptions of SCADA data characteristics and problem context. Traditional ML methods—logistic regression, SVM, RF, XGBoost and MLP classifier (neural network)—were developed as a baseline comparison for this study. After fine-tuning, the baseline models were assessed using the confusion metric, with the XGBoost and RF models achieving the highest classification performance on the SCADA dataset. In assessing LLM-generated ML pipelines, the untuned RF model proposed by DeepSeek-V3 achieved the highest CV-score. Additionally, the untuned XGBoost model from the same pipeline delivered performance comparable to the best fine-tuned baseline models.

In the comparison analysis of the LLMs, the study discovered that DeepSeek-V3 outperformed GPT-4o in addressing the problem described, not only by generating task-appropriate solutions but also by including an unprompted EDA phase in its pipeline and addressing key preprocessing steps. Furthermore, DeepSeek-V3 proposed a feature extraction function in its pipeline that enhanced ML model prediction accuracy beyond both GPT-4o and baseline models. Although the adopted feature extraction method might have contributed to data leakage, the strategy to extract temporal information from timestamps is effective in increasing the prediction accuracy of ML algorithms through data processing. In contrast, although GPT-4o provided a technically correct solution, it omitted critical steps such as data cleaning and cross-validation steps, which compromised the reliability and the accuracy of its suggested ML pipeline. Conclusively, this work recommends the use of DeepSeek-V3 in assisting the development of ML pipelines for failure mode classification for datasets of a similar nature to wind turbine SCADA data under zero-shot prompting.

One of the biggest limitations of off-the-shelf LLMs lies in the replicability of responses despite providing a specific prompt. LLMs, like humans, do not “reason” identically in response to the same prompt. Developer-specific optimisation can result in different response quality. For instance, GPT-4o may have a limitation in the token window, which may limit response detail and exclude optimisations that were not prompted. As the evaluation metric adopted in this study was based on the ML model cross-validated F1-scores rather than direct LLM performance metrics, the results cannot serve as direct indicators of model quality. However, it was a good indicator in presenting the quality of the pipeline designed by LLMs in solving the given problem. The variation in response nuances necessitates a more robust and problem-specific metric, as well as more LLMs such as Gemini and Llama to be considered in future work. While there are LLM benchmarking metrics in the literature that specifically evaluate the quality of code generated by LLMs, these evaluation metrics are only able to reflect the quality of code syntax and functionality. They are unable to reflect the reasonability behind the design of the generated codes. It must be highlighted that the results presented in this work do not definitively reflect the quality of the LLMs, given the differences in LLM architecture and varying design intentions.

Additionally, another limitation of this work is the prompt strategy adopted for the ML pipeline generation. It is important to acknowledge that the description of SCADA dataset details does not provide the complete representation of the dataset. Notably, statistical information regarding our dataset was also not included in the prompt due to confidentiality. As a result, discrepancy in data description will affect the response of the LLMs when defined by different users. Recent development has introduced the input format of documents and folders in some state-of-the-art LLMs, and it is believed that providing a data file directly in the prompt would produce more accurate pipeline design. However, due to confidentiality, this method was not adopted in this work.

As the LLM is an accelerating emerging technology, it is important to acknowledge that this study does not provide a complete representation of the technology’s capability in wind turbine gearbox failure prediction. While this study focused on utilising the generative capacity of the LLMs in the development of ML techniques, an important consideration for future work is to leverage the predictive capabilities of the LLM architecture.

Prior studies have established the capacity of pretrained LLMs to perform next-token predictions beyond NLP applications. For future work, the adaptation of LLMs in the context of wind turbine gearbox failure prediction will be explored. As discussed in Section 1, refs. [11,12] proposed prompt engineering approaches to enable the LLMs to process and predict numerical time-series data effectively. The success of these approaches in facilitating LLM-based analysis of purely numerical inputs inspires the exploration of similar strategies to be applied for classification and continuous data prediction for SCADA datasets of mixed data types. Furthermore, another feasible adaptation approach adopted by [13] is to use lightweight fine-tuning on small pretrained LLMs by using domain-specific data, enabling response generation which is more accurate to work with SCADA datasets. However, this direction of future work might be unfavorable due to the requirement of large amounts of domain-specific data to fine-tune LLMs for the application.

Author Contributions

Conceptualization, Y.W.T. and J.C.; Methodology, Y.W.T.; Formal analysis, Y.W.T.; Resources, J.C.; Writing—original draft, Y.W.T.; Writing—review & editing, Y.W.T. and J.C.; Supervision, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grant EP/S023801/1 from the UK Engineering and Physical Sciences Research Council (EPSRC).

Data Availability Statement

The datasets utilised in this article are not readily available due to confidentiality.

Acknowledgments

This work was lead and supported by Centre for Doctoral Training in Wind and Marine Energy Systems and Structures—WAMSS CDT: https://www.wamss-cdt.co.uk/ (accessed on 15 July 2025), a comprehensive doctoral training programme at Strathclyde, Oxford and Edinburgh Universities.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

The following nomenclatures are used in this manuscript:

$Δ T$	Difference in temperature
$° C$	Degree Celsius
$m / s$	Meters per second
$k w h$	Kilowatt hour
$T P_{k}$	True positive of selected class
$T N_{k}$	True negative of selected class
$F P_{k}$	False positive of selected class
$F N_{k}$	False negative of selected class
Subscript
k	Selected class

Appendix A

Appendix A.1

This section presents a data sample example of the dataset used in this study. Please note that some of the details such as feature names and values of unit type have been replaced to protect the confidentiality of the data.

Figure A1. SCADA dataset examples.

Appendix A.2

This section discusses the structure of the prompt design proposed in this work to instruct the generation of functional Python codes for the development of ML pipelines by the LLMs. It is important to note that the prompt utilised in this work is designed to be customisable based on the characteristics of the dataset available to the user, allowing replicability across different datasets than the one investigated in this study. The prompt template is as shown in figure below:

Figure A2. Prompt design template.

References

International Energy Agency. Renewables 2024 Global Overview; International Energy Agency: Paris, France, 2024. [Google Scholar]
Crabtree, C.J.; Zappalá, D.; Hogg, S.I. Wind Energy: UK Experiences and Offshore Operational Challenges. Proc. Inst. Mech. Eng. Part A J. Power Energy 2015, 229, 727–746. [Google Scholar] [CrossRef]
Zhu, T.; Ran, Y.; Zhou, X.; Wen, Y. A Survey of Predictive Maintenance: Systems, Purposes and Approaches. arXiv 2024, arXiv:1912.07383. [Google Scholar] [CrossRef]
Carroll, J.; McDonald, A.; Dinwoodie, I.; McMillan, D.; Revie, M.; Lazakis, I. Availability, Operation and Maintenance Costs of Offshore Wind Turbines with Different Drive Train Configurations. Wind Energy 2017, 20, 361–378. [Google Scholar] [CrossRef]
Carroll, J.; Koukoura, S.; Mcdonald, A.; Charalambous, A.; Weiss, S.; Mcarthur, S. Wind Turbine Gearbox Failure and Remaining Useful Life Prediction Using Machine Learning Techniques. Wind Energy 2018, 22, 360–375. [Google Scholar] [CrossRef]
Bindingsbø, O.T.; Singh, M.; Øvsthus, K.; Keprate, A. Fault Detection of a Wind Turbine Generator Bearing Using Interpretable Machine Learning. Front. Energy Res. 2023, 11, 1284676. [Google Scholar] [CrossRef]
Asy’ari, M.Z.; Nugroho, H.A.; Kusumoputro, B. Failure Mode Classification of Wind Turbine Gearbox Utilizing Machine Learning Algorithms. In Proceedings of the IEEE Region 10 Annual International Conference, Proceedings/TENCON, Singapore, 1–4 December 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024; pp. 527–530. [Google Scholar]
Desai, A.; Guo, Y.; Sheng, S.; Phillips, C.; Williams, L. Prognosis of Wind Turbine Gearbox Bearing Failures Using SCADA and Modeled Data. In Proceedings of the Annual Conference of The Prognostics and Health Management Society, Virtual Conference, 9–13 November 2020. [Google Scholar]
Alaskar, H.; Saba, T. Machine Learning and Deep Learning: A Comparative Review. In Proceedings of the Integrated Intelligence Enable Networks and Computing, Gopeshwar, India, 5–7 September 2021; pp. 143–150. [Google Scholar]
Vidal, Y.; Puruncajas, B.; Castellani, F.; Tutivén, C. Predictive Maintenance of Wind Turbine’s Main Bearing Using Wind Farm SCADA Data and LSTM Neural Networks. Proc. J. Phys. Conf. Ser. 2023, 2507, 012024. [Google Scholar] [CrossRef]
Alnegheimish, S.; Nguyen, L.; Berti-Equille, L.; Veeramachaneni, K. Large Language Models Can Be Zero-Shot Anomaly Detectors for Time Series? arXiv 2024, arXiv:2405.14755. [Google Scholar] [CrossRef]
Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J.Y.; Shi, X.; Chen, P.-Y.; Liang, Y.; Li, Y.-F.; Pan, S.; et al. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. arXiv 2023, arXiv:2310.01728. [Google Scholar]
Tao, L.; Liu, H.; Ning, G.; Cao, W.; Huang, B.; Lu, C. LLM-Based Framework for Bearing Fault Diagnosis. Mech. Syst. Signal Process 2025, 224, 112127. [Google Scholar] [CrossRef]
Zhu, G.; Jia, W.; Xing, Z.; Xiang, L.; Hu, A.; Hao, R. CMLLM: A Novel Cross-Modal Large Language Model for Wind Power Forecasting. Energy Convers. Manag. 2025, 330, 119673. [Google Scholar] [CrossRef]
Zhang, Z.; Li, S.; Zhang, L.; Ye, J.; Hu, C.; Yan, L. LLM-LADE: Large Language Model-Based Log Anomaly Detection with Explanation. Knowl. Based Syst. 2025, 326, 114064. [Google Scholar] [CrossRef]
Das, A. Logistic Regression. In Encyclopedia of Quality of Life and Well-Being Research; Michalos, A.C., Ed.; Springer: Dordrecht, The Netherlands, 2014; pp. 3680–3682. ISBN 978-94-007-0753-5. [Google Scholar]
Cortes, C.; Vapnik, V.; Saitta, L. Support-Vector Networks Editor; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1995; Volume 20. [Google Scholar]
Breiman, L. Random Forests; Springer: Berlin/Heidelberg, Germany, 2001; Volume 45. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Murtagh, F. Neurocomputing 2 (1990/1991) 183/197. 1991. Available online: https://www.sciencedirect.com/journal/neurocomputing/vol/2/issue/1 (accessed on 3 July 2025).
Categorical Features. Available online: https://skforecast.org/0.12.1/user_guides/categorical-features#ordinal-encoding (accessed on 10 June 2025).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Bergstra, J.; Ca, J.B.; Ca, Y.B. Random Search for Hyper-Parameter Optimization Yoshua Bengio. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Fu, Y.; Li, B.; Li, L.; Zhang, W.; Xie, T. The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-Based Code Generation. In Proceedings of the ACM on Software Engineering, New York, NY, USA, 1 August 2024. [Google Scholar]
Jiang, G.; Ma, Z.; Zhang, L.; Chen, J. Prompt Engineering to Inform Large Language Model in Automated Building Energy Modeling. Energy 2025, 316, 134548. [Google Scholar] [CrossRef]
OpenAI. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
DeepSeek-AI; Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; et al. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2025, arXiv:2402.07927. [Google Scholar]
Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar] [CrossRef]
Caelen, O. A Bayesian Interpretation of the Confusion Matrix. Ann. Math. Artif. Intell. 2017, 81, 429–450. [Google Scholar] [CrossRef]

Figure 1. Multi-class Confusion Matrix [29].

Table 1. Selected SCADA Variables.

	Data Features	Units
1.	Generator Speed	RPM
2.	Gear Oil Temperature	°C
3.	Gear Bearing Temperature	°C
4.	Nacelle Temperature	°C
5.	Ambient Temperature	°C
6.	ΔT Gear Oil and Nacelle	°C
7.	ΔT Gear Oil and Ambient	°C
8.	ΔT Gear Bearing and Nacelle	°C
9.	ΔT Gear Bearing and Ambient	°C
10.	Wind Speed	m/s
11.	Total Power Production	kWh

Table 2. XGBoost Hyperparameter Tuning Range.

Hyperparameters	Value Range
Number of trees (estimators)	100–1000
Learning rate	0.01–0.19
Maximum depth of each tree	3–10
Percentage of samples used per tree	0.6–1.0
Percentage of features used per tree	0.6–1.0
Minimum loss reduction to make split	0.0–0.5
Minimum sum of instance weight needed	1–10
L1 regularisation	0–1
L2 regularisation	0–1

Table 3. MLP Classifier Hyperparameter Tuning Range.

Hyperparameters	Value Range
Hidden layer size	Single-layer network: (28), (56), (100) Two-layer network: (28, 14) Three-layer network: (28, 14, 7)
Activation function of hidden layers	Rectified linear unit
Optimiser algorithm	“Adam”
L2 regularisation term (penalty)	0.0001–0.001
Learning rate	“constant”, “adaptive”
Maximum number of iterations	200–600

Table 4. Metric Formulas [29,30].

Evaluation Metric	Formula
$P r e c i s i o n_{k}$	$\frac{T P_{k}}{T P_{k} + F P_{k}}$
$R e c a l l_{k}$	$\frac{T P_{k}}{T P_{k} + F N_{k}}$
F1-score (binary classification)	$\frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}$

Table 5. Baseline Model Binary Classification Performance.

Models	Class	Precision	Recall	F1-Score	CV Average F1-Score
Logistic Regression	3 Months	0.69	0.71	0.71	0.6878
Logistic Regression	Healthy	0.68	0.66	0.67	0.6878
SVM	3 Months	0.83	0.85	0.84	0.8201
SVM	Healthy	0.83	0.81	0.82	0.8201
Random Forest	3 Months	0.94	0.94	0.94	0.9328
Random Forest	Healthy	0.94	0.93	0.93	0.9328
XGBoost	3 Months	0.94	0.94	0.94	0.9333
XGBoost	Healthy	0.93	0.94	0.93	0.9333
MLP Classifier	3 Months	0.89	0.90	0.89	0.8823
MLP Classifier	Healthy	0.89	0.88	0.88	0.8823

Table 6. Performance of GPT-4o-suggested models.

Models	Class	Precision	Recall	F1-Score	CV Average F1-Score
SVM	3 Months	0.83	0.85	0.84	0.8243
SVM	Healthy	0.83	0.81	0.82	0.8243
Random Forest	3 Months	0.94	0.95	0.94	0.9338
Random Forest	Healthy	0.94	0.93	0.94	0.9338
Gradient Boosting Classifier	3 Months	0.81	0.85	0.83	0.8183
Gradient Boosting Classifier	Healthy	0.83	0.78	0.80	0.8183

Table 7. Performance of DeepSeek-V3-suggested models.

Models	Class	Precision	Recall	F1-Score	CV Average F1-Score
Logistic Regression	3-Months	0.69	0.73	0.71	0.6859
Logistic Regression	Healthy	0.68	0.65	0.67	0.6859
Random Forest	3-Months	0.97	0.97	0.97	0.9605
Random Forest	Healthy	0.96	0.96	0.96	0.9605
GB Classifier	3-Months	0.81	0.86	0.84	0.8158
GB Classifier	Healthy	0.84	0.78	0.81	0.8158
XGBoost	3-Months	0.94	0.94	0.93	0.9330
XGBoost	Healthy	0.94	0.93	0.94	0.9330
SVM	3-Months	0.84	0.87	0.86	0.8431
SVM	Healthy	0.85	0.82	0.84	0.8431
K-Nearest Neighbours	3-Months	0.96	0.96	0.96	0.9494
K-Nearest Neighbours	Healthy	0.95	0.96	0.96	0.9494
Naïve Bayes	3-Months	0.66	0.69	0.68	0.6512
Naïve Bayes	Healthy	0.65	0.62	0.63	0.6512

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tan, Y.W.; Carroll, J. LLMs in Wind Turbine Gearbox Failure Prediction. Energies 2025, 18, 4659. https://doi.org/10.3390/en18174659

AMA Style

Tan YW, Carroll J. LLMs in Wind Turbine Gearbox Failure Prediction. Energies. 2025; 18(17):4659. https://doi.org/10.3390/en18174659

Chicago/Turabian Style

Tan, Yoke Wang, and James Carroll. 2025. "LLMs in Wind Turbine Gearbox Failure Prediction" Energies 18, no. 17: 4659. https://doi.org/10.3390/en18174659

APA Style

Tan, Y. W., & Carroll, J. (2025). LLMs in Wind Turbine Gearbox Failure Prediction. Energies, 18(17), 4659. https://doi.org/10.3390/en18174659

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLMs in Wind Turbine Gearbox Failure Prediction

Abstract

1. Introduction

2. Methodology

2.1. SCADA Dataset

2.2. Baseline Machine Learning Models

2.2.1. Logistic Regression [16]

2.2.2. Support Vector Machine [17]

2.2.3. Random Forest [18]

2.2.4. XGBoost [19]

2.2.5. Multi-Layer Perceptron Classifier [20]

2.3. Baseline Model Data Preprocessing

2.3.1. Missing Data

2.3.2. Data Type Handling

2.3.3. Class Imbalance

2.3.4. Model Training

2.3.5. Baseline ML Model Optimisation

2.4. Large Language Models

2.4.1. ChatGPT (GPT-4o) [26]

2.4.2. DeepSeek-V3 [27]

2.5. Prompt Design [28]

2.6. Model Performance Evaluation

3. Results and Discussion

3.1. Baseline Models

3.2. LLM-Suggested ML Pipelines

3.3. Comparative Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI