Large Language Model-Based Autonomous Agent for Prognostics and Health Management

Cha, Minhyeok; Yoon, Sang-il; Kim, Seongrae; Kang, Daeyoung; Nam, Keonwoo; Lee, Teakyong; Kim, Joon-Young

doi:10.3390/machines13090831

Open AccessArticle

Large Language Model-Based Autonomous Agent for Prognostics and Health Management

by

Minhyeok Cha

^†,

Sang-il Yoon

^†,

Seongrae Kim

,

Daeyoung Kang

,

Keonwoo Nam

,

Teakyong Lee

and

Joon-Young Kim

^*

Industrial Intelligence Research Group, AI/DX Center, Institute for Advanced Engineering (IAE), Yongin 17180, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Machines 2025, 13(9), 831; https://doi.org/10.3390/machines13090831

Submission received: 8 August 2025 / Revised: 4 September 2025 / Accepted: 5 September 2025 / Published: 9 September 2025

(This article belongs to the Section Automation and Control Systems)

Download

Browse Figures

Versions Notes

Abstract

Prognostics and Health Management (PHM), including fault diagnosis and Remaining Useful Life (RUL) prediction, is critical for ensuring the reliability and efficiency of industrial equipment. However, traditional AI-based methods require extensive expert intervention in data preprocessing, model selection, and hyperparameter tuning, making them less scalable and accessible in real-world applications. To address these limitations, this study proposes an autonomous agent powered by Large Language Models (LLMs) to automate predictive modeling for fault diagnosis and RUL prediction. The proposed agent processes natural language queries, extracts key parameters, and autonomously configures AI models while integrating an iterative optimization mechanism for dynamic hyperparameter tuning. Under identical settings, we compared GPT-3.5 Turbo, GPT-4, GPT-4o, GPT-4o-mini, Gemini-2.0-Flash, and LLaMA-3.2 on accuracy, latency, and cost, using GPT-4 as the baseline. The most accurate model is GPT-4o with an accuracy of 0.96, a gain of six percentage points over GPT-4. It also reduces end-to-end time to 1.900 s and cost to $0.00455 per 1 k tokens, which correspond to reductions of 32% and 59%. For speed and cost efficiency, Gemini-2.0-Flash reaches 0.964 s and $0.00021 per 1 k tokens with accuracy 0.94, an improvement of four percentage points over GPT-4. The agent operates through interconnected modules, seamlessly transitioning from query analysis to AI model deployment while optimizing model selection and performance. Experimental results confirmed that the developed agent achieved stable performance under ideal configurations, attaining accuracy 0.97 on FordA for binary fault classification, accuracy 0.95 on CWRU for multi-fault classification, and an asymmetric score of 380.74 on C-MAPSS FD001 for RUL prediction, while significantly reducing manual intervention. By bridging the gap between domain expertise and AI-driven predictive maintenance, this study advances industrial automation, improving efficiency, scalability, and accessibility. The proposed approach paves the way for the broader adoption of autonomous AI systems in industrial maintenance.

Keywords:

fault diagnosis; remaining useful life prediction; autonomous agent; large language model; artificial intelligence

1. Introduction

1.1. Research Background

Prognostics and Health Management (PHM) is essential for ensuring the safety, reliability, and operational efficiency of industrial equipment [1]. These technologies minimize unexpected failures, optimize maintenance schedules, and reduce overall operational costs, making them indispensable in industrial operation and maintenance [2,3]. Among various PHM tasks, fault diagnosis and Remaining Useful Life (RUL) prediction are essential for predictive maintenance, as they enable early fault detection and provide accurate estimates of component degradation trends.

In practice, however, developing effective PHM systems is a highly challenging task for maintenance engineers and domain practitioners, especially those without advanced AI expertise. Current workflows demand extensive data preprocessing, careful model design, performance evaluation, and iterative refinement, all of which require deep knowledge in machine learning and data analysis [4,5]. This makes the barrier to entry prohibitively high for non-experts. As a result, despite the promise of AI-driven PHM, its deployment is often restricted to organizations with sufficient resources and dedicated AI specialists. This reliance on expert-driven workflows presents several challenges. Firstly, the steep learning curve and technical expertise required create barriers for non-specialists, limiting accessibility and adoption [6,7]. Secondly, manual processes demand considerable time and resources, making it difficult to scale these technologies across various industrial applications [8]. This inefficiency hinders the rapid and cost-effective deployment of PHM systems. Lastly, traditional approaches often struggle to adapt to new industrial environments and datasets, reducing their effectiveness in dynamic operational settings [9].

Therefore, there is a strong motivation to develop a new type of system that allows non-experts to interactively build PHM models through natural language dialog. Such a system should be able to capture the user’s objectives and required level of performance, translate these into concrete modeling tasks, and autonomously generate a fault diagnosis or RUL prediction system that aligns with those expectations. By transforming the process from manual, expert-intensive workflows into intuitive conversations, PHM can become more democratized, scalable, and adaptive to real-world industrial needs.

Thus, we develop an autonomous large language model-based autonomous agent that bridges the expertise gap in maintenance workflows by eliciting user objectives and performance targets, translating them into concrete steps for data preparation, modeling, and validation, and automatically producing fault diagnosis and RUL predictors. In doing so, the system broadens access to advanced PHM tools, reduces dependence on specialists, and supports efficient, scalable, and adaptive deployment in industrial settings.

1.2. Advances in LLMs

Recent advancements in LLMs have demonstrated their versatility across various tasks, including natural language processing, mathematical reasoning, and information retrieval [10,11]. These models leverage vast datasets and sophisticated architecture, enabling users with minimal knowledge to perform complex tasks [12]. Their potential has driven growing interest in applying LLMs to domain-specific applications. For instance, autonomous agents integrating LLMs with domain-specific APIs and algorithms have been proposed to solve targeted problems in certain industrial settings [13,14]. These agents leverage the LLM’s capabilities to perform domain-specific tasks while utilizing additional tools or algorithms as required. Furthermore, recent studies have extended this approach by developing systematic systems involving multiple agents [15,16,17,18].

From the perspective of PHM, these advances in LLMs present a unique opportunity. Unlike conventional AutoML methods, LLM-based agents are capable of interpreting user intent expressed in natural language, handling incomplete or ambiguous queries, and dynamically structuring workflows for data preprocessing, model training, and evaluation. This makes them particularly well suited to addressing the long-standing challenge of enabling AI non-specialists to design and deploy effective PHM systems.

In addition, PHM applications often involve multivariate time-series datasets derived from sensors such as vibration, temperature, and pressure. The preprocessing required for such data, which includes noise filtering, normalization, imputation, and feature extraction, has often been overlooked, even though it is critical for model performance [19]. When combined with external preprocessing modules, LLMs can automate and guide these steps through natural language instructions [20].

For example, a user query such as “Prepare the bearing dataset for fault diagnosis” can be translated into a structured sequence of preprocessing operations involving data cleaning, partitioning, and feature engineering. By bridging user intent with PHM-specific data preparation, LLMs improve both the usability and scalability of complex preprocessing pipelines, thereby reinforcing their potential for real-world PHM tasks.

1.3. Research Gap and Objective

Despite recent advancements demonstrating the versatility and potential of LLMs, several types of gaps remain in their application to PHM domains, where high accuracy, domain-specific knowledge, and reliability are essential. First, current LLMs are not explicitly trained on PHM data, which limits their ability to incorporate domain-specific knowledge required for accurate fault diagnosis and RUL prediction. Second, while LLMs excel in processing natural language, they have limitations in handling heterogeneous PHM datasets such as multivariate time-series sensor data, which require specialized preprocessing, feature extraction, and temporal modeling. Third, the reliability and consistency of LLM outputs are still a concern in accuracy- and reliability-critical applications, since incorrect parameter extraction or task misinterpretation can compromise the dependability of PHM workflows. Finally, existing approaches to applying LLMs in PHM face scalability challenges across diverse industrial contexts, as they often rely on expert-driven customization. This limitation highlights the need for autonomous agents that can flexibly adapt to different equipment, environments, and operating conditions, which is the focus of our proposed framework. To date, no comprehensive framework has been developed to leverage LLMs for automating the full lifecycle of model development while addressing these gaps. Key stages, including data preprocessing, model design, optimization, and performance evaluation, remain heavily reliant on expert intervention, limiting scalability and accessibility.

This study introduces a novel approach by integrating an LLM-based autonomous agent to address these issues and outlines its key advantages as follows:

Simplified user interaction and robust query handling: The agent lowers the barrier to entry by allowing non-specialists to interact with the system using simple natural language queries. When queries are ambiguous or incomplete, the agent provides interactive guidance, prompting for missing inputs or suggesting valid alternatives. This functionality prevents workflow interruptions and enhances overall usability, making the system more accessible in practical industrial contexts.
Automated model generation and execution: The proposed agent eliminates manual intervention by structuring and executing the complete workflow for fault diagnosis and RUL prediction, thereby reducing reliance on domain expertise and accelerating the development of predictive models.
Comprehensive evaluation of multiple LLM engines: This study systematically compares LLMs, including GPT-3.5, GPT-4 mini, GPT-4o, LLaMA, and Gemini, assessing their effectiveness in parameter extraction, computational cost, and query processing. The comparative analysis provides insights into the optimal LLM choice for AI-based an autonomous agent.
Domain-adapted automation with PHM-specific tools: The autonomous agent integrates PHM specific tools to ensure that model generation, evaluation, and performance assessment are aligned with industrial requirements. As illustrated in Figure 1, the agent dynamically activates specialized modules for data handling, model training, and evaluation, thereby tailoring the process to PHM tasks.

The rest of the paper is organized as follows. Section 2 reviews existing research on fault diagnosis and RUL prediction using machine learning and deep learning, as well as studies on LLMs and their applications in domain-specific tasks. Section 3 outlines the methodology, functionalities, and implementation of the proposed autonomous agent for fault diagnosis and RUL prediction, including the overall agent design and its components. Section 4 presents the evaluation results of the autonomous agent applied to various use cases, along with performance comparisons. In Section 5, we discuss the performance results, limitations, and potential strategies to address these limitations. Finally, Section 6 concludes the paper with conclusions.

2. Related Work

2.1. Fault Diagnosis and Remaining Useful Life Prediction

Machine learning (ML) and deep learning (DL) have great promise in fault diagnosis and RUL prediction, enabling meaningful insights from large datasets. These methods have advanced predictive systems through architectures like Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, Attention models, and Hybrid models.

CNNs are widely used for analyzing sensor data, excelling in spatial feature extraction for applications like fault detection and RUL estimation [21]. Their effectiveness in processing high-dimensional data has made them crucial for pattern recognition in industrial settings. LSTM networks improve temporal modeling by capturing long-term dependencies in sequential data, making them well-suited for prognostic tasks in fault diagnosis and RUL prediction [22]. In commercial aviation, for instance, LSTM-based models trained on multivariate sensor trajectories like NASA’s C-MAPSS learn degradation trends to predict turbofan RUL. These models enhance prediction accuracy and prevent overfitting, enabling reliable maintenance scheduling and increased operational safety. Feeding historical sensor data into such deep learning architectures allows learning of degradation trends that are critical for precise RUL prediction [23,24]. Attention-based models, such as time-enhanced multi-head self-attention, enhance these capabilities by selectively focusing on relevant features, particularly for aero-engine RUL prediction [25]. Hybrid models further integrate these strengths, combining CNNs for spatial analysis and LSTMs for temporal dependency modeling. For example, CNN-LSTM architectures and multi-scale CNNs with bidirectional Gated Recurrent Units (GRU) have demonstrated improved diagnostic accuracy in applications like bearing fault detection and rotating machinery analysis [26,27,28]. In fault diagnosis, a representative bearing example uses raw accelerometer signals that are windowed and normalized, then passed through a one-dimensional CNN to distill local spectral temporal motifs and through an LSTM to capture sequence dynamics, after which a classifier outputs health-state labels such as normal, inner-race defect, outer-race defect, and ball defect [29,30].

Comprehensive reviews of PHM highlight the transformative impact of DL models in enabling predictive maintenance and scalable health monitoring, addressing critical efficiency and accuracy challenges in industrial applications [31,32,33]. Despite their success, prior research has also identified critical limitations in the adoption and scalability of these methods in real-world industrial settings. Developing and deploying ML and DL models often require extensive manual intervention by data scientists and AI specialists [34]. Data preprocessing, feature engineering, model selection, and hyperparameter tuning remain labor-intensive and time-consuming, making these methods inaccessible to non-specialists [35].

Although automated machine learning (AutoML) systems have been introduced to streamline these processes, they still fall short in fully addressing domain-specific complexities, leaving significant room for improvement [36,37]. Additionally, industrial maintenance professionals often lack the advanced technical skills required to interpret ML and DL model outputs and integrate them into decision-making workflows [38]. This expertise barrier restricts the practical adoption of these technologies, particularly in industries where non-specialists play a central role in operations [39].

To address these challenges, several studies have proposed solutions aimed at reducing reliance on expert knowledge and improving accessibility. Integrating ML and DL models with big data analytics frameworks has been explored to create more robust and user-friendly systems [40]. Furthermore, strategic management perspectives have emphasized the importance of designing AI systems that are both adaptable and scalable across diverse applications [41,42]. However, these methods still require significant computational resources and domain-specific adjustments. As such, the development of more automated and user-friendly frameworks remains a critical focus in advancing AI for fault diagnosis and RUL prediction.

2.2. Large Language Models

LLMs have significantly enhanced language understanding and generation capabilities in the field of Natural Language Processing (NLP) [43]. These capabilities, known as Language Modeling (LM), represent a fundamental approach to advancing machine language intelligence. LM predicts appropriate words based on the context of preceding and succeeding words in a sequence. Early research applied information theory to human language, demonstrating the potential of n-gram models to predict or compress natural language text [44]. This led to the development of Statistical Language Models (SLMs), which estimate the probability of word sequences. SLMs rely on the Markov assumption and n-gram models to predict the next word, making them simple and efficient tools for various NLP tasks [45]. However, SLMs have notable limitations, including data sparsity and an inability to understand long-range contexts. Techniques such as smoothing and back-off methods have been employed to mitigate these issues [46,47], but they remain inadequate for addressing the complexities of high-dimensional data [48,49].

To overcome these challenges, Neural Language Models (NLMs) were introduced, marking a significant advancement in language modeling. NLMs use neural networks to learn word sequences and calculate text probabilities [50]. They transform words into embedding vectors, capturing contextual meaning more effectively [51]. Architectures like Recurrent Neural Networks (RNNs), LSTM, and Transformer efficiently model relationships between words, enabling superior performance in tasks such as web search, machine translation, text generation, and sentiment analysis [10,11,52]. However, NLMs require large-scale data and high computational costs, and their lack of interpretability presents challenges in applications where trust and transparency are critical [53]. More recently, Pretrained Language Models (PLMs) and LLMs have revolutionized the NLP paradigm. PLMs are pretrained on massive amounts of unlabeled text data and fine-tuned for specific downstream tasks, achieving a balance between generality and task-specific performance [54]. LLMs, built on Transformer architectures with billions of parameters, further enhance language understanding and generation capabilities [39,40].

They demonstrate emergent abilities, allowing them to learn new tasks and solve complex problems. Examples such as OpenAI’s GPT-3.5 turbo, GPT-4, GPT-4o, GPT-4o-mini Meta’s LLaMA-3.2, and Google’s Gemini-2.0-flash showcase reasonable accuracy and efficiency in not only everyday tasks but also in understanding domain-specific terminology and linguistic nuances. These models have substantially advanced text analysis and query processing in specialized fields.

2.3. LLMs for Domain-Specific Applications

Advancements in LLMs have demonstrated their potential to address domain-specific challenges across various industries. For instance, in materials science, LLMs have been utilized to predict and generate metal–organic frameworks (MOFs), highlighting the potential to transform material design processes [55]. In structured data analysis, multi-agent LLM systems have been employed to handle tabular data queries, enabling more efficient and accurate data-driven insights [56]. In chemistry and life sciences, LLMs have shown promise in tackling molecular design problems, suggesting potential applications in drug discovery and chemical analysis [17]. In engineering and automation, LLMs have been integrated into workflows for electronic design automation (EDA), enhancing circuit design efficiency [57]. In cloud computing, LLM-based autonomous agents have been developed to analyze and resolve root causes of system issues, demonstrating utility in select cases [16].

Evaluations of LLM-based agents in real-world scenarios have further highlighted their adaptability and reliability [58]. In the energy sector, LLMs have been applied to detect gas leaks in natural gas valve chambers offering the potential to enhance safety and efficiency [59]. Autonomous systems and manufacturing have also benefited from LLM integration. Multi-agent frameworks leveraging LLMs have been developed to improve decision-making and control in autonomous driving [60]. In the manufacturing industry, LLM-powered decision-making systems have been utilized for managing carbon emissions, presenting opportunities to promote sustainable practices [61].

Similarly, LLMs have been applied in decentralized collaboration systems using smart contracts, enabling more efficient decision-making in distributed systems [62]. LLMs have also been explored for visual information seeking tasks, improving autonomous data collection and exploration [63]. Flexible modular production systems incorporating LLM-powered agents have shown potential in enhancing operational flexibility and scalability in manufacturing [64]. In scientific research, LLMs have been applied to predict organic synthesis pathways, simplifying chemical processes [65]. Autonomous chemical research has also employed LLMs to support experimental design and data analysis, demonstrating potential efficiency improvements [18].

Furthermore, the integration of LLMs with external tools has enhanced their problem-solving capabilities across diverse applications [15]. In project management, LLM-based agents have improved efficiency and adaptability in agile workflows [66]. In particle physics, LLM-inspired computational approaches have been used to analyze high-energy physics data, contributing to the exploration of phenomena beyond the Standard Model [67]. While LLMs have shown promise in a variety of domains, their application in PHM, which are key tasks in industrial maintenance, remains largely unexplored. This study aims to address this gap by developing a LLM-powered autonomous agent tailored for these critical tasks, paving the way for more efficient and automated solutions in the maintenance industry.

3. Method

3.1. Overall Strategy for Autonomous Agent

The autonomous agent is designed to process user queries in natural language and generate models for fault diagnosis or RUL prediction in an efficient and structured manner. Unlike traditional PHM workflows that require manual dataset preprocessing, algorithm selection, and parameter tuning, the proposed agent allows users to initiate the entire process through simple natural language instructions. This design significantly lowers the barrier to entry, enabling non-specialists to perform complex tasks such as fault diagnosis or RUL prediction without programming expertise or in-depth AI knowledge. For example, when a user inputs a query like “Conduct fault diagnosis on my bearing dataset”, the system automatically interprets the request, prepares the data, selects and trains appropriate models, and evaluates their performance.

Beyond simplifying access for non-specialists, the agent is also designed to remain robust when user inputs are ambiguous or incomplete. If a required parameter such as the dataset path is missing, the Parameter evaluator interactively prompts the user to provide the missing information or suggests a default option. Likewise, if unsupported models or evaluation metrics are requested, the system suggests valid alternatives from its supported list. By guiding users in this way, the agent ensures that tasks can proceed smoothly even with imperfect queries, thereby preventing workflow interruptions and further enhancing overall usability.

As illustrated in Figure 2, the agent consists of four interconnected modules: Parameter extractor, Parameter evaluator, Task executor and Answer generator. These modules work together in a coordinated manner, ensuring a smooth transition from query processing to model generation and result delivery.

The Parameter extractor analyzes the user query to identify necessary parameters (e.g., target category, dataset path) and optional parameters (e.g., model name, evaluation flag). To ensure reliable extraction, prompt engineering techniques are applied, allowing the module to accurately extract parameters. Based on this structured approach, the LLM engine processes the query and extracts key parameters.

To account for diverse query structures and potential ambiguities, we define eight use case categories, each representing a different query characteristic. These predefined cases help ensure that the extraction process remains consistent and adaptable across various input scenarios. Detailed description and analysis of these use cases can be found in Section 4.1.

The extracted parameters are then passed to the Parameter evaluator, which verifies their validity and completeness. If any necessary parameters are missing, the evaluator prompts the user to input and updates the parameters accordingly. This process is repeated until all required information is provided. If optional parameters are missing, the agent suggests a list of available parameters and use the parameters selected by the user. If not, apply the default settings for smooth execution.

Once validation is complete, the structured parameters are passed to the Task executor, which configures and initiates the model generation process. The Task executor is composed of three main components: the Manager, Tool, and Task evaluator, each playing a crucial role in executing and optimizing the workflow. The Manager structures the job flow based on the extracted parameters and activates the appropriate submodules within Tool. The Tool includes six key submodules: Model optimizer, Model tuner, Model trainer & evaluator, Data loader, Data preprocessor, and Model generator, responsible for data preparation, model selection, hyperparameter tuning, training, and evaluation.

After a model is generated, the Task evaluator within Task executor assesses its performance based on predefined metrics. If the model does not meet the required criteria, the system triggers an optimization cycle. This process continues until an optimal model is obtained or the maximum optimization limit is reached. After optimization, the Task executor selects the top 5 models based on evaluation flag and saves their details into a Comma Separated Values (CSV) file. Answer generator then extracts information about the best-performing model and formats it into a structured response. The following sections provide a detailed explanation of each module in the autonomous agent.

3.2. Query Analysis of Autonomous Agent

3.2.1. Parameter Extractor

The Parameter extractor of the autonomous agent is based on an LLM engine and is responsible for analyzing user queries and extracting parameters. As previously described, these parameters consist of necessary parameters and optional parameters.

The necessary parameters are essential for the autonomous agent to execute tasks based on the user query. In this study, we define the target category and dataset path as the necessary parameters required for executing model generation tasks. The target category specifies the type of model to be generated, such as fault diagnosis (binary fault classification or multi-fault classification) or RUL prediction, while the dataset path provides the location of the data required for model training.

Additionally, we designate model name, evaluation flag, and evaluation value as optional parameters to accommodate diverse user requirements and allow greater flexibility in model configuration. The model name allows users to specify architectures such as CNNs, LSTM networks, attention model, and hybrid models, while the evaluation flag enables user to select performance criteria such as Validation Loss, Root Mean Squared Error (RMSE), and F1 Score.

Since users may express the same intent using different wording and structures, the extracted parameters can vary significantly, potentially reducing the accuracy of the Parameter Extractor. This variation propagates through subsequent stages of the autonomous agent, directly impacting the effectiveness of the Task Executor. Therefore, selecting an appropriate LLM engine is crucial to ensuring robust parameter extraction and maintaining overall system reliability.

We evaluated the performance of parameter extraction with six widely used LLM engines [68,69]. To determine the most suitable LLM engine for this study, we conducted experiments on 100 use cases covering various scenarios across eight predefined use-case types. To ensure a fair performance comparison, each use case was repeated 10 times per engine. The engines were assessed based on accuracy (exact match at the use-case level), end-to-end response time (seconds), and API cost in USD per 1 k tokens (input plus output where applicable). Table 1 provides a comparative analysis of each LLM engine’s performance, summarizing their effectiveness in parameter extraction across these metrics. Additionally, overall results of parameter extraction, response time, and cost, which compare the performance of each engine across the eight use case categories comprising 100 queries, can be found in Tables S1–S3 of Supplementary Materials.

The experimental results showed that GPT-4o achieved the highest accuracy 0.96 while maintaining lower response time and cost compared to GPT-4. Although GPT-4 had a slightly lower accuracy 0.90, it exhibited higher latency of 2.809 s and higher cost $0.01114. Additionally, Gemini-2.0-Flash and GPT-4o-mini achieved competitive accuracy of 0.94 and 0.91, with faster response times 0.964 s and 1.489 s and lower costs, suggesting their suitability for scenarios that prioritize efficiency over absolute accuracy. These results emphasize the need to select an LLM engine that aligns with the specific requirements of each task. While GPT-4o is well-suited for applications demanding high accuracy, Gemini-2.0-Flash and GPT-4o-mini offer cost-effective options when processing speed is a priority.

Since precise operation and outcome generation are crucial in the autonomous agent, GPT-4o was chosen as the LLM engine for this study. Furthermore, the agent is designed to accommodate multiple LLM engines, providing flexibility in model selection based on constraints such as computational cost, response time, and accuracy. This adaptability ensures that the framework can be tailored to various operational scenarios, optimizing performance according to application-specific requirements.

Additionally, as illustrated in Figure 3, the prompt engineering techniques of the Parameter extractor define the extraction task the LLM must perform, while the user specifies the desired extraction conditions. These conditions are predefined based on parameter categories, including expected formats and constraints. The extraction task request is formulated through prompt engineering and sent to the LLM engine, which processes the request and returns a response containing a parameter dictionary extracted from the user query. This parameter dictionary includes essential values such as target category, dataset path, evaluation flag, model list, and target evaluation value, which are necessary for subsequent steps in the autonomous agent workflow. Algorithm 1 presents a simplified pseudocode outlining the query processing and parameter extraction procedure. More detailed version, including can be found in Table S4 of Supplementary Materials.

Algorithm 1 Parameter extractor

Procedure extract_parameter (Q

, T

, O

)
Inputs:

Q

: A user query

T

: Query type

O

: Extraction options
Output:
A dictionary of extracted parameters

1 : Define system_role and message based on T

2 : If T

== 0:
3: system_role ← “Extract all parameters”
4: message ← {“validations”: “Ensure missing values are assigned None”}

5 : Else if T

== 1:
6: system_role ← “Extract necessary parameters”
7: message ← {“validations”: “Validate missing local paths”}

8 : Else if T

== 2:

9 : system_role \leftarrow “ Extract ” + (“ models ” if O

[1] = = 1 else “ evaluation method ” if O

[1] == 2 else “evaluation method and models”

10 : message \leftarrow {“ validations ” : [O

[2] [0], O

[2,1]]}

11 : Else if T

== 3:
12: system_role ← “Extract dataset path”
13: Return LLM_engine (system_role, message)

3.2.2. Parameter Evaluator

The Parameter evaluator verifies the parameters extracted by the Parameter extractor to ensure that both necessary and optional parameters required by the Task executor are correctly identified. First, it checks for the presence of essential parameters, such as the target category and dataset path. If any of these parameters are missing due to LLM inference errors or user input omissions, the Parameter evaluator prompts the user for the missing information. Upon receiving a response, the Parameter extractor reprocesses the query, and the Parameter evaluator revalidates it. This iterative process continues until all necessary parameters are accurately extracted.

Then, the Parameter evaluator conducts a two-step validation of the dataset path. It first checks for the existence of the dataset folder; if not, the system requests a new path from the user. When the user provides a revised path, the Parameter extractor re-extracts and the Parameter evaluator revalidates it. This process repeats until a valid dataset path is obtained. Subsequently, the system checks whether the required data files are present within the folder. If any files are missing, the system prompts the user for additional input and repeats the verification cycle until all required files are correctly specified.

After validating the necessary parameters, the Parameter evaluator proceeds to assess the optional parameters. Although these are not required for execution, they still undergo validation to ensure correctness. If any optional parameters are missing, the Parameter evaluator provides suggested values or allows users to modify them. Additionally, if an extracted optional parameter is unsupported by the Task executor, the system suggests an alternative valid parameter.

Throughout this process, the Parameter evaluator dynamically interacts with the user using an LLM-based messaging system, applying structured prompt engineering techniques to ensure clarity. The overall workflow is illustrated in Figure 4, with key steps summarized in Algorithm 2. A detailed implementation is provided in Table S5 of the Supplementary Materials.

Algorithm 2 Parameter evaluator

Procedure evaluate_parameter (Q

)
Method necessary_parameter_checker (query)
Inputs:

Q

: query (Dictionary of extracted parameters)
Output:
request_query (LLM-generated message) or False
1: If missing parameters exist:

2 : Return LLM_engine_request (Q

)
3: Return False
---
Method dataset_checker (path)
Inputs:

P

: path (Dataset directory path)
Output:
(Boolean, request_query)
1: If path exists and contains files: Return (True, False)

2 : Else Return (False, LLM_engine_request (P

))
---

Method unsupport_parameter_checker (Q

)
Inputs:

Q

: query (Dictionary of extracted parameters)
Output:
suggestion_message, p_type, p

1 : U_{m}

: models not in supported list

2 : U_{e}

: evaluation flag not in supported list

3 : S_{m}

: supported models list

4 : S_{e}

: supported evaluation flags list

5 : If U_{m}

or U_{e}

:

6 : p_type, p \leftarrow (1, S_{m}

) if U_{m}

Else

7 : p_type, p \leftarrow (2, S_{e}

) if U_{m}

Else

8 : p_type, p \leftarrow (3, [S_{m}

, S_{e}

]) if U_m Else

9 : Return LLM_engine_request (Q

), p_type, p
10: Return False

3.3. Task Executor

The Task executor is activated once the Parameter Evaluator validates the extracted keywords. As mentioned in Section 3.1, the Task executor comprises three core components: Manager, Task Evaluator, and Tool. The Manager structures the job flow by interpreting keywords and dynamically configuring the optimization process. The Task Evaluator monitors model performance and determines whether to terminate the optimization cycle or continue refinement. The Tool module consists of various functional submodules, including the Model Optimizer, Model Tuner, Model Trainer & Evaluator, Data Loader, Data Preprocessor, and Model Generator, which collectively handle model training and tuning tasks. A detailed explanation of each module is provided in the following sections.

To simulate real-world scenarios where users input arbitrary datasets, we designed the autonomous agent using widely adopted benchmark datasets in the PHM domain, with primary focus on fault diagnosis and RUL prediction. Specifically, the FordA datasets [70] ere used for binary fault classification, the Case Western Reserve University (CWRU) [71] Bearing Dataset for multi-fault classification, and the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) [72] dataset for RUL prediction.

By utilizing these datasets, we simulate scenarios where users input arbitrary datasets, ensuring that the agent can generalize across various real-world applications. The Task executor follows an iterative optimization process, progressively refining models through multiple cycles. Initially, the system starts with a narrow search space for hyperparameter tuning to ensure efficient exploration while minimizing computational overhead. As the optimization cycles progress, the range of tunable hyperparameters is gradually expanded, allowing the model to explore a broader search space and achieve incremental performance improvements. This stepwise approach balances computational efficiency with model accuracy, ensuring that resources are effectively utilized while optimizing performance. At the end of the process, the executor selects the top 5 models based on their evaluation flag and saves the results as a CSV file. The workflow is presented in Algorithm 3 as pseudocode.

Algorithm 3 Task executor

Procedure Task executor (K

)
Inputs:

K

: a dictionary to construct job flow
Output:
A CSV file containing information on the top 5 models

1 : Load configuration file based on K

2 : Manager update C

(config) with keywords and struct a job-flow activating Tool
3: num_cycle ← 1
4: all_results ← ∅
5: Perform Optimization:
6: While num_cycle ≤ max_cycle do
7: Adjust hyperparameter range based on num_cycle
8: Perform model tuning:
9: Initialize the model components:

10 : model_list \leftarrow Model generator (C

)

11 : data \leftarrow Data loader (C

)
12: Train and Evaluate models:
13: result ← Model trainer & evaluator (model_list, data)
14: Store results in all_results ← all_results ⋃ results
15: Task evaluator analysis results and determine whether to increase num_cycle or not
16: Select top 5 models from all_results
17: Save a CSV file with the top 5 models
18: Return a CSV file

3.3.1. Manager

The Manager is responsible for coordinating the overall optimization process, serving as the central controller of the Task executor. The Manager activates specific submodules of the Tool that are suitable with extracted parameters, structuring a job flow that enables the system to execute the optimization process. As illustrated in Figure 2, the Manager collaborates closely with the Task evaluator, ensuring that model performance is continuously monitored and the optimization process is dynamically adjusted based on evaluation results.

When the Task evaluator returns feedback of re-optimization cycle, the Manager initiates an additional optimization cycle, modifying specific hyperparameters including batch size, learning rate and the number of epochs. This optimization continues until an optimal model is obtained, or the maximum optimization limit is reached. In this study, the maximum number of optimization cycles is set to five, considering both training time and computational efficiency. However, this limit can be adjusted based on resource availability and specific optimization requirements.

Through this series of processes, the Manager ultimately generates and stores a single CSV file, which is later provided to the Answer generator. This file contains information on the top 5 models based on performance metrics, including details such as the model name, corresponding performance score, and the storage path of the models.

3.3.2. Tool

The Tool comprises six main components: Model optimizer, Model tuner, Model trainer & evaluator, Data loader, Data preprocessor, and Model generator. Each module, activated by the job flow structured by Manager, performs its specific role in completing the assigned task. Figure 5 illustrates the overview of the Tool’s submodules, detailing the specific functions of each component in the optimization and model generation process.

The Data loader constructs data pipelines based on a predefined batch size, ensuring efficient data handling during training and evaluation. To enable unbiased model selection and prevent temporal and label leakage, it establishes train/validation/test partitions using a file structure aware strategy. This design is adopted for general applicability. In our study, the official Train/Test splits of FordA for binary fault classification and CMAPSS for RUL prediction were preserved, and a 20% validation subset was carved from training portion. CWRU for multi-fault classification, provided as class-wise folders without a predefined split, was partitioned using a stratified 60/20/20 rule. The Data preprocessor applies normalization and missing value imputation for fault diagnosis, while windowing, clipping, and conditionally Principal Component Analysis (PCA) are used for RUL prediction. Especially, conditional PCA is adopted for improving numerical conditioning and generalization. For prediction in PHM domain, strongly correlated sensor channels often induce multicollinearity, which can make the regression mapping ill-conditioned and hinder stable optimization [73,74]. To address this, the agent computes variance inflation factors (VIFs) on the training features and, when the maximum VIF exceeds 100, fits a PCA on the training split to decorrelate and compress the feature space. In this study, PCA dimensionality was tuned over {5, 7, 10}. The Model trainer & evaluator executes training and assesses performance using predefined evaluation flag applying early stopping to prevent unnecessary computation. The Model tuner dynamically adjusts model configurations based on evaluation feedback, refining the initial search range for parameters such as batch size and learning rate.

Additionally, as mentioned in Section 3.3.1, the Model optimizer interacts with the Task evaluator to adjust the tuning process dynamically. It enables the Model tuner to progressively explore a broader set of hyperparameters based on evaluation feedback. The Model generator module leverages deep learning architectures to ensure robustness and reliability in PHM tasks. By utilizing widely recognized models, it effectively captures temporal dependencies, extracts meaningful features, and models complex relationships in sequential sensor data.

For fault diagnosis, the module supports architectures such as LSTM, CNN-LSTM, MLP, TCN, and Transformer, which effectively capture spatial and temporal dependencies in sensor data. LSTM is widely applied in fault classification due to its ability to model dynamic system behaviors and identify fault patterns by capturing long-term dependencies in sensor signals [75]. For clarity, the LSTM updates are written separately as:

f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})

(1)

i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})

(2)

{\tilde{c}}_{t} = t a n h (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c})

(3)

c_{t} = f_{t} ⨀ c_{t - 1} + i_{t} ⨀ {\tilde{c}}_{t}

(4)

o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})

(5)

h_{t} = o_{t} ⨀ t a n h (c_{t})

(6)

where

x_{t}

is the input at time

t

,

h_{t}

and

c_{t}

are the hidden and cell states,

W

,

U

,

b

are learnable parameters,

σ (\cdot)

and

t a n h (\cdot)

are sigmoid and hyperbolic tangent activations, and

⨀

denotes element-wise multiplication, For binary and multi-fault classification rules are given by:

\hat{p} = σ (w^{⊺} h_{T} + b)

(7)

\hat{y} = 1 [\hat{p} \geq τ] (d e f a u l t τ = 0.5, t u n e d u n d e r c l a s s i m b a l a n c e)

(8)

\hat{y} = s o f t m a x (W h_{T} + b), \hat{c} = a r g m a x {\hat{y}}_{k}

(9)

so that (7) applies a logistic link to the sequence summary

h_{T}

to obtain the posterior for the positive class, and (8) implements the decision threshold

τ

. (9) yields K-way posteriors that sum to one and select the most probable class.

CNN-LSTM enhances classification accuracy by combining spatial feature extraction with temporal modeling; a representative 1D convolution block is:

y_{c, t} = \sum_{m} \sum_{k = 0}^{K - 1} w_{c, m, k} x_{m, t + s \cdot k} + b_{c}

(10)

z_{c, t} = R e L U (y_{c, t}), p_{c, t} = M a x P o o l (z_{c, \cdot})

(11)

whose stacked outputs form a temporal embedding subsequently consumed by an LSTM head. MLP provides a simple yet effective feed-forward baseline and has been successfully used in gear-fault tasks [76]. TCN utilizes dilated convolutions to efficiently capture long-range dependencies in time-series fault data, achieving competitive classification performance compared to recurrent networks [77]. Transformer models leverage self-attention mechanisms to effectively model complex fault patterns and have been shown to outperform recurrent architectures in failure mode classification [78].

For RUL prediction, the module supports architectures such as LSTM, Bi-LSTM, CNN-LSTM, and Transformer, which are well-suited for capturing temporal dependencies in sequential sensor data. LSTM is widely used for RUL estimation due to its ability to model long-term dependencies in degradation patterns. LSTM effectively predicts the RUL of industrial components by capturing nonlinear degradation trends [75]. Bi-LSTM captures both forward and backward dependencies, providing a more comprehensive representation of degradation trends. This structure reduces information loss and enhances predictive accuracy, outperforming unidirectional LSTM models in RUL estimation [22]. CNN-LSTM enhances RUL prediction by integrating convolutional layers for spatial feature extraction with LSTM for temporal modeling. This hybrid model improves predictive accuracy in complex degradation scenarios by learning both local and sequential dependencies [28]. Transformer models leverage self-attention mechanisms to efficiently capture long-range dependencies in sequential data. They outperform recurrent models in RUL prediction by effectively modeling complex temporal relationships [25]. For the Transformer backbone used for RUL regression, the input token matrix

X \in R^{T \times d}

is combined with positional encodings:

Z^{(0)} = X + P

(12)

Q = Z^{(l)} W_{Q}, K = Z^{(l)} W_{K}, V = Z^{(l)} W_{V}, A t t n (Q, K, V) = s o f t m a x (\frac{Q K^{⊺}}{\sqrt{d_{k}}}) V

(13)

M H A (Z^{(l)}) = C o n c a t ({A t t n}_{1}, \dots, {A t t n}_{H}) W_{o}

(14)

{\tilde{Z}}^{(l)} = L N (Z^{(l)} + M H A (Z^{(l)})), Z^{(l + 1)} = L N (Z^{(l)} + F F N ({\tilde{Z}}^{(l)}))

(15)

F F N (Z) = ϕ (Z W_{1} + b_{1}) W_{2} + b_{2}

(16)

which feeds a backbone-agnostic linear readout to produce scalar RUL:

\hat{r} = w^{⊺} h_{T} + b

(17)

In our implementation,

h_{T}

is instantiated as the last hidden state for LSTM (concatenated forward/backward states for Bi-LSTM), the final hidden state of the LSTM head for CNN–LSTM, and the global average over final token embeddings for Transformer.

Hereafter, given the user-specific model list, the agent instantiates each backbone and trains in on the prepared train, validation, and test dataset. Optimization uses stochastic gradient descent or Adam [79] as configured. The binary fault classification and multi-fault classification optimize cross entropy with sigmoid or softmax heads, while RUL prediction optimizes mean squared error with a linear head. We monitor accuracy, precision, and recall for classification, and root mean squared error together with asymmetric score for RUL prediction. Early stopping on validation loss, learning rate reduction on plateaus, and checkpointing of the best validation model are enabled. After training, the model is evaluated on the held-out test set and the agent exports the metrics, predictions, final weights, and the configuration. Each submodule’s pseudocode is provided in Tables S6–S10 of the Supplementary Materials.

3.3.3. Task Evaluator

As illustrated in Figure 2, the Task evaluator determines whether the task’s objectives have been achieved based on results obtained at the end of an optimization cycle. To make this decision, the Task evaluator also utilizes parameters extracted from the query, and its role varies depending on the presence and values of two parameters: the evaluation flag and the target evaluation value. In cases where multiple models are evaluated simultaneously, these parameters help prioritize the selection of the most optimal model.

Concretely, selection and stopping are driven by metrics computed on the validation split. The evaluator applies the user-specified evaluation flag as the governing criterion: loss-type measures such as validation loss, root mean square error, and the asymmetric score are minimized, whereas accuracy-type measures such as accuracy, precision, recall, and F1 score are maximized. When a target evaluation value is supplied, success is declared once the governing metric satisfies the target, that is, less than or equal to the target for loss-type measures and greater than or equal to the target for accuracy-type measures, and the search terminates. When no evaluation flag is provided, validation loss serves as the default criterion. The test split remains untouched during selection and is evaluated only once after the search. After the final cycle, all trials are ranked by the governing criterion, and the top five configurations are returned together with the checkpoint path of the best model for downstream use. Further details on the implementation and decision-making process of the Task Evaluator are provided in Algorithm 4 as pseudocode.

Algorithm 4 Task evaluator

Procedure task_evaluator (sorted_result, K_{v}

, C_{E}

, ascending)
Inputs:
sorted_result: a list containing sorted evaluation results

K_{v}

: threshlod value for evaluation

C_{E}

: metric for evaluation performance
ascending: boolean flag indicating sorting order
Output:
satisfied: boolean indicating if optimization criteria is met
best_result: best evaluation result
1: best_result ← sorted_result [0]

2 : best_score \leftarrow best_result [C_{E}

]

3 : If K_{v}

is None do
4: return True, best_result

5 : Else If (ascending and best_result \leq K_{v}

) or (not ascending and best_result \geq K_{v}

) do
6: return True, best_result
7: Else do
8: return False, best_result

3.3.4. Answer Generator

The Answer generator is implemented using the GPT-4o-based LLM engine to deliver task execution results to the user. Like the Parameter extractor and Parameter evaluator, it is designed to accommodate multiple LLMs depending on the task requirements. It plays a key role in generating messages based on the CSV file. First, the Answer generator reads the CSV file produced by the Task executor, which contains the top-performing models ranked by evaluation flag. It then converts this data into a structured dataframe, extracting key details such as model names, evaluation scores, and storage paths. Using this information, the system identifies the best-performing model and generates a well-structured response for the user.

Since a simple text listing approach may reduce readability compared to directly reading a CSV file, the system must provide a more intuitive message. To address this, the Answer generator employs prompt engineering techniques to ensure that the response is presented in an intuitive and interactive format. Instead of simply listing CSV contents, it reformats the information into natural language, making it easier for users to interpret the results. The simplified workflow of the Answer Generator is presented in Algorithm 5, while a more detailed version with implementation specifics is available in Table S11 of the Supplementary Materials.

Algorithm 5 Answer generator

Procedure generate_answer ()
Inputs:
None
Output:
M: response_message
1: result_files ← list_files
(get_latest_directory (“./result”))
2: best_result ← first_row(read_csv (result_files [0]))
3: column_names ← [“model”, col (3), “path”]
4: result_values ← [best_result [1],
round (best_result [2], 4), best_result [3]]
5: result_summary ← create_dict
(column_names, result_values)
6: request_query ← “Summarize:“ +
str (result_summary) + “in a clear format.”
7: Return LLM_generate
(“Generate summary”, request_query)

4. Results

4.1. Evaluation Setup

As mentioned in Section 3.1, to account for diverse query structures and potential ambiguities, we define eight use cases. These use cases enable a systematic assessment of the autonomous agent’s ability to process various query types and ensure the robustness of the Parameter extractor and Parameter evaluator in handling diverse input scenarios. The predefined use case types are follows:

Case 1. Vague query: Lacks optional details but can run with default values or require additional user input.
Case 2. Well-defined query: Contains both necessary and optional parameters, requiring no further input.
Case 3. Query with missing necessary parameters: Lacks essential parameters, requiring user input for validation.
Case 4. Query with unsupported parameters: Contains undefined parameters, making execution impossible.
Case 5. Lexical drift query: Includes errors (e.g., typos, grammatical mistakes) that hinder Parameter extractor.
Case 6. Query optimizable for result: Can meet evaluation criteria by increasing optimization cycles.
Case 7. Query unoptimizable for result: Cannot meet evaluation criteria even with more optimization cycles.
Case 8. Query casing dataset path issue: Specifies an incorrect dataset path, requiring user correction.

4.2. Performance Evaluation for Agent

Further illustrating these use cases, Table S13 of Supplementary Materials presents a comprehensive list of 100 queries categorized by use case type, providing an overview of the various query formulations utilized in this study. The experimental results obtained from processing these 100 queries are detailed in Section 4.2, where we evaluate the autonomous agent’s performance across different query types. The agent’s ability to accurately extract and validate parameters, efficiently execute model generation tasks, and process queries across different LLM engines is assessed.

In addition, to validate the effectiveness of the proposed agent, we also conducted a baseline comparison between the agent-generated models and expert-designed models reported in prior studies using the same datasets. As summarized in Table 2, the models produced by the agent achieved comparable predictive performance to literature-reported baselines, with only marginal differences in accuracy and asymmetric score. These results confirm that the agent can autonomously generate models that are not only valid and reliable but also competitive with manually engineered approaches, thereby demonstrating its practical utility.

Figure 6 provides a detailed visualization of the agent’s internal workflow, specifically for use case 3, illustrating the interactions between its modules, including the flow of inputs and outputs at each stage. Concretely, consider a user query that omits the dataset path: “Create predictive systems with CNNLSTM and BILSTM for useful life estimation; RMSE must not exceed 30.” The parameter extractor parses the request into structured fields (target category, model list, evaluation flag, target value), and the evaluator detects the missing path and prompts the user. After the path is supplied, the system re-parses and validates completeness; the Manager then orchestrates execution, and the Tool instantiates the CNN-LSTM and Bi-LSTM backbones, trains them with early stopping, learning-rate scheduling, and checkpointing, and logs validation metrics.

The Task evaluator applies the governing metric RMSE, compares it with the user-specified target, selects the best configuration, and reports the chosen model, test metrics, and artifact locations. Figure 6 is annotated to reflect these stages from user query through parameter extraction and validation to orchestration, training, and model selection so the flow of inputs and outputs is explicit. Table 3 defines the eight use cases for the experiment. Table 4 illustrates Use case 1 as a representative example, and results for the remaining user queries are provided in Table S14 of the Supplementary Materials. Complete results are provided in Tables S14–S21 of the Supplementary Materials.

For deployment cost considerations, the annual operating cost can be approximated as per-query API cost × agent runs per year × LLM calls per run, if > 1. Using the per-query costs measured in this study, see Supplementary Table S2, 100 runs per day, about 36,500 per year, with one LLM call per run correspond to approximately $165.71 for GPT-4o, $7.74 for Gemini-2.0-Flash, and $406.50 for GPT-4, while self-hosted models such as Llama-3.2 incur no API fee but may require on-premises infrastructure costs. Actual deployments may vary with prompt length and the number of calls per run.

5. Discussion

The proposed autonomous agent enables non-specialists to effectively automate PHM model generation for fault diagnosis and RUL prediction through natural language interaction, while also providing options and guidance to handle ambiguous or incomplete queries, thereby integrating LLM-based query processing, parameter validation, and task execution into a seamless workflow. Through systematic evaluation across 100 use cases, the system demonstrated its ability to extract relevant parameters, validate user inputs, and execute model development tasks with minimal manual intervention. One of the demonstrated results was the comparative analysis of multiple LLM engines, including GPT-3.5-turbo, GPT-4, GPT-4o, GPT-4o-mini, Gemini, and LLaMA, revealed their respective trade-offs in accuracy, computational cost, and response time. GPT-4o, with the highest accuracy 0.96 and a balanced computational cost $0.00455, was selected as the optimal model for this study.

Additionally, the Parameter extractor processes natural language queries, extracting necessary and optional parameters, while the Parameter evaluator verifies completeness and correctness. The Task executor structures the job flow and activates necessary tools for data loading, preprocessing, model selection, training, evaluation, and optimization, ensuring an efficient execution process. The system also employs an iterative optimization mechanism, where models are retrained and hyperparameters are adjusted until the best-performing configuration is achieved. Finally, the Answer generator compiles results into a structured output, enhancing user interaction through LLM-based natural language responses. The effectiveness of this automation was confirmed through experimental evaluations, demonstrating significant reductions in manual intervention, processing time, and complexity in AI model configuration.

Despite these advantages, several challenges remain in LLM-based parameter extraction and dataset adaptability. The reliance on LLMs for query processing presents trade-offs between accuracy, cost, and response time, with incorrect extractions potentially requiring multiple validation cycles, increasing computational overhead. Selecting the most cost-efficient and task-optimized LLM engine for different query processing stages remains a key factor in improving system performance. Another challenge concerns dataset adaptability within the Task executor. This study utilized predefined open datasets for each task. While these datasets were effective for evaluating the agent’s performance, real-world applications require greater flexibility to accommodate datasets with varying formats, feature distributions, and preprocessing requirements. It is difficult to standardize data ingestion and transformation across different user-provided datasets.

To address these limitations, future research should focus on adaptive LLM selection strategies and enhanced dataset handling mechanisms. Instead of relying on a single LLM engine for all processes, a hybrid LLM approach can be employed, where lightweight models (e.g., GPT-4o-mini, Gemini-2.0-flash) handle early-stage parameter extraction, while higher-accuracy models (e.g., GPT-4o) are reserved for final validation and answer generation. This dynamic model allocation would optimize both processing time and cost, ensuring an efficient and scalable system. Furthermore, a modular and adaptive data ingestion pipeline should be developed to enhance dataset handling. This includes automated dataset structure detection, which allows the system to identify key features, missing values, and formatting inconsistencies. Additionally, a dynamic preprocessing framework should be implemented to automatically adjust feature engineering steps based on dataset characteristics, minimizing manual intervention. By integrating these mechanisms, the system can handle a broader range of datasets more effectively, improving usability, adaptability, and automation across diverse industrial applications.

Overall, this study demonstrates that LLM-powered automation can significantly improve AI-driven predictive modeling, reducing dependency on manual query structuring and parameter selection, the system bridges the gap between domain expertise and AI-driven predictive modeling, enhancing accessibility for non-experts. Addressing these remaining challenges through hybrid LLM strategies and adaptive dataset processing will further enhance the autonomous agent’s flexibility and real-world applicability, making it a more robust and efficient tool for fault diagnosis, RUL prediction, and other predictive maintenance tasks.

6. Conclusions

In this study, we proposed an autonomous agent that leverages LLMs to automate the process of fault diagnosis and RUL prediction in industrial PHM applications. By integrating natural language processing capabilities with domain-specific tools, the agent efficiently extracts key parameters from user queries, validates inputs, and executes the model development pipeline. The experimental results demonstrated the agent’s ability to minimize manual intervention while improving accessibility for non-experts, thereby enhancing scalability and usability in industrial maintenance operations.

From a PHM perspective, the LLM functions as an orchestration layer that translates natural language maintenance intents into an end-to-end workflow covering data preparation, model configuration, training, evaluation, and model selection while upholding standard validation discipline. Instead of requiring expert scripting, the agent automatically chooses task-appropriate modeling patterns for fault diagnosis and RUL prediction, applies consistent objective and metric regimes, and returns artifacts that PHM stakeholders can act on. This abstraction reduces the expertise burden, improves reproducibility through logged configurations and checkpoints, and tightens the link between analytics outputs and maintenance planning.

Through systematic evaluations, we compared the performance of various LLM engines, including GPT-4o, GPT-4, Gemini, and LLaMA, identifying GPT-4o as the most suitable model due to its high accuracy and balanced computational efficiency. Furthermore, the autonomous agent successfully processed diverse query types, ensuring robustness in parameter extraction, validation, and task execution. The iterative optimization mechanism implemented in the Task executor enabled efficient hyperparameter tuning, ensuring optimal model performance for both fault diagnosis and RUL prediction.

Despite its promising performance, certain challenges remain, particularly in dataset adaptability and computational efficiency. The reliance on LLMs for query processing introduces trade-offs between accuracy, processing time, and cost, while the handling of diverse datasets requires further improvements in adaptive preprocessing mechanisms. Future research should explore hybrid LLM strategies that dynamically allocate different models for various processing stages to enhance cost-effectiveness and accuracy. Additionally, the development of a more flexible data ingestion and preprocessing framework will be crucial for extending the applicability of the autonomous agent across diverse industrial datasets.

Overall, this study highlights the potential of LLM-based autonomous agents for PHM. By bridging the gap between domain expertise and predictive modeling, our approach enables more efficient, scalable, and user-friendly solutions for fault diagnosis and RUL prediction. Further advancements in adaptive LLM selection and data handling strategies will further enhance the robustness and practicality of the proposed system, paving the way for broader adoption in industrial PHM applications.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/machines13090831/s1, Table S1: Response time comparison of LLM engines; Table S2: Cost comparison of LLM engines; Table S3: Accuracy comparison of LLM engines; Table S4: Descriptions of eight use cases; Table S5: Queries for each of the eight use cases; Table S6: Task process and outcomes for use case 1 and query; Table S7: Task process and outcomes for use case 2 and query; Table S8: Task process and outcomes for use case 3 and query; Table S9: Task process and outcomes for use case 4 and query; Table S10: Task process and outcomes for use case 5 and query; Table S11: Task process and outcomes for use case 6 and query; Table S12: Task process and outcomes for use case 7 and query; Table S13: Task process and outcomes for use case 8 and query; Algorithm S1: Detailed pseudocode of Parameter extractor; Algorithm S2: Detailed pseudocode of Parameter evaluator; Algorithm S3: Detailed pseudocode of Model optimizer; Algorithm S4: Detailed pseudocode of Model tuner; Algorithm S5: Detailed pseudocode of Model trainer & evaluator; Algorithm S6: Detailed pseudocode of Data loader; Algorithm S7: Detailed pseudocode of Data preprocessor; Algorithm S8: Detailed pseudocode of Answer generator.

Author Contributions

Conceptualization, M.C., S.-i.Y. and J.-Y.K.; methodology, J.-Y.K.; software, M.C., S.-i.Y. and J.-Y.K.; validation, M.C., S.-i.Y. and J.-Y.K.; formal analysis, M.C., S.-i.Y. and J.-Y.K.; investigation, M.C., S.-i.Y., S.K., D.K., K.N., T.L. and J.-Y.K.; resources, S.K., D.K., K.N., T.L. and J.-Y.K.; data curation, M.C., S.-i.Y. and J.-Y.K.; writing—original draft preparation, M.C., S.-i.Y. and J.-Y.K.; writing—review and editing, M.C., S.-i.Y. and J.-Y.K.; supervision, J.-Y.K.; project administration, J.-Y.K.; funding acquisition, J.-Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Technology Innovation Program (RS-2024-00444913) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).

Data Availability Statement

The FordA datasets used in this study are publicly available from the UCR Time Series Classification Archive, maintained by the University of California, Riverside and the University of East Anglia, and can be accessed at: https://www.timeseriesclassification.com/description.php?Dataset=FordA (accessed on 4 September 2025). The CWRU bearing dataset was obtained from the Case School of Engineering Bearing Data Center, which provides open-access vibration data collected under various bearing fault conditions, and is available at: https://engineering.case.edu/bearingdatacenter/download-data-file (accessed on 4 September 2025). The C-MAPSS dataset, used for remaining useful life (RUL) prediction of turbofan engines, is provided by the NASA Prognostics Center of Excellence and can be downloaded from: https://data.nasa.gov/dataset/cmapss-jet-engine-simulated-data (accessed on 4 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
CSV	Comma Separated Values
LLM	Large Language Model
LSTM	Long-Short Term Memory
MLP	Multi-Layer Perceptron
PHM	Prognostics and Health Management
RMSE	Root Mean Squared Error
RUL	Remaining Useful Life
TCN	Temporal Convolutional Network

References

Zhang, Y.; Fang, L.; Qi, Z.; Deng, H. A Review of Remaining Useful Life Prediction Approaches for Mechanical Equipment. IEEE Sens. J. 2023, 23, 29991–30006. [Google Scholar] [CrossRef]
Bhandare, R.V.; Mogal, S.P.; Phalle, V.M.; Kushare, P.B. Fault Diagnosis and Prediction of Remaining Useful Life (RUL) of Rolling Element Bearing: A Review State of Art. Tribol.-Finn. J. Tribol. 2024, 41, 28–42. [Google Scholar] [CrossRef]
Gawde, S.; Patil, S.; Kumar, S.; Kamat, P.; Kotecha, K.; Abraham, A. Multi-Fault Diagnosis of Industrial Rotating Machines Using Data-Driven Approach: A Review of Two Decades of Research. Eng. Appl. Artif. Intell. 2023, 123, 106139. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.; Yuan, Y. Filter-Based Fault Diagnosis and Remaining Useful Life Prediction, 1st ed.; CRC Press: Boca Raton, FL, USA, 2023; ISBN 9781000835946. [Google Scholar]
Qiu, S.; Cui, X.; Ping, Z.; Shan, N.; Li, Z.; Bao, X.; Xu, X. Deep Learning Techniques in Intelligent Fault Diagnosis and Prognosis for Industrial Systems: A Review. Sensors 2023, 23, 1305. [Google Scholar] [CrossRef]
Orf, S.; Ochs, S.; Doll, J.; Schotschneider, A.; Heinrich, M.; Zofka, M.R.; Zöllner, J.M. Modular Fault Diagnosis Framework for Complex Autonomous Driving Systems. In Proceedings of the 2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 17–19 October 2024. [Google Scholar]
Zhao, D.; Sharma, K.; Yin, H.; Qi, Y.; Zhang, S. SRTFD: Scalable Real-Time Fault Diagnosis through Online Continual Learning. arXiv 2024, arXiv:2408.05681. [Google Scholar] [CrossRef]
Li, X.; Wang, L.; Wang, C.; Ma, X.; Miao, B.; Xu, D.; Cheng, R. A Method for Predicting Remaining Useful Life Using Enhanced Savitzky–Golay Filter and Improved Deep Learning Framework. Sci. Rep. 2024, 14, 23983. [Google Scholar] [CrossRef]
Ji, D.; Wang, C.; Li, J.; Dong, H. A Review: Data Driven-Based Fault Diagnosis and RUL Prediction of Petroleum Machinery and Equipment. Syst. Sci. Control Eng. 2021, 9, 724–747. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Urlana, A.; Kumar, C.V.; Singh, A.K.; Garlapati, B.M.; Chalamala, S.R.; Mishra, R. LLMs with Industrial Lens: Deciphering the Challenges and Prospects—A Survey. arXiv 2024, arXiv:2402.14558. [Google Scholar]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar] [CrossRef]
Yuan, X.; Wang, J.; Zhao, H.; Yan, T.; Qi, F. Empowering LLMs with Toolkits: An Open-Source Intelligence Acquisition Method. Future Internet 2024, 16, 461. [Google Scholar] [CrossRef]
Wang, Z.; Liu, Z.; Zhang, Y.; Zhong, A.; Fan, L.; Wu, L.; Wen, Q. RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024. [Google Scholar] [CrossRef]
Bhattacharya, D.; Cassady, H.J.; Hickner, M.A.; Reinhart, W.F. Large Language Models as Molecular Design Engines. J. Chem. Inf. Model. 2024, 64, 7086–7096. [Google Scholar] [CrossRef]
Boiko, D.A.; MacKnight, R.; Kline, B.; Gomes, G. Autonomous Chemical Research with Large Language Models. Nature 2023, 624, 570–578. [Google Scholar] [CrossRef]
Cofre-Martel, S.; Droguett, E.L.; Modarres, M. Big Machinery Data Preprocessing Methodology for Data-Driven Models in Prognostics and Health Management. Sensors 2021, 21, 6841. [Google Scholar] [CrossRef]
Chang, J.; Liu, C.; Huang, J.; Mao, R.; Qin, J. LLaPipe: LLM-Guided Reinforcement Learning for Automated Data Preparation Pipeline Construction. arXiv 2025, arXiv:2507.13712. [Google Scholar]
Li, X.; Ding, Q.; Sun, J.Q. Remaining Useful Life Estimation in Prognostics Using Deep Convolution Neural Networks. Reliab. Eng. Syst. Saf. 2018, 172, 1–11. [Google Scholar] [CrossRef]
Wang, J.; Wen, G.; Yang, S.; Liu, Y. Remaining Useful Life Estimation in Prognostics Using Deep Bidirectional LSTM Neural Network. In Proceedings of the 2018 Prognostics and System Health Management Conference, PHM-Chongqing, Chongqing, China, 26–28 October 2018; pp. 1037–1042. [Google Scholar] [CrossRef]
Peng, C.; Chen, Y.; Chen, Q.; Tang, Z.; Li, L.; Gui, W. A Remaining Useful Life Prognosis of Turbofan Engine Using Temporal and Spatial Feature Fusion. Sensors 2021, 21, 418. [Google Scholar] [CrossRef]
Peringal, A.; Mohiuddin, M.B.; Haddad, A.G.; Muthusamy, P.K. Reliable Prediction of Remaining Useful Life for Aircraft Engines: An Lstm-Based Approach with Conservative Loss Function. In Proceedings of the AIAA SCITECH 2025 Forum, Orlando, FL, USA, 6–10 January 2025; p. 1909. [Google Scholar]
Wang, X.; Li, Y.; Xu, Y.; Liu, X.; Zheng, T.; Zheng, B. Remaining Useful Life Prediction for Aero-Engines Using a Time-Enhanced Multi-Head Self-Attention Model. Aerospace 2023, 10, 80. [Google Scholar] [CrossRef]
Khorram, A.; Khalooei, M.; Rezghi, M. End-to-End CNN + LSTM Deep Learning Approach for Bearing Fault Diagnosis. Appl. Intell. 2021, 51, 736–751. [Google Scholar] [CrossRef]
Albarbar, A.; Gurski, V.; Korendiy, V.; Saghi, T.; Bustan, D.; Aphale, S.S. Bearing Fault Diagnosis Based on Multi-Scale CNN and Bidirectional GRU. Vibration 2022, 6, 11–28. [Google Scholar] [CrossRef]
Philip, J.; Muthukumar, G. CNN-LSTM Hybrid Deep Learning Model for Remaining Useful Life Estimation. arXiv 2024, arXiv:2412.15998. [Google Scholar] [CrossRef]
Fu, G.; Wei, Q.; Yang, Y. Bearing Fault Diagnosis with Parallel CNN and LSTM. Math. Biosci. Eng. 2024, 21, 2385–2406. [Google Scholar] [CrossRef]
Sun, H.; Zhao, S. Fault Diagnosis for Bearing Based on 1DCNN and LSTM. Shock. Vib. 2021, 2021, 1221462. [Google Scholar] [CrossRef]
Zio, E. Prognostics and Health Management (PHM): Where Are We and Where Do We (Need to) Go in Theory and Practice. Reliab. Eng. Syst. Saf. 2022, 218, 108119. [Google Scholar] [CrossRef]
Zhang, L.; Lin, J.; Liu, B.; Zhang, Z.; Yan, X.; Wei, M. A Review on Deep Learning Applications in Prognostics and Health Management. IEEE Access 2019, 7, 162415–162438. [Google Scholar] [CrossRef]
Zhu, Z.; Lei, Y.; Qi, G.; Chai, Y.; Mazur, N.; An, Y.; Huang, X. A Review of the Application of Deep Learning in Intelligent Fault Diagnosis of Rotating Machinery. Measurement 2023, 206, 112346. [Google Scholar] [CrossRef]
Fischer, L.; Ehrlinger, L.; Geist, V.; Ramler, R.; Sobiezky, F.; Zellinger, W.; Brunner, D.; Kumar, M.; Moser, B. AI System Engineering—Key Challenges and Lessons Learned. Mach. Learn. Knowl. Extr. 2021, 3, 56–83. [Google Scholar] [CrossRef]
Schmitt, M. Automated Machine Learning: AI-Driven Decision Making in Business Analytics. Intell. Syst. Appl. 2023, 18, 200188. [Google Scholar] [CrossRef]
Schmitt, M. Deep Learning in Business Analytics: A Clash of Expectations and Reality. Int. J. Inf. Manag. Data Insights 2023, 3, 100146. [Google Scholar] [CrossRef]
Clayton, P.R.; Clopton, J. Business Curriculum Redesign: Integrating Data Analytics. J. Educ. Bus. 2019, 94, 57–63. [Google Scholar] [CrossRef]
Kar, S.; Kar, A.K.; Gupta, M.P. Modeling Drivers and Barriers of Artificial Intelligence Adoption: Insights from a Strategic Management Perspective. Intell. Syst. Account. Financ. Manag. 2021, 28, 217–238. [Google Scholar] [CrossRef]
Grover, V.; Chiang, R.H.L.; Liang, T.P.; Zhang, D. Creating Strategic Business Value from Big Data Analytics: A Research Framework. J. Manag. Inf. Syst. 2018, 35, 388–423. [Google Scholar] [CrossRef]
Nguyen, G.; Dlugolinsky, S.; Bobák, M.; Tran, V.; López García, Á.; Heredia, I.; Malík, P.; Hluchý, L. Machine Learning and Deep Learning Frameworks and Libraries for Large-Scale Data Mining: A Survey. Artif. Intell. Rev. 2019, 52, 77–124. [Google Scholar] [CrossRef]
Sarker, I.H. AI-Based Modeling: Techniques, Applications and Research Issues Towards Automation, Intelligent and Smart Systems. SN Comput. Sci. 2022, 3, 158. [Google Scholar] [CrossRef]
Sayyadi, M.; Collina, L. How to Adapt to AI in Strategic Management. Calif. Manage Rev. 2023, 67. Available online: https://cmr.berkeley.edu/2023/06/how-to-adapt-to-ai-in-strategic-management/ (accessed on 4 September 2025).
Qin, L.; Chen, Q.; Feng, X.; Wu, Y.; Zhang, Y.; Li, Y.; Li, M.; Che, W.; Yu, P.S. Large Language Models Meet Nlp: A Survey. arXiv 2024, arXiv:2405.12819. [Google Scholar] [CrossRef]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Brown, P.F.; Della Pietra, V.J.; deSouza, P.V.; Lai, J.C.; Mercer, R.L. Class-Based n-Gram Models of Natural Language. Comput. Linguist. 1992, 18, 467–480. [Google Scholar]
Chen, S.F.; Goodman, J. An Empirical Study of Smoothing Techniques for Language Modeling. Comput. Speech Lang. 1999, 13, 359–394. [Google Scholar] [CrossRef]
Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Katz, S.M. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. Acoust. Speech Signal Process. 1987, 35, 400–401. [Google Scholar] [CrossRef]
Good, I.J. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013—Workshop Track Proceedings, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar]
Saleh, M.; Paquelet, S. Anatomy of Neural Language Models. arXiv 2024, arXiv:2401.03797. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Kang, Y.; Kim, J. ChatMOF: An Artificial Intelligence System for Predicting and Generating Metal-Organic Frameworks Using Large Language Models. Nat. Commun. 2024, 15, 4705. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.-P.; Cai, P.; Xu, K.; Li, L.; Sun, Y.; Zhou, S.; Su, H.; Tang, L.; Liu, Q. AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models. Proc. VLDB Endow. 2024, 17, 3920–3933. [Google Scholar] [CrossRef]
Wu, H.; He, Z.; Zhang, X.; Yao, X.; Zheng, S.; Zheng, H.; Yu, B. Chateda: A Large Language Model Powered Autonomous Agent for Eda. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 3184–3197. [Google Scholar] [CrossRef]
Kinniment, M.; Sato, L.J.K.; Du, H.; Goodrich, B.; Hasin, M.; Chan, L.; Miles, L.H.; Lin, T.R.; Wijk, H.; Burget, J.; et al. Evaluating Language-Model Agents on Realistic Autonomous Tasks. arXiv 2023, arXiv:2312.11671. [Google Scholar]
Wei, Q.; Sun, H.; Xu, Y.; Pang, Z.; Gao, F. Exploring the Application of Large Language Models Based AI Agents in Leakage Detection of Natural Gas Valve Chambers. Energies 2024, 17, 5633. [Google Scholar] [CrossRef]
Jiang, K.; Cai, X.; Cui, Z.; Li, A.; Ren, Y.; Yu, H.; Yang, H.; Fu, D.; Wen, L.; Cai, P. KoMA: Knowledge-Driven Multi-Agent Framework for Autonomous Driving with Large Language Models. IEEE Trans. Intell. Veh. 2024, 1–15. [Google Scholar] [CrossRef]
Wu, T.; Li, J.; Bao, J.; Liu, Q. ProcessCarbonAgent: A Large Language Models-Empowered Autonomous Agent for Decision-Making in Manufacturing Carbon Emission Management. J. Manuf. Syst. 2024, 76, 429–442. [Google Scholar] [CrossRef]
Jin, A.; Ye, Y.; Lee, B.; Qiao, Y. DeCoAgent: Large Language Model Empowered Decentralized Autonomous Collaboration Agents Based on Smart Contracts. IEEE Access 2024, 12, 155234–155245. [Google Scholar] [CrossRef]
Hu, Z.; Iscen, A.; Sun, C.; Chang, K.-W.; Sun, Y.; Ross, D.; Schmid, C.; Fathi, A. AVIS: Autonomous Visual Information Seeking with Large Language Model Agent. In Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 867–878. [Google Scholar]
Xia, Y.; Shenoy, M.; Jazdi, N.; Weyrich, M. Towards Autonomous System: Flexible Modular Production System Enhanced with Large Language Model Agents. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, ETFA 2023, Sinaia, Romania, 12–15 September 2023. [Google Scholar] [CrossRef]
Vaškevičius, M.; Kapočiūtė-Dzikienė, J. Language Models for Predicting Organic Synthesis Procedures. Appl. Sci. 2024, 14, 11526. [Google Scholar] [CrossRef]
Cinkusz, K.; Chudziak, J.A.; Niewiadomska-Szynkiewicz, E. Cognitive Agents Powered by Large Language Models for Agile Software Project Management. Electronics 2024, 14, 87. [Google Scholar] [CrossRef]
Yue, C.-X.; Li, Y.-Y.; Wang, M.-S.-Y.; Zhang, X.-M. Searching for the Light Leptophilic Gauge Boson $Z_x$ via Four-Lepton Final States at the CEPC. Chin. Phys. C 2024, 48, 43103. [Google Scholar] [CrossRef]
Tiago Bianchi Number of Monthly ChatGPT and Gemini AI Mobile App Downloads in the United States from May 2023 to September 2024. Available online: https://www.statista.com/statistics/1497377/global-chatgpt-vs-gemini-app-downloads/ (accessed on 4 September 2025).
David Gewirtz The Most Popular AI Tools of 2024. Available online: https://www.zdnet.com/article/the-most-popular-ai-tools-of-2024-and-what-that-even-means/ (accessed on 27 February 2025).
Dau, H.A.; Bagnall, A.; Kamgar, K.; Yeh, C.-C.M.; Zhu, Y.; Gharghabi, S.; Ratanamahatana, C.A.; Keogh, E. The UCR Time Series Archive. IEEE/CAA J. Autom. Sin. 2019, 6, 1293–1305. [Google Scholar] [CrossRef]
Case Western Reserve University Bearing Data Center [Dataset]. Available online: https://engineering.case.edu/bearingdatacenter/download-data-file (accessed on 4 September 2025).
Saxena, A.; Goebel, K.; Simon, D.; Eklund, N. Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation [Dataset]. In Proceedings of the 2008 International Conference on Prognostics and Health Management, Denver, CO, USA, 6–9 October 2008; pp. 1–9. [Google Scholar]
Zhou, Z.; Qiu, C.; Zhang, Y. A Comparative Analysis of Linear Regression, Neural Networks and Random Forest Regression for Predicting Air Ozone Employing Soft Sensor Models. Sci. Rep. 2023, 13, 22420. [Google Scholar] [CrossRef]
Kim, J.H. Multicollinearity and Misleading Statistical Results. Korean J. Anesthesiol. 2019, 72, 558–569. [Google Scholar] [CrossRef]
Jean-Pierre, N.; Birmelé, E.; François, R.E.Y. LSTM and Transformers Based Methods for Remaining Useful Life Prediction Considering Censored Data. In Proceedings of the PHM Society European Conference, Prague, Czech Republic, 3–5 July 2024; Volume 8, p. 10. [Google Scholar]
Adel, A.; Hand, O.; Fawzi, G.; Walid, T.; Chemseddine, R.; Djamel, B. Gear Fault Detection, Identification and Classification Using MLP Neural Network. In Recent Advances in Structural Health Monitoring and Engineering Structures: Select Proceedings of SHM and ES 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 221–234. [Google Scholar]
Tunio, N.A.; Hashmani, A.A.; Khokhar, S.; Tunio, M.A.; Faheem, M. Fault Detection and Classification in Overhead Transmission Lines through Comprehensive Feature Extraction Using Temporal Convolution Neural Network. Eng. Rep. 2024, 6, e12950. [Google Scholar] [CrossRef]
Vu, M.T.; Hiraga, M.; Miura, N.; Masuda, A. Failure Mode Classification for Rolling Element Bearings Using Time-Domain Transformer-Based Encoder. Sensors 2024, 24, 3953. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Wichard, J.D. Classification of Ford Motor Data. Comput. Sci. 2008. Available online: http://www.j-wichard.de/publications/FordPaper.pdf (accessed on 4 September 2025).
Adeodato, P.J.L.; Arnaud, A.L.; Vasconcelos, G.C.; Cunha, R.C.L.V.; Gurgel, T.B.; Monteiro, D.S.M.P. The Role of Temporal Feature Extraction and Bagging of MLP Neural Networks for Solving the WCCI 2008 Ford Classification Challenge. In Proceedings of the 2009 International Joint Conference on Neural Networks, Atlanta, GA, USA, 14–19 June 2009; pp. 57–62. [Google Scholar]
Schlegel, U.; Keim, D.A. A Deep Dive into Perturbations as Evaluation Technique for Time Series XAI. In Proceedings of the World Conference on Explainable Artificial Intelligence, Lisbon, Portugal, 26–28 July 2023; pp. 165–180. [Google Scholar]
Aziz, S.; Khan, M.U.; Faraz, M.; Montes, G.A. Intelligent Bearing Faults Diagnosis Featuring Automated Relative Energy Based Empirical Mode Decomposition and Novel Cepstral Autoregressive Features. Measurement 2023, 216, 112871. [Google Scholar] [CrossRef]
Raj, K.K.; Kumar, S.; Kumar, R.R.; Andriollo, M. Enhanced Fault Detection in Bearings Using Machine Learning and Raw Accelerometer Data: A Case Study Using the Case Western Reserve University Dataset. Information 2024, 15, 259. [Google Scholar] [CrossRef]
Muneer, A.; Taib, S.M.; Fati, S.M.; Alhussian, H. Deep-Learning Based Prognosis Approach for Remaining Useful Life Prediction of Turbofan Engine. Symmetry 2021, 13, 1861. [Google Scholar] [CrossRef]
Ensarioğlu, K.; İnkaya, T.; Emel, E. Remaining Useful Life Estimation of Turbofan Engines with Deep Learning Using Change-Point Detection Based Labeling and Feature Engineering. Appl. Sci. 2023, 13, 11893. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram showing the flow of this research.

Figure 2. Detailed pipeline of the autonomous agent: (1) Parameter extractor: parses user queries into structured parameters. (2) Parameter evaluator: validates and completes parameters; if missing or misaligned, requests clarification and updates the set. (3) Task executor: the Manager routes the job through Data preprocessor, Data loader, Model generator, and Model trainer; Model tuner/optimizer iterate with feedback while the Task evaluator checks the governing metric to decide re-optimization or acceptance. (4) Answer generator: composes the final summary to the user and exports a result CSV.

Figure 3. Workflow of the Parameter extractor.

Figure 4. Workflow of the Parameter evaluator.

Figure 5. Overview of Tool: (1) Data loader: constructs data pipelines and establishes train/validation/test partitions. (2) Data preprocessor: performs preprocesses including normalization, windowing, PCA, and clipping. (3) Model generator: instantiates backbone architecture, (4) Model trainer & evaluator: trains model with train/validation dataset, then evaluates on the held-out test split. (5) Model tuner/optimizer: tunes hyperparameters, ranks candidates by the governing metric, and exports a CSV with the top 5 configurations.

Figure 6. Detailed pipeline of the autonomous agent.

Table 1. Average performance comparison results of commonly used LLM engines.

Engine	Time(s)	Cost ($)	Accuracy
GPT-3.5 turbo	1.206	0.00053	0.68
GPT-4	2.809	0.01114	0.90
GPT-4o	1.900	0.00455	0.96
GPT-4o-mini	1.489	0.00225	0.91
Gemini-2.0-flash	0.964	0.00021	0.94
LLaMA-3.2	1.952	0	0.53

Table 2. Comparison of agent generated model performance with literature-reported.

Task	Dataset	Metric	Methods	Value
Binary fault classification	FordA	Accuracy	Ours	0.97
			Wichard [80]	0.95
			Adeodato et al. [81]	0.95
			Schlegel et al. [82]	0.89
Mult-fault classification	CWRU	Accuracy	Ours	0.95
			Aziz et al. [83]	0.97
			Raj et al. [84]	0.99
RUL prediction	C-MAPSS (FD001)	Asymmetric Score	Ours	380.74
			Muneer et al. [85]	223.00
			Ensarioğlu et al. [86]	437.20

Table 3. Definition of Use Case for Evaluation.

Use Case Number	Use Case Name	Explanation
1	Vague query	Lacks optional details but can run with default values or require additional user input.
2	Well-defined query	Contains both necessary and optional parameters, requiring no further input.
3	Query with missing necessary parameters	Lacks essential parameters, requiring user input for validation.
4	Query with unsupported parameters	Contains undefined parameters, making execution impossible.
5	Lexical drift query	Includes errors (e.g., typos, grammatical mistakes) that hinder Parameter extractor.
6	Query optimizable for result	Can meet evaluation criteria by increasing optimization cycles.
7	Query unoptimizable for result	Cannot meet evaluation criteria even with more optimization cycles.
8	Query causing dataset path issue	Specifies an incorrect dataset path, requiring user correction.

Table 4. Summary of autonomous agent processing across different use case types.

Use Case	Process	Value
1	User query input	“Develop models to predict remaining operational life; data is available at ‘./dataset’.”
	[Trial 1] Parameter extractor	{‘target_category’: ‘prediction’, ‘dataset_base_dir’: ‘./dataset’, ‘evaluation_flag’: None, ‘model_list’: None, ‘target_evaluation_value’: None}
	[Trial 1] Parameter evaluator	False
	User’s additional query input	“I will choose cnn combined lstm.”
	[Trial 2] Parameter extractor	{‘target_category’: ‘prediction’, ‘dataset_base_dir’: ‘./dataset’, ‘evaluation_flag’: ‘VALIDATION_LOSS’, ‘model_list’: [‘CNNLSTM’], ‘target_evaluation_value’: None}
	[Trial 2] Parameter evaluator	True
	Task evaluator	True—Cycle 1
	Answer generator	Model Training Result Summary: - Model Used: CNNLSTM_1_1 - Validation Loss: 400.0396 - Model Saved To: ./result Summary: The experiment utilized the CNNLSTM_1_1 model, achieving a validation loss of 400.0396. The trained model has been successfully saved at the specified location for future use and reference.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cha, M.; Yoon, S.-i.; Kim, S.; Kang, D.; Nam, K.; Lee, T.; Kim, J.-Y. Large Language Model-Based Autonomous Agent for Prognostics and Health Management. Machines 2025, 13, 831. https://doi.org/10.3390/machines13090831

AMA Style

Cha M, Yoon S-i, Kim S, Kang D, Nam K, Lee T, Kim J-Y. Large Language Model-Based Autonomous Agent for Prognostics and Health Management. Machines. 2025; 13(9):831. https://doi.org/10.3390/machines13090831

Chicago/Turabian Style

Cha, Minhyeok, Sang-il Yoon, Seongrae Kim, Daeyoung Kang, Keonwoo Nam, Teakyong Lee, and Joon-Young Kim. 2025. "Large Language Model-Based Autonomous Agent for Prognostics and Health Management" Machines 13, no. 9: 831. https://doi.org/10.3390/machines13090831

APA Style

Cha, M., Yoon, S.-i., Kim, S., Kang, D., Nam, K., Lee, T., & Kim, J.-Y. (2025). Large Language Model-Based Autonomous Agent for Prognostics and Health Management. Machines, 13(9), 831. https://doi.org/10.3390/machines13090831

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Model-Based Autonomous Agent for Prognostics and Health Management

Abstract

1. Introduction

1.1. Research Background

1.2. Advances in LLMs

1.3. Research Gap and Objective

2. Related Work

2.1. Fault Diagnosis and Remaining Useful Life Prediction

2.2. Large Language Models

2.3. LLMs for Domain-Specific Applications

3. Method

3.1. Overall Strategy for Autonomous Agent

3.2. Query Analysis of Autonomous Agent

3.2.1. Parameter Extractor

3.2.2. Parameter Evaluator

3.3. Task Executor

3.3.1. Manager

3.3.2. Tool

3.3.3. Task Evaluator

3.3.4. Answer Generator

4. Results

4.1. Evaluation Setup

4.2. Performance Evaluation for Agent

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI