A Model for Learning-Curve Estimation in Efficient Neural Architecture Search and Its Application in Predictive Health Maintenance

Solís-Martín, David; Galán-Páez, Juan; Borrego-Díaz, Joaquín

doi:10.3390/math13040555

Open AccessArticle

A Model for Learning-Curve Estimation in Efficient Neural Architecture Search and Its Application in Predictive Health Maintenance

by

David Solís-Martín

^†

,

Juan Galán-Páez

^†

and

Joaquín Borrego-Díaz

^*

Department of Computer Science and Artificial Intelligence, Universidad de Sevilla, 41012 Sevilla, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(4), 555; https://doi.org/10.3390/math13040555

Submission received: 29 November 2024 / Revised: 27 January 2025 / Accepted: 6 February 2025 / Published: 7 February 2025

(This article belongs to the Special Issue Metaheuristics and Artificial Intelligence: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

A persistent challenge in machine learning is the computational inefficiency of neural architecture search (NAS), particularly in resource-constrained domains like predictive maintenance. This work introduces a novel learning-curve estimation framework that reduces NAS computational costs by over 50% while maintaining model performance, addressing a critical bottleneck in automated machine learning design. By developing a data-driven estimator trained on 62 different predictive maintenance datasets, we demonstrate a generalized approach to early-stopping trials during neural network optimization. Our methodology not only reduces computational resources but also provides a transferable technique for efficient neural network architecture exploration across complex industrial monitoring tasks. The proposed approach achieves a remarkable balance between computational efficiency and model performance, with only a 2% performance degradation, showcasing a significant advancement in automated neural architecture optimization strategies.

Keywords:

learning curves; neural architecture search; predictive maintenance; Bayesian optimization

MSC:

68U99

1. Introduction

In recent years, deep learning (DL) has significantly impacted various industrial sectors, achieving remarkable success in fields such as image processing and natural language (or audio) processing. Moreover, in relatively less-explored industrial domains like prognostics and health management (PHM), DL has demonstrated promising results. PHM refers to the process of monitoring, analyzing, and predicting the health and performance of equipment and machinery. PHM systems aim to detect potential failures before they occur by assessing the condition of assets through various sensors and data analysis techniques. This helps optimize maintenance schedules, reduce downtime, and extend equipment lifespan. The ultimate goal of PHM is to enhance the reliability and efficiency of industrial operations by predicting failures, improving safety, and minimizing costs associated with unplanned maintenance and repairs. Additionally, synthesizing explanations for machinery experts [1] and aiding decision-making processes for domain experts [2] are equally critical.

The economic significance of predictive health monitoring (PHM) has been increasingly recognized across industrial sectors. Recent studies highlight substantial financial benefits, showing that organizations implementing PHM technologies experience significant reductions in maintenance costs and operational downtime [3].

1.1. Prognostics and Health Management

Predictive health monitoring (PHM) systems represent an advanced integration of sensing technologies, data analytics, and machine learning. The diagram in Figure 1 illustrates the workflow of a PHM system, specifically for predictive maintenance in an industrial setting.

The process begins with industrial machinery as the primary source of operational data, equipped with sensors that continuously monitor performance and health metrics. Raw sensor data—such as temperature, pressure, and vibration levels—are logged and recorded. A process flowchart outlines the machinery’s role within industrial operations, providing insights into interactions and potential failure points. All collected data, including sensor readings and process flow information, are stored in a centralized repository to ensure efficient aggregation and availability for analysis.

The three upper-level components represent industrial support systems. The ERP (enterprise resource planning) system manages core business processes, optimizing inventory, orders, and maintenance using stored data. Quality engineering applies quality management principles, analyzing data to ensure machinery meets standards and implementing improvements. The manufacturing execution system (MES) monitors and controls production in real time, adjusting processes to enhance efficiency and minimize downtime.

The AI core is the ML models which are applied to the stored data to predict future machinery failures before they occur. These models analyze patterns and anomalies in the data to identify potential issues that might lead to equipment failure. The results from the ML models are visualized in a user-friendly dashboard, providing real-time insights into the health and performance of the machinery, highlighting areas that need attention and predicting future maintenance needs. It includes various metrics and KPIs (key performance indicators) that help in decision making.

Data-driven methods have gained popularity for their ability to leverage vast amounts of data from modern sensors and systems (e.g., the Internet of Things). These methods process historical data to identify patterns and detect degradation trends. Various approaches have been explored, including neural networks [4], deep learning [5], XAI [1], and ensemble methods based on decision trees [6,7]. Support vector machines (SVMs) have also been applied to RUL in studies such as [8,9]. To address uncertainty, Bayesian networks [10] and fuzzy logic-based systems [11] have been proposed, enhancing the robustness of RUL predictions.

1.2. Neural Architecture Search

The design of neural architectures is a critical factor in extracting task-relevant features, significantly impacting the performance of the resulting model. Over the course of neural network research, various high-level architectures have been introduced, including feedforward networks (FFNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformers, among others. Nevertheless, identifying the optimal hyperparameter configuration—such as the number and types of layers, activation functions, number of neurons, learning rate, and other related parameters—remains a challenging task that requires substantial expertise and iterative fine-tuning.

Neural architecture search (NAS) is an emerging approach in the automated design of neural networks, aiming to systematically identify the optimal network architecture for a specific task. This paradigm explores various network structures to determine which one delivers the best performance, optimizing both accuracy and resource efficiency.

Three key components define NAS. Firstly, at its core is the architecture search space, which encompasses all possible network configurations that can be evaluated for a given problem. Secondly, the evaluation process plays a critical role in determining which architecture best meets the requirements, such as prediction accuracy or computational efficiency. Lastly, to find the optimal architecture, NAS employs optimization methods that guide the search, using techniques such as evolutionary algorithms or reinforcement learning. In this way, NAS enables the automation and substantial improvement of neural network design, reducing the need for manual intervention in selecting parameters and structures.

In machine learning (ML) practices, NAS is typically structured with an outer loop that searches for optimal network hyperparameters and an inner loop that optimizes the model parameters (i.e., the network weights) using those hyperparameters. This work focuses on implementing early-stopping mechanisms within the inner loop to improve efficiency in NAS.

NAS typically requires training and evaluating a large number of candidate models—often in the hundreds—throughout the search process. This makes NAS inherently resource-intensive and time-consuming. For instance, in [12], Zoph et al. introduce NASNet, which requires 2000 GPU days to perform a search using reinforcement learning, while AmoebaNet-A, presented by Real et al. in [13], demands 3150 GPU days using evolutionary algorithms. Consequently, developing strategies to reduce the computational resources and time required for NAS is of paramount importance, enabling broader accessibility and greater efficiency in deploying NAS-based methodologies.

NAS faces several computational and methodological challenges. One key challenge is combinatorial complexity, as the architecture search space grows exponentially with the depth and width of the network. This makes finding the optimal architecture increasingly difficult as the design space expands. Another significant hurdle is computational cost, as evaluating multiple architectures requires substantial resources in both time and hardware. Additionally, transferability poses a challenge, as optimal architectures may vary across different domains, meaning that a network architecture effective for one task may not perform as well in another. Finally, interpretability remains a major issue, as understanding why one architecture outperforms another is often not straightforward, making it difficult to gain insights into the factors driving performance differences.

NAS for PHM

While the general principles of NAS are universal, their application in specific domains like prognostics and health management (PHM) requires important adaptations.

Interpretability constraints, reliability requirements, handling of limited or noisy data, and the need for explainability all come into play. The transition from generic NAS to NAS specialized for PHM involves addressing these unique domain-specific constraints. In PHM, ensuring that the system can explain its predictions and operate reliably under uncertain conditions is particularly important. Additionally, the limited availability of clean data in many PHM scenarios adds complexity to the design process, necessitating methods that perform well even with noisy or incomplete information [14].

While NAS provides a framework for automating model design, accurately estimating the potential performance of a model with minimal computational resources remains a significant challenge. This issue is especially critical in domains like PHM, where both computational efficiency and predictive accuracy are crucial. Our approach aims to develop a learning-curve estimation technique that provides early, reliable insights into the potential performance of a model, thereby minimizing the computational overhead typically associated with traditional NAS methods [15].

1.3. On Estimating Learning Curves from Initial Data

In the study of neural networks (and ML in general), analyzing the mathematical properties of learning curves is of significant important. These curves, which track the evolution of error or loss during training, provide valuable insights into model convergence and overall behavior.

Key properties of learning curves that offer critical insights into the behavior of DL models and guide more effective training strategies include [16] the following:

Convergence rate: The speed at which error decreases during training iterations, reflecting the optimization efficiency of the process.
Smoothness: The degree of regularity or stability in learning curves. Smoother curves often indicate more stable and predictable training processes.
Convexity: Whether learning curves exhibit convex behavior, which simplifies analysis and optimization.

Notably, these factors facilitate the study of the initial curvature of learning curves, which may contain predictive information about a model’s final performance [17]. A steeper initial curvature, representing a rapid decrease in error, could indicate better final performance. Investigating the relationship between initial learning curve curvature and final performance across various neural network architectures and datasets is therefore a promising area of research.

This subsection does not aim to comprehensively address the foundational problem [16,17]. Instead, it focuses on leveraging initial behavior (early stopping) to predict future performance and identify the best model. Typical methods for fitting learning curves involve cross-validation or direct parameter tuning to extrapolate performance to unseen dataset sizes. Parametric approaches, such as modeling loss functions with exponential or power-law behaviors, are particularly useful in ML settings where data complexity varies widely [16]. These approaches form the basis of the practical estimation strategies explored here. We briefly discuss other approaches, along with their advantages and disadvantages, which justify further exploration of alternative solutions.

1.3.1. Extrapolation of Learning Curves

In the ML literature, the term “learning curve” is understood in two different ways. The first, more general, meaning refers to the learning curve as a function of the size of the training set. These types of learning curves have been studied to extrapolate performance from smaller to larger datasets and are not the focus of this paper. The second, more commonly used, meaning refers to the learning curve as a function of the number of training iterations (neural network epochs in this work).

In this paper, we consider the learning curve

y (t) = L (x, θ_{t})

, where

L

represents the loss function of the model applied to the dataset x with parameters

θ_{t}

at epoch t.

An intuitive approach to predicting learning curves is based on the method proposed by Domhan et al. [18], which considers a family of parametric functions to extrapolate the learning curve f from its initial observations.

In this framework, the curves are modeled as a weighted linear combination of k basis functions

{ϕ_{i} (t, θ_{i})}_{1 \leq i \leq n}

, each dependent on time t and parameter vectors

θ_{i}

. The central assumption is that the curve being estimated can be modeled as

\hat{y} (t | Θ, w) = \sum_{i = 1}^{k} w_{i} ϕ_{i} (t, θ_{i}),

where

Θ

represents the set of all parameters

θ_{i}

, and w denotes the weight vector associated with the parameters used in the basis functions. The prediction process also assumes observational noise around the unknown true value

f (t)

, modeled as

y (t) \sim N (\hat{f} (t | Θ, w), σ^{2}),

where a prior is defined for the parameters. Using a gradient-free Markov chain Monte Carlo (MCMC) method, samples from the posterior distribution are obtained to predict future values of the learning curve.

The approach is flexible due to the inclusion of arbitrary parametric functions, allowing it to adapt to various neural network architectures and hyperparameter configurations. However, a key limitation is that the model does not leverage previously evaluated hyperparameter configurations, requiring observation of a significant portion of the learning curve before its predictions become reliable.

A crucial aspect of this methodology is estimating the curvature of the learning curve to predict its future behavior, such as detecting convexity. Using the central limit theorem, the distribution of sample means approaches normality for sufficiently large sample sizes. Therefore, to estimate the curvature, S independent experiments are performed, generating predictions for large enough epochs m to reveal reliable trends. This provides a solid foundation for evaluating the progression of the learning curve. For instance, based on Donham’s method, if

\hat{y} (m)

is the predicted learning curve

\hat{y} (m) = E [{y (m) | {y (t)}}_{t = 1}^{N}] \approx \frac{1}{S} \sum_{s = 1}^{S} {\hat{f}}_{s} (m | Θ_{s}, w_{s}),

then the curvature of

\hat{y}

is

\frac{d^{2} {\hat{y}}_{m}}{d m^{2}} ≃ \frac{1}{S} \sum_{s = 1}^{S} \frac{\partial^{2} \hat{f}}{\partial m^{2}} (m | θ_{s}, w_{s}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{d^{2}}{d m^{2}} (\sum_{j = 1}^{k} w_{s j} ϕ_{j} (m, θ_{s j})) = \frac{1}{S} \sum_{s = 1}^{S} \sum_{j = 1}^{k} w_{s j} \frac{d^{2} ϕ_{j} (m, θ_{s j})}{d m^{2}} .

which can provide valuable insights into the evolution of the learning curve. Thus, the curvature depends on the values of

θ_{s j}, w_{s j}

. If the curvature is positive, the curve is convex, which typically indicates that the model is improving its performance at an increasing rate. Conversely, a negative curvature (concave) suggests that the improvement rate is slowing down, often signaling that the model performance is stabilizing.

1.3.2. The Power-Law Hypothesis

Before introducing a new method for NAS and its application in the context of PHM, it is crucial to evaluate whether a parametric approach is appropriate. In particular, by using power-law functions.

The answer is not clear-cut. In some cases, parameterizations of power-law functions can be effective, while in others, they may not be applicable. The question of whether learning curves in PHM can be reliably modeled with power laws or require alternative functional forms is fundamental given the complexity of PHM environments. Let us briefly examine this issue.

Taking the curve as in our case (as a function of epochs), several models exist that use a parametric power law to model the learning curve. For instance, Kadra et al. [19] model learning curves using a neural network ensemble, with the output being conditioned to follow a power law, and [20] Tissue et al. model the learning curve using a power law but as a function of epochs and learning rate annealing.

Starting from the hypothesis that the learning curve follows a power law (or a parametric sum of such functions),

y (m) ≃ a + c m^{- (1 + α)}, m \geq m_{0}

the problem can be narrowed to estimating the parameters

(a, c, α)

. These can be effectively inferred through methods like the Domhan approach, as previously described, or via specialized techniques for power-law detection (cf. [21]). In [21], Clauset et al. introduce a statistical method to estimate power-law parameters and conduct goodness-of-fit tests using robust techniques, including the Kolmogorov–Smirnov test, to compare power-law fits with alternative distributions.

The utility of power-law models in PHM has been recognized for their generalization capabilities (for instance, in [22] Hestness et al. show the power-law relation between DL models’ generalization error and factors such as the amount of training data, model size, and compute resources), though alternative functional forms are often needed to capture intricate learning dynamics [23]. PHM scenarios frequently involve diverse equipment and operational conditions, leading to learning curves that deviate from power-law behavior. For example, exponential functions effectively model rapid initial error reduction in systems with stable signal-to-noise ratios [18]. Complex PHM systems may also exhibit phase transitions, requiring hybrid or piecewise models over purely parametric approaches. Selecting functional forms should depend on empirical validation, model evaluation, and domain expertise.

1.3.3. Modeling Learning Curves by Stochastic Processes

An alternative approach to modeling learning curves is to harness their stochastic nature. Specifically, consider a stochastic process

{Y (t)}_{t \geq 0}

, where

Y (t)

represents the loss at time t (see [24]). While stochastic processes are widely used in machine learning (e.g., [25]), the stochasticity here arises from parameter updates at each epoch. In the expression

Y (t) = L (x | θ_{t})

, the parameter vector

θ_{t}

is defined recursively as

θ_{t} = θ_{t} (L (x | θ_{t - 1}))

, leading to the relationship

Y (t) = L (x | θ_{t} (L (x | θ_{t - 1})))

Modeling learning curves as stochastic processes offers several advantages. First, it captures the inherent randomness in the learning process, which is influenced by factors such as weight initialization, the order of training data, and noise in the data. Additionally, it allows for the quantification of uncertainty in predicting the final performance of the model, a crucial aspect for informed decision making. Stochastic modeling also provides access to powerful analytical tools from statistics and probability theory, which can be used to study the properties of learning curves and predict future performance.

It is useful to hypothesize that

Y (t)

follows a classic stochastic diffusion process, expressed as a differential stochastic equation of the form

d Y (t) = μ (Y (t), t) d t + σ (Y (t), t) d W (t)

(1)

where

μ (Y (t), t)

is the drift term, representing the deterministic trend of the learning curve;

σ (Y (t), t)

is the diffusion coefficient, capturing stochastic variability; and lastly

W (t)

is a standard Brownian motion. This kind of stochastic differential equations has been extensively studied. For example, Brogat-Motte et al. [26] present a method for estimating both the drift and diffusion coefficients of continuous, multidimensional, nonlinear stochastic differential equations (SDEs) that are influenced by control inputs.

Our learning-curve estimator

{\hat{f}}_{N}

can predict final performance based on N initial epochs in stochastic terms:

{\hat{Y}}_{N} (m) = E [{Y (m) | {Y (t)}}_{t = 1}^{N}]

(2)

The interest in stochastic modeling lies in the available tools for estimating the learning curve. Using the classical result of Itô [27], if we model the curve with Equation (1) and

f (t, Y_{t})

is twice differentiable in

Y_{t}

and differentiable in t, the change in

f (t, Y_{t})

is given by

d f (t, Y_{t}) = (\frac{\partial f}{\partial t} + μ \frac{\partial f}{\partial x} + \frac{1}{2} σ^{2} \frac{\partial^{2} f}{\partial x^{2}}) d t + σ \frac{\partial f}{\partial x} d W_{t} .

This solution is useful for stochastic modeling of learning curves. For example, using

X_{t} = f (t, Y_{t})

for

f (t, Y_{t}) = \frac{\partial f}{\partial t} (Y_{t})

, we would obtain a stochastic expression for the curvature of

Y_{t}

.

Integrating both sides of the lemma of Itô, we obtain

f (T, Y_{T}) = f (0, Y_{0}) + \int_{0}^{T} \frac{\partial f}{\partial t} (t, Y_{t}) d t + \int_{0}^{T} \frac{\partial f}{\partial x} (t, Y_{t}) d Y_{t} + \frac{1}{2} \int_{0}^{T} \frac{\partial^{2} f}{\partial x^{2}} (t, Y_{t}) σ^{2} (t, Y_{t}) d t .

Substituting

d Y_{t}

with Equation (1) in the second integral term, we obtain

f (T, Y_{T}) = f (0, Y_{0}) + \int_{0}^{T} (\frac{\partial f}{\partial t} + μ \frac{\partial f}{\partial x} + \frac{1}{2} σ^{2} \frac{\partial^{2} f}{\partial x^{2}}) d t + \int_{0}^{T} σ \frac{\partial f}{\partial x} d W_{t} .

This integral form of the lemma is useful for evaluating

f (t, X_{t})

at the final time T, based on its initial value and the integral contributions over time. For

f (t, Y) = Y (t)

, as in our case, it provides a detailed estimation.

Despite its benefits, stochastic modeling of learning curves comes with some limitations. One key challenge is complexity: some stochastic models can be intricate and difficult to fit to the available data. Furthermore, these models rely on assumptions about the learning process that may not always be valid, which can affect their accuracy in certain contexts. Lastly, stochastic models require a sufficient amount of data to accurately estimate their parameters, posing a problem in situations where data are limited.

1.4. Bayesian Optimization for Neural Architecture Search

In the optimization process described above, the input to the objective function f consists of the model architecture and training hyperparameters, while the output is the performance of the model trained with the given settings. Bayesian optimization (BO) is particularly effective in scenarios where evaluating f is computationally expensive.

The BO framework provides a probabilistic approach to optimizing expensive-to-evaluate objective functions, especially when the function lacks a known analytical form or is computationally demanding. BO uses a surrogate model, typically a Gaussian process, to approximate the objective function based on prior evaluations. An acquisition function then guides the selection of the next evaluation point, balancing exploration of uncertain regions with exploitation of promising areas. This iterative approach makes BO highly efficient, requiring significantly fewer evaluations compared to brute-force or grid-search methods. BO is widely applied in hyperparameter optimization, NAS, and experimental design. For instance, in [28], Kandasamy et al. present NASBOT, a BO framework for NAS using a novel distance metric, in the space of neural network architectures.

A typical application of BO in ML involves model selection tasks [29], where the generalization performance of a statistical model cannot be determined analytically and must instead be evaluated empirically. For example, BO can be used to select the parameter

λ

and the kernel bandwidth h for SVMs.

Methods based on BO provide a systematic and automated alternative to traditional practices commonly employed in PHM. Typically, human experts inspect learning curves during training to identify and terminate runs with poor hyperparameter settings, thereby accelerating manual hyperparameter optimization. In PHM, learning curves play a central role in evaluating the performance of algorithms with respect to resources such as the number of training examples or iterations [17]. These curves have a variety of applications, including guiding data acquisition strategies, implementing early-stopping criteria to prevent overfitting, and facilitating model selection processes [18]. BO enhances these tasks by automating the exploration and optimization of hyperparameters, reducing the need for manual intervention while maintaining efficiency.

1.5. Aim of the Paper

This work focuses on leveraging learning curves to eliminate unpromising models early in the NAS process using BO. Building on prior research [30], the present work aims to optimize NAS specifically for PHM applications by introducing a novel performance estimation framework. This framework is designed to streamline the NAS process by significantly reducing computational costs while maintaining high-quality model selection.

The core of the proposed approach lies in an intelligent estimator capable of predicting the long-term performance of a model by analyzing only a few initial training and validation epochs. To achieve this, the estimator is built using an extensive dataset comprising 62 predictive maintenance problems, ensuring its robustness and generalization. This estimator is then seamlessly integrated into the BO loop, enabling the process to efficiently prune unpromising candidates and allocate resources to models with higher potential, thereby enhancing the overall efficiency of NAS in PHM tasks.

In our proposed model, prior assumptions about the functional form of the curves are not required. While such assumptions could simplify the problem, they are not necessary within our framework. As a result, our method generalizes beyond purely parametric approaches, making it better suited to environments characterized by heterogeneity, such as in the context of PHM.

1.6. Structure of the Paper

This work is structured as follows. Section 2 presents some related works, where key developments and methodologies related to the proposed approach are discussed. Section 3 provides the foundations of Bayesian optimization, offering an introduction to the technique, whereas Section Bayesian Optimization Based on a Gaussian Process elaborates on Bayesian optimization, explaining its integration with Gaussian processes.

In Section 4, the datasets used in this study are described in detail, focusing on their characteristics and applicability to predictive maintenance tasks. Section 5 outlines the main approach and methodology developed in this work.

Section 6 details the architectures of the performance estimators used in the study and the training procedure. Finally, the performance of the estimators is assessed and analyzed under different conditions.

In Section 7, the early-stopping methodology applied during BO-based NAS is discussed. This section defines the metrics used to evaluate the performance of the early-stopping process, presents the results obtained, and compares the approach with different baselines.

Section 8 is devoted to conducting a theoretical analysis of two fundamental aspects of the method. Firstly, the impact of additive and autoregressive noise on learning curves and early-stopping efficiency is investigated. Secondly, the convergence of robust early-stopping methods under noise is analyzed.

Finally, Section 9 contains a discussion of the results and insights obtained from the experiments and theoretical analysis, and Section 10 concludes the work, summarizing the findings and suggesting potential directions for future research.

2. Related Work

Building upon the ideas introduced in the Domhan et al. paper [18] mentioned above, Baker et al. propose a method for enabling early stopping during the training of neural networks by leveraging learning curves [31]. This method employs an estimator to predict the final performance of a model during its training. If the estimated performance is lower than the best performance observed so far, the training process for that model is terminated prematurely. This early-stopping approach is integrated into the Hyperband optimization algorithm [32], which is designed to efficiently allocate computational resources during hyperparameter optimization. Furthermore, Klein et al. [17] explore the use of Bayesian networks for modeling learning curves within the Hyperband framework, enhancing its predictive capabilities during hyperparameter search.

It is important to highlight that the Hyperband algorithm operates as a resource allocation strategy and does not utilize performance data from previous iterations when selecting new configurations to evaluate. Instead, Hyperband accelerates random search by applying an adaptive early-stopping mechanism that dynamically allocates resources—such as training iterations, data samples, or features—based on intermediate performance metrics.

Within the context of Bayesian optimization (BO), various methods have been developed to improve efficiency by incorporating early-stopping mechanisms. These methods aim to halt evaluations of unpromising hyperparameter configurations within the inner loop of the optimization process. A notable example is BOHB [33], which combines the strengths of Bayesian optimization and Hyperband. BOHB leverages information from previously sampled configurations to guide the search process while retaining the early-stopping strategy of Hyperband to discard unpromising candidates, thereby improving both computational efficiency and solution quality.

Dai et al. [34] proposed a Bayesian model to estimate the convergence of a model being trained during a Bayesian optimization (BO) iteration. This estimation allowed for the early termination of training, thereby reducing the number of unnecessary training epochs. However, a limitation of this method is that models with limited potential to outperform the current best configuration could still undergo a substantial number of training epochs before being halted.

The field of learning curve prediction and neural architecture search has seen significant developments in recent years, with researchers exploring various approaches to improve computational efficiency and model performance.

Adriaensen et al. [35] proposed an approach for efficient Bayesian learning curve extrapolation using prior-data-fitted networks. While sharing our goal of improving learning curve prediction, their method focuses primarily on extrapolation techniques, whereas our work emphasizes early stopping in neural architecture search specifically applied in predictive maintenance domains. Their probabilistic modeling approach differs from our more comprehensive framework that studies network hyperparameter information and provides a detailed theoretical analysis of noise in learning processes.

The work by Egele et al. [36] introduced a provocative method for early discarding of models after just one epoch of training. Their approach is notably simpler compared to our method, focusing on empirical observations rather than a comprehensive theoretical framework. While both works aim to reduce computational costs in hyperparameter optimization, our approach provides a more nuanced estimation process that leverages learning curve characteristics and incorporates domain-specific insights from predictive maintenance. Furthermore, we evaluate the performance of our method using only two observed epochs, making it directly comparable to their approach.

Rakotoarison et al. [37] developed an in-context freeze–thaw Bayesian optimization technique. Their work shares our interest in Bayesian optimization and computational efficiency, but differs in its primary focus. While they explore a flexible model freezing/unfreezing strategy, our research concentrates on developing a robust performance estimator specifically tailored to learning curve analysis in predictive maintenance contexts.

The foundational work by Klein et al. [17] on learning curve prediction using Bayesian neural networks laid crucial groundwork in the field. Their pioneering research introduced the idea of using probabilistic models to predict model performance, an approach we extend and refine. Unlike their work, our method incorporates a more sophisticated analysis of noise in learning processes and provides a more comprehensive framework for early stopping in neural architecture search.

Ruhkopf et al. [38] introduced MASIF, a meta-learning approach for algorithm selection using implicit fidelity information. Although their work shares our interest in meta-learning and algorithm optimization, our research is more specifically focused on learning-curve estimation and early stopping in neural architecture search for predictive maintenance.

Our work distinguishes itself through several key contributions. First, we provide a comprehensive integration of hyperparameter information into the performance estimation process. Second, our approach is specifically tailored to predictive maintenance domains, utilizing a set of 62 different PHM-related datasets. Third, we offer a detailed theoretical analysis of (some models of) noise in learning processes, going beyond previous empirical approaches. Finally, our early-stopping method, based on a sophisticated estimator, provides a more nuanced approach to reducing computational costs in neural architecture search.

3. Background: Gaussian Processes

One approach to studying the learning curve is to model it as a Gaussian process (GP). A GP is a collection of random variables where any finite subset of these variables has a joint distribution that is a multivariate normal distribution. In simpler terms, a Gaussian process defines a distribution over functions, allowing for flexible and non-parametric modeling and predictions for complex data.

Formally, a Gaussian process can be defined as

f (x) \sim GP (m (x), k (x, x^{'}))

(3)

where

$m (x) : X \to R$ is the mean function;
$k (x, x^{'}) : X \times X \to R$ is the covariance function (kernel).

Such that for any finite set of points

x_{1}, \dots, x_{n}

, the vector of values

f = {[f (x_{1}), \dots, f (x_{n})]}^{T}

follows a multivariate Gaussian distribution:

f \sim N (μ, K)

(4)

where

$μ = {[m (x_{1}), \dots, m (x_{n})]}^{T}$ ;
$K$ is the covariance matrix, with entries $K_{i j} = k (x_{i}, x_{j})$ .

Therefore, GPs are characterized by their mean and covariance functions (kernels). As distributions over functions, GPs are highly useful in non-parametric regression, allowing inferences about the behavior of a function without requiring its form to be specified in advance. In machine learning, GPs play a key role in Bayesian optimization (BO), enabling the efficient selection of evaluation points in scenarios where the objective function is costly to compute. By effectively modeling uncertainty, GPs minimize the number of evaluations needed, optimizing both computational resources and time.

Bayesian Optimization Based on a Gaussian Process

Bayesian optimization (BO) is an effective technique for optimizing expensive black-box functions, particularly when evaluations are costly or time-consuming. It is widely used in fields such as machine learning, engineering, and scientific research, especially for tasks like hyperparameter tuning, experimental design, and parameter estimation. BO efficiently balances exploration and exploitation, which is crucial for noisy or multi-modal objective functions. By incorporating prior knowledge through kernel functions, BO is well suited to model complex and unknown landscapes, offering a structured optimization approach with minimal evaluations. While various approaches exist for BO, those based on Gaussian processes (GPs) are the most commonly used in the literature.

Generally, a Bayesian optimization (BO) algorithm operates sequentially: Starting with a Gaussian process (GP) prior for f at time 0, it incorporates results from previous evaluations at times

1, \dots, t - 1

to update the posterior for f. This posterior is then used to construct an acquisition function

ϕ_{t}

, where

ϕ_{t} (x)

measures the value of evaluating f at x at time t (although the function can be independent of epochs). If the goal is to maximize f, the algorithm selects

x_{t}

as the maximizer of the acquisition function:

x_{t} = arg max_{x \in X} ϕ_{t} (x) .

Two key elements are necessary for implementing GP-based BO. First, we need a kernel function

κ (x, x^{'})

to quantify the similarity between two points x and

x^{'}

in the domain. This kernel is essential for defining the GP, enabling us to reason about the unobserved value

f (x^{'})

when

f (x)

has already been evaluated. Second, we require a method to maximize

ϕ_{t}

.

In the context of NAS, the objective function

f : A \to R

represents the performance of a neural network architecture

a \in A

, where

A

denotes the set of all possible architectures (i.e., the parameter/configuration space). The BO process iteratively searches for the optimal architecture

a^{*} = arg {min}_{a \in A} f (a)

by constructing a probabilistic model of f and using it to guide the search. Formally, the BO process can be described as follows.

Initialization: Begin with an initial set of evaluated architectures $D = {(a_{i}, f (a_{i}))}_{i = 1}^{n_{0}}$ , where $n_{0}$ is the number of initial evaluations.
Probabilistic modeling: Construct a probabilistic model $p (f (a) | D)$ to capture the uncertainty about the objective function f given the observed data $D$ . A common choice is a GP.
Acquisition function: Define an acquisition function $α (a; D)$ that quantifies the utility of evaluating a new architecture a, balancing exploration (searching regions with high uncertainty) and exploitation (searching regions with promising performance).
Optimization: Solve the optimization problem $a_{n + 1} = arg {max}_{a \in A} α (a; D)$ to find the next architecture to evaluate. Several optimization techniques, such as gradient-based methods or evolutionary algorithms, can be used.
Evaluation and update: Evaluate the performance $f (a_{n + 1})$ of the new architecture and update the dataset $D \leftarrow D \cup {(a_{n + 1}, f (a_{n + 1}))}$ .
Iteration: Repeat steps 2–5 until a stopping criterion is met, such as reaching a maximum number of iterations or a target performance level.

Also, local BO has been studied by Wu et al. [39].

Therefore, the acquisition function guides the search. Common acquisition functions include expected improvement (EI) [40],

EI (θ_{T}) = E \{max (f (θ^{*}) - GP (θ_{T}), 0)\}

(5)

upper confidence bound (UCB)

α (θ) = μ (θ) + κ σ (θ)

(which will be used in this paper), and probability of improvement (PI)

α (x) = P (f (x) \geq f^{*} - ϵ)

.

In the case of NAS, the BO algorithm can be formalized as in Algorithm 1.

Algorithm 1: Bayesian Optimization for Neural Architecture Search

The BO process is designed with the expectation that as the number of iterations increases it will converge towards the objective with the best value. For example, guiding the method towards minimizing the expected improvement is expected to lead to convergence under the selected strategies (cf. [41]).

4. Experiments on Predictive Maintenance: Datasets

To illustrate the proposed methodology, a dataset of 61,000 learning curves was generated. This dataset was created using 62 different datasets from multiple tasks in the context of PHM, including failure detection, fault diagnosis, and prognosis. The datasets were obtained using the Python tool phmd [42], which facilitates the downloading and loading of each dataset. These datasets were sourced by searching various research repositories, related bibliographies, and the internet. They are public datasets in the context of PHM, covering the years between 2010 and 2022: 10 are provided by NASA [43,44,45,46,47,48,49,50,51,52], while 8 are offered by the PHM Society through their American, European, and Asian challenges [53,54,55,56,57,58,59,60]. Universities worldwide contributed 19 datasets [61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79], and 7 more were produced by other research institutions [80,81,82,83,84,85,86]. One dataset belongs to the Society for Machinery Failure Prevention Technology (MFPT) [59], and the remaining were published by various companies.

Figure 2 shows the distribution of dataset publishers. Universities, along with research centers, contributed over fifty percent of the datasets. NASA has been treated separately due to its significant contribution to dataset provision over the years. Similarly, the PHM Society, through its annual data challenges, has made a substantial contribution.

All datasets found were included in the experiment set. The only requirement that a dataset had to meet was that it had to be a time-series dataset. The datasets span multiple engineering domains, systems, and components, such as electrical components, drive technology, mechanical components, and materials, among others.

Regarding the experimental nature of the datasets, the majority of them are categorized under mechanical and electrical component fault diagnosis and prognosis. Within these domains, the analysis of bearings, gears, and batteries predominates. Figure 3 presents the overall distribution of domains and applications to which each dataset belongs.

The type of feature present in each dataset has also been categorized, and the distribution is summarized in Figure 4. As can be observed, there is a high number of vibration features. While generating datasets on mechanical components, accelerometers were used in most cases to measure vibrations; therefore, this kind of feature is very common. Features like temperature, current, and voltage are commonly used in electrical component-based datasets, which are well represented among the datasets gathered.

Figure 5 displays the distribution of modeling tasks and PHM tasks. Regarding the modeling tasks (Figure 5a), most of the tasks are regression tasks, followed by multiclass classification tasks. Binary classification is less represented and is typically focused on fault detection. Notably, any fault diagnosis task can be converted into a fault detection task by considering the normal or healthy condition as the negative label and the other fault categories as the positive label. In relation to the PHM tasks (Figure 5b), diagnosis tasks are dominant, followed by prognosis tasks.

In real-world applications, fault categories are less common than the healthy condition. It is desirable for the set of datasets to be representative of this situation. Therefore, the imbalance ratio of the dataset is an important factor to consider. Figure 6 shows the distribution of imbalance ratios across the datasets. A considerable number of datasets are imbalanced or very imbalanced, as expected.

For each dataset, neural architecture search (NAS) was performed on four different architectures: Feedforward neural networks (FFNs), deep convolutional networks (DCNs), recurrent neural networks (RNNs), and Transformers.

The NAS process was conducted using Bayesian optimization (BO) implemented using the software published by F. Nogueira [87]. The Bayesian optimization (BO) algorithm used in our work relies on a Gaussian process (GP) with a radial basis function (RBF) kernel to model the objective function. For selecting the next set of hyperparameters, we employed the upper confidence bound (UCB) acquisition function, with the default hyperparameters

κ = 2.576

and

ξ = 0.0

. The choice of UCB provides a balance between exploration and exploitation, allowing the algorithm to focus on both areas of high uncertainty and regions where performance is expected to improve.

Each neural architecture search (NAS) run consisted of 100 iterations. The first 20 iterations used random sampling to explore the search space, providing an initial set of diverse observations to seed the GP. Subsequently, the BO process used the UCB acquisition function to guide the search for optimal hyperparameter configurations. Figure 7 presents a sample of the learning curves generated from the dataset.

5. Overall Description of the Proposal

Our proposal is based on two main components: a learning-curve estimator (Section 6) and a BO process for NAS that uses the aforementioned estimator for early stopping (Section 7). Let us briefly describe these elements.

Learning-curve estimator:: The goal is to develop a model capable of predicting the final performance of a DL model based solely on the first epochs of its training and validation curves. Two versions of the estimator are proposed: An estimator based only on the learning curves, and an estimator conditioned on the hyperparameters of the neural network architecture.
Integration into BO for NAS:: The learning-curve estimator is used to accelerate the BO process in the search for optimal neural network architectures for PHM tasks. When the estimator predicts that the performance of a model will be significantly worse than the best model found so far, the training of that model is stopped, thus saving computational resources.

The model is trained and evaluated using a dataset of 61,000 learning curves generated from 62 different datasets in the context of PHM, as described in Section 4.

6. Performance Estimators

The proposed methodology centers on a model designed to estimate the final validation performance

y (T)

of a neural network. This estimation allows for the application of an early-stopping strategy during NAS, improving efficiency and reducing computational costs by halting training early based on predicted performance.

To construct the estimator, the initial step involves selecting a specific machine learning model. The process uses the first N observed points of training and validation performance extracted from the learning curves. These points provide a basis for analyzing and predicting trends by working on pairs of curves:

y^{i} = 〈 y_{i}^{t r a i n}, y_{i}^{v a l} 〉

.

The training set to build such a model is composed of a set of former training and validation performances and their corresponding targets. Using the notation

y_{↾ [0, N]} = 〈 y_{↾ [0, N]}^{t r a i n}, y_{↾ [0, N]}^{v a l} 〉

:

S_{LC} = \{(y_{↾ [0, N]}^{1}, y^{1} (T)), (y_{↾ [0, N]}^{2}, y^{2} (T)), \dots, (y_{↾ [0, N]}^{M}, y^{M} (T))\}

where M is the number of samples in the dataset.

Note that the number of training epochs for each model could be different (see Figure 7). The variation is due to the models being trained using an early-stopping criterion. This is quite different from other works, where the number of epochs is always the same.

In a second modeling stage, it is studied how the information of the hyperparameters impacts the learning-curve estimator. Thus, the hyperparameters

θ_{i}

will be included as part of the estimator.

Since the dataset contains various types of architectures and only a general learning-curve estimator is trained, the hyperparameter vector is paired with a one-hot vector that determines whether each hyperparameter is set for that network. Thus, the training set collecting such information for this kind of estimator is of the form

S_{LC + H} = \{((y^{1}, θ_{1}), y_{T}^{1}), ((y^{2}, θ_{2}), y_{T}^{2}), \dots, ({(y)}^{M}, θ_{M}), y_{T}^{M})\}

A first approximation to approach the estimation could be to start from

\hat{y} (x | Θ, ω) ≃ \sum_{j = 1}^{M} ω_{j} y_{j}^{t r a i n} (x | θ_{j}) (+ \sum_{j = 1}^{M} ω_{j} y_{j}^{v a l} (x | θ_{j}))

Thus, the approach proposed by Domhan could be applicable. However, as previously mentioned, parameter estimation using MCMC can be computationally expensive, and the most appropriate parametric family (especially in the context of PHM) is unclear. An alternative, which we aim to demonstrate in this paper, is to use a neural network-based model, trained on the available data, to compute

\hat{y} (T)

. The architecture of the network is outlined below.

6.1. Performance Estimator Architectures

The selected architecture for the final performance estimator is the LSTM, chosen for its excellent performance in processing and modeling time-series data, effectively capturing both past and future dependencies. The architecture begins with LSTM cells consisting of 128 units. Following these, two fully connected layers with 128 and 64 neurons, respectively, are added, each utilizing ReLU as the activation function. To improve training stability and convergence, batch normalization is applied immediately after the LSTM cells.

The input to the model corresponds to training and validation performance curves, with a variable shape ranging from 2 to 9 observed epochs. The complete architecture is illustrated in Figure 8.

To analyze the impact of network hyperparameters on the learning process, a conditioned network was designed. The hyperparameters define the architecture of the network responsible for generating the learning curve. Given that the dataset includes different types of networks (FFN, CNN, RNN, and Transformer), certain hyperparameters are not applicable across all architectures (see Table 1).

To address this, a one-hot vector mask was introduced to indicate the relevance of each of the 28 hyperparameters for a specific network type. Both the hyperparameter vector and the mask, each with dimensions

28 \times 2

, are processed through two fully connected layers containing 32 and 16 neurons, respectively. These layers utilize the Swish activation function, defined as

f (x) = x \cdot sigmoid (x)

(see Figure 8B).

The output from the final fully connected layer is concatenated with the output of the Bi-LSTM cells. This combined representation is then passed to the two final dense layers of the network. Detailed summaries of the architectures selected are provided in Table A1.

6.2. Training Procedure and Estimator Performance

The performance estimators were designed to predict the final validation loss of a network, with the choice of loss function depending on the nature of the task. For regression tasks, the mean squared error (MSE) was used, while classification tasks relied on cross-entropy loss.

The hyperparameters of the performance estimator network to be optimized include the number of LSTM layers, the size of the LSTM cells, the bidirectionality, and the learning rate. To ensure robust training and evaluation, the dataset of performance curves was split at the dataset level (known as group-based splitting) to mitigate overfitting. A total of 30% of the datasets were randomly assigned to the test set, while the remaining 70% were used within a three-fold cross-validation (that is, they were further divided, allocating 66% to the training folds and 33% to the validation fold), to enhance the results’ reliability. Additionally, all experiments were repeated six times with different random seeds to validate the results across diverse test conditions and network initializations. Table 2 shows the range of hyperparameters studied.

The training process incorporated several optimization techniques to ensure efficiency and stability. Early stopping was applied to terminate training when validation performance failed to improve for eight consecutive epochs, reducing unnecessary computation and minimizing overfitting. A learning rate decay strategy, inspired by the work of You et al. [88], was also employed. If validation performance showed no improvement for five consecutive epochs, the learning rate was reduced by a factor of 0.1. The Adam optimizer [89] was used to train the estimators effectively.

The performance of the estimators, as illustrated in Figure 9, is influenced by the number of observed training epochs. As anticipated, the estimation error decreases as the number of observed epochs increases, demonstrating the advantage of incorporating more training data. Interestingly, the inclusion of network hyperparameters enhances the accuracy of the estimators, primarily when the number of observed epochs is small. Additionally, the learning rate appears to have a lesser impact on the non-conditional approach, whereas lower learning rates are necessary for the conditional approach to achieve optimal performance.

Figure 10 illustrates the relationship between the validation MAE and test MAE as measured by the performance estimator. The figure highlights that a lower number of observed epochs is associated with a reduced likelihood of overfitting, where the model maintains similar performance on both validation and test sets. This trend is particularly evident in the kernel density estimation (KDE) plot, which shows a more consistent alignment between validation and test errors as the number of observed epochs decreases.

6.3. Robustness Analysis

In this section, we analyze the sensitivity of the estimator performance to various hyperparameters using the Sobol sensitivity index. The Sobol index quantifies the contribution of each input parameter to the output variance, providing insight into how robust the model is to changes in hyperparameters. The Sobol index for a given hyperparameter

X_{i}

is computed as follows:

S_{i} = \frac{Var [E [Y | X_{i}]]}{Var [Y]}

where Y is the output of the model (e.g., test RMSE) and

X_{i}

is the hyperparameter under study. The numerator measures the variance of the output due to variations in

X_{i}

, while the denominator represents the total variance in the output. The Sobol index values for each hyperparameter are summarized in Table 3.

For the learning rate, the Sobol index was found to be

0.0587

. This relatively moderate value suggests that the learning rate has a meaningful impact on model performance. Thus, fine-tuning the learning rate is likely to have the most noticeable effect on improving model performance. In contrast, the network depth had a Sobol index of

- 0.0303

, indicating a small negative influence on the performance of the model. The negative Sobol index suggests that a shallower architecture could be more effective for this problem, and adding extra layers might not contribute positively to performance.

The bidirectional setup yielded a Sobol index of

0.0109

, which indicates a very small positive effect on performance. This suggests that processing information in both forward and backward directions has a minor benefit. Finally, for the number of recurrent units, the Sobol index was

0.0306

, showing a small positive influence on model performance. While this index indicates a modest effect, it still suggests that the width of the recurrent layers plays a role in capturing more complex patterns in the data. However, the contribution is small compared to the learning rate, and adjustments to the recurrent units may lead to marginal improvements rather than drastic changes in performance.

7. Early Stopping During NAS with BO

After training the final performance estimator, its effectiveness in reducing the number of epochs required during BO for NAS was evaluated. Since running the BO process across 62 datasets for various architectures is computationally expensive, and these experiments had already been conducted to generate the learning curves, we simulated the BO process using the results from those experiments. Specifically, we followed the same sequence of test set experiments for each dataset as their respective training runs.

The details of this simulation procedure are outlined in Algorithm 2. The condition

e s t i m a t e d_l o s s < 2 \cdot b e s t_l o s s

was introduced as a heuristic rather than a parameter derived from tuning. This threshold was set at the beginning of our study based on the hypothesis that if the predicted loss of a model is twice as high as the best loss observed so far, it is unlikely that the model will outperform the current best. The value of 2 was chosen to provide a balance between filtering out poorly performing models and allowing sufficient exploration. While this specific threshold was not extensively tuned, it reflects an intuitive assumption about the relationship between early loss estimates and final performance. Future work could explore alternative thresholds to better understand their impact on model selection.

7.1. Metrics

Two key metrics were computed to evaluate the impact of early stopping during the BO process. The first metric quantifies the total number of training epochs skipped during the BO process. Early stopping is applied to reduce the computational cost by avoiding unnecessary training iterations. Let

E_{total}

denote the total number of epochs a model would run without early stopping, and

E_{observed}

represent the number of epochs observed before early stopping is applied. The number of skipped epochs,

E_{skipped}

, is defined as

E_{skipped} = E_{total} - E_{observed}

This metric provides a measure of the computational time saved by applying early stopping.

The second metric measures the performance drop percentage,

D_{performance}

. This metric measures the relative reduction in performance caused by early stopping, compared to the best possible performance achieved if all epochs were fully observed.

Let

P_{full}

denote the performance (e.g., test loss) of the best model when trained for all epochs, and

P_{early}

the performance of the model selected by early stopping. The performance drop percentage is computed as

D_{performance} = \frac{P_{full} - P_{early}}{P_{full}}

This metric quantifies the cost of applying early stopping in terms of performance degradation.

It is noteworthy that there exists a trade-off between these two metrics. As the number of training epochs skipped increases, the probability of discarding the best models selected during the BO process without early stopping increases. Therefore, it is important to find an optimal balance, where early stopping effectively reduces computational cost without significantly compromising model performance.

Algorithm 2: Early stopping within BO simulation

7.2. Results

The results of the simulation are presented in Figure 11. Both performance estimators save approximately 42% to 65% of computation time (skipped epochs). However, an inverse trend is observed: as fewer epochs are observed, more epochs are discarded when a non-promising architecture is detected early, which slightly reduces the overall performance.

Additionally, we measure the frequency with which the estimator identifies the best solution, or ground truth. This likelihood increases as the number of observed epochs grows, suggesting that having more data leads to more accurate estimations.

The effect of observing fewer epochs on the mean validation loss is also analyzed. On average, the increase in loss is minimal—approximately 2% in the worst-case scenario when only two epochs are observed (note that the learning curves are normalized between 0 and 1). This is evident from the star markers in Figure 12. These results suggest that the negative impact of early stopping in BO, in terms of performance drop (

D_{p e r f o r m a n c e}

), is negligible, even when the number of observed epochs is limited. This finding aligns with the work of Egele et al. [36], who demonstrated excellent performance even when observing only a single epoch.

An important question, particularly regarding stability and reliability, is whether the predictive accuracy of the performance estimator translates effectively to the BO process. The star markers in Figure 12 and Figure 13 illustrate the correlation between the validation loss of the performance estimator and the mean increase in validation loss achieved during the BO process. The results indicate that a lower validation loss of the performance estimator corresponds to a lower increase in the mean validation loss during the BO process, highlighting the role of the estimator in guiding efficient model selection.

The observed moderate positive correlations (both Pearson and Spearman) indicate a relationship between the validation loss of the performance estimator and the BO efficiency of the process in identifying well-performing models, even when a significant percentage of epochs are discarded. Notably, the higher Spearman correlation compared to the Pearson correlation suggests that while the relationship may not be strictly linear, it follows a consistent monotonic trend. This insight highlights the importance of accurate performance estimation in enhancing the effectiveness of the BO process.

The inclusion of hyperparameters in the performance estimator results in just marginal improvements. While it enhances the prediction accuracy of the curve estimator, the benefits of incorporating network features become less pronounced if one looks at the BO process results. Conditioning aids in discarding poor-performing models, with only a minimal decrease in the mean loss percentage drop (see Table 4), particularly when using only two observed epochs. This finding aligns with the greater variability in performance observed with fewer epochs, as illustrated in Figure 9. Once both approaches converge, the amount of discarded training data becomes comparable, leading to similar average BO performance.

7.3. Performance Estimator Architecture Impact in the BO Process

Additionally, the potential impact of the hyperparameters of the estimator was analyzed. The hyperparameters studied included learning rate, network depth, bidirectionality, and recurrent units (width). Figure 14 illustrates the effect of different ranges of values for each hyperparameter. Most hyperparameters appear to have no significant impact on the BO process, with the exception of the learning rate. For the non-conditioned estimator, higher learning rates are preferred, whereas for the conditioned estimator, this relationship is reversed.

7.4. Comparison with Baseline Models

The proposed approach was evaluated against three baseline methods to assess its performance.

The first baseline is a random approach, where a percentage of training epochs is discarded at random. In Figure 11 and Figure 15, the results of the random baseline are represented with black plus symbols.

The second baseline leverages the “last seen value” as an approximation for the final performance. This approach is often competitive with more sophisticated learning curve extrapolation methods [90]. Despite its simplicity, it has been effectively utilized in frameworks such as Hyperband [32]. To ensure consistency in comparisons, the same early-stopping rule applied to our approach is also applied to this baseline. The results of the last-seen baseline are depicted in Figure 11 and Figure 15 using square markers.

The third baseline employs an ARIMA model to predict the final performance. The optimal parameters for the ARIMA algorithm—p, d, and q—are determined based on the T observed epochs. Using these parameters, the model forecasts up to 100 data points, representing the maximum number of epochs executed per model. As with the previous baselines, the same early-stopping rule applied to our approach is also applied here. The ARIMA baseline results are displayed in Figure 11 and Figure 15 using cross markers.

The results indicate that the random approach achieves competitive outcomes in terms of minimal performance drop, but only when a relatively small portion of training runs (20%) is discarded. In contrast, the proposed approach demonstrates superior performance, achieving better results while reducing the training time by 40%.

Similarly, the ARIMA baseline achieves a comparable 40% reduction in training time. However, this comes at the cost of a significant increase in the mean performance loss compared to the LSTM-based method.

Lastly, while the “last-seen value” baseline effectively minimizes training time and achieves impressive time savings, it exhibits a substantial increase in performance drop (

D_{performance}

). This underscores its limitations in maintaining performance consistency, particularly when compared to the proposed approach.

Test of Statistical Significance

To evaluate the statistical significance of the performance differences between the proposed approach and the baselines, a paired t-test was conducted. The analysis compared the mean loss performance drop for different observed epochs. The t-test statistic was calculated as

t = \frac{Δ μ}{s / \sqrt{n}},

(6)

where

Δ μ

is the difference in mean performance drop between the proposed approach and the baseline, s is the standard deviation of the paired differences, and n is the number of paired observations. The resulting p-value was computed as

p = 2 \cdot (1 - Φ (| t |)),

(7)

where

Φ

represents the cumulative distribution function of the t-distribution. The results are displayed in Table 5.

To ensure fairness, the comparison was limited to results within the range of epochs skipped by the ARIMA and "last-seen value" baselines. Against the ARIMA baseline, the test yielded a p-value of 0.007, demonstrating a statistically significant difference and confirming that the proposed method achieves a lower performance drop while avoiding a similar percentage of training epochs. Similarly, against the "last-seen value" baseline, a p-value of 0.0001 was obtained, indicating strong statistical significance.

For the random baseline, the p-value was 0.0002, highlighting a statistically significant improvement over the random approach. These results collectively confirm that the observed advantages of the proposed method are not due to random chance and underline its effectiveness compared to the baseline methods.

8. Theoretical Analysis

We do not wish to conclude the paper without addressing some key considerations about the two main elements of our proposal. Firstly, regarding the learning-curve estimator, it is important to discuss the potential impact of noise during the search for the optimal estimator. Secondly, with respect to Bayesian optimization (BO) for NAS, attention must be given to the implications of noise within the process.

It is assumed that, as is common when working with observational data, the noise

ϵ (t)

satisfies

$E [ϵ (t)] = 0$ ;
$Var [ϵ (t)] = σ^{2} (t)$ .

The likelihood of erroneous early stopping is influenced by several factors, including the noise variance (

E [| ϵ (t) |^{2}]

), the selected tolerance factor

δ

, and the characteristics of the noise process. To mitigate this risk, several strategies can be implemented. Increasing the tolerance factor

δ

offers a safeguard against premature termination by allowing greater flexibility in the stopping criteria. Reducing the noise variance stabilizes the learning process, decreasing sensitivity to fluctuations. The influence on the convergence and suitability of BO under noise is an actively researched topic in the community (e.g., [91]).

Due to space constraints, we focus on the current modeling approach, leaving the incorporation of additional dependencies, such as noise models of the form

Var [ϵ (t, θ)] = σ^{2} (t, θ) = f (architecture, epoch, hyperparameters) .

for future research. For example, models such as

ϵ (t, θ, X) = f_{arq} (θ) + f_{data} (X) + f_{opt} (α, β) + ξ (t)

(8)

where

θ

represents the architecture hyperparameters, X denotes the dataset,

α

is the learning rate,

β

corresponds to the momentum, and

ξ (t)

represents the stochastic noise process.

We now examine the question under two noise models.

8.1. On the Impact of Noise on Learning Curves on Early-Stopping Efficiency (I): Additive Noise

It is clear that noise in learning curves can affect the effectiveness of early stopping by complicating precise predictions of the future performance of the model. Robust pruning methods against noise could enhance BO efficiency in NAS.

In this section, we consider the decomposition of the learning curve as

y (t) = \bar{y} (t) + ϵ (t),

where

\bar{y} (t)

represents the underlying (noise-free) learning curve, and

ϵ (t)

is a stochastic noise process, assumed to be autocorrelated. Recall that

\hat{y} (t) = E [y (t) ∣ y_{↾ [0, t - 1]}]

.

The event of erroneous early stopping for a promising model can be defined as

E = {\hat{y} (t) > \bar{y} (T) + δ},

where T is the total number of training epochs, and

δ > 0

represents a tolerance factor. This indicates that the estimator

\hat{y} (t)

might prematurely discard a promising model.

Defining

X = \hat{y} (t) - \bar{y} (T)

, and noting that

P (E) = P (| X | > δ)

, an upper bound for the probability of the event E can be derived, namely,

P (E) \leq \frac{E [| ϵ (t) |^{2}]}{δ^{2}}

(9)

Indeed,

\hat{y} (t) - \bar{y} (T) = E [y (t) - \bar{y} (T) | y_{↾ [0, t - 1]}] = E [ϵ (t) | y_{↾ [0, t - 1]}]

(10)

Thus,

P [E] = P (E [ϵ (t) | y_{↾ [0, t - 1]}] > δ) \leq \frac{V a r (E [ϵ (t) | y_{↾ [0, t - 1]}])}{δ^{2}}

(11)

Using the law of total variance,

\begin{matrix} V a r (E [ϵ (t) | y_{↾ [0, t - 1]}]) = V a r (ϵ) - E [V a r (ϵ (t) | y_{↾ [0, t - 1]})] \leq V a r (ϵ) = {E [| ϵ |}^{2}] \end{matrix}

In the case of the additive decomposition of the noise described earlier (and considering the independent variables), the event of erroneous early stopping,

E = \{\hat{y} (t) > \bar{y} (T) + δ ∣ ϵ (t, θ, X)\}

and the noise could be approximated using the sum of the corresponding variances, with the noise decomposing as

Var [ϵ (t, θ, X)] = Var [f_{arq} (θ)] + Var [f_{data} (X)] + Var [f_{opt} (α, β)] + Var [ξ (t)]

(12)

P (E) \leq \frac{\sum_{i} Var [{Component}_{i}]}{δ^{2}}

(13)

8.2. On the Impact of Noise on Learning Curves on Early-Stopping Efficiency (II): Parametric (Autoregressive) Noise

To capture the complexity of learning curves with a more sophisticated model, the previous framework can be extended to account for advanced characteristics of the noise:

y (t) = \bar{y} (t) + ϵ (t, θ)

where

ϵ (t, θ)

models a non-stationary noise model. In practice, the extended noise model follows an autoregressive law of the form

ϵ (t, θ) = \sum_{i = 1}^{p} ϕ_{i} (t) ϵ (t - i) + ξ (t, θ)

where

ϕ_{i} (t)

represents time-dependent autoregressive coefficients, p is the order, and

ξ (t, θ)

represents noise conditional on the hyperparameters.

A reasonable hypothesis is that noise in the learning process can be effectively modeled using a first-order autoregressive process, capturing temporal dependencies that influence the variability throughout training:

ϵ (t) = ρ ϵ (t - 1) + ξ (t), ξ (t) \sim N (0, σ^{2}),

where the temporal correlation is

ρ

. Considering the autoregressive structure, expectation and the effective variance would be

E [ϵ (t)] = 0 and Var [ϵ (t)] = \frac{σ^{2}}{1 - ρ^{2}}

(14)

Thus, by the central limit theorem,

ϵ (t)

could be approximated by the normal distribution:

ϵ (t) \sim N (0, \frac{σ^{2}}{1 - ρ^{2}})

(15)

Under these conditions, the probability of erroneous early stopping is bounded by

P (E) = P [\hat{y} (t) > \bar{y} (T) + δ] = P (ϵ (t) > δ) \leq exp (- \frac{δ^{2}}{2 \cdot Var [ϵ (t)]}) = exp (- \frac{δ^{2} (1 - ρ^{2})}{2 σ^{2}})

(16)

The likelihood of erroneous early stopping depends on several factors, including the autocorrelation coefficient

ρ

, the variance of the white noise

σ^{2}

, and the tolerance factor

δ

. As

| ρ | \sim 1

, the effective variance increases exponentially, indicating a stronger temporal dependency that amplifies the variability in the learning process. As

| ρ | \sim 0

, the behavior resembles white noise, where the temporal correlations decrease and the process approaches the characteristics of independent noise.

8.3. On the Convergence of Robust Early-Stopping Methods Under Noise

We discuss whether the pruning method is robust to noise, i.e., whether it converges somewhat to an estimate of the underlying learning curve with bounded error. This is an interesting aspect to study, as the noise present in the learning curves may affect the effectiveness of pruning during the BO process. Thus, we assume that the pruning method uses a filtering condition:

| \hat{y} (t) - y (t) | \leq γ (t) \cdot σ (t)

(17)

where

γ (t)

is a filtering function that decreases with t.

Assume that the learning curve

y (t)

is subject to random noise

ϵ (t)

:

y (t) = \bar{y} (t) + ϵ (t)

where

\bar{y} (t)

is the underlying learning curve without noise.

Let

\hat{y} (t)

be an estimate of the learning curve

\bar{y} (t)

obtained using a noise-robust pruning method. We define the estimation error as

e (t) = | \bar{y} (t) - \hat{y} (t) |

Then,

C > 0

can be taken, such that

lim_{t \to \infty} P [e (t) \leq C] = 1

in probability; that is, the noise-robust pruning method converges to an estimate of the underlying learning curve with bounded error.

We begin by decomposing the error into two components:

e (t) = \bar{y} (t) - \hat{y} (t) = [\bar{y} (t) - y (t)] + [y (t) - \hat{y} (t)]

(18)

Applying triangular inequality:

| e (t) | \leq | \bar{y} (t) - y (t) | + | y (t) - \hat{y} (t) |

(19)

For the inequality of Chebyshev:

P [| ϵ (t) | > k σ (t)] \leq \frac{1}{k^{2}}

(20)

Therefore,

\begin{matrix} P [e (t) > k σ (t)] \leq P [| ϵ (t) | + γ (t) σ (t) > k σ (t)] \leq \\ P [| ϵ (t) | > k σ (t)] + P [{| ϵ (t) | < k σ (t), | ϵ (t) | + γ (t) σ (t) > k σ (t)}] \end{matrix}

the second probability tends to 0 by the filtering condition. Thus,

P [e (t) > k σ (t)] \leq 1 / k^{2}

Taking

k = t

to be sufficiently large,

P [e (t) \leq k σ (t)]

is small, and we can assume that the error is bounded.

9. Discussion

This work proposes a methodology to improve the efficiency of NAS for PHM applications. A performance estimator capable of predicting the final validation performance of a model based on the initial training and validation performance curves has been developed. Two versions of the estimator were created: one using only the performance curves, and another that also incorporates the network hyperparameters. The estimators were trained and evaluated using a dataset of 61,000 learning curves generated from 62 different PHM datasets.

The results demonstrate that integrating the performance estimator into the BO process for NAS can significantly reduce the computational cost, achieving time savings of over 50%. There is a trade-off between the number of skipped epochs and the final performance of the BO process. The experiments show that the proposed approach achieves a performance drop between 3.5% and 1.5%, depending on the number of epochs observed, with an average drop of 2%. This drop is considered minimal in comparison to the time savings achieved.

In a certain sense, the approach discussed here aligns with the method proposed by Dai et al. [34]. In their work, the authors employed a Bayesian model to predict when the model being trained during a BO iteration had sufficiently converged. This approach enabled earlier termination of training, thus avoiding unnecessary epochs. However, a limitation of this method is that it could still allow for training a model with little potential to outperform the current best model during a significant number of epochs. Such inefficiencies highlight the need for more refined stopping criteria, which may improve the trade-off between computational cost and model performance in BO-based optimization processes.

Our work goes a step further by training each model for a fixed, lower number N of epochs. During these N epochs, the learning curves are monitored and used by a previously trained network to estimate the final performance of the model. If the estimation indicates that the model performance is significantly worse than the best obtained so far, the training of the model is stopped. This approach aims to ensure that poorly performing models are trained for only N epochs.

This proposal could be valuable for the use of digital twins in studying PHM, a promising avenue (e.g., [92]). Although the methodologies differ, both approaches aim to optimize decision making by anticipating the future behavior of a system. Digital twins, which rely on physical models to simulate real systems, could generate synthetic data to train and validate the neural models proposed in our approach, particularly in scenarios where real-world data are scarce. Conversely, machine learning techniques, such as those explored in this paper, could enhance the accuracy and efficiency of digital twins. This can be achieved by modeling complex system aspects that are difficult to capture with traditional physical models or by optimizing parameter selection within the digital twin framework. While this approach has been applied in the PHM domain to address a critical gap in the literature, the framework is inherently applicable across a wide range of fields.

Our study gains its insights through a simulated approach. Therefore, the next immediate work is to corroborate these findings with a real BO execution process.

Additionally, other points of improvement have been identified. In this work, the estimator is trained to make estimations with a fixed number N of observed epochs. In some cases, with a few more executions it becomes evident that a model will not perform well. Thus, estimating future performance at each epoch could help detect this and stop further executions.

An important aspect of our work is the emphasis on the diversity of datasets used during training and evaluation. While this diversity ensures that the proposed method is robust across a wide range of tasks and architectures, we did not explicitly analyze the performance differences of the trained neural networks across these datasets. Investigating such differences could reveal valuable insights into how dataset characteristics influence model performance and the effectiveness of our method. For instance, certain types of datasets may benefit more from learning curve predictions, while others might pose greater challenges. This unexplored dimension presents a promising avenue for future work, where a detailed analysis could help identify patterns and further improve the generalizability of our approach.

Regarding the hyperparameters used to condition the estimator, it could be interesting to include additional hyperparameters, such as meta-attributes about the dataset and the involved task. Studying how these meta-attributes could impact the estimator would be an interesting avenue for academic research.

Our method, while robust, has inherent limitations that must be acknowledged. In scenarios where the objective function does not guarantee convexity, our estimator may encounter significant challenges. These include the risk of converging to local minima, potential instability in performance estimation, and increased variability in predictions.

To address these challenges, we propose several mitigation mechanisms. First, we suggest using multiple random initializations to reduce the likelihood of becoming stuck in suboptimal solutions. Additionally, probabilistic restart techniques can be employed to explore different regions of the solution space, improving convergence. Finally, we recommend uncertainty estimation through bootstrapping to quantify the variability in the performance predictions and enhance the reliability of the estimates made by the model.

Our method may encounter certain other limitations under specific conditions. Firstly, the presence of significant noise in the data can decrease the accuracy of our predictions and increase the variability in the estimates. Noise interferes with the ability of the model to identify underlying patterns in the data, which results in less reliable predictions.

Secondly, when working with nonlinear architectures, our method may struggle to capture complex dynamics. This can lead to problems of overfitting or underfitting, as the model may not be able to adequately represent the relationships between the variables.

Finally, in domains with limited transferability, the generalization ability of the model may be compromised. The performance of our method can depend heavily on the specific characteristics of the domain, limiting its effectiveness when applied to different tasks or environments. These limitations will be the subject of future research.

Future work will focus on developing a more sophisticated state-space model for learning curves that captures the intricate dynamics of neural network training. For instance, this could involve modeling with an extended stochastic differential equation of the form

d y (t) = [A (t) y (t) + f_{arch} (θ) + f_{data} (X) + f_{opt} (α, β)] d t + Σ (t) d W (t)

(21)

which incorporates architecture-dependent parameters through

f_{arch} (θ)

, which captures layer complexity, connection types, and structural characteristics. Dataset dynamics are modeled by

f_{data} (X)

, integrating dimensionality, entropy, and separability measures. Optimization processes are represented by

f_{opt} (α, β)

, explicitly accounting for learning rate, momentum, and adaptive optimization strategies. The noise process

Σ (t) d W (t)

is modeled as a correlated, time-dependent stochastic process, allowing for more nuanced capture of temporal dependencies and noise characteristics in learning dynamics.

Finally, it should be noted that the current method does not fully exploit the potential knowledge and adaptation capabilities of the performance curves generated during application to a new dataset. Incorporating few-shot learning to leverage these performance curves in ongoing experiments could be another area for improvement.

10. Conclusions

This study presents a methodology to enhance the efficiency of neural architecture search (NAS) in the PHM domain through the integration of a performance estimator. By leveraging the initial training and validation performance curves, the proposed estimator predicts the final validation performance of a model, offering a substantial reduction in computational cost. The experiments demonstrate that the approach achieves over 50% time savings with an average performance drop of just 2%, highlighting its minimal impact on model performance.

The proposed framework effectively addresses the trade-off between computational efficiency and final performance. Its robust design and adaptability make it applicable to a wide range of domains beyond PHM. While the current work provides significant insights, future research should explore further optimization of the estimator, incorporation of additional meta-attributes, and application to real-world scenarios to fully realize its potential. Overall, this methodology contributes to advancing resource-efficient NAS while maintaining high-quality model selection.

Author Contributions

Conceptualization, D.S.-M. and J.G.-P., methodology D.S.-M. and J.G.-P.; software, D.S.-M.; validation, D.S.-M. and J.G.-P.; formal analysis, J.B.-D.; investigation, D.S.-M., J.G.-P. and J.B.-D.; resources, D.S.-M.; data curation, D.S.-M. and J.G.-P.; writing—original draft preparation, D.S.-M., J.G.-P. and J.B.-D.; writing—review and editing, D.S.-M., J.G.-P. and J.B.-D.; visualization, D.S.-M. and J.G.-P.; supervision, J.G.-P. and J.B.-D.; project administration, D.S.-M.; funding acquisition, J.B.-D. All authors have read and agreed to the published version of the manuscript.

Funding

Grant PID2023-147198NB-I00 funded by MICIU/AEI/10.13039/501100011033 (Agencia Estatal de Investigación) and by FEDER, UE, and by the Ministry of Science and Education of Spain through the national program “Ayudas para contratos para la formación de investigadores en empresas (DIN2019-010887/AEI/10.13039/50110001103)”, of State Programme of Science Research and Innovations 2017–2020.

Data Availability Statement

To ensure the reproducibility and transparency of our work, we have made the source code and dataset publicly available. These resources can be accessed at the following link: https://github.com/dasolma. The datasets used to generate the curve dataset are all publicly available and were accessed using the phmd Python package [42], which can be installed via the pip package manager. Additionally, the generated curve dataset has been included as part of the phmd tool to facilitate future research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Best Performance Estimator Architectures

Table A1 summarizes the architecture hyperparameters that achieved the best validation losses during cross-validation. The performance of these models during the BO process is shown in Figure 12, indicated with star markers.

Table A1. Architecture hyperparameters with best mean validation loss during cross-validation for the performance estimator.

Observed Epochs	Conditioned	LSTM Blocks	LSTM Cells	Bidirectional	Learning Rate
2	No	2	64	Yes	0.010
3	No	1	16	Yes	0.010
4	No	2	32	Yes	0.010
5	No	3	128	Yes	0.001
6	No	3	64	No	0.001
7	No	2	128	No	0.010
8	No	4	16	No	0.010
9	No	1	16	Yes	0.010

References

Solís-Martín, D.; Galán-Páez, J.; Borrego-Díaz, J. On the Soundness of XAI in Prognostics and Health Management (PHM). Information 2023, 14, 256. [Google Scholar] [CrossRef]
Borrego-Díaz, J.; Galán-Páez, J. Explainable Artificial Intelligence in Data Science. Minds Mach. 2022, 32, 485–531. [Google Scholar] [CrossRef]
Bousdekis, A.; Apostolou, D.; Mentzas, G. Predictive Maintenance in the 4th Industrial Revolution: Benefits, Business Opportunities, and Managerial Implications. IEEE Eng. Manag. Rev. 2020, 48, 57–62. [Google Scholar] [CrossRef]
Tian, Z.; Wong, L.; Safaei, N. A neural network approach for remaining useful life prediction utilizing both failure and suspension histories. Mech. Syst. Signal Process. 2010, 24, 1542–1555. [Google Scholar] [CrossRef]
Deutsch, J.; He, D. Using deep learning-based approach to predict remaining useful life of rotating components. IEEE Trans. Syst. Man, Cybern. Syst. 2017, 48, 11–20. [Google Scholar] [CrossRef]
Mathew, V.; Toby, T.; Singh, V.; Rao, B.M.; Kumar, M.G. Prediction of Remaining Useful Lifetime (RUL) of turbofan engine using machine learning. In Proceedings of the 2017 IEEE International Conference on Circuits and Systems (ICCS), San Francisco, CA, USA, 25–27 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 306–311. [Google Scholar]
Remaining Useful Life (RUL) Prediction of Rolling Element Bearing Using Random Forest and Gradient Boosting Technique; Vol. 13: Design, Reliability, Safety, and Risk. In Proceedings of the ASME International Mechanical Engineering Congress and Exposition, Pittsburgh, PA, USA, 9–15 November 2018.
Nieto, P.G.; García-Gonzalo, E.; Lasheras, F.S.; de Cos Juez, F.J. Hybrid PSO–SVM-based method for forecasting of the remaining useful life for aircraft engines and evaluation of its reliability. Reliab. Eng. Syst. Saf. 2015, 138, 219–231. [Google Scholar] [CrossRef]
Ordóñez, C.; Lasheras, F.S.; Roca-Pardinas, J.; de Cos Juez, F.J. A hybrid ARIMA–SVM model for the study of the remaining useful life of aircraft engines. J. Comp. Appl. Math. 2019, 346, 184–191. [Google Scholar] [CrossRef]
Ng, S.S.; Xing, Y.; Tsui, K.L. A naive Bayes model for robust remaining useful life prediction of lithium-ion battery. Appl. Energy 2014, 118, 114–123. [Google Scholar] [CrossRef]
Baraldi, P.; Di Maio, F.; Al-Dahidi, S.; Zio, E.; Mangili, F. Prediction of industrial equipment remaining useful life by fuzzy similarity and belief function theory. Expert Syst. Appl. 2017, 83, 226–241. [Google Scholar] [CrossRef]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar]
Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4780–4789. [Google Scholar]
Biggio, L.; Kastanis, I. Prognostics and Health Management of Industrial Assets: Current Progress and Road Ahead. Front. Artif. Intell. 2020, 3, 578613. [Google Scholar] [CrossRef] [PubMed]
Fink, O.; Wang, Q.; Svensén, M.; Dersin, P.; Lee, W.J.; Ducoffe, M. Potential, challenges and future directions for deep learning in prognostics and health management applications. Eng. Appl. Artif. Intell. 2020, 92, 103678. [Google Scholar] [CrossRef]
Hutter, M. Learning Curve Theory. arXiv 2021, arXiv:2102.04074. [Google Scholar]
Klein, A.; Falkner, S.; Springenberg, J.T.; Hutter, F. Learning Curve Prediction with Bayesian Neural Networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Domhan, T.; Springenberg, J.T.; Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Proceedings of the 24th International Conference on Artificial Intelligence, Tokyo, Japan, 3–7 July 2023; AAAI Press: Washington, DC, USA, 2023. IJCAI’15. pp. 3460–3468. [Google Scholar]
Kadra, A.; Janowski, M.; Wistuba, M.; Grabocka, J. Scaling Laws for Hyperparameter Optimization. In Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: San Francisco, CA, USA, 2023; Volume 36, pp. 47527–47553. [Google Scholar]
Tissue, H.; Wang, V.; Wang, L. Scaling Law with Learning Rate Annealing. arXiv 2024, arXiv:2408.11029. [Google Scholar]
Clauset, A.; Shalizi, C.R.; Newman, M.E.J. Power-Law Distributions in Empirical Data. SIAM Rev. 2009, 51, 661–703. [Google Scholar] [CrossRef]
Hestness, J.; Narang, S.; Ardalani, N.; Diamos, G.; Sengupta, S.; Anderson, T.; Patwary, M.M.A.; Yang, Y.; Smelyanskiy, M. Deep learning scaling is predictable, empirically. arXiv 2017, arXiv:1712.00409. [Google Scholar]
Smith, L.N. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv 2018, arXiv:1803.09820. [Google Scholar]
Mandt, S.; Hoffman, M.D.; Blei, D.M. Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 2017, 18, 4873–4907. [Google Scholar]
Chaudhari, P.; Choromanska, A.; Soatto, S.; LeCun, Y.; Baldassi, C.; Borgs, C.; Chayes, J.; Sagun, L.; Zecchina, R. Entropy-SGD: Biasing gradient descent into wide valleys. J. Stat. Mech. Theory Exp. 2019, 2019, 124018. [Google Scholar] [CrossRef]
Brogat-Motte, L.; Bonalli, R.; Rudi, A. Learning Controlled Stochastic Differential Equations. arXiv 2024, arXiv:stat.ML/2411.01982. [Google Scholar]
Itô, K. On Stochastic Processes. Ann. Math. 1944, 45, 202–220. [Google Scholar]
Kandasamy, K.; Neiswanger, W.; Schneider, J.; Poczos, B.; Xing, E.P. Neural architecture search with bayesian optimisation and optimal transport. Adv. Neural Inf. Process. Syst. 2018, 31, 2020–2029. [Google Scholar]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 2951–2959. [Google Scholar]
Solís-Martín, D.; Galán-Páez, J.; Borrego-Díaz, J. Bayesian Model Selection Pruning in Predictive Maintenance. In Proceedings of the Hybrid Artificial Intelligent Systems: 19th International Conference, HAIS 2024, Salamanca, Spain, 9–11 October 2024; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2024; pp. 263–274. [Google Scholar] [CrossRef]
Baker, B.; Gupta, O.; Raskar, R.; Naik, N. Accelerating Neural Architecture Search using Performance Prediction. In Proceedings of the International Conference on Learning Representations (ICLR 2018)-Workshop Track, Cambridge, MA, USA, 2018; Available online: https://openreview.net/pdf?id=BJypUGZ0Z (accessed on 1 December 2024).
Li, L.; Jamieson, K.G.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In Proceedings of the ICLR (Poster), Toulon, France, 24–26 April 2017; p. 53. [Google Scholar]
Falkner, S.; Klein, A.; Hutter, F. BOHB: Robust and efficient hyperparameter optimization at scale. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1437–1446. [Google Scholar]
Dai, Z.; Yu, H.; Low, B.K.H.; Jaillet, P. Bayesian optimization meets Bayesian optimal stopping. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 1496–1506. [Google Scholar]
Adriaensen, S.; Rakotoarison, H.; Müller, S.; Hutter, F. Efficient bayesian learning curve extrapolation using prior-data fitted networks. Adv. Neural Inf. Process. Syst. 2024, 36, 19858–19886. [Google Scholar]
Egele, R.; Mohr, F.; Viering, T.; Balaprakash, P. The unreasonable effectiveness of early discarding after one epoch in neural network hyperparameter optimization. Neurocomputing 2024, 597, 127964. [Google Scholar] [CrossRef]
Rakotoarison, H.; Adriaensen, S.; Mallik, N.; Garibov, S.; Bergman, E.; Hutter, F. In-Context Freeze-Thaw Bayesian Optimization for Hyperparameter Optimization. arXiv 2024, arXiv:2404.16795. [Google Scholar]
Ruhkopf, T.; Mohan, A.; Deng, D.; Tornede, A.; Hutter, F.; Lindauer, M. Masif: Meta-learned algorithm selection using implicit fidelity information. Trans. Mach. Learn. Res. 2023, 2835–8856. [Google Scholar]
Wu, K.; Kim, K.; Garnett, R.; Gardner, J.R. The Behavior and Convergence of Local Bayesian Optimization. arXiv 2024, arXiv:2305.15572. [Google Scholar]
Jones, D.R.; Schonlau, M.; Welch, W.J. Efficient Global Optimization of Expensive Black-Box Functions. J. Glob. Optim. 1998, 13, 455–492. [Google Scholar] [CrossRef]
Bull, A.D. Convergence Rates of Efficient Global Optimization Algorithms. J. Mach. Learn. Res. 2011, 12, 2879–2904. [Google Scholar]
Solís-Martín, D.; Galán-Páez, J.; Borrego-Díaz, J. PHMD: An easy data access tool for prognosis and health management datasets. SoftwareX 2025, 29, 102039. [Google Scholar] [CrossRef]
Bole, B.; Kulkarni, C.; Daigle, M. Randomized battery usage data set. NASA AMES Progn. Data Repos. 2014, 70. [Google Scholar]
Saha, B.; Goebel, K. Battery Data Set. NASA AMES Prognostics Data Repository. 2007. Available online: http://ti.arc.nasa.gov/project/prognostic-data-repository (accessed on 1 December 2024).
Capacitor Electrical Stress-2–Catalog—catalog.data.gov. 2024. Available online: https://catalog.data.gov/dataset/capacitor-electrical-stress-2 (accessed on 8 April 2024).
Celaya, J.R.; Saxena, A.; Saha, S.; Goebel, K.F. Prognostics of power MOSFETs under thermal stress accelerated aging using data-driven and model-based methodologies. In Proceedings of the Annual Conference of the PHM Society, Shenzhen, China, 24–25 May 2011; Volume 3, pp. 1–10. [Google Scholar]
Fricke, K.; Nascimento, R.; Corbetta, M.; Kulkarni, C.; Viana, F. Prognosis of Li-ion Batteries Under Large Load Variations Using Hybrid Physics-Informed Neural Networks. In Proceedings of the Annual Conference of the PHM Society, Salt Lake City, UT, USA, 28 October–2 November 2023; Volume 15, pp. 1–12. [Google Scholar]
Saxena, A.; Goebel, K.; Simon, D.; Eklund, N. Damage propagation modeling for aircraft engine run-to-failure simulation. In Proceedings of the 2008 International Conference on Prognostics and Health Management, Denver, CO, USA, 6–9 October 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1–9. [Google Scholar]
Prognostics Center of Excellence Data Set Repository—NASA—nasa.gov. 2024. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 8 April 2024).
Kulkarni, C.S.; Guarneros Luna, A. Description of Simulated Small Satellite Operation Data Sets; Technical Report; NASA: Washington, DC, USA, 2018. [Google Scholar]
Kulkarni, C.; Hogge, E.; Goebel, K. HIRF Battery Data Set. NASA Ames Prognostics Data Repository. Available online: https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/ (accessed on 8 April 2024).
Arias Chao, M.; Kulkarni, C.; Goebel, K.; Fink, O. Aircraft engine run-to-failure dataset under real flight conditions for prognostics and diagnostics. Data 2021, 6, 5. [Google Scholar] [CrossRef]
Seijun Chung, S. Data Challenge at PHM Asia Pacific 2021. Kaggle. 2021. Available online: https://kaggle.com/competitions/phmap21-classification-task (accessed on 2 October 2024).
Danilo Giordano, M.S. PHME Data Challenge. 2024. [European Conference of the Prognostics and Health Management Society]. Available online: https://github.com/PHME-Datachallenge/Data-Challenge-2024 (accessed on 3 October 2024).
2010 PHM Society Conference Data Challenge–PHM Society—phmsociety.org. 2024. Available online: http://www.phmsociety.org/competition/phm/10 (accessed on 8 April 2024).
2023, P.A.P. Data Challenge at PHM Asia Pacific 2023. 2023. Available online: https://phmap.jp/program-data/ (accessed on 3 September 2024).
Annual Conference of the Prognostics and Health Management Society 2018—PHM Society—phmsociety.org. 2024. Available online: http://www.phmsociety.org/competition/phm/18 (accessed on 8 April 2024).
İnce, K.; Sirkeci, E.; Genç, Y. Remaining Useful Life Prediction for Experimental Filtration System: A Data Challenge. PHM Soc. Eur. Conf. 2020, 5, 1–10. [Google Scholar] [CrossRef]
Fault Data Sets—Society For Machinery Failure Prevention Technology—mfpt.org. 2024. Available online: https://www.mfpt.org/fault-data-sets/ (accessed on 8 April 2024).
He, J.; Guan, X.; Peng, T.; Liu, Y.; Saxena, A.; Celaya, J.; Goebel, K. A multi-feature integration method for fatigue crack detection and crack length estimation in riveted lap joints using Lamb waves. Smart Mater. Struct. 2013, 22, 105007. [Google Scholar] [CrossRef]
Bosello, M.; Falcomer, C.; Rossi, C.; Pau, G. To Charge or to Sell? EV Pack Useful Life Estimation via LSTMs, CNNs, and Autoencoders. Energies 2023, 16, 2837. [Google Scholar] [CrossRef]
Cao, P.; Zhang, S.; Tang, J. Preprocessing-free gear fault diagnosis using small datasets with deep convolutional neural network-based transfer learning. IEEE Access 2018, 6, 26241–26253. [Google Scholar] [CrossRef]
Lu, C.; Wang, Y.; Ragulskis, M.; Cheng, Y. Fault diagnosis for rotating machinery: A method based on image processing. PLoS ONE 2016, 11, e0164111. [Google Scholar] [CrossRef]
Kollmeyer, P.; Carlos, V.; Naguib, M.; Skells, M. LG 18650HG2 Li-Ion Battery Data and Example Deep Neural Network xEV SOC Estimator Script. 2020. Mendeley Data. Available online: https://data.mendeley.com/datasets/cp3473x7xv/3 (accessed on 3 October 2024).
Gear Datasets. 2024. Available online: https://www.kau-sdol.com/kaug (accessed on 8 April 2024).
Wang, B.; Lei, Y.; Li, N.; Li, N. A hybrid prognostics approach for estimating remaining useful life of rolling element bearings. IEEE Trans. Reliab. 2018, 69, 401–412. [Google Scholar] [CrossRef]
Bearing Data Center | Case School of Engineering|Case Western Reserve University—engineering.case.edu. 2024. Available online: https://engineering.case.edu/bearingdatacenter (accessed on 8 April 2024).
Soto-Ocampo, C.R.; Maroto, J.; Cano-Moreno, J.D.; Mera, J.M. Bearing Database—Evaluation of Isolated Cases—zenodo.org. 2023. Available online: https://zenodo.org/records/8241764 (accessed on 8 April 2024).
Nehasil, O.; Dobiášová, L.; Mazanec, V.; Širokỳ, J. Versatile AHU fault detection–Design, field validation and practical application. Energy Build. 2021, 237, 110781. [Google Scholar] [CrossRef]
Lessmeier, C.; Kimotho, J.K.; Zimmer, D.; Sextro, W. Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set for data-driven classification. In Proceedings of the PHM Society European Conference, Bilbao, Spain, 5–8 July 2016; Volume 3, pp. 1–10. [Google Scholar]
Coraddu, A.; Oneto, L.; Ghio, A.; Savio, S.; Anguita, D.; Figari, M. Machine learning approaches for improving condition-based maintenance of naval propulsion plants. Proc. Inst. Mech. Eng. Part M J. Eng. Marit. Environ. 2016, 230, 136–153. [Google Scholar] [CrossRef]
Cipollini, F.; Oneto, L.; Coraddu, A.; Murphy, A.J.; Anguita, D. Condition-based maintenance of naval propulsion systems with supervised data analysis. Ocean Eng. 2018, 149, 268–278. [Google Scholar] [CrossRef]
Yang, X.; Yan, R.; Gao, R.X. Induction motor fault diagnosis using multiple class feature selection. In Proceedings of the 2015 IEEE International Instrumentation and Measurement Technology Conference (I2MTC) Proceedings, Pisa, Italy, 11–14 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 256–260. [Google Scholar]
Birkl, C. Diagnosis and Prognosis of Degradation in Lithium-Ion Batteries. Ph.D. Thesis, University of Oxford, Oxford, UK, 2017. [Google Scholar]
Li, K.; Ping, X.; Wang, H.; Chen, P.; Cao, Y. Sequential fuzzy diagnosis method for motor roller bearing in variable operating conditions based on vibration analysis. Sensors 2013, 13, 8013–8041. [Google Scholar] [CrossRef]
Barbieri, G.; Sanchez-Londoño, D.; Cattaneo, L.; Fumagalli, L.; Romero, D. A case study for problem-based learning education in fault diagnosis assessment. IFAC-PapersOnLine 2020, 53, 107–112. [Google Scholar] [CrossRef]
Severson, K.A.; Attia, P.M.; Jin, N.; Perkins, N.; Jiang, B.; Yang, Z.; Chen, M.H.; Aykol, M.; Herring, P.K.; Fraggedakis, D.; et al. Data-driven prediction of battery cycle life before capacity degradation. Nat. Energy 2019, 4, 383–391. [Google Scholar] [CrossRef]
Soto-Ocampo, C.R.; Mera, J.M.; Cano-Moreno, J.D.; Garcia-Bernardo, J.L. Low-cost, high-frequency, data acquisition system for condition monitoring of rotating machinery through vibration analysis-case study. Sensors 2020, 20, 3493. [Google Scholar] [CrossRef]
Kovalenko, I.; Saez, M.; Barton, K.; Tilbury, D. SMART: A system-level manufacturing and automation research testbed. Smart Sustain. Manuf. Syst. 2017, 1, 20170006. [Google Scholar] [CrossRef]
Lei, M.; Jackson, L. IEEE 2014 Data Challenge Data. 2014. Available online: https://repository.lboro.ac.uk/articles/dataset/IEEE_2014_Data_Challenge_Data/3518141/1?file=5604290 (accessed on 1 October 2024).
Von Birgelen, A.; Buratti, D.; Mager, J.; Niggemann, O. Self-organizing maps for anomaly localization and predictive maintenance in cyber-physical production systems. Procedia Cirp 2018, 72, 480–485. [Google Scholar] [CrossRef]
UCI Machine Learning Repository—archive.ics.uci.edu. 2024. Available online: https://archive.ics.uci.edu/dataset/421/aps+failure+at+scania+trucks (accessed on 8 April 2024).
Darban, Z.Z.; Webb, G.I.; Pan, S.; Aggarwal, C.C.; Salehi, M. Deep learning for time series anomaly detection: A survey. arXiv 2022, arXiv:2211.05244. [Google Scholar]
Helwig, N.; Pignanelli, E.; Schütze, A. Condition monitoring of a complex hydraulic system using multivariate statistics. In Proceedings of the 2015 IEEE International Instrumentation and Measurement Technology Conference (I2MTC) Proceedings, Pisa, Italy, 11–14 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 210–215. [Google Scholar]
Battery Data|Center for Advanced Life Cycle Engineering—calce.umd.edu. 2024. Available online: https://calce.umd.edu/battery-data (accessed on 8 April 2024).
Lee, J.; Qiu, H.; Yu, G.; Lin, J.; Rexnord, I.M.S. University of Cincinnati. Bearing Data Set, NASA Ames Prognostics Data Repository; NASA Ames Research Center: Moffett Field, CA, USA, 2007. [Google Scholar]
Nogueira, F. Bayesian Optimization: Open Source Constrained Global Optimization Tool for Python. 2014. Available online: https://github.com/fmfn/BayesianOptimization (accessed on 3 October 2024).
You, K.; Long, M.; Wang, J.; Jordan, M.I. How does learning rate decay help modern neural networks? arXiv 2019, arXiv:1908.01878. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Gargiani, M.; Klein, A.; Falkner, S.; Hutter, F. Probabilistic rollouts for learning curve extrapolation across hyperparameter settings. arXiv 2019, arXiv:1910.04522. [Google Scholar]
Wang, J.; Wang, H.; Petra, C.G.; Chiang, N.Y. On the convergence of noisy Bayesian Optimization with Expected Improvement. arXiv 2025, arXiv:2501.09262. [Google Scholar]
Aivaliotis, P.; Georgoulias, K.; Chryssolouris, G. The use of Digital Twin for predictive maintenance in manufacturing. Int. J. Comput. Integr. Manuf. 2019, 32, 1067–1080. [Google Scholar] [CrossRef]

Figure 1. Predictive maintenance ecosystem.

Figure 2. Distribution of datasets by publisher.

Figure 3. Distribution of domains and applications of the datasets.

Figure 4. Distribution of types of features present in datasets.

Figure 5. Distributions of modeling and PHM tasks.

Figure 6. Distribution of imbalance ratios across the datasets.

Figure 7. Example of learning curves used in this work. On the top, the training performance curves are depicted, and on the bottom, the validation performance curves, both for the same model settings. As can be seen, the validation curves are particularly noisy during the first epochs.

Figure 8. Performance estimator architectures. (A) Estimator architecture using only observed training and validation performance curves. (B) Estimator architecture using performance curves and network hyperparameters.

Figure 9. Efficiency of the final performance estimators, grouped by learning rate and whether the model was conditioned or not.

Figure 10. Effect of the performance estimator on the test set in relation to validation performance.

Figure 11. BO early-stopping simulation results. Left: Percentage of times the best solution (ground truth) was found. Right: Percentage of training epochs skipped. Both plots compare these metrics against the mean percentage of loss reduction and the number of observed training epochs.

Figure 12. The figure illustrates the distribution of the mean percentage loss reduction across varying numbers of observed training epochs. The x-axis represents the number of observed training epochs, while the y-axis shows the mean percentage reduction in loss. Stars indicate the best models, selected based on the validation loss of the performance estimator, whereas red dots represent the performance in the BO process of models with the highest validation loss. The hyperparameters of these architectures are detailed in Table A1.

Figure 13. Correlation between the validation loss of the performance model and the mean validation loss drop during the BO process. The data points represent individual experiments, and the red line shows the regression trend.

Figure 14. Boxplots illustrating the impact of hyperparameter configurations on the mean loss percentage drop across different numbers of observed training epochs. Each row corresponds to a specific hyperparameter: learning rate, network depth, bidirectionality, and recurrent units (width). The left column shows results for non-conditioned estimators, while the right column shows results for conditioned estimators.

Figure 15. BO early-stopping simulation results. Darker colors imply better results (lower

D_{p e r f o r m a n c e}

).

Figure 15. BO early-stopping simulation results. Darker colors imply better results (lower

D_{p e r f o r m a n c e}

).

Table 1. Hyperparameters used to condition the performance estimator. Some hyperparameters are not applicable, depending on the type of network.

Hyperparameter	Description	MLP	CNN	RNN	Transformer
MLP Activation	Activation function	√
Batch Normalization	Use batch normalization	√	√	√	√
Bidirectional	Bidirectional RNN			√
Block Size	Number of convolutional blocks		√
RNN Cell Type	Recurrent cell type			√
Conv Activation	Convolution activation function		√
Dense Activation	Dense block activation function		√	√
Dilation Rate	Convolution activation rate		√
Dropout	Dropout regularization rate	√	√	√	√
F1, F2, F3	MS-CNN convolution size		√
FC1, FC2	Dense layer size		√	√	√
Filters	Convolution filters		√
Kernel Size	Convolution kernel size		√
L1, L2	L1 and L2 regularization rates	√	√	√	√
Dense Dim	Dense block parameters		√	√	√
Model Dim	Total model parameters	√	√	√	√
MS Blocks	Number of MS-CNN blocks		√
Conv Blocks	Number of convolutional blocks		√
Layers	Number of layers	√	√	√	√
Heads	Number of Transformer heads				√
Output Dim	Output dimension	√	√	√	√
RNN Units	Number of RNN units			√
Segment Size	Patch segment size				√

Table 2. Hyperparameter range studied for the performance estimator.

Hyperparameter	Values
Learning Rate	{0.01, 0.001, 0.0001}
Number of LSTM Layers	{1, 2, 3, 4}
Number of LSTM Cells	{16, 32, 64, 128}
Bidirectional LSTM	{Yes, No}

Table 3. Sobol sensitivity index for each hyperparameter studied for the performance estimator.

Hyperparameter	Sobol Index
Learning rate	0.0587
Network depth	−0.0303
Bidirectional	0.0109
Recurrent units (width)	0.0306

Table 4. Comparison of mean loss percentage drop and percentage of skipped epochs for varying numbers of observed epochs, with and without network feature conditioning. The table highlights the impact of conditioning on both metrics, across different levels of observed epochs (from 2 to 9).

	$D_{performance}$		$E_{skipped}$
Network Features Conditioned	No	Yes	No	Yes
Observed Epochs
2	0.043709	0.045064	0.646917	0.746301
3	0.033079	0.038057	0.625684	0.626886
4	0.044258	0.048155	0.604485	0.602086
5	0.034555	0.038707	0.573473	0.574904
6	0.032851	0.036993	0.540953	0.549439
7	0.038565	0.042210	0.514593	0.523448
8	0.030337	0.039286	0.483932	0.496455
9	0.032317	0.037472	0.458637	0.468747

Table 5. Statistical significance results comparing the proposed approach with baselines.

Baseline	p-Value
ARIMA	0.0070
Last-Seen Value	0.0001
Random Approach	0.0002

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Solís-Martín, D.; Galán-Páez, J.; Borrego-Díaz, J. A Model for Learning-Curve Estimation in Efficient Neural Architecture Search and Its Application in Predictive Health Maintenance. Mathematics 2025, 13, 555. https://doi.org/10.3390/math13040555

AMA Style

Solís-Martín D, Galán-Páez J, Borrego-Díaz J. A Model for Learning-Curve Estimation in Efficient Neural Architecture Search and Its Application in Predictive Health Maintenance. Mathematics. 2025; 13(4):555. https://doi.org/10.3390/math13040555

Chicago/Turabian Style

Solís-Martín, David, Juan Galán-Páez, and Joaquín Borrego-Díaz. 2025. "A Model for Learning-Curve Estimation in Efficient Neural Architecture Search and Its Application in Predictive Health Maintenance" Mathematics 13, no. 4: 555. https://doi.org/10.3390/math13040555

APA Style

Solís-Martín, D., Galán-Páez, J., & Borrego-Díaz, J. (2025). A Model for Learning-Curve Estimation in Efficient Neural Architecture Search and Its Application in Predictive Health Maintenance. Mathematics, 13(4), 555. https://doi.org/10.3390/math13040555

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Model for Learning-Curve Estimation in Efficient Neural Architecture Search and Its Application in Predictive Health Maintenance

Abstract

1. Introduction

1.1. Prognostics and Health Management

1.2. Neural Architecture Search

NAS for PHM

1.3. On Estimating Learning Curves from Initial Data

1.3.1. Extrapolation of Learning Curves

1.3.2. The Power-Law Hypothesis

1.3.3. Modeling Learning Curves by Stochastic Processes

1.4. Bayesian Optimization for Neural Architecture Search

1.5. Aim of the Paper

1.6. Structure of the Paper

2. Related Work

3. Background: Gaussian Processes

Bayesian Optimization Based on a Gaussian Process

4. Experiments on Predictive Maintenance: Datasets

5. Overall Description of the Proposal

6. Performance Estimators

6.1. Performance Estimator Architectures

6.2. Training Procedure and Estimator Performance

6.3. Robustness Analysis

7. Early Stopping During NAS with BO

7.1. Metrics

7.2. Results

7.3. Performance Estimator Architecture Impact in the BO Process

7.4. Comparison with Baseline Models

Test of Statistical Significance

8. Theoretical Analysis

8.1. On the Impact of Noise on Learning Curves on Early-Stopping Efficiency (I): Additive Noise

8.2. On the Impact of Noise on Learning Curves on Early-Stopping Efficiency (II): Parametric (Autoregressive) Noise

8.3. On the Convergence of Robust Early-Stopping Methods Under Noise

9. Discussion

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Best Performance Estimator Architectures

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI