Time-Series Prediction of Failures in an Industrial Assembly Line Using Artificial Learning

Sen, Mert Can; Alkan, Mahmut

doi:10.3390/app15115984

Open AccessArticle

Time-Series Prediction of Failures in an Industrial Assembly Line Using Artificial Learning

by

Mert Can Sen

¹

and

Mahmut Alkan

^2,*

¹

Department of Mechanical Engineering, Graduate School of Natural and Applied Sciences, Nigde Omer Halisdemir University, Nigde 51240, Türkiye

²

Department of Mechanical Engineering, Engineering Faculty, Nigde Omer Halisdemir University, Nigde 51240, Türkiye

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 5984; https://doi.org/10.3390/app15115984

Submission received: 2 May 2025 / Revised: 21 May 2025 / Accepted: 24 May 2025 / Published: 26 May 2025

Download

Browse Figures

Versions Notes

Abstract

Featured Application

The methods developed in this study can be directly applied to predictive maintenance in aerospace assembly lines, where the early detection of potential component failures is critical. By utilising artificial learning models optimised for time-series data, manufacturers can improve production reliability, reduce downtime, and mitigate risks associated with structural or assembly errors. The framework may also be adapted to other high-precision, low-volume manufacturing sectors requiring advanced prediction solutions for operational safety and efficiency.

Abstract

This study evaluates the efficacy of six artificial learning (AL) models—nonlinear autoregressive (NAR), long short-term memory (LSTM), adaptive neuro-fuzzy inference system (ANFIS), gated recurrent unit (GRU), multilayer perceptron (MLP), and CNN-RNN hybrid networks—for time-series data for failure prediction in aerospace assembly lines. The data consist of 45,654 records of configurations of failure. The models are trained to predict failures and assessed via error metrics (RMSE, MAE, MAPE), residual analysis, variance analysis, and computational efficiency. The results indicate that NAR and MLP models, respectively, achieve the lowest residuals (clustered near zero) and minimal variance, demonstrating robust calibration and stability. MLP exhibits strong accuracy (MAE = 2.122, MAPE = 0.876%, RMSE = 1.418, and ME = 1.145) but higher residual variability, while LSTM and CNN-RNN show sensitivity to data noise and computational inefficiency. ANFIS balances interpretability and performance but requires extensive training iterations. The study underscores NAR as optimal for precision-critical aerospace applications, where error minimisation and generalisability are paramount. However, the reliance on a single failure-related variable “configuration” and exclusion of exogenous factors may constrain holistic failure prediction. These findings advance predictive maintenance strategies in high-stakes manufacturing environments with future work integrating multivariable datasets and domain-specific constraints.

Keywords:

predictive maintenance; aerospace assembly line; time-series forecasting; artificial neural networks; residual analysis; failure prediction

Graphical Abstract

1. Introduction

Unforeseen failures during production and assembly processes remain one of the most critical challenges in industrial plants. Such failures can lead to increased production costs, reduced product quality, diminished customer satisfaction, and weakened competitiveness. Key sources of production failures include machine malfunctions, human error, design deficiencies, and limitations in control mechanisms [1]. Identifying which of these variables contributes to a specific failure is particularly complex in high-variation environments like the aerospace industry, where accurate and timely fault prediction is essential.

Automation has long been employed as a solution to improve process control and implement fault-prevention strategies [1,2]. While automation is effective in reducing machine-based failures, it often falls short in addressing human-related and complex systemic failures. In recent years, data-driven approaches have gained prominence due to their ability to model and predict system behaviours without relying on predefined physical laws. These methods derive models from statistical patterns in historical process data, offering a viable means of understanding complex, dynamic systems [3,4].

Unlike traditional model-based methods, which depend on simplified assumptions and are often limited to systems with primitive geometries, data-driven techniques can capture nonlinear relationships and hidden patterns within big datasets. Tools such as regression, simulation, and artificial intelligence (AI) are frequently used for this purpose [5,6]. Among these, machine-learning techniques—particularly those applied to time-series data—have emerged as highly effective for fault prediction. Notably, methods like multilayer perceptron (MLP), nonlinear autoregressive models (NARXs), recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and gated recurrent units (GRUs) have been widely studied [7,8,9,10]. Convolutional neural networks (CNNs) were first developed for image recognition. However, they have a regression layer for time-series analysis. Hybrid CNN-RNN models have been shown to successfully capture both local and temporal patterns in sequential data [11,12].

Recent studies have explored self-similar neural networks, particularly fuzzy neural network–long short-term memory (FNN-LSTM) hybrid architectures, to enhance time-series forecasting by integrating fuzzy logic for uncertainty handling and LSTM for temporal dependencies. Liu et al. (2022) proposed an FNN-LSTM model for prediction of the rate of penetration, combining fuzzy rule-based pre-processing with LSTM layers to address nonlinearity and noise, achieving a better RMSE in exchange for an accuracy reduction of only 5% compared to standalone LSTM or FNN [13]. Similarly, Chumakova et al. (2023) applied FNN-LSTM to an individual knowledge-testing trajectory. The use of the FNN direct propagation network as part of a hybrid algorithmic module makes it possible to construct a trajectory with an individual testing length, regardless of the number of thematic blocks [14]. A thermal control method combining fuzzy logic and LSTM neural networks stabilises the electrolysis stack temperature within ±1 °C under a 10 kW power output [15]. Short-term load forecasting challenges in smart grids are dealt with by proposing a hybrid model integrating fuzzy C-means (FCM) clustering and an improved LSTM neural network, demonstrating enhanced accuracy for uncertain load conditions and advancing adaptive forecasting frameworks for grid stability [16]. LSTM-based time-series forecasting limitations (error accumulation, weak temporal correlations, low interpretability) are addressed by integrating a fuzzy inference system (FIS). A Wang–Mendel-based method constructs simplified and complete fuzzy rules, while fuzzy prediction fusion, a memory-strengthening layer, and parameter segmentation enhance reasoning, long-term dependency retention, and efficiency [17]. Fuzzy embedded long short-term memory architecture leverages the strengths of both systems to achieve high prediction accuracies with interpretable results [18]. In active power filters, to suppress harmonics, LSTM-FNN is used, demonstrating enhanced accuracy and robustness for power quality management [19]. To classify neuromuscular motor states (e.g., Forward, Reverse, Hand-Grip) from EMG signals, LSTM-FNN resolves the limitations of existing methods that ignore temporal signal correlations. The approach achieves 91.3% accuracy for four-way actions and 96.7% for two-way classifications. The FIS-LSTM framework enhances interpretability and reliability for applications in rehabilitation and medical devices, demonstrating robust performance in capturing dynamic neuromuscular patterns [20]. These hybrid frameworks highlight the potential of FNN-LSTM in complex, noisy environments, though challenges remain in optimising fuzzy rule scalability and computational efficiency. Future research should focus on dynamic fuzzy parameter adaptation and lightweight architectures for real-time applications.

Despite the advancements in predictive modelling, existing studies on fault detection in assembly lines primarily focus on rework station optimisation and process efficiency. However, the impact of stochastic variables, modelling uncertainties, and practical constraints limit the applicability of these models [21]. Moreover, heuristic and metaheuristic algorithms are increasingly needed, particularly for large-scale, complex systems [22]. There remains a significant research gap in developing holistic, real-time, and generalisable fault prediction models that account for diverse error types and their influence on production efficiency. While AI and machine learning have been successfully implemented across various domains, their application in aerospace assembly line fault prediction is still underexplored.

As a result of the above explanations, it is clear that modelling and prediction are still ongoing studies to be able to select an optimum model. However, so far there have not been any unique techniques leading to accurate simulation for any type of data or process. Moreover, whichever model is selected, determining its parameters will be the next problem. There are some solution alternatives for modelling or prediction, such as interpolation, regression, optimisation, and artificial learning (AL). Thus, initial estimates of the parameters can be determined easily via regression of the experimental data. This has the potential to eliminate very expensive and time-consuming trial and error. After a literature review, the idea of comparing the performance of the AL models was raised. This led the authors to an expectation and a research question, “Can any failure in an industrial assembly line, which has a part of manufacturing lines, be determined using AL models?” In the light of the above explanations, expectation, and the techniques’ potential, it is hypothesised that AL techniques can be used to predict failures. Thus, it was decided to investigate this subject.

In the literature on modelling and prediction, there are too many studies. However, few studies have handled failure prediction using AL methods. This study aims to address that gap by evaluating the performance of six AL models—NAR, LSTM, GRU, adaptive neuro-fuzzy inference system (ANFIS), MLP, and CNN-RNN—for failure prediction in an assembly line. A time series, so-called “configuration”, is handled. Each model inherently has distinct features. NAR is a variant of MLP specialised for time series. LSTM and GRU are deep recurrent networks, while ANFIS is a hybrid neuro-fuzzy system. Also, CNN-RNN is another hybrid model integrating both convolutional and recurrent neural networks. The trained networks are assessed through testing–validation iterations, with an additional closed-loop forecast.

Subsequent sections present the significance and novelty of the study. Section 3 explains the materials and methods used. It also gives the background of the models. Later, the results are presented in Section 4 with a comprehensive discussion. Finally, conclusions and future insights are explained in Section 5.

2. Significance and Novelty of the Study

The existing literature demonstrates the successful application of AL methods for modelling and prediction. However, predictive maintenance studies specifically targeting assembly lines in the aerospace industry remain scarce. The significance and main contributions of the study are outlined as follows:

The configuration data are collected in the aerospace industry. Unlike many other sectors, the safety factor in structural design typically does not exceed 1.2, generally, with some designs even falling below 1.0. This is primarily due to the stringent requirement for lightweight structures. Nonetheless, structural integrity must not be compromised. The assembly phase—being the final stage of production—demands precise component compatibility. Any failure occurring during assembly can halt the entire production line and potentially result in catastrophic in-service damage to the aircraft or spacecraft. Such failures are associated with severe economic consequences.
The aerospace sector is distinguished by its low-volume, high-precision production and assembly requirements. These unique characteristics necessitate domain-specific attention and pose particular challenges not encountered in other industries.
Structural damage in aerospace systems may result from various factors, including fatigue, corrosion, manufacturing defects (e.g., voids, inclusions, dislocations), processing flaws (e.g., surface scratches, weld discontinuities, chevron notches, microstructural alterations), mechanical loading (tension/compression, bending, torsion, shear, pressure), impact, wear, creep, friction, hydrogen embrittlement, high temperature, and vibration. While damage tolerance design can mitigate many of these effects, human-induced errors remain difficult to eliminate. Regardless of the underlying cause, early fault prediction can help to prevent such failures.
Not all AL techniques are suitable for time-series data. Some methods are good at image processing, while others are more effective in modelling. Time-series data demand specific attention due to their unimodal nature and the necessity to incorporate time lags as both input and output for accurate prediction. Furthermore, convergence during the training phase of learning algorithms is not always guaranteed.
This study not only compares prediction performance but also evaluates different methods with distinct underlying architectures. At the end, the most suitable technique for modelling aerospace assembly-related time series is identified.
The novelty of this research lies in both its use of domain-specific aerospace assembly data and the implementation of diverse and specialised modelling techniques. Modelling and prediction continue to be an active area of inquiry for both researchers and industry practitioners.

In conclusion, this study aims to address a notable gap in the current body of knowledge and is expected to contribute significantly to the advancement of the field.

3. Materials and Methods

The method followed is shown in Figure 1 schematically.

Various AL techniques are appraised for modelling and prediction. Their hyper-parameters are determined by trial and error and suggestions collected from the literature. Creating the models is based on optimisation, so-called training.

Preliminary analyses of the dataset are conducted to characterise its properties, including temporal dependencies such as lags. Model performance is evaluated through error metrics, regression analysis, error histograms, and variance assessments. The dataset is partitioned into two segments: the initial 80% is allocated for network training, testing, and validation. Within this subset, validation and testing portions are employed to assess and refine the model’s generalisation capacity, rather than for predictive purposes. The remaining 20% of the dataset, corresponding to the most recent temporal observations, is reserved for prediction. This enables the evaluation of machine-learning (ML) performance on data extending beyond the temporal scope of the training period, referred to as the “horizon,” which signifies future time points outside the available dataset. In this context, the terms “network” and “model” are used interchangeably, as both denote systems governed by underlying mathematical functions. All computational experiments are executed on a system equipped with 8 GB RAM and a 2.8 GHz quad-core CPU. Processing is confined to CPU resources, with one time-series analysed across six distinct AL approaches, resulting in six cases.

3.1. Data Collecting

The time series of the configuration data is collected by means of field observations. The data provide information about the configuration of the product. Since this information specifies the configuration of failure, one can more quickly conclude which line has more failures. In order not to disturb the nature of the data and to see the performance of the models on the raw data, no pre-processing is applied to the raw data. The data and their portions are given in Figure 2. The sampling range is [0, 100]. The series has 45,654 data. It has units of number.

The data are normalised using

o_{t}^{n o r m a l i z e d} = \frac{o_{t} - \bar{o_{t}}}{σ}

where

t

is the time step size,

o_{t}

is the original data at time step t,

\bar{o_{t}}

is its mean, and

σ

is the standard deviation;

σ = \sqrt{\frac{\sum_{t = 1}^{N} {[(p_{t} - \bar{o_{t}})]}^{2}}{N}}

where N is the data count in the dataset.

p_{t}

represents the data at time step t predicted by the model.

3.2. Preliminary Studies on the Data

As a specific case unique to time series, it is necessary to determine the time lags. There are powerful tools, namely Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF), which measure the relationship of a time series with its own past values at specified lags. The confidence level has been set at six standard deviations. In the graph, this confidence level is represented as a band that covers 99.99% of all peaks. Peaks that fall within these bounds are considered to be approximately zero and are referred to as white noise. These parts are considered insignificant in the determination of lags. While ACF gradually decreases, PACF shows that the first few peaks significantly exceed the confidence bounds, indicating that the time series exhibits autocorrelation and a structured pattern. If the autocorrelation continues to decline with increasing lag, it may suggest that the series has an autoregressive nature. Moreover, a slow decline in autocorrelation coefficients implies that the series is non-stationary and may contain a unit root.

As seen in Figure 3, in the ACF graph, the zero-lag peak is, as expected, equal to 1. Subsequently, a strong positive correlation is observed at the first lag, and the correlations at the following lags gradually diminish toward zero. This indicates that the autocorrelation of the time series decays slowly, suggesting a potentially non-stationary structure. Given the distinctiveness of this decline, the data can be classified as autoregressive (AR) in nature. In the PACF graph, a high correlation is observed for the first lag, and from the third lag onward, correlations rapidly approach zero. Consequently, it has been determined that the

c o n f i g u r a t i o n

series is an AR-type time series with three lags.

3.3. Statistical Tools for Evaluating the Model Performances

Two commonly used methods for performance evaluation are regression analysis and error analysis. Error histograms provide insight into the distribution of errors, as well. Additionally, variance analysis is another method employed in performance evaluation.

Through regression analysis, the correlation coefficient, denoted by R, is calculated. For regression analysis, a first-order regression is applied to fit a curve. The correlation in this curve is computed using the formula

R = \frac{N \sum_{t = 1}^{N} (p_{t} - o_{t}) - (\sum_{t = 1}^{N} p_{t}) (\sum_{t = 1}^{N} o_{t})}{(\sqrt{N \sum_{t = 1}^{N} {p_{t}}^{2} - {(\sum_{t = 1}^{N} p_{t})}^{2}} \sqrt{N \sum_{t = 1}^{N} {o_{t}}^{2} - {(\sum_{t = 1}^{N} o_{t})}^{2}})}

. In regression plots, the “fit line” represents the first-order regression applied to the entire set of (target, output) points. This line takes the form of output = a × target + c, where a is the slope of the linear regression line. c is the intercept on the “output” axis. The correlation between this line and the data is shown by R in the plot. While R is the correlation coefficient, R² is the coefficient of determination. Both are dimensionless. While R is good at evaluating the strength and direction of a linear relationship between two variables, R² is good at evaluating the proportion of variance. The R² value can be [0, 1]. It means that the model explains 100 × R²% of the variance in the dependent variable. When it comes to the R value, [0.01, 0.29], [0.30, 0.70], or [0.71, 0.99] indicate a weak, moderate, or strong relationship, respectively. A value of 0 implies no relationship, while negative values indicate an inverse linear relationship. The absolute increase in R signifies a stronger relationship. The “target” axis includes original data, while the “output” axis includes data predicted by the network. Ideally, these values should be equal, in which case R = 1 and all points lie on the 45° line. Although regression analysis shows the linear relationship, it is not solely sufficient to evaluate a model’s performance [23]. Therefore, these findings must be supported by other statistical control methods and error metrics to ensure more reliable results.

The error, denoted as

E_{t}

, is the difference between the estimated and original values, calculated as

E_{t} = p_{t} - o_{t}

where

o_{t}

and

p_{t}

are original and estimated values, respectively. The mean error (ME) is given by Equation (1):

M E = \frac{1}{N} \sum_{t = 1}^{N} E_{t}

(1)

Mean squared error (MSE) is given by Equation (2).

M S E = \frac{1}{N} \sum_{t = 1}^{N} {(E_{t})}^{2}

(2)

The square root of MSE is called RMSE. Another error metric is the mean of the absolute relative error (MAE) given in Equation (3):

M A E = \frac{1}{N} \sum_{t = 1}^{N} |E_{t}|

(3)

The average of the absolute percentage error is called MAPE. It can be calculated via Equation (4):

M A P E = \frac{100}{N} [\sum_{t = 1}^{N} |\frac{E_{t}}{o_{t}}|]

(4)

The mean bias error can be calculated using

M B E = 100 \frac{\sum_{t = 1}^{N} |E_{t}|}{\sum_{t = 1}^{N} o_{t}}

. Another metric is t-statistics. It can be calculated using

t = \sqrt{\frac{(N - 1) {M B E}^{2}}{{R M S E}^{2} - {M B E}^{2}}}

. While MAE and RMSE have the same units as the data, the MSE has the square of this unit. MBE and MAPE are expressed as a percentage. If the error approaches zero, the model’s outputs closely approximate the original data. This is expected in all error metrics. Additionally, the mean square deviation (χ²) can be computed utilising

χ^{2} = \sum_{t = 1}^{N} \frac{{E_{t}}^{2}}{o (t)}

.

The error histogram visualises the distribution of E_t using bars with specific intervals [24]. The widths of the bars on the horizontal axis show the error ranges. On the vertical axis, the bar lengths show the “frequency of occurrence.” An expected histogram should meet these criteria; a bell-shaped curve; a symmetric distribution; a peak of the curve close to the “zero error” line; a narrow curve [25]. A bell-shaped curve indicates a random error distribution [26]. Systematic bias may cause an asymmetric distribution. The narrowness of the curve indicates low deviation in the errors. A sharp histogram implies the presence of outliers or extreme errors, meaning the model struggles to explain some observations. If the peak of the curve is close to the zero-error line, it indicates that errors near zero occur frequently. Observing the histogram reveals that the tallest bars are clustered around the “mean error”, although the mean is slightly shifted to the right of the zero-error line.

For variance analysis, a “residuals vs. model prediction” plot should be generated first. If the values are randomly scattered (no pattern) and lie within two parallel bounds, constant variance (homoscedasticity) is present, which is the desired condition. However, if the points form a cone or bowtie pattern, this indicates variable variance (heteroscedasticity).

3.4. Modelling Using the NAR Network

NAR is a nonlinear model. It is good at predicting the future of any time-series data. Its input layer consists of past values (lags, d). Its hidden layers learn nonlinear relationships and produce outputs through activation functions. The neurons in each layer are connected to those in the subsequent layer through weighted connections. The learning capability of such networks is provided by weights and biases. In the output layer, the predicted value is computed and, via feedback, used as input in the subsequent iteration. However, the behaviour and performance of the model are significantly influenced by its hyper-parameters, which must be carefully optimised.

The hyper-parameters of NAR networks are dataset partition sizes for training, testing, and validation; activation functions in the layers; a training algorithm; training parameters such as counts of training iterations, momentum, learning rate, and learning threshold; the count of hidden layers in the network; the neuron counts in the hidden layers. The validation performance and generalisation capability of the model are assessed using metrics such as MSE, MAE, and R² to determine the optimal values of these hyper-parameters. The success of NAR networks depends on their architecture and the careful selection of these hyper-parameters. Although specific methods exist for selecting certain hyper-parameters, for example, hidden layer count, neurons per layers, and iteration counts, many are still determined through trial and error. The number of hidden layers is typically determined based on problem complexity. The more hidden layers the network has, the more capability the network has to represent complex features, but the more risk of overfitting it has. A network with zero hidden layers performs well for linearly separable functions or decisions. The ability to approximate functions that map continuously from one finite space to another can be achieved with just one hidden layer. However, the use of two hidden layers allows for the unique representation of arbitrary decision boundaries. Networks with three or more hidden layers can approximate any smooth mapping to an arbitrary level of precision. From these explanations, it is evident that 0, 1, or 2 layers are generally sufficient.

Having too few neurons per layer may lead to inadequate representation, while having too many may result in overfitting. Various formulas exist to determine the optimum neuron number in each layer. One such formula is

N_{h} \leq \frac{N_{s}}{(α * (N_{i} + N_{o}))}

where

N_{i}

and

N_{o}

are the neuron numbers in the input and output layers, respectively;

N_{s}

is the data count in the training set; and α is an arbitrary scaling factor, typically ranging from 2 to 10.

N_{h}

stands for the neuron number in the hidden layer. In this study, the total number of data points is 45,654, with 80% used for training, so,

N_{s} = 36,523

,

N_{i} = 1

,

N_{o} = 1

, and

α = 10

. Therefore, the neuron number must satisfy

N_{h} \leq 3044

. Nevertheless, a higher number of neurons can be used provided that overfitting is avoided. The term Configuration(t) is the output, while {Configuration_t₋₁, Configuration_t₋₂, Configuration_t₋₃} are inputs, indicating that the output feeds back into the input. In the form of a function, it can be expressed as Configuration_t = f(Configuration_t₋₁, Configuration_t₋₂, Configuration_t₋₃, w, b) where

w_{i j}

represents the weights of each connection.

b

means bias. It can improve a model’s prediction efficiency. It adds a constant value leading to systematic behaviour [27].

j^{t h}

neuron in any layer consists of all weighted values coming from previous layer. Equation (5) governs these calculations.

n e t_{j} = b + \sum_{i}^{NN} y_{i} w_{i j}

(5)

where

N N

is the total number of neurons in the previous layer.

n e t_{j}

is the value of

j^{t h}

neuron.

y_{i}

is outputs coming from the previous layer.

The

j^{t h}

neuron’s output is

o u t

. It is calculated via an activation function as seen in Equation (6). Its threshold can be [0, 1].

o u t = f_{a c t} (n e t_{j})

(6)

If

o u t

is less than the threshold, then out can be called weak information. It is forgotten, in the current iteration, by creating no output. Thus, the network can become stronger. Tangent hyperbolic,

t a n h (n e t_{j}) = \frac{e^{n e t_{j}} - e^{- n e t_{j}}}{e^{n e t_{j}} + e^{- n e t_{j}}}

, and sigmoid,

σ (n e t_{j}) = \frac{1}{1 + e^{- n e t_{j}}}

are two popular activation functions. The sigmoid function compresses values to the [0, 1] interval. The output neuron has a linear activation function.

Training employs backpropagation with weight updates using

w = w + ∆ w

where

∆ w = - η \underset{g r a d i e n t}{\underset{⏟}{\frac{\partial L}{\partial w}}} + α {(∆ w)}^{p r e v i o u s}

where

α

is the momentum and determines the amount of influence from the previous iteration on the present one.

η

is the step size and is called the learning rate, and

L = M S E

loss.

α

and

η

are scalar hyper-parameters within the interval [0, 1].

η

governs the magnitude of weight updates during gradient descent, directly influencing optimisation dynamics. A greater

η

expedites training convergence but risks overshooting minima, potentially yielding suboptimal model performance. Conversely, lower

η

values reduce convergence speed and may induce oscillatory behaviour near local optima. To mitigate this instability, the optimisation process incorporates

α

, which proportionally adds gradients from prior iterations to stabilise weight updates.

α

thereby attenuates oscillations and enhances the likelihood of converging to a global optimum. Figure 4 shows the NAR network’s topology used in this study.

In light of the above explanations and findings, the remaining hyper-parameters are determined through trial and error. The optima are given in Table 1.

3.5. Modelling Using the LSTM Network

LSTM is a model with an RNN structure. This network consists of layers. The main layers are an input layer, LSTM layer(s), and an output layer. The input layer processes the data received from external sources. The LSTM layers hold hidden units, the core components responsible for the model’s learning capability. These units calculate data sequentially and learn temporal dependencies. The output layer transforms the information processed by LSTM into a final output format (e.g., classification, regression, or sequence prediction). The recurrent structure of LSTM specifically provides the ability to learn long-term dependencies in time-series data. The general architecture of an LSTM network is shown in Figure 5.

A hidden unit includes three gates, the input gate, forget gate, and output gate. The gates determine which information the model retains and which it discards. Each gate employs sigmoid activation functions that are regulated by weights and biases. This is shown in Figure 6, where

i, f, g

, and

o

, respectively, represent the input gate, forget gate, cell candidate, and output gate.

h

denotes the hidden state.

W \in R^{d x e}

,

R \in R^{d x e}

, and

b \in R^{e}

represent the input weights, recurrent weights, and biases, respectively. t denotes the time step,

\{x\}

is the set of input variables,

\{h_{t}\}

is the set of outputs (hidden states), and

c

denotes the cell state.

In LSTM networks, learning occurs through the update of weights and biases. The process of updating is illustrated by the signal flow within a unit cell in the diagram.

At time step t, the values of

c_{t}

and

\{h_{t}\}

are calculated using their previous values

c_{t - 1}

and

h_{t - 1}

. The network state is updated at each step by adding or removing information, and this process is governed by the gates. At time step

t

, the cell state

c_{t}

is computed using Equation (7):

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ g_{t}

(7)

The symbol ⊙ represents the Hadamard product. The hidden state

h_{t}

is then calculated using Equation (8):

h_{t} = o_{t} ⊙ t a n h (c_{t})

(8)

In this equation, tanh refers to the hyperbolic tangent activation function. The terms in the above equations are computed using Equation (9):

i_{t} = σ (W_{i} x_{t} + R_{i} h_{t - 1} + b_{i}) f_{t} = σ (W_{f} x_{t} + R_{f} h_{t - 1} + b_{f}) o_{t} = σ (W_{o} x_{t} + R_{o} h_{t - 1} + b_{o}) g_{t} = t a n h (W_{g} x_{t} + R_{g} h_{t - 1} + b_{g})

(9)

where

σ (x, h, b)

denotes the sigmoid activation function in the gates. Each gate has weights and biases optimised during training. Weights capture the relationships between the input data and the hidden state, while the biases ensure the correct functioning of each gate. The cell state of the LSTM enhances its ability to learn long-term dependencies, and the structure also helps to mitigate issues such as vanishing gradients.

The hyper-parameters of the LSTM and GRU networks used in this study are presented in Table 2, and their optimal values are generally determined through experimental methods and cross-validation. The input layer is defined in accordance with the characteristics of the dataset. The number of LSTM layers and units is tuned through trial and error to optimise the model’s generalisation capability. The dropout rate is adjusted to prevent overfitting. The fully connected layer ensures that the features extracted from the LSTM layers are appropriately structured for classification or regression tasks. Activation function of the output layer is chosen based on the problem type (e.g., softmax for classification, linear for regression). Solvers such as ADAM, RMSProp, or SGD are used as the training algorithm, and the optimal solver is selected by monitoring the validation error. Training parameters such as the learning rate and training duration are optimised by tracking loss function and accuracy. Initial weight values are set using Xavier or He initialisation methods to prevent vanishing or exploding gradients, thereby facilitating faster model convergence. The careful tuning of these hyper-parameters significantly enhances model performance and helps to prevent overfitting.

3.6. Modelling Using the GRU Network

GRU was proposed by Chung et al. in 2014 [29]. It is a model with an RNN architecture designed to capture temporal dependencies in sequential data processing. While GRUs share a similar structure with LSTM networks, they differ in terms of the number of gates within the hidden units and the signal flow mechanism (Figure 7).

A GRU hidden unit consists of two primary gating mechanisms known as update gate and reset gate that sum the input,

x_{t}

, the previous hidden state,

h_{t - 1}

, and the bias,

b

. These mechanisms alleviate the vanishing gradient problem, enabling more effective training over deeper and longer sequences. Despite having a simpler architecture than LSTM, a GRU network can learn time dependencies effectively.

Initially, at t = 0, the output vector is

h_{0} = 0

. At time t, the activation of GRU, denoted as

h_{t}

, is a linear interpolation between the previous activation

h_{t - 1}

and the candidate activation as given in Equation (10).

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h_{t}}

(10)

Here,

h_{t}

is the output vector, and

\tilde{h_{t}}

stands for candidate activation, which is calculated as in Equation (11).

\tilde{h_{t}} = t a n h (W_{h} x_{t} + U_{h} (r_{t} ⊙ h_{t - 1}) + b_{h})

(11)

Update gate (

z_{t}

): This mechanism regulates how much

h_{t}

is influenced by

h_{t - 1}

. It is defined using a sigmoid activation function,

z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z})

, where

W_{z}

represents the weights of

x_{t}

.

U_{z}

represents the weights of

h_{t - 1}

.

b_{z}

is the bias term in the update gate. The sigmoid activation function,

σ,

is employed in this gate. It compresses values to the [0, 1] interval, enabling the gate to operate in an on/off manner, thereby allowing the GRU to selectively pass on information.

Reset gate

{(r}_{t})

: This mechanism determines how much

h_{t - 1}

should influence

h_{t}

. It has a sigmoid activation function,

r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r})

, where

W_{r}

and

U_{r}

represent the weights of

x_{t}

and

h_{t - 1}

, respectively.

b_{z}

is the bias in the reset gate.

Since the GRU network is a type of LSTM network, the LSTM network’s parameters given in Table 2 are used, as well. Apart from these, tanh is used as the state, while sigmoid is used in the gates.

3.7. Modelling Using the CNN-RNN Hybrid Network

CNN-RNN hybrid networks integrate convolutional neural networks with recurrent neural networks to learn both spatial and temporal dependencies in sequential data. In this hybrid architecture, CNN layers are responsible for learning spatial features from the input and extracting specific patterns. The filters and biases of the convolutional layers are optimised to detect relevant features within the data. The RNN layers, on the other hand, capture temporal dependencies by processing the sequential features derived from the CNN, thereby evaluating their temporal context. In such architectures, gated units such as LSTM or GRU are typically employed, and their weights and biases are optimised to enable the effective learning of long-term dependencies. Additionally, the network incorporates regularisation techniques such as dropout and batch normalisation, which contribute to an improved generalisation performance. The weights between layers are generally optimised using gradient descent algorithms and updated based on performance metrics evaluated on the validation set. This multilayer structure of the CNN-RNN hybrid network enables efficient modelling of both spatial and temporal relationships. Figure 8 presents its topology.

In CNN-RNN hybrid networks, hyper-parameters such as the dilation factor, pool size, and stride are tuned to optimise the learning performance and generalisation capability of the model. The dilation factor controls the spacing between filter elements during the convolution process and allows for the construction of large receptive fields. Its optimal value depends on the scale of the dataset. The pool size refers to the size of the region considered during pooling operations and is optimised to reduce spatial dimensions while minimising information loss. The selection of pool size is typically made through empirical methods or cross-validation techniques. The stride parameter defines the step size of the filter movement across the data, affecting both computational cost and output size. The optimal stride value is determined based on model performance criteria, available computational resources, and the requirements of the target application.

Each layer in the CNN architecture fulfils a specific function within the deep learning model. The sequence input layer allows for the proper ingestion of sequential data into the network, preserving temporal or ordering information. The sequence folding layer transforms the sequential input into a two-dimensional format, enabling convolutional layers to operate on the data. The convolution2D layer extracts spatial features from the input, such as edges, textures, or other local patterns. The batch normalisation layer normalises intermediate outputs to accelerate training and mitigate overfitting. The ELU (Exponential Linear Unit) layer serves as a nonlinear activation function and improves performance in deeper networks by smoothing gradient flow in negative input regions. The average pooling2D layer reduces the spatial dimensions and creates a more condensed representation of the features. The sequence unfolding layer restores the sequential structure by converting the two-dimensional representation back to a temporal format, thus preserving the time-dependent nature of the data. The flatten layer reshapes multi-dimensional input into a single-dimensional vector, typically used as input for fully connected layers. Working in concert, these layers form an effective model for learning complex structures in sequential data. Architecture and hyper-parameters of each layer are detailed in Table 3.

In this study, two bidirectional long short-term memory (BILSTM) layers are implemented as the RNN framework. While BILSTM was initially conceptualised in 2005 [30], its hybrid configuration was formally introduced in 2017 [31]. The BILSTM structure comprises two distinct LSTM layers, configured with 64 and 16 hidden units here, respectively. A dropout layer, parameterised with a probability of 0.2, is inserted between the two layers to mitigate overfitting. The architecture terminates with a fully connected layer and a regression output layer, enabling sequential prediction tasks. RNN-based models, such as BILSTM, are adept at capturing both short- and long-term temporal dependencies by learning hierarchical relationships within sequential data. To train the hybrid CNN-RNN framework, an initial 80% of the dataset is allocated to model training, while the remaining 20% is reserved for testing, facilitating the evaluation of predictive accuracy on unseen temporal sequences.

3.8. Modelling Using ANFIS

ANFIS has a hybrid architecture that combines fuzzy logic with artificial neural networks, leveraging the strengths of both approaches. It comprises five fundamental layers that process input data to produce a crisp output, with each layer performing a distinct function. In the fuzzification layer, input variables are transformed into fuzzy sets, and each input is assigned appropriate fuzzy values through predefined membership functions (MFs). This transformation allows the system to effectively handle uncertainty. In the multiplication layer, the antecedent parts of the fuzzy rules are combined, and the firing strength of each rule is computed. When the “AND” operator is used, the outputs of the membership functions are multiplied to determine the weight of the corresponding rule. The normalisation layer normalises each rule’s weight by dividing it by the total weight, thereby calculating the relative importance of each rule. The defuzzification layer converts the fuzzy outputs into crisp values using specific mathematical techniques, most commonly the weighted average method. Finally, the aggregation layer combines the outputs of all rules to generate the final system output. This multilayer structure enables ANFIS to successfully model complex nonlinear systems by integrating ANN’s learning capability with fuzzy logic’s interpretability, making it a powerful tool for prediction and modelling. Figure 9 illustrates the topology of the ANFIS network used. The architecture and parameters of each layer are presented in Table 4.

In ANFIS, during training, MFs for the input variables are defined, and their clusters are updated to gain learning ability. MF determines which fuzzy sets an input value belongs to. An MF can take various forms, such as triangular, trapezoidal, Gaussian, bell-shaped, and sigmoid. These functions enable the modelling of uncertain data in fuzzy logic systems. A generalised fuzzy rule using the “AND” operator is given in Equation (12).

Rule-

n^{t h} :

I f (i n_{1} = A_{i}) \land (i n_{2} = B_{j}) \land \dots \land (i n_{\dots} = C_{k}) t h e n f_{n} = f (i n_{1}, i n_{2}, \dots, i n_{\dots})

(12)

where

\land

is the “AND” operator.

i n_{1}, i n_{2}, \dots

are input sequences.

n = 1,2, \dots

is the rule number. The cluster counts for each input variable are

i, j, k = 1,2, \dots

in the fuzzification layer. The input variables’ linguistic expressions are

A, B, \dots, C

which correspond to each cluster. In the defuzzification layer, each node’s output is shown via

f_{n}

. If it is a constant, then ANFIS is called a zero-order Sugeno model, while it becomes a first-order Sugeno model if it is a first-degree polynomial like

f_{n} = p_{n} * i n_{1} + q_{n} * i n_{2} + \dots + r_{n} * i n_{\dots} + s_{n}

where

p_{n}, q_{n}, r_{n}

, and

s_{n}

are the parameters determined in the output layer via regression. As a rule, parameters in the “

I f

” clause are called antecedents, while those in the “

t h e n

” clause are called consequents.

First, the input variables in the fuzzification layer are mapped to fuzzy sets using appropriate MFs. For instance, the Gaussian membership function is given in Equation (13).

μ_{A_{i}} (x) = e x p (- \frac{{(x - c_{i})}^{2}}{{2 σ}_{i}^{2}})

(13)

where

μ_{A_{i}} (x)

denotes the degree to which input

(x)

belongs to the i-th fuzzy set, where

c_{i}

is the center and

σ_{i}

is the standard deviation of the function. Gaussian, trapezoidal, bell-shaped functions, and triangular functions are well-known MFs.

In the multiplication layer, each rule’s firing strength is calculated by multiplying the membership degrees of the input variables. The method with the “AND” operator is the minimization operator, given in Equation (14).

w_{i} = \min (μ_{A_{i}} (x_{1}), μ_{B_{i}} (x_{2}))

(14)

The firing strength of each rule is normalised, in the normalisation layer, by the total firing strength as in Equation (15).

{\bar{w}}_{i} = \frac{w_{i}}{Σ_{j} ω_{j}}

(15)

where

{\bar{w}}_{i}

is the normalised firing strength and

w_{i}

is the firing strength of the i-th rule.

In the defuzzification layer, in a Takagi–Sugeno model, the output of each rule is either a constant or a linear combination of the input variables. A typical first-order Takagi–Sugeno model is represented by

f_{n} = p_{n} * i n_{1} + q_{n} * i n_{2} + \dots + r_{n} * i n_{\dots} + s_{n}

, where

\{p_{n}, q_{n}, {\dots, r}_{n}, s_{n}\}

are the consequents. In the defuzzification process, techniques like the weighted average or center of gravity are applied and optimised based on the error function.

Finally, in the aggregation layer, the contributions of all rules are combined to produce the final output using a weighted average as in Equation (16).

y = \sum_{i} {\bar{w}}_{i} f_{i}

(16)

This formula determines the final system output by weighting each rule’s output by its normalised firing strength.

During the fuzzification stage, the clustering method, the number of clusters, and the type of MF are selected. These significantly affect model sensitivity. FISs provide a computational framework capable of performing logical inference under uncertainty and imprecise information. FIS design methods include rule-based approaches (e.g., Mamdani, Sugeno), data-clustering-based methods (e.g., fuzzy C-means, abbreviated to FCM), and learning techniques utilising genetic algorithms or artificial neural networks. The FCM method partitions data into fuzzy clusters by computing each data point’s degree of membership to each cluster. The FCM algorithm initially selects cluster centres randomly and computes the membership degree

u_{i j}

of each data point using Equation (17).

J = \sum_{i = ⊥}^{N} \sum_{i = ⊥}^{N} u_{i j}^{m} {| | x_{i} - c_{j} | |}^{2}

(17)

Here,

x_{i}

is the data point,

c_{j}

is the cluster center, m is the fuzzification parameter, and

u_{i j}

is the membership degree. FCM iteratively updates membership functions and optimises cluster centres.

In the decision-making stage, gradient descent or hybrid learning algorithms are used to establish fuzzy rules and their weights. For the output function, either linear or nonlinear models are used, typically selected to minimise an error metric on the training dataset. The number and type of membership functions (MFs), as well as the cluster size for FCM clustering, are optimised to avoid an overly complex rule-base that could lead to overfitting. A smaller number of MFs is selected to maintain generalisability, balancing training accuracy and model simplicity. These approaches enhance generalisation ability and reduce the risk of overfitting.

3.9. Modelling Using MLP

MLP is one of the earliest ANN techniques developed for supervised learning, with initial applications dating back to 1954. It was not originally designed for time-series analysis; however, it has been included in this study for performance comparison. MLP consists of layers. The main layers are the input, hidden, and output layers. The neurons in each layer process weighted inputs from the previous layer using a nonlinear activation function and transmit the resulting output to the subsequent layer. The input layer receives raw data, providing the foundation for network operation, while the hidden layers learn nonlinear relationships and generate high-level representations of the data. Finally, the output layer performs either classification or regression tasks.

The MLP network topology shown in Figure 10 illustrates the interconnections among the layers and the flow of information within the network.

Since MLP networks have the same topology as NAR networks with their layered structure, their hyper-parameters and detection methods are also the same. To ensure an enhancement in generalisation ability and a reduced risk of overfitting, the number of hidden layers and neurons per layer are limited through a trial-and-error process based on validation error trends. Also, early-stopping is applied to determine the optimum epoch number. The early-stopping method is based on catching the best iteration number just before the validation curve starts to divergence. Also, simpler architectures generalise better.

4. Results and Discussion

To ensure an enhancement in generalisation ability and a reduced risk of overfitting, and to evaluate the networks’ prediction performance on time-series data, iterative error convergence curves, error metrics, prediction graphs, residual analysis, regression analysis, and variance analyses are undertaken. These diagnostics demonstrate whether the prediction errors remained stable without variance inflation. They are indirect indicators showing that overfitting does not affect the models’ performance.

4.1. Error Convergences and Performance Metrics

It is expected that the convergence curves approach zero with each iteration and exhibit no fluctuations in stability throughout the process. Convergence toward zero indicates that the model’s error is consistently decreasing at each iteration. Sudden spikes in the curves, on the other hand, may be a signal of overfitting, often referred to as “memorisation”.

Figure 11 illustrates the training error convergence behaviour of six models. The error is measured in terms of RMSE and plotted against the iteration number. Among the models, GRU and LSTM exhibit superior convergence characteristics, with steadily declining error trends and minimal fluctuations throughout the training iterations. Both models achieve the lowest final RMSE, suggesting strong learning capacity and robustness in capturing the temporal dependencies of the data. MLP and CNN-RNN demonstrate acceptable convergence, with error gradually decreasing as iterations proceed. However, their convergence is less stable than that of GRU and LSTM, as indicated by small oscillations and a slower reduction in error. ANFIS and NAR convergence earlier. The curves remain relatively flat throughout the iterations.

Table 5 presents a comparison of the performance metrics.

NAR and MLP achieve the lowest MAE (2.592 and 2.122, respectively), MAPE (0.895% and 0.876%), and RMSE (1.107 and 1.418), demonstrating strong accuracy in pointwise prediction. Additionally, MLP has the best performance in terms of ME (1.145) and R²_test (0.9858), suggesting it is most effective at capturing the variance in the target variable.

LSTM and GRU achieve high R²_test scores (0.9166 and 0.9675, respectively), indicating a good generalisation capability despite their relatively higher absolute errors. However, both models require significantly longer calculation durations (1826.525 s for LSTM and 1724.109 s for GRU), making them computationally more expensive.

ANFIS shows a balanced performance and the highest R²_test (0.9875), suggesting 98.75% of the variance in the dependent variable, though its computational demand (796.826 s) and high number of iterations (1000) may limit its practical deployment. CNN-RNN, in contrast, has the weakest performance with the highest MAE, MAPE, MSE, and RMSE, as well as a relatively high computation time (1004.711 s), despite reaching 100 iterations.

In summary, MLP and NAR offer a favourable trade-off between accuracy and computational cost, while ANFIS and LSTM provide a strong predictive capacity at the expense of a longer training time. ANFIS demonstrates potential with its high accuracy but requires further tuning. CNN-RNN, however, underperforms relative to its peers across most metrics.

4.2. Predicted Data

Predicted data on the time series are also compared as an additional validation. Figure 12 illustrates the prediction curves generated by six models over the forecasting portion, allowing for a detailed comparative assessment of their predictive behaviours under varying data dynamics. The visual comparison reveals that most models demonstrate a stable prediction performance with limited deviations from the original data, the so-called experimental data.

Notably, the MLP, LSTM, and GRU models exhibit smoother prediction curves, reflecting their robustness in capturing the underlying temporal structure of the dataset. The NAR and ANFIS models also maintain consistent predictions but show occasional small fluctuations in regions with rapid changes in configuration values. On the other hand, the CNN-RNN model displays a relatively higher level of noise, which may suggest sensitivity to short-term volatility or overfitting in localised intervals.

The presence of transient spikes in the predictions, as observed around timestamps 37,600 s and 41,700 s, further highlights the importance of model generalisation in regions with abrupt changes. Such artefacts are less pronounced in models like MLP and LSTM, implying their superior adaptability to both stationary and non-stationary patterns.

In conclusion, the prediction curves support the quantitative findings reported earlier, confirming the higher temporal consistency and accuracy of the MLP, LSTM, and GRU models. These results emphasise their suitability for time prediction tasks characterised by complex temporal dependencies and moderate noise levels.

Although the LSTM and GRU models provide a general curve shape, there is a distinct shift as reported in previous studies [32].

4.3. Residual Analysis

The residuals (prediction errors) represent the differences between the predicted and actual values at each time step. Ideally, a well-performing model should yield residuals that are symmetrically distributed around zero and exhibit no obvious pattern over time. Figure 13 illustrates the residual distributions of six models, reflecting their prediction errors relative to actual values.

In each case, the error becomes smaller and smaller. Figure 13a shows LSTM’s performance. The residuals are mostly concentrated around zero after the time step of 38,000. The early section likely includes transition periods or unusual patterns. Figure 13b gives ANFIS’s residuals. The errors are symmetrically distributed with moderate scatter and fewer extreme outliers compared to LSTM. The residuals still show a visible spread early in the timeline. ANFIS is more robust to nonlinear but smooth transitions due to its fuzzy logic rule-base. However, it may struggle with highly dynamic or stochastic segments. Figure 13c gives the NAR model’s residuals. There is a similar pattern to ANFIS with symmetric error spread and relatively fewer extreme values. The residuals stabilise quickly. NAR networks are recursive and perform well in autoregressive tasks. Figure 13d shows GRU’s performance. It has a relatively small scatter and fewer extreme errors, especially after the 38,000 mark. The distribution is tighter than LSTM’s. GRU converges faster and generalises better than LSTM when the training data are limited. It may offer a good balance between complexity and performance. Figure 13e shows CNN-RNN’s performance. The distribution is similar, with minimal extreme values. Figure 13e shows MLP’s performance. It exhibits a slightly broader error distribution. Lacking memory, MLP models rely solely on current and recent input values without temporal feedback, making them less suited to time-series prediction unless lagged inputs are manually engineered. Consequently, NAR and MLP have a superior performance. GRU offers a faster performance. LSTM shows potential but has a lower performance. ANFIS performs reasonably well but may not handle highly dynamic segments or abrupt transitions efficiently.

4.4. Variance Analysis

Figure 14 evaluates the homoscedasticity of the models. The scatter of all models stays between two parallel boundaries. This indicates homoscedastic variance, suggesting good generalisability. NAR and GRU exhibit the lowest variance, with GRU’s predictions tightly clustered near zero, demonstrating exceptional consistency. However, the asymmetry in boundary clarity—distinct lower bounds versus ambiguous upper bounds—arises from the nature of the time-series data. The series does not have any negative values. The upper boundary ambiguity stems from the absence of strict maxima in the data, allowing overpredictions to exhibit greater variability.

Domain-aware regularisation, augmenting training data in underrepresented high-value regions, applying transformations (e.g., log-scaling), using bounded activation functions, or post-processing to enforce balanced residual behaviour would be applied on the data to symmetrise error distributions and avoid overpredictions. However, in order not to change the nature of the data, no pre-processing is applied to the raw data.

4.5. Discussion

In this study, the prediction of the “configuration” parameter is performed using various artificial intelligence and statistical modelling approaches, and the performance of these models is evaluated based on statistical error metrics, namely RMSE, MAE, and MAPE.

Initially, in a comparative study from the literature that investigated standalone and wavelet-based modelling approaches (Table 6), performance metrics for different models (MLP, ANFIS, and NARX) were reported. Hussein and AlAlili (2017) collected time series of the climatological variable global horizontal radiation. They present a model with a lower error [33]. This is because their model includes both its delays and exogenous inputs such as humidity, temperature, wind speed, and sunshine duration. Thus, the model can easily catch the relations among all climatic variables.

It is also important to note that some studies involved pre-processed data and employed different evaluation metrics. Another significant point is the diversity in the modelled parameter and its unit across studies. These differences hinder a direct one-to-one comparison. Nevertheless, the overall behaviour of the models appears to align with the expected outcomes. For instance, Duan et al. (2025) reported that the GRU model performs faster than LSTM [35], which is consistent with the findings of the present study, where the training durations are recorded as 1826.525 s for LSTM and 1724.109 s for GRU.

Overall, the findings suggest that the modelling approaches proposed in this study outperform their counterparts in the literature in terms of both lower error and higher prediction accuracy. In particular, the ability of the NAR and MLP models to achieve MAPE values below 1% underscores their potential for high-precision prediction and confirms their superior performance relative to existing models. At the beginning of this study, it was hypothesised that AL techniques could be used to model and predict failures in an industrial assembly line. At the end of the study, it is seen that these techniques have a better modelling and prediction performance.

Configuration data inherently contain numerous peaks and dips. Machine-learning techniques generally exhibit a better modelling performance for data with periodic or non-periodic peak-and-dip patterns compared to traditional regression methods. A key distinction of machine-learning models is their dynamic structure, which allows them to achieve a higher predictive performance. For instance, in the NAR network, this adaptability is achieved by updating the weights during training. In the ANFIS method, it is accomplished by adjusting the cluster size of MFs throughout the training process. LSTM networks, on the other hand, are composed of self-repeating units. During training, such networks can become more adaptive to the input data due to their layered architecture and features such as gated structures, bias terms, dropout mechanisms, and activation thresholds.

The training algorithm essentially represents an optimisation process aimed at minimising the prediction error. In this study, the Levenberg–Marquardt algorithm was selected [38]. To construct an MLP network, its hyper-parameters must first be appropriately selected [39,40]. The NAR network is a type of MLP [41]. Even when the training performance of the network is high, overfitting and issues such as gradient exploding or vanishing must be prevented. Model accuracy is evaluated through validation sets, cross-validation, and error analysis, while its true performance is measured with the test set. Within the scope of this study, the terms “network” and “model” are used interchangeably, since a network essentially represents a mathematical function, as well.

The optimisation used here is based on gradient descent. The term gradient refers to change. Vanishing gradients occur when a gradient shrinks during backpropagation, thereby reducing the model’s learning capacity. Gradient vanishing usually occurs when activation functions such as sigmoid or tanh are used. Gradient explosion, on the other hand, causes the optimisation to become unbalanced due to the excessive growth of the weights. Functions such as ReLU and its derivatives are preferred to prevent gradient vanishing, while weight normalisation and gradient-limiting techniques are applied to prevent gradient explosion. Also, different initialisation techniques (e.g., Xavier or He), gradient clipping, learning rate adjustment, and adaptive optimisation algorithms provide effective solutions.

Dataset partitioning is a critical step for assessing model performance and enhancing generalisation. The training set is used for learning, the validation set is employed for hyper-parameter tuning and early stopping, and the test set is employed to evaluate generalisation. The proportions of these partitions depend on dataset size. A commonly adopted portion is 70~80% for training, 10~15% for validation, and 10~15% for testing.

Open-loop prediction is performed when test data are available. For example, if actual values of a time series are known from time steps 1 to t − 1, the value at time t + 1 can be predicted. In this case, the actual value at time t should be recorded and used as input for the t + 1 prediction. Closed-loop prediction, by contrast, uses previous predictions as inputs to forecast future time steps, requiring no actual values. For instance, if the goal is to predict values from time t to t + k using data only from time steps 1 to t − 1, the predicted value at t − 1 becomes the input for predicting the value at t. Closed-loop prediction is useful for long-term forecasting or when real-time inputs are unavailable. Open-loop prediction is preferred for short-term forecasts. Closed-loop prediction enables sequential predictions using the model’s prior outputs, offering flexibility for long-term scenarios but increasing the risk of accumulated error. Open-loop prediction is faster and more stable, whereas closed-loop prediction, while more flexible, may suffer from reduced accuracy due to error accumulation.

Overfitting results in poor performance on new data. This typically arises from small datasets, overly complex models, or lack of regularisation. Solutions include gathering more data, applying L1 or L2 regularisation, adjusting learning rates and momentum, using dropout, simplifying the model, and implementing early-stopping. These approaches enhance the model’s generalisation capability and yield a balanced performance. Overfitting can be identified by monitoring performance (error convergence) curves during training. These curves allow one to track model behaviour and observe whether the loss is decreasing. Similar training and validation losses indicate good generalisation, while flattening curves suggest that learning has completed. A continued decrease in training error concurrent with an increasing or destabilising validation error signifies overfitting. In such cases, training should be stopped. The error convergence of the NAR network curve is shown in Figure 15. As can be observed, all error curves approach zero. Up to 10 iterations, all curves (training, validation, and test) converge steadily toward zero error. Beyond 10 iterations, however, the situation changes. While the training error continues to decrease, the validation error begins to rise during the next 6 iterations, indicating that the optimal number of iterations is 10. Training beyond this point risks overfitting and a loss of stability.

Since the Levenberg–Marquardt algorithm is based on gradient descent, the decrease in gradient and increase in mu in Figure 16 are expected outcomes. The increase in validation error beyond the 10th iteration (green curve in Figure 15) and failure of further validation are more clearly visible in Figure 16c. During optimisation, the gradient should decrease as the model approaches the target, and it should reach zero at the exact minimum or maximum point. From this perspective, the gradient reduction shown in Figure 16a is a sign that the training algorithm is approaching an optimal value. Another indicator is the mu value; its increase suggests proximity to optimality. In Figure 16b, the mu curve consistently rises, further indicating successful training. The final indicator is the validation status shown in Figure 16c. At the end of each iteration, the model’s accuracy is assessed. Validation is expected to succeed. As shown, validation is successfully performed for the first 10 iterations, but fails consecutively in the following 6, indicating increasing error. This confirms that the best training outcome is achieved within 10 iterations, and continuing beyond that results in overfitting. If validation fails six times in a row, the model reverts to the state at the sixth successful validation to ensure the best model is retained.

GRU and NAR emerge as the most balanced models, combining low residuals and minimal variance, which shows their suitability for tasks requiring precision and stability. The results underscore the importance of aligning model choice with task requirements. In the aspect of precision versus speed, GRU is good at accuracy while NAR or MLP is good at computing efficiency. However, LSTM and GRU require substantial computational resources and extended training times. This limitation is particularly relevant when considering real-time applications. ANFIS and CNN-RNN show intermediate performance, with room for improvement in handling outliers and complex patterns.

Although the current study specifically utilises configuration data from the aerospace sector—chosen for their high-precision, low-volume characteristics and critical safety demands—the underlying machine-learning architectures (such as LSTM, GRU, MLP, CNN-RNN, ANFIS, and NAR) are inherently domain-agnostic and can be retrained or fine-tuned for other industrial settings. What distinguishes this work is not the limitation to a particular industry, but rather the demonstration that the time-series prediction of assembly failures using artificial learning is feasible and beneficial even in the most sensitive and stringent production contexts. This implies that in less safety-critical or more high-volume environments, the implementation might be more straightforward and computationally less demanding.

While this study provides critical insights into AL model performance for failure prediction, several limitations warrant consideration. First, the exclusive focus on a single variable (“configuration of failure”) restricts the model’s ability to account for multivariate interactions, such as environmental conditions or machine-specific parameters, which are often critical in industrial failure dynamics. Also, some architectures like transformers or attention-based networks may offer superior temporal modelling. It is one of the rising questions whether a dataset originated from a single aerospace assembly line can be generalised to other manufacturing sectors or failure types. Additionally, computational resource constraints (8 GB RAM, CPU-only processing) precluded hyper-parameter optimisation for computationally intensive models like LSTM and CNN-RNN, potentially underrepresenting their capabilities. Nevertheless, the study’s rigorous within-context analysis establishes a foundational framework for AL-driven predictive maintenance, with future research directions emphasising multivariable integration, cross-industry validation, and advanced computational resources.

5. Conclusions

This study focuses on the modelling and prediction of operational failures within aerospace assembly lines to enable proactive defect mitigation. Accurate prediction of failure patterns is critical for pre-emptive intervention, as it facilitates the avoidance of costly disruptions in manufacturing processes. The experimental dataset consists of 45,654 time-stamped records from a mounting process variable. To analyse these, supervised machine and deep learning methodologies are implemented, including a multilayer perceptron (MLP), nonlinear autoregressive exogenous network (NARX), adaptive neuro-fuzzy inference system (ANFIS), long short-term memory network (LSTM), gated recurrent unit (GRU), and a hybrid architecture integrating convolutional neural networks (CNNs) with recurrent neural networks (RNNs). These techniques are selected for their capacity to model nonlinear temporal dependencies and sequential behaviours inherent in industrial failure dynamics.

The comparative analysis reveals that NAR and MLP are the most effective techniques in terms of both predictive accuracy and computational efficiency. MLP achieved the lowest MAE (2.122), RMSE (1.418), and ME (1.145). NAR closely followed, with a notably high coefficient of determination (R² = 0.9867) and the shortest computation time (4.847 s), indicating strong real-time applicability.

ANFIS also demonstrates a competitive accuracy (MAE = 2.805, RMSE = 1.409) and R² value (0.9875), but at the cost of significantly higher computational overhead due to its iterative clustering mechanism (1000 iterations). GRU and CNN-RNN, although generally effective in modelling temporal dependencies, have relatively higher error rates and training durations (e.g., GRU: MAE = 4.124, time = 1724 s; CNN-RNN: MAE = 5.063, time = 1004 s), which may limit their deployment in latency-sensitive environments. LSTM, while widely recognised for sequential learning, shows suboptimal performance in this case, with the highest MSE (17.297) and longest training time (1826 s).

The study demonstrates that simpler architectures such as NAR and MLP can outperform more complex models in specific industrial contexts, particularly when low latency and interpretability are prioritised. It contributes significantly to predictive maintenance strategies. It validates the potential of artificial learning models to mitigate late-stage failures and suggests a path forward for more generalised applications.

As a future study potential, the research can be extended by adding exogenous inputs including multi-modal data and domain-specific factors (e.g., human error, environmental effects). Expanding these dimensions could facilitate the broader adoption of AI-driven predictive maintenance in complex manufacturing ecosystems beyond aerospace.

Author Contributions

Conceptualisation, M.A. and M.C.S.; methodology, M.A. and M.C.S.; software, M.A. and M.C.S.; validation, M.A. and M.C.S.; formal analysis, M.A.; investigation, M.A.; resources, M.A.; data curation, M.A.; writing—original draft preparation, M.A. and M.C.S.; writing—review and editing, M.C.S.; visualisation, M.A. and M.C.S.; supervision, M.A. and M.C.S.; project administration, M.A. and M.C.S.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Acknowledgments

The authors would like to thank Mehmet Seyhan, Karadeniz Technical University, for providing the opportunity to use Matlab© version 9.13.0 (R2022b) for educational purposes. This study has been produced from the Master’s Thesis of Mert Can Sen, accepted by Nigde Omer Halisdemir University, Graduate School of Natural and Applied Sciences. The authors appreciate the valuable comments from the reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Abbreviation	Full Form
ACF	Autocorrelation Function
ADAM	Adaptive Moment Estimation
AL	Artificial learning
ANFIS	Adaptive neuro-fuzzy inference system
AR	Autoregressive
BILSTM	Bidirectional long short-term memory
CNN-RNN	Convolutional neural network–recurrent neural network hybrid
CPU	Central Processing Unit
ELU	Exponential Linear Unit
FCM	Fuzzy C-means
FIS	Fuzzy inference system
FNN	Fuzzy neural network
GRU	Gated recurrent unit
LSTM	Long short-term memory
MAE	Mean absolute error
MBE	Mean bias error
MF	Membership function
ML	Machine learning
MLP	Multilayer perceptron
MSE	Mean squared error
MAPE	Mean absolute percentage error
NAR	Nonlinear autoregressive
NARX	Nonlinear autoregressive with exogenous inputs
PACF	Partial Autocorrelation Function
RNN	Recurrent neural network
RMSE	Root mean square error
RMSProp	Root mean square propagation
R	Correlation coefficient
R²	Coefficient of determination
SGD	Stochastic gradient descent
SVR	Support Vector Regression

References

Montgomery, D.C. Introduction to Statistical Quality Control, 8th ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2019; p. 768. [Google Scholar]
Kusiak, A. Smart manufacturing. Int. J. Prod. Res. 2018, 56, 508–517. [Google Scholar] [CrossRef]
Tobon-Mejia, D.A.; Medjaher, K.; Zerhouni, N.; Tripot, G. A Data-Driven Failure Prognostics Method Based on Mixture of Gaussians Hidden Markov Models. IEEE Trans. Reliab. 2012, 61, 491–503. [Google Scholar] [CrossRef]
Amiroh, K.; Rahmawati, D.; Wicaksono, A.Y. Intelligent System for Fall Prediction Based on Accelerometer and Gyroscope of Fatal Injury in Geriatric. Jurnal Nasional Teknik Elektro 2021, 10, 154–159. [Google Scholar] [CrossRef]
Çelen, S. Availability and modelling of microwave belt dryer in food drying. J. Tekirdag Agric. Fac. 2016, 13, 71–83. [Google Scholar]
Karacabey, E.; Aktaş, T.; Taşeri, L.; Uysal Seçkin, G. Examination of different drying methods in Sultana Seedless Grapes in terms of drying kinetics, energy consumption and product quality. J. Tekirdag Agric. Fac. 2020, 17, 53–65. [Google Scholar] [CrossRef]
Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef]
Werbos, P.J. Generalization of backpropagation with application to a recurrent gas market model. Neural Netw. 1988, 1, 339–356. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
Hu, Y.; Wong, Y.; Wei, W.; Du, Y.; Kankanhalli, M.; Geng, W. A novel attention-based hybrid CNN-RNN architecture for sEMG-based gesture recognition. PLoS ONE 2018, 13, e0206049. [Google Scholar] [CrossRef]
Shekar, P.R.; Mathew, A.; Sharma, K.V. A hybrid CNN–RNN model for rainfall–runoff modeling in the Potteruvagu watershed of India. CLEAN—Soil Air Water 2024, 53, 2300341. [Google Scholar] [CrossRef]
Liu, H.; Jin, Y.; Song, X.; Pei, Z. Rate of Penetration Prediction Method for Ultra-Deep Wells Based on LSTM–FNN. Appl. Sci. 2022, 12, 7731. [Google Scholar] [CrossRef]
Chumakova, E.V.; Korneev, D.G.; Chernova, T.A.; Gasparian, M.S.; Ponomarev, A.A. Comparison of the application of FNN and LSTM based on the use of modules of artificial neural networks in generating an individual knowledge testing trajectory. J. Eur. Des Systèmes Automatisés 2023, 56, 213–220. [Google Scholar] [CrossRef]
Yu, H.-C.; Wang, Q.-A.; Li, S.-J. Fuzzy Logic Control with Long Short-Term Memory Neural Network for Hydrogen Production Thermal Control System. Appl. Sci. 2024, 14, 8899. [Google Scholar] [CrossRef]
Liu, F.; Dong, T.; Liu, Q.; Liu, Y.; Li, S. Combining fuzzy clustering and improved long short-term memory neural networks for short-term load forecasting. Electr. Pow. Syst. Res. 2024, 226, 109967. [Google Scholar] [CrossRef]
Wang, W.; Shao, J.; Jumahong, H. Fuzzy inference-based LSTM for long-term time series prediction. Sci. Rep. 2023, 13, 20359. [Google Scholar] [CrossRef]
Xin, T.L.L. Fuzzy Embedded Long Short-Term Memory (FE-LSTM) with Applications in Stock Trading; Final Year Project (FYP); Nanyang Technological University: Singapore, 2022. [Google Scholar]
Liu, L.; Fei, J.; An, C. Adaptive Sliding Mode Long Short-Term Memory Fuzzy Neural Control for Harmonic Suppression. IEEE Access 2021, 9, 69724–69734. [Google Scholar] [CrossRef]
Suppiah, R.; Kim, N.; Sharma, A.; Abidi, K. Fuzzy inference system (FIS)—Long short-term memory (LSTM) network for electromyography (EMG) signal analysis. Biomed. Phys. Eng. Express 2022, 8, 065032. [Google Scholar] [CrossRef]
Topaloğlu Yıldız, Ş.; Yıldız, G.; Cin, E. Mathematical programming and simulation modeling based solution approach to worker assignment and assembly line balancing problem in an electronics company. Afyon Kocatepe Univ. J. Econ. Adm. Sci. 2020, 22, 57–73. [Google Scholar] [CrossRef]
Demirkol Akyol, Ş. Linear programming approach for type-2 assembly line balancing and workforce assignment problem: A case study. Dokuz Eylül Univ. Fac. Eng. J. Sci. Eng. 2023, 25, 121–129. (In Turkish) [Google Scholar] [CrossRef]
Korkmaz, C.; Kacar, İ. Explaining Data Preprocessing Methods for Modeling and Forecasting with the Example of Product Drying. J. Tekirdag Agric. Fac. 2024, 21, 482–500. (In Turkish) [Google Scholar] [CrossRef]
Kutner, M.H.; Nachtsheim, C.J.; Neter, J. Applied Linear Regression Models; McGraw-Hill Higher Education: New York, NY, USA, 2003. [Google Scholar]
Seber, G.A.F.; Lee, A.J. Linear Regression Analysis; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
Gujarati, D.N.; Porter, D.C. Basic Econometrics, 5th ed.; McGraw-Hill/Irwin: New York, NY, USA, 2009; p. 921. [Google Scholar]
Montesinos López, O.A.; Montesinos López, A.; Crossa, J. Fundamentals of Artificial Neural Networks and Deep Learning. In Multivariate Statistical Machine Learning Methods for Genomic Prediction; Montesinos López, O.A., Montesinos López, A., Crossa, J., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 379–425. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In Proceedings of the NIPS 2014 Workshop on Deep Learning, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Graves, A.; Fernández, S.; Schmidhuber, J. Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition; Springer: Berlin/Heidelberg, Germany, 2005; pp. 799–804. [Google Scholar]
Yenter, A.; Verma, A. Deep CNN-LSTM with combined kernels from multiple branches for IMDb review sentiment analysis. In Proceedings of the 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON), New York, NY, USA, 19–21 October 2017; pp. 540–546. [Google Scholar]
Kacar, İ.; Korkmaz, C. Prediction of agricultural drying using multi-layer perceptron network, long short-term memory network and regression methods. Gümüşhane Univ. J. Sci. Technol. 2022, 12, 1188–1206. (In Turkish) [Google Scholar] [CrossRef]
Hussain, S.; AlAlili, A. A hybrid solar radiation modeling approach using wavelet multiresolution analysis and artificial neural networks. Appl. Energ. 2017, 208, 540–550. [Google Scholar] [CrossRef]
Aslanargun, A.; Mammadov, M.; Yazici, B.; Asma, S. Comparison of ARIMA, neural networks and hybrid models in time series: Tourist arrival forecasting. J. Stat. Comput. Simul. 2007, 77, 29–53. [Google Scholar] [CrossRef]
Duan, W.; Zhang, K.; Wang, W.; Dong, S.; Pan, R.; Qin, C.; Chen, H. Parameter prediction of lead-bismuth fast reactor under various accidents with recurrent neural network. Appl. Energ. 2025, 378, 124790. [Google Scholar] [CrossRef]
Zhu, R.; Sun, Q.; Han, X.; Wang, H.; Shi, M. A novel dual-channel deep neural network for tunnel boring machine slurry circulation system data prediction. Adv. Eng. Softw. 2025, 201, 103853. [Google Scholar] [CrossRef]
Yu, Z.; He, X.; Montillet, J.-P.; Wang, S.; Hu, S.; Sun, X.; Huang, J.; Ma, X. An improved ICEEMDAN-MPA-GRU model for GNSS height time series prediction with weighted quality evaluation index. GPS Solut. 2025, 29, 113. [Google Scholar] [CrossRef]
Khan, F.M.; Gupta, R. ARIMA and NAR based prediction model for time series analysis of COVID-19 cases in India. J. Saf. Sci. Resil. 2020, 1, 12–18. [Google Scholar] [CrossRef]
Olmedo, M.T.C.; Mas, J.F.; Paegelow, M. The Simulation Stage in LUCC Modeling; Olmedo, M.T.C., Paegelow, M., Mas, J.F., Escobar, F., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
Pinkus, A. Approximation theory of the MLP model in neural networks. Acta Numer. 1999, 8, 143–195. [Google Scholar] [CrossRef]
Korkmaz, C.; Kacar, İ. Time-series prediction of organomineral fertilizer moisture using machine learning. Appl. Soft Comput. 2024, 165, 112086. [Google Scholar] [CrossRef]

Figure 1. Schematic representation of prediction workflow.

Figure 2. Collected data and batch size.

Figure 3. Correlogram of the series: (a) ACF and (b) PACF.

Figure 4. NAR network’s topology used.

Figure 5. Layer topology of the LSTM network.

Figure 6. Signal flow in an LSTM hidden unit.

Figure 7. Signal flow in a GRU hidden unit.

Figure 8. The CNN-RNN structure used.

Figure 9. The ANFIS structure used.

Figure 10. MLP network topology.

Figure 11. Error convergence of models in prediction.

Figure 12. Prediction curves of AL models.

Figure 13. Residuals of the (a) LSTM, (b) ANFIS, (c) NAR, (d) GRU, (e) CNN-RNN, and (f) MLP models.

Figure 14. Variance of (a) LSTM, (b) ANFIS, (c) NAR, (d) GRU, (e) CNN-RNN, and (f) MLP.

Figure 15. MSE error convergence plot and determination of the optimal number of iterations.

Figure 16. Training performance of the NAR network: (a) gradient, (b) mu, and (c) validation status.

Table 1. Hyper-parameters of the NAR technique.

Parameter	Value
Stopping criteria of training	MSE = 0.0 or 6 consecutive validation fails or max 1000 epochs or min 1 × 10⁻⁷ gradient or min 1 × 10¹⁰ mu
Number of hidden layers	2
Number of neurons in each hidden layer	15 and 10, respectively
Activation functions	Hyperbolic tangent sigmoid in hidden layers, linear in the output layer.
Training algorithm	Levenberg–Marquardt
Delay (d)	3, feedback of Configuration_t₋₁, Configuration_t₋₂, Configuration_t₋₃
Inputs	Configuration_t₋₁, Configuration_t₋₂, Configuration_t₋₃
Output	Configuration_t
Output threshold	0.99
Learning rate	0.1
Momentum	0.1
Learning threshold	0.0001
Batch size	80% (modelling *), 20% (forecasting); all data normalised

(*) Modelling includes 75% training, 15% testing, and 15% validation.

Table 2. Hyper-parameters of LSTM and GRU networks.

Layer	Configuration	Parameters
Sequential input layer	Time-series data	Configuration
Input variable(s)	3 lag(s) of output	Configuration_t₋₁, Configuration_t₋₂, Configuration_t₋₃
LSTM layer(s)	1 layer	Stack size: 80% (modelling), 20% (forecasting); all data normalised
Hidden unit (count)	64	Gates: sigmoid, State: tanh
Dropout layer	After LSTM layer	Dropout probability: 0.2
Fully connected layer	1 unit	Number of responses = 1 (the configuration data)
Output	1 unit	Regression layer: Loss function = MSE
Training algorithm (Solver)	ADAM	Initial learning rate: 0.03 Learning rate: piecewise (drop factor: 0.5, drop period: 100 epochs) Gradient explosion threshold: 0.8 L2 regularisation: 0.01 Shuffle: every epoch
Weight initialisations	$W, R, b$	He initialiser [28]
Epochs	Iterations per epoch = 1	100

Table 3. Proposed CNN-RNN model and layer configuration.

CNN Design
Layers Configuration	Parameters
Sequence input layer	Feedback of three delays of the output [3 1 1]
Sequence folding layer	Sequence input layer → 2D format
Convolution2d layer	Dilation factor = [1 1]
Batch normalisation layer	${d a t a}^{n o r m a l i z e d} = \frac{d a t a - μ}{σ}$ where $μ$ is mean
ELU layer	$f (x) = \{\begin{matrix} x, & x \geq 0 \\ \exp (x) - 1, & x > 0 \end{matrix}$
Convolution2d layer	Dilation factor = [2 2]
ELU layer	$f (x)$
Convolution2d layer	Dilation factor = [4 4]
ELU layer	$f (x)$
Convolution2d layer	Dilation factor = [8 8]
ELU layer	$f (x)$
Convollution2d layer	Dilation factor = [16 16]
ELU layer	$f (x)$
Average pooling2d layer	Pool size = 1, stride = [5 5]
Sequence unfolding layer	--
Flatten layer	It converts H-by-W-by-C-by-N-by-S to (HWC)-by-N-by-S
RNN design
BILSTM layer	Hidden unit = 64
Dropout layer	Probability = 0.2
BILSTM layer	Hidden unit = 16 Output the last time step of the sequence
Fully connected layer	Number of responses = 1 (configuration failure)
Regression layer	Loss function = $M S E$

Table 4. Hyper-parameters of the ANFIS network.

Layer	Configuration	Parameters
Output variable	Time-series data	Configuration
Input variable(s)	3 lag(s) of output	Configuration_t₋₁, Configuration_t₋₂, Configuration_t₋₃
Batch size	--	80% (modelling), 20% (forecasting); normalised
Fuzzification and data clustering	FCM	Input MF: Gauss; clusters: 10 for each input Partition matrix exponent, $m = 1.5$ Initial step size: 0.01; step size decrement rate: 0.9; step size increase rate: 1.1
Defuzzification	Output MF	Zero-order Sugeno and first-order Sugeno
Decision-making	AND	--
Optimisation method	Hybrid	--
Epochs	Iterations per epoch = 1	10, 50, and 100 epochs

Table 5. Models’ performance metrics on the “configuration” data.

Metrics	NAR	LSTM	ANFIS	GRU	CNN-RNN	MLP
Total data count	45,654	45,654	45,654	45,654	45,654	45,654
Training data count	9130	9130	9130	9130	9130	9130
MAE (unit) *	2.592	5.081	2.805	4.124	5.063	2.122
MAPE (%)	0.895	5.081	0.862	2.848	2.948	0.876
ME (unit)	1.279	3.061	1.296	2.488	3.632	1.145
MSE (unit²)	1.225	17.297	1.985	8.243	30.814	2.011
RMSE (unit)	1.107	4.159	1.409	2.871	5.551	1.418
Calculation duration (s)	4.847	1826.525	796.826	1724.109	1004.711	10.762
R²_test	0.9867	0.9166	0.9875	0.9656	0.9675	0.9858
Number of iteration	9	100	1000	100	100	9

(*) “unit” is the same as the data unit.

Table 6. Comparison of the proposed model with similar ones in the literature.

Modelling Studies	Model	Modelled Parameter	RMSE (*)	$R^{2}$	Additional Metrics
Proposed models	NAR	Time-series data, configuration of failure (count)	1.107	0.987	MAE = 2.592, MAPE = 0.895%
	LSTM		4.159	0.917	MAE = 5.081, MAPE = 5.081%
	ANFIS		1.409	0.988	MAE = 2.805, MAPE = 0.862%
	GRU		2.871	0.966	MAE = 4.124, MAPE = 2.848%
	CNN-RNN		5.551	0.968	MAE = 5.063, MAPE = 2.948%
	MLP		1.418	0.986	MAE = 2.122, MAPE = 0.876%
Hussain and AlAlili (2017) [33]	MLP	Global horizontal irradiance (Whm⁻² day⁻¹)	0.0682	0.896	MAPE = 5.15%
	MLP (wavelet)		0.0389	0.965	MAPE = 2.98%
	ANFIS		0.0410	0.896	MAPE = 3.37%
	ANFIS (wavelet)		0.0662	0.961	MAPE = 4.88%
	NARX		0.0329	0.940	MAPE = 2.32%
	NARX (wavelet)		0.0504	0.975	MAPE = 3.45%
Aslanargun et al. (2007) [34]	MLP	Number of tourists (count)	147,909	NA	MAE = 127,838.8, MAPE = 14.96%
Duan et al. (2025) [35]	GRU	Temperature (°C)	1.750	NA	Train time(s) = 22.256, MAPE = 0.230%
	LSTM		1.701	NA	Train time(s) = 23.796, MAPE = 0.213%
	BIGRU		1.889	NA	Train time(s) = 27.095, MAPE = 0.273%
	BILSTM		2.016	NA	Train time(s) = 29.813, MAPE = 0.232%
Zhu et al. (2025) [36]	RNN	Main outlet slurry flow rate, MOSFR (m³h⁻¹)	9.8188	0.853	MAE = 7.2049, MAPE = 0.3025%
	LSTM		9.9564	0.838	MAE = 7.2458, MAPE = 0.3042%
	SVR (**)		19.578	0.695	MAE = 14.5567, MAPE = 0.6099%
Zhou et al. (2025) [37]	ICEEMDAN-MPA-GRU (***)	GNNS (****) height coordinate (°)	2.49	0.98	MAE = 2.12, MAPE = 2.85%, Weighted quality evaluation index = 0.463

(*) RMSE and MAE are the units of their variables. MAPE is in percentage. (**) SVR: Support Vector Regression. (***) ICEEMDAN-MPA-GRU: Improved complete ensemble empirical mode decomposition with an adaptive noise–marine predators algorithm–gated recurrent unit (****) GNNS: Global navigation satellite system.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sen, M.C.; Alkan, M. Time-Series Prediction of Failures in an Industrial Assembly Line Using Artificial Learning. Appl. Sci. 2025, 15, 5984. https://doi.org/10.3390/app15115984

AMA Style

Sen MC, Alkan M. Time-Series Prediction of Failures in an Industrial Assembly Line Using Artificial Learning. Applied Sciences. 2025; 15(11):5984. https://doi.org/10.3390/app15115984

Chicago/Turabian Style

Sen, Mert Can, and Mahmut Alkan. 2025. "Time-Series Prediction of Failures in an Industrial Assembly Line Using Artificial Learning" Applied Sciences 15, no. 11: 5984. https://doi.org/10.3390/app15115984

APA Style

Sen, M. C., & Alkan, M. (2025). Time-Series Prediction of Failures in an Industrial Assembly Line Using Artificial Learning. Applied Sciences, 15(11), 5984. https://doi.org/10.3390/app15115984

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Time-Series Prediction of Failures in an Industrial Assembly Line Using Artificial Learning

Abstract

Featured Application

Abstract

1. Introduction

2. Significance and Novelty of the Study

3. Materials and Methods

3.1. Data Collecting

3.2. Preliminary Studies on the Data

3.3. Statistical Tools for Evaluating the Model Performances

3.4. Modelling Using the NAR Network

3.5. Modelling Using the LSTM Network

3.6. Modelling Using the GRU Network

3.7. Modelling Using the CNN-RNN Hybrid Network

3.8. Modelling Using ANFIS

3.9. Modelling Using MLP

4. Results and Discussion

4.1. Error Convergences and Performance Metrics

4.2. Predicted Data

4.3. Residual Analysis

4.4. Variance Analysis

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI