1. Introduction
With the rapid technological evolutions in manufacturing, we are now experiencing a new generation of industrial revolution, that is IIoT [
1,
2,
3,
4,
5]. In critical sectors covered by the Industrial Internet, such as manufacturing, energy, transportation, and healthcare, equipment failures can lead to production line disruptions, inefficient resource utilization, safety risks, and ecological and environmental hazards [
6,
7,
8]. Typical scenarios include the following: downtime of key processing equipment in discrete manufacturing workshops, which can incur direct production losses ranging from hundreds to tens of thousands of dollars per hour for a single machine; failures in energy transmission and distribution systems, which may cause regional power outages affecting tens of thousands of users; and malfunctions of core components in rail transit, which could jeopardize operational safety. Therefore, establishing a precise equipment fault diagnosis system to achieve early defect identification and potential fault prediction has become a core technical requirement for safeguarding industrial systems’ reliability, safety, and economic efficiency.
However, traditional supervised learning methods depend heavily on abundant labeled fault samples, which are rarely available in industrial environments due to the randomness of failures and the preventive replacement of key components [
9,
10]. This scarcity necessitates developing detection methods capable of identifying unknown faults under few- or zero-fault-sample conditions [
11,
12,
13,
14]. Manual inspections and fixed-interval maintenance are inefficient, lack real-time degradation monitoring, and cannot reliably predict sudden failures, highlighting the urgent need for intelligent predictive maintenance [
15]. In the field of industrial Internet equipment fault diagnosis, deep learning technologies, by constructing multi-layer nonlinear mapping networks, have provided a revolutionary solution for unknown fault detection. As a typical unsupervised learning paradigm, the autoencoder (AE) minimizes the mean squared error between the input and the reconstructed output, realizing feature compression and reconstruction of high-dimensional signals in the hidden layer [
16]. Its core advantage lies in the fact that it only requires normal samples for training and detects anomalies by identifying reconstruction errors that exceed an adaptive threshold [
17]. Traditional AEs and their variants (such as sparse autoencoders and deep convolutional autoencoders) perform excellently in static data processing. However, when facing dynamic signals such as time-series vibrations and currents collected by industrial sensors, due to the lack of the ability to model the correlation of time series and spatial local features, the accuracy of abnormal identification is limited [
18]. To address this, we integrate CNN for spatial feature extraction and LSTM for temporal modeling, enhancing fault representation in complex conditions [
19]. In industrial predictive maintenance, deep learning-based fault detection offers distinct advantages: by continuously collecting operational data, the model can track reconstruction error changes in real time to identify early-stage performance degradation. Once an anomaly is detected and addressed, the newly acquired fault data can be used for model updates, forming a closed-loop process of “detection–maintenance–optimization” [
20]. This capability enables the system not only to detect unknown faults but also to accurately predict the types of faults through the continuous learning of historical fault patterns, guide the formulation of targeted maintenance strategies, greatly improve the operation and maintenance efficiency and reliability of industrial equipment, and promote the paradigm shift of smart factories from passive maintenance to active health management [
21].
Amid substantial research on fault detection, deep learning has become a pivotal direction in fault detection due to its ability to automatically extract representative features and handle complex nonlinear data with strong adaptability and generalization [
22]. An intelligent diagnostic method utilizing Deep Neural Networks (DNNs) was proposed in [
23] to address the shortcomings of traditional methods in handling complex nonlinear data. With the prominence of CNN’s capability in local feature extraction, the authors in the paper [
24] proposed a CNN-based prediction model, which achieved the classification of mechanical equipment faults and attained high accuracy even with limited data sources. To further improve the performance of algorithms in equipment fault diagnosis, researchers have begun to combine neural network algorithms with traditional machine learning algorithms. In [
25], the researchers integrated CNN with SVM to propose a CNN-SVM bearing fault diagnosis scheme, which uses CNN to extract data features and then employs SVM for classification. This approach reduces system runtime while improving classification accuracy. In reference [
26], the authors developed a novel RF-based CNN model, incorporating a dropout layer into the CNN to prevent overfitting. Experimental results confirmed that this method outperforms traditional algorithms such as SVM, CNN, and RF. As input time-series data lengthen, Long Short-Term Memory (LSTM) networks have been adopted in fault diagnosis to capture long-range temporal dependencies. Reference [
27] presents a fault diagnosis method for wind turbine gearboxes leveraging an LSTM-based approach. It optimizes the network with cosine loss to reduce the impact of signal intensity on diagnostic accuracy and enhance diagnostic precision. The integration of CNN and LSTM enables simultaneous extraction of spatial and temporal features for fault diagnosis. For example, Reference [
28] developed an end-to-end CNN–LSTM model using raw sensor data, and Reference [
29] introduced a bidirectional convolutional LSTM to address complex planetary gearbox responses. In reference [
30], the authors explored generative adversarial networks combined with deep extreme learning machines to enhance fault diagnosis performance under data-scarce conditions. Reference [
31] enhanced detection accuracy by incorporating raw, Fourier-, and wavelet-transformed signals into a multichannel convolutional LSTM, maintaining high performance even with short input segments. Recent studies have increasingly applied Transformer architectures to fault detection and diagnosis tasks in industrial systems. Wu et al. [
32] developed a Transformer-based approach for rotary machinery that can classify known faults and detect novel fault types. Zhang et al. [
33] proposed TSViT, which leverages a time-series vision Transformer to improve diagnostic performance on rotating machinery datasets. More recently, Xiao et al. [
34] introduced a Bayesian variational Transformer that enhances robustness under small-sample and noisy conditions. These works collectively indicate that Transformer-based models are becoming strong baselines in intelligent fault diagnosis. The papers [
35,
36,
37,
38,
39,
40,
41] report contributions to the anomaly detection of equipment from different perspectives. The development of autoencoder (AE) networks in the field of deep learning has provided new insights for anomaly detection in mechanical equipment. AE is an unsupervised neural network model with a symmetric network structure [
42]. When trained only on normal samples, an autoencoder learns to accurately reconstruct them, yielding low reconstruction errors. In contrast, abnormal samples—unseen during training—are poorly reconstructed, producing higher errors. By setting a reasonable reconstruction error threshold, abnormal samples can be accurately detected [
43]. While AEs trained on normal samples are effective for anomaly detection [
44,
45,
46,
47,
48,
49], they are unable to recognize unseen fault types, underscoring the need for methods dedicated to unknown fault detection.
Table 1 presents a comparative summary of the proposed DC-LSTM-AE model and related deep learning methods cited in this study.
In summary, autoencoders (AEs) trained exclusively on normal samples are effective at learning representations of normal conditions, but they inherently lack the ability to recognize or categorize abnormal samples. This limitation highlights a fundamental drawback of existing industrial fault detection methods. Although models trained on normal data or predefined fault classes can achieve good performance within their training scope, they are unable to identify previously unseen fault types. This shortcoming poses significant risks in practical applications, where unexpected faults may occur due to the stochastic nature of equipment degradation and operational variability, potentially leading to misjudgments by maintenance personnel, reduced diagnostic efficiency, and avoidable production losses. Accordingly, the objective of this study is to explore whether deep neural networks, trained exclusively on normal operating data, can effectively identify unknown faults in industrial equipment without requiring prior fault samples. To explore this question, we propose a novel deep learning-based method for unknown fault detection that relies exclusively on normal samples during training, enabling the identification of unseen fault types without prior labeling. This paper presents the following primary contributions to fault detection in IIoT:
This paper proposes a DC-LSTM-AE model based on deep CNN and LSTM. The model extracts spatial features via a five-layer CNN and captures the long-term dependencies of time-series data through LSTM, enabling spatiotemporal feature fusion for high-dimensional nonlinear time-series signals in industrial environments. This approach addresses traditional autoencoders’ limitations in extracting features from complex signals, enabling accurate reconstruction-based identification of unknown faults without requiring labeled fault samples.
To address the core problem of scarce fault samples in industrial scenarios, we design a training procedure using only normal samples. By leveraging the reconstruction error characteristics of autoencoders, a benchmark feature space is constructed through training with normal samples. When abnormal samples are input, their absence from the training process leads to significantly increased reconstruction errors. Anomaly detection is achieved by setting thresholds based on the Pauta criterion. This strategy breaks through the dependence of traditional supervised learning on fault samples, providing a feasible solution for early equipment maintenance.
In this study, we employ sliding windows and fast Fourier transform (FFT) to convert time-series signals into spectral features. This reduces data dimensions while preserving key information, enhances model training stability, and lowers memory consumption. The L2 regularization term is introduced to optimize the loss function, suppress overfitting, and enhance the model’s generalization ability. By dynamically adjusting the regularization coefficient through cross-validation, a balance is achieved between model complexity and detection accuracy, making it suitable for real-time detection requirements in industrial sites.
We evaluate our method on two industrial datasets: the Southeast University gearbox dataset (four fault types) and a constant-speed water pump dataset from factory settings (one fault type). Compared with traditional autoencoders, Deep Convolutional Autoencoders (DCAEs), and several machine learning algorithms, DC-LSTM-AE achieves notably higher accuracy and precision. In particular, it better distinguishes reconstruction errors for unknown faults with high feature similarity, confirming its effectiveness and industrial applicability.
These contributions collectively advance the understanding and development of fault detection problems in IIoT, providing a robust foundation for future work and practical applications. The remainder of this article is organized as follows:
Section 2 provides the system models and an overview of the fault detection model in the IIoT situation. The fault detection strategy based on the DC-LSTM-AE algorithm is created in
Section 3. The simulation results and performance analysis are discussed in
Section 4. In
Section 5, the conclusions are given.
2. System Model
The overall architecture of the proposed model is illustrated in
Figure 1. It is an improved DC-LSTM-AE framework that integrates deep CNN and LSTM within an enhanced autoencoder design. By combining CNN’s capability for extracting spatial features from high-dimensional, complex-structured sensor signals with LSTM’s strength in modeling long-term temporal dependencies, the model jointly learns spatial–temporal representations essential for accurate fault detection. This integration not only improves representation capacity and detection accuracy but also enhances noise robustness and stability in industrial environments, forming the basis for the subsequent data preprocessing, encoder–decoder structure, and detection workflow described below.
In this study, the raw sensor signals are first segmented into fixed-length time windows to capture localized temporal dynamics, and then they are transformed into the frequency domain using the FFT to highlight periodic and harmonic fault-related components. All features are subsequently normalized to a standard scale to eliminate amplitude disparities across sensors and operating conditions. The proposed DC-LSTM-AE model adopts a symmetric encoder–decoder architecture optimized for anomaly detection in high-dimensional industrial time-series data. The encoder consists of a five-layer deep convolutional neural network (CNN), where each layer includes convolution, normalization, non-linear activation, and pooling operations. This hierarchical design enables the progressive extraction of abstract spatial features from frequency-domain representations of raw sensor signals. To mitigate neuron inactivation and improve feature continuity, LeakyReLU is used in place of standard ReLU in the activation layers. The pooling layer is applied to retain certain prominent features maximally, thereby enhancing the model’s fault detection accuracy. Following convolution, an LSTM network is applied to extract temporal features. Subsequently, a fully connected layer (FCL) is used for feature integration and dimensionality reduction. In the decoder, the encoder’s output features are first expanded via an FCL network. Then, the temporal features are restored using an LSTM network. Furthermore, the spatial features of the samples are recovered via a five-layer transposed CNN to ultimately achieve sample reconstruction.
Model training aims to minimize the mean squared reconstruction error between the input and output samples, with an additional L2 regularization term applied to constrain parameter magnitudes and enhance generalization. Once the training loss converges, the model processes the normal training data to compute reconstruction errors, and the detection threshold is determined using the Pauta criterion, defined as the mean reconstruction error plus three standard deviations. During inference, any sample with a reconstruction error exceeding this threshold is classified as faulty, while those below the threshold are considered normal.
The DC-LSTM-AE algorithm model architecture includes an encoder and a decoder. The encoder employs a five-layer CNN for spatial feature extraction, and each layer of the CNN contains a normalization layer, an activation layer, and a pooling layer.
3. The Proposed DC-LSTM-AE Algorithms
The working environment of actual industrial production is extremely complex, with complicated and changeable working conditions, and most of the signals collected by sensors are non-stationary and non-linear data [
50]. Under these conditions, fault detection and diagnosis require more powerful feature extraction tools to obtain more valuable information from such signals. Deep learning addresses this through powerful feature extraction and learning capabilities enabled by deep neural networks with increased parameters and depth [
51].
3.1. The CNN Algorithm
CNNs represent a class of deep learning architectures that have been widely adopted for time-series signal analysis. They are particularly effective in processing grid-structured data and have been extensively applied in mechanical equipment fault detection, where they demonstrate superior performance in both signal processing and hierarchical feature extraction [
52]. The fundamental principle of CNNs lies in their ability to extract salient local features through convolution operations, while computational complexity is reduced via pooling layers. Core mechanisms such as local connectivity, parameter sharing, and pooling operations collectively enable the network to identify discriminative patterns across diverse signal regions. Local connectivity ensures that each neuron interacts only with a restricted receptive field of the input, in contrast to the fully connected architecture, which links every neuron to all input nodes. Given that localized structures often carry critical diagnostic cues in mechanical fault signals, this property facilitates the targeted extraction of informative patterns from relevant regions, thereby enhancing the model’s fault detection accuracy and robustness. Parameter sharing refers to the use of the same convolution kernel for the connections between all output layer nodes and their corresponding local regions in the input layer. This approach significantly reduces trainable parameter counts, particularly for high-dimensional data, enabling efficient CNN training and deployment in resource-constrained environments. CNN is a hierarchical model whose basic structure consists of an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer with detailed layer specifications provided below [
53]:
Input Layer: The input layer serves to receive raw data as input and transmit this information to subsequent layers for feature extraction. In the context of mechanical equipment fault detection, the input data can be time-series data collected by sensors or preprocessed spectrograms.
Convolutional Layers: The convolutional layer extracts local features from the input data, which are leveraged by CNN in mechanical equipment fault diagnosis to identify potential anomalies and faults. The convolutional layer scans the input data using multiple convolution kernels to extract local features in different dimensions. The core formula for the convolution operation is as follows:
where
x is the input signal,
y is the output signal,
w is the convolution kernel,
b is the bias term, and
M and
N are parameters representing the size of the convolution kernel.
Pooling Layers: The pooling layer constitutes a critical component of CNN architectures, serving primarily to downsample feature maps and thereby reduce their dimensionality. This operation markedly decreases the number of model parameters while retaining the most informative features, thus improving computational efficiency and enhancing robustness. Pooling also imparts a degree of translation invariance, enabling the network to maintain stable performance under variations in input positioning. In the context of mechanical equipment fault detection, pooling layers facilitate the extraction of key diagnostic patterns from vibration signals or spectrograms, while effectively mitigating the influence of noise.
Fully Connected Layers: The fully connected layer is typically located at the back end of the network. Its main function is to utilize the features extracted by the preceding convolutional and pooling layers for classification, regression, or other tasks. In mechanical equipment fault detection, the fully connected layer transforms analyzed feature representations into diagnostic labels of health states, performing critical fault classification decisions.
3.2. The LSTM Algorithm
The LSTM network extends the conventional recurrent neural network (RNN) architecture by introducing memory cells along with input, output, and forget gates. These components enable LSTM to effectively capture long-term dependencies and mitigate the vanishing gradient problem, thereby exhibiting superior performance in modeling long-sequence data [
54,
55]. The overall architecture of the LSTM network is illustrated in
Figure 2.
Input Gate: The input gate computes an affine transformation of the current input
and the hidden state
output at the previous moment, transforming it via sigmoid activation to the range of [0, 1] to generate a gating signal. Its function is to determine which information is selected for storage in the cell state at the current time step. The specific calculation formula is as follows:
where
and
are the weight parameters of the input gate and
is the bias parameter of the input gate. In mechanical equipment fault detection, the input gate mechanism prevents critical time series information from being discarded, reduces sensitivity to transient irrelevant signals, and consequently extracts key features from long-term data trends.
Forget Gate: The forget gate determines, at each time step, which components of the previous cell state should be retained and which should be discarded through gated operations. This mechanism allows the model to dynamically regulate the information flow within the memory units, thereby adapting to variations in the input sequence. The gating signal is computed as follows:
where
and
are the weight parameters of the forget gate and
is the bias parameter of the forget gate. In fault detection, the forget gate retains the maximum amount of information relevant to fault patterns. Conversely, for random noise or irrelevant operational signals, it filters out such information, effectively suppressing their interference with the judgment of the equipment’s state.
Memory Cell: The memory cell constitutes the core component of the LSTM network, responsible for storing and maintaining long-term dependencies while dynamically regulating the flow of information through its gating mechanisms. Its operation integrates two primary sources of information: the retained state from the previous time step and the newly received input at the current time step. This design enables the model to selectively preserve or update information, thereby mitigating the vanishing gradient problem commonly encountered in conventional RNNs. The corresponding computational formulations are expressed as follows:
where
denotes the input information and
is the state information of the memory cell at the current time step. In equipment fault detection, signals collected by sensors may contain a large amount of noise and redundant features. This approach can adaptively filter important information, effectively mitigating key information loss in long time series.
Output Gate: The output gate primarily controls information flow to the next time step. It dynamically selects which historical information to extract from the memory cells for output to the current hidden state. The relevant formula is as follows:
where
and
are the weight parameters of the output gate,
is the bias parameter of the output gate, and
is the gating signal of the output gate with a value range of [0, 1]. In mechanical equipment fault detection, the output gate extracts and outputs content related to fault features from the memory cells. This enables the real-time monitoring of operational status and prediction of future failure risks.
3.3. The AE Algorithm
AEs are a class of unsupervised learning models composed of two symmetrical components: an encoder and a decoder [
56]. They are widely employed for dimensionality reduction of high-dimensional data by learning compact latent representations. Given a high-dimensional input, the encoder compresses it into a low-dimensional latent space, while the decoder reconstructs the original input from this representation. The network is trained by minimizing the reconstruction error between the input and the reconstructed output, thereby ensuring that the latent space captures the most salient features of the data. In the context of fault detection, anomalies are identified by measuring deviations in reconstruction error between test samples and normal samples [
57].
The structure of the AE is shown in
Figure 3. For high-dimensional input data
, where
n represents the number of data points, the corresponding low-dimensional representation is obtained through the encoder. The low-dimensional representation of the data is then input into the decoder to obtain the corresponding output data. Assuming the input data is
, the formulas for the encoding and decoding processes can be expressed as follows:
where
is the low-dimensional feature vector obtained by the encoder;
and
represent the activation functions of the encoder and decoder, respectively;
and
are the weights and biases of the encoder;
and
are the weights and biases of the decoder; and
denotes the output data obtained by the decoder.
During the training process, the loss function of AE generally selects the MSE loss function:
The aforementioned function iteratively updates the network parameters through backpropagation combined with gradient descent, aiming to minimize the reconstruction error between the output and the input data. Upon convergence, the intermediate hidden layer encodes a compact, low-dimensional representation of the original high-dimensional input, effectively capturing its essential features. During inference, if the AE network encounters input data that substantially deviates from the distribution of the training set, its reconstruction performance deteriorates, leading to a pronounced increase in reconstruction error. By quantitatively comparing reconstruction errors across samples, anomalous instances can be distinguished from normal ones with high sensitivity.
3.4. The DC-LSTM-AE Algorithm
3.4.1. Model Training
To ensure that the DC-LSTM-AE model effectively captures both spatial and temporal characteristics of vibration signals, the architecture integrates a five-layer convolutional encoder followed by an LSTM-based temporal feature extractor and a decoder for reconstruction. This study substitutes LeakyReLU for the traditional ReLU function in activation layers to mitigate the representational capacity degradation caused by neuronal deactivation, which occurs when ReLU outputs zero for negative inputs. LeakyReLU addresses this issue by introducing a small non-zero gradient (typically 0.01 or 0.02) in the negative domain. Its mathematical formulations are as follows:
where
is a small constant; this ensures that when the input is negative, the neuron avoids complete deactivation by retaining a small gradient, thereby alleviating vanishing gradients and preserving the network’s learning capacity. The pooling layer is applied to retain certain prominent features maximally, thereby enhancing the model’s fault detection accuracy. Following convolution, an LSTM network is applied to extract temporal features. Subsequently, a fully connected layer (FCL) is used for feature integration and dimensionality reduction.
In the decoder, the encoder’s output features are first expanded via an FCL network. Then, the temporal features are restored using an LSTM network. Furthermore, the spatial features of the samples are recovered via a five-layer transposed CNN to ultimately achieve sample reconstruction. Each layer of the transposed CNN also includes a normalization layer, an activation layer, and an unpooling layer. Here, the LeakyReLU activation function is also used in the activation layer.
With the model architecture established, the training process begins by preparing the input data in a format compatible with both the CNN and LSTM components. This involves segmenting continuous time-domain signals into fixed-length frames through a sliding window approach, ensuring consistency with model input size requirements and FFT preprocessing constraints. The sliding window method segments normal samples into frames. Based on FFT requirements, model input size constraints, and the trade-off between data overlap and information novelty per sample, the window width and step size are set to 2048 and 1024, respectively. This windowing process yields time-series normal samples. Convert the obtained time-series normal signal
into the frequency domain via FFT to generate the training dataset
, where
n denotes the number of normal samples. This can be expressed by the following formula:
where
represents the
m-th frequency value of the
i-th sample,
denotes the
k-th time-series value of the
i-th sample, and
l is the size of the sliding window. Due to the symmetry of the Fourier transform, we retain only half of the spectrum after the Fourier transform to reduce the number of network parameters. The training dataset
is subjected to Z-Score standardization processing [
58] to obtain the standardized dataset
. The specific formula is as follows:
where
is the standardized value of the
j-th feature in the
i-th sample.
is the
j-th feature value of the
i-th sample.
represents the set of all values for the
j-th feature across all training samples.
denotes the mean value of the
j-th feature computed from the training samples.
denotes the standard deviation of the
j-th feature computed from the training samples. The computed normalization parameters are stored for use during inference.
The preprocessed training dataset
is fed into the designed DC-LSTM-AE model for training, yielding the generated sample set
, the mean squared error between the generated samples, and the input samples is used as the training loss function. In addition, an L2 regularization term is incorporated into the loss function to constrain the model’s complexity and improve its generalization capability. Training is terminated once the loss value converges to a stable level. The specific formula for the loss function is given as follows:
where
P denotes the total number of parameters in the network, and
represents the p-th parameter of the network. The first term, mean squared error (MSE), quantifies the deviation between the input and its reconstruction, driving the model to progressively approximate the distribution of normal operating conditions and, thus, minimizing reconstruction errors for normal samples. Since the model is trained exclusively on normal data, abnormal samples—which deviate from the learned distribution—produce noticeably larger reconstruction errors, providing a clear discriminative basis for fault detection. The second term, L2 regularization, penalizes the squared magnitude of all network parameters to suppress overfitting, reduce noise sensitivity, and improve adaptability across varying operating environments. The regularization coefficient
directly influences the trade-off between model complexity and generalization: an excessively large value can lead to underfitting by overly constraining parameters, while an overly small value may retain redundant features and reduce robustness. In this study,
is determined via cross-validation to achieve an optimal balance between representational capacity and generalization ability. The proposed loss design is computationally efficient and demonstrates strong stability in industrial fault detection tasks, particularly in noisy, multi-condition scenarios.
Upon completion of training, all normal samples are reintroduced to the trained DC-LSTM-AE model to obtain the reconstructed dataset
, and the reconstruction error for each sample is computed as follows:
where
f is the number of features. Based on the reconstruction error sequence
obtained from all normal training samples, it is essential to establish a reasonable and robust decision threshold to enable the trained DC-LSTM-AE model to automatically differentiate between normal and abnormal inputs during inference. The choice of threshold directly impacts the accuracy and stability of the fault detection system: if the threshold is set too low, some normal samples may be misclassified as abnormal (increasing false positives), leading to unnecessary maintenance costs and downtime; conversely, an overly high threshold may cause actual abnormal samples to be overlooked (increasing false negatives), delaying fault diagnosis and potentially resulting in severe equipment failures. Therefore, the threshold determination process must balance high sensitivity with a low false alarm rate while maintaining robustness against the complex and variable noise conditions typically found in industrial environments.
In this study, the Pauta Criterion is adopted for threshold calculation, the core idea of which is based on the statistical properties of the normal distribution. The Pauta Criterion states that in a normal distribution, the vast majority of data (approximately 99.73%) falls within three standard deviations from the mean, and data points outside this range are highly likely to be caused by abnormal factors. Therefore, the threshold is defined as follows:
where
denotes the reconstruction error threshold;
and
denote the mean and standard deviation of the reconstruction errors for all normal training samples, respectively. This setting implies that if the reconstruction error of a test sample exceeds this range, its deviation from the normal-state distribution surpasses the statistically rare fluctuation interval—there is high confidence that the operating state corresponding to the sample is abnormal.
3.4.2. Anomaly Detection Process
The specific workflow of the proposed mechanical equipment unknown fault detection method, based exclusively on normal samples, is illustrated in
Figure 4. As shown in the flowchart, raw vibration signals collected from the equipment’s sensors serve as the primary input for the diagnostic process. These signals are initially examined to reveal potential variations indicative of abnormal operating conditions.
Following data acquisition, the trained DC-LSTM-AE model and the reconstruction error threshold—both determined during the model training phase using the standardized dataset
—are retrieved from storage. The detection stage then proceeds as follows. The newly acquired time-domain vibration signals are first transformed into the frequency domain using the FFT, as described in (
12), enabling the capture of periodic characteristics and frequency-related fault signatures that may not be evident in the time domain alone. To maintain statistical consistency with the training data, the resulting frequency-domain features are standardized using the mean and standard deviation parameters obtained during the Z-score normalization of the normal-sample training set.
The standardized frequency-domain samples are subsequently fed into the preserved DC-LSTM-AE model, which reconstructs the input signals based on learned normal patterns. For each test sample, the reconstruction error is computed as the squared difference between the original and reconstructed features, aggregated across all dimensions. This error is then compared against the precomputed threshold , defined using the Pauta Criterion to balance sensitivity and specificity in the classification decision.
If the reconstruction error does not exceed , the operational state of the equipment is classified as normal. Conversely, when the reconstruction error surpasses this threshold, the sample is labeled as abnormal, triggering an automated alarm. This alarm acts as an immediate alert to on-site maintenance personnel, prompting timely inspection and corrective measures to prevent escalation of faults, reduce production downtime, and minimize the risk of safety incidents. The reliance on a statistically grounded threshold, combined with the end-to-end learned reconstruction capability of the DC-LSTM-AE, ensures that the system can detect unknown faults robustly, even in the absence of fault-specific training data, thereby enhancing its applicability in complex and variable industrial environments.
5. Conclusions
This paper presents a novel mechanical equipment unknown fault detection that addresses the challenge of limited fault samples by training solely on normal samples, enabling accurate detection of unknown faults without relying on fault samples. To overcome the insufficient expressiveness of conventional autoencoders for high-dimensional nonlinear signals, the proposed DC-LSTM-AE model incorporates both CNN and LSTM modules to jointly extract spatial and temporal features, thereby improving its ability to detect unknown faults. To further improve training stability and model generalization, the input signals are transformed into the frequency domain using FFT, and an L2 regularization term is introduced with dynamic adjustment via cross-validation. These enhancements make the model well-suited for real-time deployment in resource-constrained industrial environments. The structure of the DC-LSTM-AE model is described in detail, and a mechanical equipment unknown fault detection based on normal samples is designed based on this model, with a detailed elaboration of the detection process for unknown fault samples. To validate the effectiveness of this method, the gearbox dataset from Southeast University and the constant-speed water pump dataset collected from actual factories were selected for experimental validation. Comparative experiments were conducted with traditional autoencoder models and common unknown fault detection algorithms. The experimental results showed that the proposed method outperformed other algorithms in multiple metrics for unknown fault detection when only trained on normal samples.
Owing to its strong feature extraction capability and generalizable architecture, the model exhibits high sensitivity to unknown faults and is not limited to specific equipment types. As long as sufficient normal-state time-series data are available, the method can be extended to other industrial machinery, such as motors, wind turbines, and compressors. The average inference time per sample was measured at 2.37 ms, ensuring real-time feasibility for industrial monitoring systems. Additionally, peak memory usage during inference was approximately 512 MB, demonstrating the model’s efficiency and suitability for deployment in resource-constrained industrial devices. Future research will focus on incorporating multidimensional data fusion techniques into the autoencoder framework. By integrating heterogeneous sensor modalities—such as vibration, temperature, and acoustic signals—the model’s perception and diagnostic capabilities can be further enhanced. This multidimensional fusion approach is expected to improve the recognition of abnormal patterns and unknown fault types, laying the foundation for a more robust and generalized deep-learning-based fault detection system in industrial environments.