Automatic Identification of Fetal Acidosis Based on Three-Stage Training and Meta-Feature Fusion

Wang, Haiyan; Yin, Yanxing; Zhang, Xin; Liu, Xiaotong; Zhao, Jian; Che, Na; Wang, Liu

doi:10.3390/app16042045

Open AccessArticle

Automatic Identification of Fetal Acidosis Based on Three-Stage Training and Meta-Feature Fusion

by

Haiyan Wang

^1,*

,

Yanxing Yin

¹,

Xin Zhang

²,

Xiaotong Liu

¹,

Jian Zhao

¹

,

Na Che

¹ and

Liu Wang

^1,*

¹

College of Computer Science and Technology, Changchun University, Changchun 130022, China

²

College of Mathematics and Statistics, Changchun University, Changchun 130022, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 2045; https://doi.org/10.3390/app16042045

Submission received: 20 January 2026 / Revised: 13 February 2026 / Accepted: 17 February 2026 / Published: 19 February 2026

Download

Browse Figures

Versions Notes

Abstract

Fetal cardiotocography (CTG) is widely used to assess fetal health during labor and to screen for fetal acidosis. However, CTG interpretation relies heavily on clinicians’ experience and is affected by subjectivity and inconsistency, which limit diagnostic reliability. Most existing artificial intelligence approaches simplify fetal acid–base assessment into a binary classification, making it difficult to distinguish acidosis severity and restricting information for refined clinical decision-making. To address these limitations, this study formulates a three-class classification task—normal, moderate acidosis, and severe acidosis—based on the CTU-CHB dataset, using umbilical artery blood pH as the reference standard. A signal-first, conditionally enhanced, three-phase training and meta-feature fusion framework is proposed. In stage A, a CNN-BiLSTM-attention network performs end-to-end modeling of fetal heart rate signals, while a recall feedback-driven dynamic weighted loss alleviates class imbalance and identifies difficult samples. Stage B incorporates relevant clinical detection information for these difficult samples and applies multimodal feature fusion to enhance discrimination. Stage C constructs meta-features from the outputs of the first two stages to adaptively fuse classification preferences and uncertainty. Experimental results demonstrate that the proposed framework achieves an accuracy of 82.80 ± 2.82% and an F1 score of 78.84 ± 2.96%, effectively mitigating class imbalance and difficult sample classification, and providing reliable support for clinical decision-making in fetal acidosis.

Keywords:

fetal acidosis classification; difficult samples; multimodal fusion; meta-feature fusion; category imbalance

1. Introduction

Fetal acidosis during labor is mainly caused by abnormalities in the process of maternal–fetal oxygen exchange, including factors such as maternal blood gas status, uterine blood supply, and impaired placental transport function. When the above links are disturbed, it can lead to fetal hypoxia, which in turn triggers the occurrence of acidosis [1]. Therefore, continuous monitoring of fetal status during labor has been regarded as a key issue in perinatal medicine. Numerous studies have shown that severe intrapartum acidosis is strongly associated with a variety of adverse neonatal outcomes, including neonatal asphyxia, seizures, increased need for neonatal intensive care, and even progression to ischemic–hypoxic encephalopathy (HIE) with long-term neurological sequelae [2,3,4,5]. According to statistics, about 130 million newborns are born globally each year, of which about 4 million deaths occur during labor-related stages, and about 23% are closely related to intrapartum hypoxia [6]. Therefore, early and accurate identification of impending or existing fetal acidosis during labor is important for clinicians to take timely interventions and improve perinatal outcomes [7]. CTG has become a key noninvasive monitoring technique for assessing intrauterine fetal status and screening for hypoxia and acidosis by continuously and synchronously monitoring fetal heart rate (FHR) and uterine contraction (UC) signals [8]. The clinical value of CTG relies, to a large extent, on the accurate interpretation of the characteristics of the FHR waveform [9]. However, traditional CTG interpretation has long relied on physicians’ subjective experience and suffers from a lack of inter-observer agreement [10]. In order to improve the standardization of interpretation, the International Federation of Gynecology and Obstetrics (FIGO) issued guidelines for the classification of CTG patterns into three categories, normal, suspicious, and pathological, in 2015 [11]. Nevertheless, CTG is still an indirect assessment of fetal physiological status, and its diagnostic results usually require objective biological indicators for final validation. In this context, umbilical artery blood gas analysis, and, in particular, umbilical artery blood pH, has been recognized by the international perinatal community as an objective criterion for assessing the acid-base balance of the newborn at birth [12], which is a direct reflection of the degree of fetal acidosis at the end of labor [13].

In view of the above clinical importance and the establishment of objective diagnostic criteria, automatic CTG analysis techniques based on artificial intelligence have made great strides. A large number of studies have been devoted to extracting effective features from CTG signals for automated classification of fetal status. For example, Liang et al. [14] enhanced the data by Hermite interpolation and analyzed the fetal heart rate and contraction signals using a hybrid model of a one-dimensional convolutional neural network and a gated recurrent unit (1D-CNN+GRU) to achieve efficient classification of abnormal conditions in prenatal fetal monitoring. To overcome the subjectivity of manual interpretation, Sbrollini et al. [15] proposed a new automated algorithm for identifying and classifying fetal heart rate decelerations in CTGs. Liang et al. [16] proposed a hybrid model based on a one-dimensional convolutional neural network (1DCNN) and a bidirectional gradient boosting unit (BiGRU) to identify the key information in the fetal heart rate for physicians’ reference. Adhikari et al. [17] proposed a classification method for fetal acidosis based on hybrid features of fetal heart rate and contraction signal spectra. Lu et al. [18] proposed an artificial intelligence evaluation method based on a multivariate time series of fetal heart rate and contractions, which can be used to assess fetal health and provide objective support for clinical decision-making. Zhang et al. [19] proposed a multimodal fusion learning method that improves the diagnostic performance of fetal distress by combining signal and image data and constructing a multimodal encoder network (MENet) model. Yefei et al. [20] used Mel frequency cepstrum coefficients (MFCCs) as an input feature and analyzed them using a bidirectional long- and short-term memory network (BiLSTM) intelligent-assisted diagnostic algorithm. SM et al. [21] proposed an artificial intelligence assessment method based on an enhanced VGGNet for automatic classification of fetal heart rate (FHR) signals. Liu et al. [22] combined a CNN-BiLSTM hybrid neural network with an attention mechanism and introduced discrete wavelet transform (DWT) features to enhance the performance of the automated diagnosis of fetal acidosis. Zhao et al. [23] transformed fetal heart rate signals by continuous wavelet transform (CWT) into a 2D time-frequency image to fully capture the time-frequency domain features in the signal, and then the CNN automatically learns the effective features end-to-end, which avoids the complex manual feature engineering steps in traditional machine learning. Rao et al. [24] proposed a model based on a multiscale long- and short-term memory neural network by suppressing the interference of signal missing and artifacts through preprocessing and mitigating the sample imbalance problem by using data augmentation, which is able to fuse information from different time scales to achieve automatic classification of fetal heart rate. Baghel et al. [25] proposed a 1D-CNN-based automatic diagnosis method for fetal acidosis to assist doctors’ decision-making. The above studies show that data-driven AI-based methods have strong potential for fetal heart monitoring signal analysis and fetal status discrimination. Liu et al. [26] uses a multimodal two-branch fusion network to construct a two-branch structure by signal slicing, combines the attention module to extract hypoxia features, and fuses the mother’s electronic medical record, fetal heart rate features, and signal data, supplemented by label smoothing techniques to optimize the model. Zhang et al. [27] proposed a clinically interpretable dual-stream AI architecture (DT-CTNet) to achieve transparent interpretation and high-precision classification of fetal distress diagnosis by processing multi-feature representations with a digital twin model and analyzing raw fetal heart rate signals with a case-tracking model. Zhang et al. [28] proposes the MMIF (multimodal medical information fusion) framework, solves the problem of unaligned multimodal data through the category-constrained parallel ViT model (CCPViT), and introduces the cross attention-based multimodal representation alignment network (MRAN) to learn cross-modal deep interactions, and ultimately designs the lightweight test model to realize the task migration from multimodal training to unimodal diagnosis.

Despite the success of the above methods, they still face challenges on real-world data, mainly in two aspects. One is the serious class imbalance problem: in real-world clinical datasets, normal samples account for the majority of samples, while acidosis samples (especially severe acidosis) are sparse, and this distribution tends to bias model training towards the majority of classes, thus weakening the ability to recognize a few clinically significant classes. The second is the lack of targeted treatment for difficult samples: there are a large number of low-confidence or confusing samples in the data, and the generic model structure finds it difficult to learn such samples effectively, limiting further improvement of the overall performance. In order to deal with the above problems, existing studies have mainly tried at three levels: at the data level, over-sampling- or under-sampling-based strategies such as SMOTE [29,30] are used to adjust the class distribution, but this may introduce synthetic sample bias or lose valuable information; at the level of the loss function, methods such as Focal Loss [31], Class-Balanced Loss [32] and so on are introduced to enhance the model’s focus on a few classes or difficult samples, but these methods are sensitive to hyperparameters and may harm the overall performance by focusing too much on a few classes; and at the model level, strategies such as integrated learning [33] or stacking [34] are used to improve the discriminative ability of difficult samples and reduce the risk of overfitting by combining multiple base learners, but traditional stacking usually requires multiple classifiers to be trained in parallel on the full amount of data, lacks explicit special optimization for difficult samples, and its fusion is mostly simple weighting or voting, which make it difficult to deeply explore the complementarity and uncertainty information in the classification behaviors of different models.

To systematically address the above challenges, this paper proposes a signal-first, conditionally enhanced, three-stage training and meta-feature fusion framework using CTU-CHB, a real-world dataset that contains both raw fetal heart rate monitoring signals and synchronized clinical information. First, the raw fetal heart rate signals are preprocessed in a rigorous and standardized manner to effectively suppress the interference of noise, artifacts and missing signals on model learning. Subsequently, a deep model based on a one-dimensional convolutional neural network and a bidirectional long- and short-term memory network with a multi-head self-attention mechanism is constructed in stage A to perform end-to-end preliminary classification of the whole sample. A dynamic weighted loss function based on real-time recall is proposed and introduced in this stage, so that the model can dynamically adjust the loss weights according to the recognition difficulty of each category during the training process, thus alleviating category imbalance and automatically identifying difficult samples that are misclassified or have low classification confidence. On this basis, stage B carries out enhancement learning for the above difficult samples, introduces deep separable convolution and a squeeze-and-excitation attention module to mine more fine-grained temporal features, and integrates the corresponding clinical detection information to construct a multimodal classifier for the difficult samples, so as to improve the model’s ability to discriminate difficult samples. Finally, a meta-learner is designed in stage C, which does not directly process the original signals, but constructs meta-features (including category probability, confidence and uncertainty information) based on the classification outputs of stage A and stage B on the same validation set, and learns the classification preferences of the two stages to realize the adaptive fusion of the generalization ability and fine-grained discriminative ability, and then generates a more robust final classification result.

The contributions of this paper are summarized as follows:

(1): A three-stage training and meta-feature fusion framework (TS-MFF) is proposed. The framework systematically divides the fetal acidosis classification task into three stages, stage A—basic learning, stage B—difficult sample enhancement learning and stage C—intelligent decision fusion, which provides a new learning model for solving the category imbalance and difficult sample challenges in medical signal analysis.
(2): An adaptive dynamic weighting loss function based on real-time recall is designed. The loss function is able to dynamically adjust its weight in the loss calculation based on the feedback of the recall rate of each category during the model training process, so as to adaptively focus on the categories that are difficult to correctly categorize and gradually transition to the target weights, which effectively mitigates the serious category imbalance problem in clinical data.
(3): A multimodal fusion classifier for difficult samples is constructed. Aiming at the difficult samples identified by the base model (samples with low classification confidence or easily confused samples), a multimodal training strategy that fuses the depth-separable convolution, the SE attention mechanism and the clinical text features is designed, which enhances the model’s ability of feature extraction and discrimination for the difficult samples.
(4): An adaptive model fusion mechanism based on meta-feature fusion is proposed. By constructing and learning the information of classification probability, confidence and uncertainty of stage A and stage B as meta-features, an adaptive trade-off between the classification preference and uncertainty of the two stages is realized, resulting in a final classification decision model that is more robust and accurate than the traditional integration methods.

2. Materials and Methods

2.1. CTU-CHB Dataset

The CTU-CHB dataset was established by the Czech University of Technology in cooperation with the University Hospital in Prague as a real-world database for fetal heart rate monitoring during labor [35]. The dataset contains 552 complete labor monitoring recordings with a signal sampling frequency of 4 Hz and an average recording duration of approximately 90 min, covering the key stages of the labor process. Compared with preprocessed or manually labeled datasets, CTU-CHB provides raw monitoring signals that are closer to a real clinical situation, which on the one hand improves the relevance of the study, and on the other hand increases the complexity of signal preprocessing and feature extraction.

It should be pointed out that there is a wide range of missing contraction signals in CTU-CHB [22], which leads to difficulties in determining whether the deceleration of fetal heart rate is associated with contractions. An important advantage of this dataset is the simultaneous inclusion of synchronized clinical text information (see Table 1), which can be used to validate the model’s output against objective criteria based on biochemical indicators. In response to the prevalence of missing contraction signals, studies have turned to extracting features from fetal heart rate signals alone and using umbilical artery blood pH as an objective criterion (with a common threshold of pH ≤ 7.15 to define acidosis) to define or validate the classification results [14,22,23,24,25,36], which has become a commonly used pathway to assess the clinical relevance of the models.

Regarding the grading of acidosis, based on the studies related to fetal electrocardiogram ST segment analysis, acidosis can be further subdivided into moderate acidosis (7.05 < pH ≤ 7.15) and severe acidosis (pH ≤ 7.05) [37]. In this paper, we defined the labeling of the samples according to this clinical grading standard, and used the pH value of umbilical artery blood as the gold standard for classification. The specific classification was as follows: normal, pH > 7.15; moderate acidosis, 7.05 < pH ≤ 7.15; and severe acidosis, pH ≤ 7.05. Based on the above criteria, a total of 439 normal samples, 69 samples with moderate acidosis and 69 samples with severe acidosis were included in the CTU-CHB dataset, showing a significant imbalance in the distribution of categories. Although umbilical arterial blood pH is an accepted objective indicator for defining acidosis classification and assessing neonatal outcome, pH and its highly correlated biochemical indicators (e.g., base residual BE and BDecf) [38,39] were intentionally excluded from the input features in the multimodal modeling of stage B in this paper. This is because there is a strong correlation between the above indicators and the classification labels, and if they are directly used as inputs to the model, the model may learn information that is highly overlapped with the labels, which may lead to the label leakage or overfitting problems, thus weakening the model’s ability to generalize in real clinical scenarios.

2.2. Data Preprocessing

In this paper, the MATLAB R2024b platform is used to preprocess the fetal heart rate (FHR) signal, and the specific process is as follows. Firstly, the original FHR signal is initialized, and sampling points with the value of 0 are uniformly replaced by NaN, and then missing value detection is carried out: when the length of the consecutive missing time is less than 15 s, linear interpolation is used to carry out local restoration; if the length of the missing time is more than 15 s, the segment is directly excluded [40]. In the spike artifact detection stage, if the absolute difference between two neighboring sampling points is greater than 25 bpm, it is determined as a noise point; at the same time, the region where the differences between five consecutive beats are less than 10 bpm are defined as a stable segment, and the noise point is replaced by the smooth transition value between the previous and the subsequent stable segments by linear interpolation [40]. In addition, the abnormal value range is detected, and when the FHR is more than 200 bpm or less than 50 bpm, Hermite spline interpolation is used to fill in the repair [40]. To visualize the morphological differences in fetal heart rate signals in different acidosis grades, Figure 1 demonstrates the typical fetal heart rate waveforms of the three types of samples after pretreatment.

In this paper, each sample was analyzed as a fragment of an FHR signal in the last 20 min before delivery, and the pH value of umbilical artery blood corresponding to this time window was used as the sample label. Data partitioning was performed at the sample level to strictly ensure that all samples from the same patient appeared in only one of the training or test sets, thus avoiding the sample leakage problem introduced by the simultaneous participation of the same patient’s data in both training and testing.

As shown in Table 2, the signal changes in the three categories of samples during the preprocessing process were comparatively analyzed. Overall, the categories were at similar levels in terms of the frequency of spike occurrence, signal repair strength, and repair duration, and no trend was observed for the preprocessing process to impose significantly stronger interventions on a particular category of samples. Although the moderate acidosis group was slightly higher in some of the noise-related metrics, its repair scale was in the same order of magnitude range as that of the normal and severe groups, suggesting that the preprocessing operation mainly targeted nonphysiological abnormal fluctuations rather than specific pathological patterns. Further, the mean fetal heart rate levels of all three categories of samples returned to consistent and reasonable physiological intervals after pretreatment, indicating that the signal cleaning process was more about correcting sensor artifacts and transient abnormalities rather than altering the structure of physiological differences between the original categories. Taken together, these results suggest that the preprocessing step maintained a relatively balanced processing intensity across categories and did not produce disproportionate signal corrections for a few categories of samples, thereby reducing the likelihood of category-selective bias introduced by preprocessing.

2.3. Description of the Framework

2.3.1. Convolutional Neural Network (CNN)

Convolutional neural networks (CNNs) are a class of deep learning models specialized in processing data with grid-like topology, and their core idea is to achieve feature extraction and weight sharing within the local receptive field by sliding the convolution kernel over the input data. In the stage A model of this paper, a three-layer one-dimensional convolutional neural network (1D CNN) is used to model the fetal heart rate time-series signal. Batch Normalization and Dropout regularization are sequentially introduced after each layer of convolutional operation, where Batch Normalization is used to normalize the intermediate layer activations to accelerate the model’s convergence, and Dropout randomly discards neurons with a probability of 0.4 to mitigate overfitting. The convolution module is mainly used to capture local morphological features and short-term change patterns in the fetal heart signals to provide high-quality local feature representations for subsequent time-series modeling.

2.3.2. Bidirectional Long- and Short-Term Memory Network (BiLSTM)

A data-designed variant of a recurrent neural network is capable of modeling long-distance temporal dependencies by introducing gating mechanisms such as input gates, forgetting gates, and output gates, which effectively alleviate the problem of gradient vanishing in traditional recurrent neural networks, as shown in Figure 2. In this paper, a bidirectional LSTM (BiLSTM) structure is adopted, in which the forward network models the sequence from the beginning to the end, while the backward network reverses the processing from the end to the beginning of the sequence, and finally splices the hidden states of the two directions to form a complete context representation. The input dimension of the BiLSTM is set to 128, which is consistent with the output of the previous transition layer, and the number of hidden units is 64, which corresponds to a bidirectional output dimension of 128. This module is mainly used to capture the long-term temporal patterns in fetal heart signals, such as the global features of heart rate variability and rhythmic changes, so as to make up for the deficiencies of the convolutional network in the modeling of long-range dependencies.

2.3.3. Attention Mechanisms

Attention mechanisms are derived from the selective attention properties of the human visual system, and are used in deep learning to achieve adaptive feature weighting and selection by explicitly modeling the correlation between input features. The multi-head self-attention mechanism is a core component of the Transformer architecture, which learns different attention patterns through multiple parallel attention heads to enhance the representation capability of the model. This paper introduces a multi-head self-attention module containing 8 attention heads in the stage A model, with the embedding dimension set to 128 to align with the output dimension of the BiLSTM. This module generates a weighted contextual representation for portraying the relative importance of different temporal regions in the overall discrimination by calculating the correlation between time steps in the BiLSTM output sequence, thus realizing adaptive enhancement of key temporal features.

2.3.4. Difficult Sample Identification Mechanism

Difficult sample identification, as shown in Figure 3, is a key link in the three-phase training framework of this paper, and its goal is to automatically screen the samples with a poor model learning effect from the training results of phase A without introducing additional manual annotation, so as to provide a data basis for subsequent targeted optimization. In this paper, we adopt a dual-criteria strategy for determining difficult samples, which takes into account the dimensions of misclassification and confidence. Specifically, the trained stage A model is first used to forward propagate the original training set, calculate the classification probability of each sample on its true category as a confidence indicator, and statistically calculate the average confidence level of the training set. Subsequently, samples that satisfy either of the following conditions are judged as difficult samples: (1) the model categorization categories are inconsistent with the true labels; and (2) the model categorizes correctly but the corresponding confidence level is lower than the average level of the training set. This dual filtering mechanism can cover both obviously erroneous samples and ambiguous samples located near the decision boundary, thus ensuring that the stage B model focuses on learning more challenging sample features and improving the model’s discriminative ability in complex clinical scenarios.

2.3.5. Depthwise Separable Convolution Module

Depthwise separable convolution is an efficient decomposition of the traditional convolution operation that decomposes the standard convolution into two separate steps, Depthwise Convolution and Pointwise Convolution. Among them, Depthwise Convolution performs a spatial convolution operation for each input channel separately, while Pointwise Convolution realizes cross-channel feature fusion by 1 × 1 convolution. Compared to traditional convolution, deep separable convolution reduces the number of parameters and computational complexity while still maintaining a strong feature extraction capability. In this paper, this module is used as the basic building block in the signal encoder of stage B. It is especially suitable for training scenarios with relatively small difficult sample sizes, which helps to reduce the risk of overfitting and improve the training efficiency.

2.3.6. Squeeze-and-Excitation Network (SE-Net)

A squeeze-and-excitation network (SE-Net) is a channel-level attentional mechanism designed to enhance the network’s representation capability by explicitly modeling the importance weights of feature channels. The module consists of two main steps. Squeeze and Excitation: firstly, the global information of each channel is compressed into a scalar description by global average pooling; subsequently, the bottleneck structure is used to learn the nonlinearities between channels by using a bottleneck structure in the full dataset. Subsequently, a connection layer with a bottleneck structure is used to learn the nonlinear dependencies between channels and generate channel weights between 0 and 1. In the stage B model, SEBlock1D is embedded after the depth-separable convolutional module, which enables the network to adaptively emphasize the frequency bands and temporal patterns that are more relevant to fetal acidosis, thus enhancing the model’s ability to characterize and identify subtle pathological changes in difficult samples.

2.3.7. Adaptive Weighted Loss Function

In order to effectively alleviate the category imbalance problem prevalent in fetal heart monitoring data, this paper designs a dynamic weighted loss function based on recall feedback. The method continuously monitors the model’s recall performance for each category during the training process and dynamically adjusts the weights of each category in the loss function accordingly, so that the model can adaptively focus on the categories and samples that are difficult to be correctly identified during the optimization process, thus improving the discriminative ability for clinically important minority categories. The overall process and weight update strategy are described below.

Weight Initialization and Updating Target

In the initial stage of training, the loss weights of all categories are initialized to 1 to avoid the model from being disturbed by unreasonable weights in the early training stage. At the end of each training cycle (epoch), the target weight vector

W_{c}

is constructed based on the recall rate

R_{c}

(where

c

denotes the category index) of each category computed by the model on the current training set, and the design of the target weights follows the principle of “the lower the recall rate, the higher the weight”, and its computational formula is defined as follows:

W_{c} = \frac{\exp (α \cdot (1 - R_{c}))}{\frac{1}{C} \sum_{j = 1}^{C} e x p (α \cdot (1 - R_{j})) + ϵ}

(1)

where

C

is the total number of categories,

α

is the amplification factor (set to 1.5 in this paper), and

ϵ

is a very small constant (of

10^{- 12}

) that prevents the denominator from being zero. The normalization operation in the denominator ensures that the average weight of the categories is 1, thus avoiding dramatic fluctuations in the overall loss scale due to changes in the weights.

2.: Smoothing Update Strategy

In order to improve the stability of the training process and prevent the weights from fluctuating drastically between neighboring rounds, this paper adopts a phased smoothing update strategy, so that the current weights

W_{i n i t i a l}

gradually transition to the target weights

W_{c}

. During the preset warm-up epochs (15 rounds), the weights are updated by linear interpolation:

W_{c} = (1 - λ) \cdot W_{i n i t i a l} + λ \cdot W_{C}

(2)

λ = \frac{e p o c h}{w a r m u p_{e p o c h s}}

(3)

This strategy aims to avoid the model’s basic modeling ability of the overall distribution of the data being compromised by over-regulation of the weights during the initial learning phase.

At the end of the warm-up phase, the weight update is switched to the exponential moving average (EMA) form to further smooth out weight changes and enhance training stability:

W_{C} = β \cdot W_{C} + (1 - β) \cdot W_{C}

(4)

where

β

is the smoothing coefficient (set to 0.3).

3.: Integration with Loss Functions

Eventually, the dynamically computed category weights

W_{c}

are introduced. For a training batch containing

N

samples, the weighted cross-entropy loss is defined as follows:

L = - \frac{1}{N} \sum_{i = 1}^{N} W_{c, y_{i}} \cdot \log ({\hat{y}}_{i, y_{i}})

(5)

where

{\hat{y}}_{i, y_{i}}

denotes the probability of categorization of the

i

th sample in the true category

y_{i}

. Through this mechanism, the model will impose a larger gradient penalty on the misclassification of low-recall categories during backpropagation, thus guiding the network to continuously optimize its ability to recognize difficult categories.

2.3.8. Meta-Feature Fusion-Based Multi-Model Classification Fusion Framework

Stage C serves as the final decision fusion module of the entire three-stage classification framework, and its core goal is to output classification results that are more robust than any single model by learning the complementary relationships between the classification behaviors of different models. This stage does not directly process raw signals or clinical features, but acts as an intelligent “decision coordinator” to model and fuse the outputs of stage A and stage B. The first step is to model and fuse the outputs from stage A and stage B. A 29-dimensional meta-feature vector (shown in Table 3) is constructed from the classification results of stage A and stage B, which encodes a combination of category probability distributions, classification confidence, uncertainty measures, and model classification consistency. The above meta-features are used as inputs for training the meta-feature fusion model. A leave-one-fold cross-validation strategy is used during the training process to enhance the generalization ability of the model across different data subsets.

In order to strictly avoid stage C from contacting any information visible at the time of testing, in the five-fold cross-validation process, the sample indexes of the corresponding validation sets are saved for each fold; stage A and stage B output classification results based on identical validation samples under the same fold; finally, stage C only utilizes the out-of-fold classification results of all folds and their corresponding sample indexes to construct a meta-training set for learning, thus effectively avoiding information leakage of the test set and enhancing the generalization ability of the model on different data subsets. This effectively avoids the leakage of test set information and ensures the consistency of the evaluation conditions in each stage.

This design improves the robustness and classification stability of the model, enables the model to adaptively balance the advantages of different models on different sample subsets, and corrects the possible bias of individual models through the learned fusion strategy, thus providing more reliable decision support for automated grading and clinical assessment of fetal acidosis.

2.4. Algorithm Description

In this paper, we propose a signal-first, conditional-enhanced, three-stage learning framework to systematically cope with the prevalent category imbalance and difficult classification problems in fetal heart monitoring signals, as shown in Figure 4. The training process follows a strict fold-wise out-of-fold design to avoid information leakage. For each fold k (using 5-fold cross-validation), the stage A base model is first trained on the training set Train_k of that fold, and features are extracted using the CNN-BiLSTM-attention architecture, which is used to extract discriminative features from the fetal heart signals and to output information such as category probabilities. A dynamic category-weighted loss function with an early stopping strategy is introduced to alleviate the category imbalance during the training process. Subsequently, based only on the output of stage A on Train_k (below average confidence samples with misclassified samples) the difficult sample set D_k is selected, which strictly does not use the information from Val_k. For the stage B difficult sample model trained on D_k, the signal features are intensively modeled in combination with deeply separable convolution, squeeze-and-excitation attention mechanism, fusing the difficult sample signals and their corresponding clinical sample features, and generating probabilistic outputs for the entire validation set during the inference phase, even though the training data is restricted to D_k, in order to construct unified meta-features. For each fold, the probability vectors, confidence, entropy, consistency, and other statistics generated by stage A and stage B on the Val_k of the validation set of that fold are spliced into the meta-feature and saved as the fold-level OOF meta-sample, and only the categorical information of Val_k is used for the training of stage C. The meta-features are then merged into the meta-feature and saved as the fold-level OOF meta-sample. Finally, the stage C meta-feature fusion network, MetaFusionModel, constructs a fusion network containing 29-dimensional meta-features, realizing adaptive decision-making for model classification preference and uncertainty, using the “leave-one-fold method” for the target fold t, i.e., the index value consistent with the validation set of stage A and stage B. The meta-samples were extracted. Then, all the OOF meta-samples with f ≠ t are assembled and trained into a meta-model, and then the trained meta-model fuses the meta-features constructed from the outputs of A_t and B_t of the t-folds to classify the meta-features and report the metrics. This process ensures that stage C is only exposed to the OOF predictions of other folds during training, and does not contain any labeling or prediction information of the target folds, thus avoiding cross-stage and cross-fold information leakage, as illustrated by the pseudo-code in Algorithm 1, a three-stage training pseudo-code.

Algorithm 1 Three-Stage Pipeline (module-level, with intra-module flow and inter-module data transfer)

Stage A—Base Signal Model
Inputs: X_signal
Outputs: P_A(all samples), D_k (hard-sample index set)
Model architecture: CNN → BiLSTM → Attention → FC → Softmax

1: Extract temporal features: F_conv = CNN(X_signal)
2: Model temporal dependency: F_seq = BiLSTM(F_conv)
3: Emphasize informative segments: F_att = Attention(F_seq)
4: Predict class probabilities: P_A = Softmax(FC(F_att))
5: Recall feedback dynamic class weighting
(1) Compute recall:

R_{c} = {TP}_{c} / ({TP}_{c} + {FN}_{c})

(2) Smooth recall over time to stabilize training:

{\tilde{R}}_{c} \leftarrow β {\tilde{R}}_{c} + (1 - β) R_{c}

(3) Compute a reference recall level across classes:

R_{r e f} = Mean ({\tilde{R}}_{c})

(4) Adjust class weights according to recognition difficulty:

w_{c} \leftarrow w_{c} \times e x p (α (R_{r e f} - {\tilde{R}}_{c}))

(5) Use the updated

w_{c}

in the loss function for subsequent training
6: After training, identify hard samples on the training set:
D_k = {misclassified samples ∪ low-confidence samples}

Stage B—Hard-Sample Fusion Model
Inputs: X_signal[D_k], X_tab[D_k]
Outputs: P_B(all samples),
Model architecture:
Signal branch: CNN → BiLSTM → Attention
Tabular branch: MLP
Fusion: Concatenation → FC → Softmax

1: Use only hard samples D_k selected by stage A for training
2: Extract signal features:
F_sig = Attention(BiLSTM(CNN(X_signal[D_k])))
3: Extract clinical features: F_tab = MLP(X_tab[D_k])
4: Fuse multimodal features: F_fuse = Concat(F_sig, F_tab)
5: Predict refined probabilities: P_B = Softmax(FC(F_fuse))

Stage C—Meta-Fusion Model
Inputs: P_A(all samples), P_B(all samples)
Outputs: Final prediction Y
Model architecture: Meta-learner (MLP)

1: Construct meta-features for each sample:
M = [P_A, P_B, confidence_A, confidence_B, prediction_agreement]
2: Train the meta-learner using out-of-fold predictions from stages A and B
3: For test samples: Y = MetaModel(M_test)

2.5. Overview of the Overall Framework

The three-stage learning framework proposed in this paper is not a simple stacking of multiple techniques, but a systematic structural design and directional optimization of the two core problems in the classification of fetal heartbeat monitoring signals, namely, severe category imbalance and insufficient discrimination of difficult samples. The overall method follows a unified signal analysis process: first, the original CTG signal is time-synchronized and quality-controlled to eliminate consecutive long missing segments; the repairable short missing segments are interpolated and smoothed; and then segmentation and windowing are performed to extract the features on a unified time scale. The above preprocessing and feature constructions not only provide stable inputs for the base model, but also lay the metric foundation for the subsequent stages of difficult sample identification and uncertainty modeling.

Stage A (base modeling) aims to establish the initial perception of all samples. To address the problem of extreme scarcity of acidosis samples, the model introduces a dynamic weighted loss function based on real-time recall feedback during the training process, which adaptively boosts the gradient contribution of a few classes without changing the original data distribution, so that the model is able to pay attention to a few classes without overly favoring the majority class. Through this mechanism, the model first develops a relatively balanced discriminative ability. However, boundary samples with low confidence or misclassification still occur in stage A. These samples reflect the fine-grained differences and decision uncertainty between classes, and constitute the key object of subsequent optimization.

The core idea of stage B (difficult sample enhancement) is to target the weaknesses exposed in stage A for targeted enhancement. By identifying difficult samples within the training set and constructing specialized subtasks, the model is able to learn finer discriminative boundaries in a more focused sample space. This phase combines structural optimization and feature enhancement mechanisms to improve the responsiveness to pathological patterns, thus improving the recognition of a small number of classes, especially severe acidosis samples. It should be emphasized that stage B trains only on a subset of difficult samples, but uniformly outputs predictive probability and uncertainty metrics for validation and test samples in the inference stage, which are used for subsequent fusion modeling, thus strengthening the discriminative ability while maintaining strict isolation of the data flow and avoiding information leakage.

Stage C (meta-feature fusion) assumes the function of decision coordination. It no longer deals with the original signal directly, but performs modeling based on the meta-information of probability distribution, confidence and uncertainty output from stage A and stage B, and learns the reliability of different models under different sample conditions. Compared with the traditional simple voting or fixed weighting strategies, meta-fusion can dynamically adjust the decision weights based on the sample-level features, which improves the sensitivity of the minority class while maintaining the stability of the majority class. The three stages form a progressive structure: stage A solves the problem of “whether it can be recognized”, stage B solves the problem of “whether it is accurate”, and stage C solves the problem of “how to weigh”. Complementary functions and strict separation of data flow form a complete learning framework from coarse to fine, from local reinforcement to global coordination.

3. Experiments

3.1. Experimental Environment and Parameter Configuration

All experiments in this paper are completed on a computing platform equipped with an NVIDIA RTX 3090 GPU and an Intel Xeon Platinum 8358P CPU (15 vCPU @ 2.60 GHz). The model’s implementation is based on the PyTorch 2.7.1+cpu deep learning framework and is GPU-accelerated using CUDA 11.1 with cuDNN 8.0, and the experimental code runs in a Python 3.11 environment.

The network structure, hyperparameter configuration and training strategy of the model in each stage have been given in detail in Table 4, and the input and output dimensions of each module have been labeled in detail to guarantee the reproducibility and logical integrity of the model’s structure. In stage A, the number of input channels of the one-dimensional convolutional layer corresponds to the preprocessed single-channel fetal heart signal, and the feature dimension is expanded layer by layer to 64 dimensions, which is mapped to 128 dimensions by the transition convolution and accessed by the BiLSTM module. The dimension of the BiLSTM’s hidden layer is set to 64, and the output of 128-dimensional temporal features is outputted after bidirectional splicing, which is followed by the temporal weighting of the multi-headed self-attention mechanism in this dimension, and, finally, the output is compressed into three-classification output by the fully connected layer compression for triple categorization output. Stage B employs a depth-separable convolution module with the number of channels asymptotically extended to 64 dimensions and combined with the SE attention module to perform weight recalibration in the channel dimension; the clinical form features are encoded into 64-dimensional hidden vectors by the two-layered fully connected network, and are spliced with the signal branching outputs to achieve multimodal fusion via the 192-dimensional joint representation layer. The meta-network of stage C takes 29-dimensional meta-feature vectors as inputs, and compresses them sequentially to 128, 64, and 32 dimensions, and finally outputs three classification decisions.

All experiments are uniformly set up with 500 training rounds (epochs), and Early Stopping is introduced, with the patience value (patience) set to 100 to prevent overfitting and enhance training stability. The model’s evaluation adopts a five-fold cross-validation strategy, in which stage A, stage B and stage C always use the same validation set throughout the experimental process, strictly avoiding the risk of information leakage due to inconsistency in the data division, and ensuring the comparability of the experimental results in each stage and the rigor of the evaluation process.

3.2. Evaluation Criteria

In this paper, in order to comprehensively evaluate the performance of the proposed three-stage learning framework in the task of fetal acidosis classification, four widely used evaluation metrics are selected, including F1 score (F1 score), accuracy, recall and precision. The above metrics portray the classification performance of the model from different aspects. In view of the category imbalance problem in the CTU-CHB dataset, this paper focuses on the variation in the F1 score, which can strike an effective balance between accuracy and recall, so as to prevent the model from neglecting the clinical importance of a few categories due to the bias towards the majority of categories. In the following, we will give the definition of each evaluation metric, its formula, and how it is applied in this paper.

Accuracy

Accuracy is the most intuitive evaluation index in the classification task, indicating the proportion of samples correctly classified by the model to the total samples. Its calculation formula is as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(6)

2.: Precision

Precision focuses on measuring the reliability of the model in classifying positive categories, i.e., the proportion of samples classified as positive that are actually positive. In this paper, the precision rate is used to assess the reliability of the model in categorizing various categories (especially minority categories), so as to avoid unnecessary interventions triggered by false positives in clinical applications. The formula is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

3.: Recall

Recall is concerned with the model’s ability to cover samples of positive categories, i.e., the proportion of samples that are actually positive categories that are correctly categorized. The level of recall directly reflects the sensitivity of the model to recognize a few classes (e.g., severe acidosis). The formula is as follows:

R e c a l l = \frac{T P}{T P + F N}

(8)

4.: F1 Score

F1 score is the average of precision and recall, which can comprehensively reflect the overall performance of the model, especially in the category imbalance data, and is more valuable. The

F 1

score range is from 0 to 1, and the higher the value, the more it indicates that the model has achieved a better balance between precision and recall. The formula is as follows:

F 1 - s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

Macro Average

F 1

(

M a c r o - F 1

) is the arithmetic mean of the

F 1 - s c o r e s

for the three categories and is used as a composite measure of the overall discriminatory fairness of the model on unbalanced data.

T P

(True Positive) is the true case,

T N

(True Negative) is the true negative case,

F P

(False Positive) is the false positive case, and

F N

(False Negative) is the false negative case.

3.3. Sensitivity Analysis

3.3.1. Sensitivity Analysis of Missing Value Duration Thresholds

In order to assess the impact of missing value interpolation thresholds on classification performance during the preprocessing of fetal heart signals, this study sets five groups of control thresholds (5 s, 10 s, 20 s, 25 s, and 30 s) and makes a side-by-side comparison with the 15 s threshold used in this paper, as shown in Table 5. Comparison of missing value duration thresholds. The experimental results show that the accuracy fluctuation under different missing thresholds ranges from 78.81% to 80.81%, with no significant monotonic trend or drastic jumps, indicating that the model is not strongly sensitive to the setting of missing thresholds in a statistical sense. The 15 s threshold selected in this paper is a commonly used standard in the clinical literature and achieved an optimal accuracy of 82.80% in this experiment. This threshold effectively preserves the available signal segments while avoiding the false smoothing introduced by too long interpolation, thereby balancing signal integrity and physiological realism.

3.3.2. Threshold Sensitivity Analysis for Difficult Sample Recognition

To assess the threshold sensitivity, this study conducted multi-group comparisons of confidence (0.3, 0.4, 0.5, and 0.6) and entropy (0.80, 0.85, 0.90, and 0.95) under the premise of fixed misclassified sample inclusion, as shown in Table 6. The results show that the accuracy is 79.91~81.35% for different confidence thresholds and 79.53~80.80% for different entropy thresholds, with a limited range of fluctuation and no monotonic dependence or critical jumps, indicating that stage B is not strongly sensitive to the threshold setting. Based on this, this paper finally adopts the average confidence level of the training set as the dynamic screening threshold, which is adaptively adjusted with the training process without manual presetting, and stabilized in the interval of 0.60~0.65 in the 50–50 cross-validation, with a final accuracy rate of 82.80%, which is significantly better than the fixed threshold combinations. The entropy screening is highly correlated with the confidence level, and the threshold is not set separately to avoid redundant tuning.

3.4. Ablation Experiments

In order to systematically assess the contribution of each module to the overall performance of the model, a series of ablation experiments was designed in this study, the results of which are shown in Table 7. The performance metrics of the model outputs in each row of the table are arranged sequentially from top to bottom by the three categories of “Normal–Moderate Acidosis–Severe Acidosis”, demonstrating the F1 scores of each category, the precision (Precision) and the recall (Recall), and summarizes the macro-averaged F1 (Macro-F1) and overall accuracy. The experiments clearly reveal the key roles of each component in the proposed three-stage fusion framework (TS-MFF).

When using only the unweighted cross-entropy loss (A + weightless), the model is completely biased towards the majority class (normal), with a near failure in recognizing the moderate and severe acidosis categories (F1 of 0% in both cases), which intuitively reflects the dominant effect of severe category imbalance on the model. After the introduction of dynamic weighting loss based on real-time recall feedback (A + Dynamic weighting), the model’s focus on a few categories was significantly improved, with the F1 scores of moderate and severe acidosis rising to 28.41% and 23.62%, respectively, and the macro-mean F1 increasing from 29.53% to 46.97%. This suggests that the dynamic weighting strategy effectively mitigates the category imbalance problem and directs the model to focus on clinically important minority category samples.

Ablation experiments for stage B further reveal the design-specific contributions and trade-offs in performance. When the depth-separable convolution is removed, the macro-averaged F1 of the model improves to 48.67%, despite a decrease in its overall accuracy. More critically, this configuration achieved an F1 score of 36.64% for severe acidosis, which is a significant improvement over the 23.62% of the dynamically weighted model (A + Dynamic weighting). Similarly, removing the SE attention module or restricting stage B to a pure signaling model showed a consistent pattern, and the F1 for severe acidosis reached 38.28% and 24.34%, respectively, which were both higher than the corresponding levels of the dynamically weighted model (23.62%). Together, these results show that the enhanced design of “Deep Separable Convolution + SE Attention + Multimodal Fusion” adopted in stage B is of central importance in significantly improving the fine-grained discriminative ability for a small number of categories, especially for severe acidosis. This focused optimization of difficult samples will inevitably change the decision boundary of the model, which may interfere with the classification of the numerically superior majority class samples, thus manifesting as a decrease in the overall accuracy. This is a reflection of a strategic compromise in the design of stage B: by strengthening the model’s ability to discriminate the minority class, it prioritizes the assurance of improved classification performance for clinically high-risk categories (e.g., severe acidosis), even with some concessions in the overall accuracy.

The meta-feature fusion mechanism adopted in stage C effectively integrates the advantages of the first two stages. Compared with traditional integration methods (e.g., stacking, with a macro-averaged F1 of 47.20%), the meta-feature fusion method in this paper achieves a smarter decision fusion by explicitly modeling the classification probability, confidence and uncertainty information in the outputs of stage A and stage B. Ultimately, the macro-averaged F1 is further improved to 53.78%, while the overall accuracy is restored to 82.80%. This result demonstrates that the meta-feature fusion mechanism not only effectively combines the stability of the base model (stage A) and the discriminative power of the difficult sample enhancement model (stage B), but also makes adaptive adjustments at the decision-making level, which restores and improves the overall classification accuracy and robustness of the model while significantly enhancing the ability of recognizing a small number of classes.

3.5. Generalization Experiments

3.5.1. Dynamically Weighted Generalization Test

To verify the generalization performance of the proposed method under different data distribution and category imbalance scenarios, this paper further conducts generalization experiments on the MIT-BIH [41] arrhythmia dataset. This dataset is the authoritative public benchmark in the field of electrocardiographic (ECG) signal analysis, and contains 48 dual-channel ambulatory ECG recordings (each about half an hour, sampled at 360 Hz) with beat-by-beat heartbeat types labeled by experts. The distribution of heartbeat types in the dataset is highly uneven: normal beats (N) predominate, while some types are sampled in very small numbers. This distributional property has similarities to the imbalance problem in the fetal acidosis category in the CTU-CHB dataset and is therefore often used to assess the robustness and generalization ability of algorithms on real clinical imbalanced data. In this paper, continuous ECG signals are segmented in a 10S window and a five-category task is performed. Since class F contains only two samples after segmentation, it cannot constitute effective training and testing segmentation in cross-validation, and it finds it difficult to support the model to learn this class sufficiently and reliably. Therefore, in order to safeguard the validity of the experiment and the interpretability of the results, this paper excludes category F from the subsequent analysis and uses only the remaining four categories for the evaluation of generalization performance.

The SMOTE (Synthetic Minority Over-sampling Technique) is a classical data-level category balancing algorithm, whose core idea is to alleviate the imbalance of category distribution by interpolating the minority category samples in the feature space and synthesizing new samples. Specifically, for each minority class sample, the SMOTE randomly selects a sample from its

k

(

k = 5

) nearest neighbors, and then generates a new sample by randomly selecting a point on the line connecting the two points. This method can effectively increase the diversity of minority class samples and avoid the risk of overfitting caused by simply copying samples. However, the limitation of the SMOTE is that it may generate samples with no practical clinical significance, especially in high-dimensional or sparse feature spaces, and the synthetic samples may fall into the majority class region and introduce noise, thus reducing the generalization performance of the classifier on the real test set.

Focal Loss is a strategy to deal with class imbalance at the loss function level by dynamically adjusting the sample loss weights so that the model focuses more on difficult-to-classify samples. It introduces a modulation factor

(1− p_{t})^{γ}

on top of the standard cross-entropy loss, where

p_{t}

is the model’s probability of categorizing the true category, and

γ

is the focusing parameter (in this paper, we set

γ = 2

). This factor decreases the loss contribution of easy-to-categorize samples and relatively increases the weight of hard-to-categorize samples. In addition, Focal Loss usually includes a category weighting factor

α

(

α = 0.25

) to further regulate the importance of different categories; in this experiment, α is set based on the category frequency. Focal Loss is particularly suitable for tasks with extreme category imbalance, but its performance is more sensitive to the hyperparameters γ and α, and may still need to be combined with other factors when dealing with severely imbalanced multicategorical medical data; it may still need to be combined with other strategies to stabilize the optimization direction.

CB Loss (Class-Balanced Loss) is based on the concept of “effective number of samples” and derives the setting of class weights from the theoretical level. The loss function considers that as the number of samples increases, the marginal benefit of adding new samples decreases. Therefore, the category weight should be inversely proportional to the effective sample size of the category, which is calculated as follows:

w_{c} \propto \frac{1 - β}{1 - β^{n_{c}}}

(10)

where

n_{c}

is the number of samples for category

c

and

β

(

β

= 0.99) is the hyperparameter.

CB Loss gives higher loss weights to categories with small sample sizes in this way, thus paying more attention to model optimization. Compared with simple inverse frequency weighting, CB Loss takes into account information redundancy due to data overlap and provides more reasonable weight estimates, but its theoretical assumptions may be challenged when the sample size is very small or the feature overlap between categories is complex.

To test the algorithmic robustness and generalizability of the proposed dynamic weighting strategy on different categories of unbalanced sequence data, this study conducted comparative experiments on the publicly available MIT-BIH arrhythmia dataset. Table 8 demonstrates the metrics, where each column shows the F1 scores, precision and recall of each method on each category in top-to-bottom order of the data, and summarizes the macro-averaged F1 and overall accuracy. All results are calculated based on five-fold cross-validation, and the summary metrics are the mean and standard deviation of each folded result.

Analyzing the data in the table, the advantages of this paper’s method in improving the minority class recognition ability are obvious. On the CTU-CHB dataset, compared with the unweighted baseline, the macro-mean F1 of the present method is substantially improved from 29.53% to 46.97%. Although there was a slight decrease in overall accuracy (from 79.53% to 78.81%), this is an expected trade-off for the model to significantly improve the detection of a clinically critical minority category (acidosis). The unweighted model was completely ineffective, with an F1 of 0% for both moderate and severe acidosis, whereas the present method improved the F1 scores for these two categories to 28.41% versus 23.62%, demonstrating that the dynamic weighting effectively guided the model to focus on clinically important, but scarce, samples. On the MIT-BIH dataset, the macro-averaged F1 score of 83.59% of the present method also outperforms the other comparative methods across the board. In particular, its F1 score of 51.16% for the most difficult to categorize minority class outperformed the SMOTE’s F1 score of 36.64% and Focal Loss F1 score of 0%. These class-by-class results confirm that the proposed dynamic weighting mechanism achieves an effective balance between generalizability and discriminative power on unbalanced data by adaptively adjusting the learning focus.

3.5.2. Statistical Testing

In order to quantitatively assess whether there is a significant performance difference between the dynamic weighting method in this paper and other categories of imbalance treatment strategies, this study uses the Wilcoxon signed-rank test for paired statistical analysis, as shown in Table 9. This test is a nonparametric method that does not assume that the data obey a normal distribution, and is suitable for the comparison of performance indicators under small samples and multiple cross-validations. In this paper, using the five-fold average F1 as the benchmark index, the method of this paper is paired fold-by-fold with the scores of the unweighted baseline, inverse frequency weighting, the SMOTE, Focal Loss, and CB Loss on five cross-validations. The results show that the p-value between this paper’s method and each of the compared methods is 0.03125 (less than 0.05) on both the CTU-CHB and MIT-BIH datasets, which reaches a statistically significant level. This result further confirms that the dynamic weighting strategy proposed in this paper is not an accidental fluctuation in improving the minority class recognition ability, but has a statistically significant and stable advantage.

3.6. Comparison of Existing Models

In this section, the proposed three-stage framework (TS-MFF) is compared with a variety of state-of-the-art temporal classification models, and all results are means ± standard deviation after five-fold cross-validation. In Table 10, “F1 Score”, “Accuracy”, ‘Precision’, and “Recall” are the arithmetic means of the corresponding indicators for each category on each fold, while “Macro-F1” is the macro-mean (F1 scores are calculated category by category, and then averaged over the results of each category). Accuracy is the proportion of all samples that are correctly categorized; precision is the proportion of samples predicted to be positive by the model that are actually positive. The corresponding overall accuracy and macro-averaged precision are the same (82.80 ± 2.82%), which is due to the fact that under the five-fold cross-validation setting, the majority of “normal” samples dominate the validation set, and the model has high consistency in predicting them, which leads to the occasional numerical coincidence of the two metrics. The two are not necessarily equivalent in definition and calculation, but the consistency can reflect the stability of the model’s prediction to a certain extent.

The TS-MFF proposed in this paper outperforms all the compared models in several key indicators. Specifically, the F1 score of this method reaches 78.84 ± 2.96%, which is a significant improvement over the best baseline model (ResNet, 73.80 ± 3.64%); the macro-averaged F1 is 53.78 ± 7.22%, which is also significantly higher than that of the other methods (up to 40.79 ± 9.11%). In addition, the proposed method maintains the lead in both accuracy (82.80 ± 2.82%) and precision (82.80 ± 2.82%), indicating that it improves the ability to recognize a small number of classes without sacrificing the reliability of the overall classification. These results consistently demonstrate that the proposed three-stage fusion framework has superior overall performance and unbalanced data adaptation in the fetal acidosis classification task.

3.7. Demonstrating the Confusion Matrix

In order to comprehensively assess the classification stability and error distribution of the three-stage framework on the whole dataset, Figure 5 shows the five-fold cross-validation cumulative confusion matrix, i.e., all the predictions of the five-fold validation set are summarized by real categories, corresponding to stage A, stage B, and stage C. This cumulative view effectively removes single-fold randomness and reflects the systematic behaviors of the models in each stage more objectively.

The cumulative confusion matrix for stage A shows that normal samples are identified highly accurately, but the missed samples for moderate and severe acidosis are higher, and the misclassifications mostly flow to the normal category. This distribution confirms that even with the introduction of dynamic weighting, stage A still tends to classify acidosis samples with ambiguous boundaries as normal, reflecting that the underlying model does not yet adequately characterize the few categories.

The cumulative confusion matrix in stage B showed significant changes: the number of correctly classified severe acidosis cases increased from 5 cases in stage A to 21 cases; the number of correctly classified moderate acidosis cases also increased to 22 cases. However, the number of misclassifications of normal samples was significantly higher, and some normal samples that were originally correct were misclassified as acidosis. This shift of “sacrificing part of the majority class precision for the minority class recall” is a strategic trade-off after focusing on difficult samples in stage B, which is highly consistent with the findings of the ablation experiments.

The cumulative confusion matrix of stage C shows that the recognition accuracy of normal samples reaches 97.0% (426/439), and 15 cases of severe acidosis are correctly categorized, which is slightly lower than that of stage B but still significantly better than the baseline; moderate acidosis is still the bottleneck of recognition, with only eight cases correctly categorized, and 58 cases misclassified as normal. Compared with stages A and B, stage C reduced the number of misclassified normal samples from 36 to 13 cases while retaining stage B’s advantage in recognizing severe acidosis, and the total number of diagonal samples in the three categories increased from 420 to 449 in stage A, with a continuous improvement in the overall classification consistency. This evolutionary process confirms the central role of stage C meta-feature fusion in coordinating minority class sensitivity with majority class stability.

3.8. Weight Visualization

The dynamic weighted loss function proposed in this paper dynamically adjusts the loss weights based on the real-time performance of the model on the recall rate of each category, but its updating process is not an instantaneous effect on the recall rate of a single round. As shown in Figure 6, there is an obvious phase lag and amplitude smoothing between the weight curves and the recall curves, and this asynchronous characteristic comes from the two-stage smoothing constraints actively introduced in the weight updating strategy: in the pre-training period, linear interpolation is used to make the weights transition from the initial values to the target values gently, so as to avoid the model from being drastically perturbed by the weights when the characterization capability has not yet been established; in the mid- and late-training period, it is switched to exponential moving average updating, so that the current weights inherit the cumulative information of historical values, and the current weights are adjusted accordingly. This inherits the accumulated information of the historical values to form a sliding average of the target weights. This design focuses the weights on the long-term statistical trend of category recognition difficulty rather than the immediate response to short-term fluctuations. This asynchronous mechanism effectively suppresses training oscillations triggered by random batch fluctuations and prevents the model from sacrificing the overall discriminative ability by overly focusing on a few classes. The fact that the weight curve in Figure 6 maintains a smooth rise in the middle and late stages of training, instead of rising and falling with the low-frequency fluctuations of recall, is visual evidence of the smoothing and buffering effect of this mechanism. The experimental results show that under the guidance of this weight allocation strategy, the stage A model is able to gradually strengthen the recognition ability of the few key clinical categories without destroying the overall distribution perception, which provides a stable and effective feature foundation for the directed optimization of stage B and the decision fusion of stage C.

4. Discussion

In this paper, a classification framework based on three-stage training and meta-feature fusion is proposed around the problem of automatic identification of fetal acidosis in fetal heart monitoring signals. Compared with existing studies, a core feature of this study is that it does not rely solely on model structural complexity to improve performance, but systematically designs a staged modeling strategy from basic learning, directed enhancement to decision fusion in terms of the characteristics of data distribution and model learning behavior. Stage A mitigates the severe category imbalance through dynamic weighted loss based on recall feedback; stage B introduces multimodal enhancement training for difficult samples, which combines deep separable convolution, SE attention and clinical features to improve fine-grained discriminative ability; and stage C adopts a meta-feature fusion mechanism to adaptively fuse the categorization preference and uncertainty of the models in the first two stages to achieve more robust decision-making. Overall, the framework effectively improves classification performance and provides a learnable solution to the unbalanced and difficult classification problems in medical signals.

The method design of this paper is a systematic refinement and extension based on previous work. In a previous study, the authors of this paper [42] introduced an adaptive weighting adjustment strategy based on the error rate to enhance the model’s focus on misclassified samples and minority class samples to alleviate the category imbalance problem. However, this dynamic weighting mechanism lacks explicit smoothing transition constraints during weight updating, and the loss weights are more sensitive to short-term performance fluctuations, which may lead to excessive weight changes during training, thus posing a potential risk of instability on complex real-world data. To address the above shortcomings, this paper, while retaining the advantages of the dynamic weighting idea, provides a more robust design of the weight updating process so that it can evolve gradually with the model learning state, thus maintaining adaptive attention to difficult samples while suppressing training oscillations. In addition, compared with the previous work, which mainly focuses on single-stage or sample-level error correction mechanisms, this paper further introduces the meta-feature fusion strategy of stage C, which adaptively integrates the outputs of stage A and stage B at the decision-making level, realizing the enhancement from the sample-level correction to the decision-making level fusion, and effectively enhances the stability and generalization ability of the overall model.

To facilitate direct comparison with existing studies, this paper downgraded the three classifications (normal, moderate acidosis, and severe acidosis) to the dichotomous task commonly used in existing studies, i.e., combining moderate and severe acidosis into the acidosis category. In this setting, the method in this paper achieved a combined performance of 83.7 ± 2.77% accuracy, 80.46 ± 3.33% F1 score, and 83.27 ± 4.85% precision with 83.70 ± 2.77% recall. To evaluate the effectiveness of this model, the text was compared to existing methods also based on the CTU-CHB dataset. As shown in Table 11, Comert et al. used EMD + DWT + SVM with 67.0% accuracy; Singh et al. used a HoloViz + CNN-based model with 69.6% accuracy and a 66% F1 score; Liu et al. utilized CNN-BiLSTM-attention with DWT features with 71.71% accuracy; and Kadarina et al. utilized SE-ResNet50+DWT with an F1 score of 72.67%. Compared to these methods, the framework proposed in this paper demonstrates a significant improvement in both accuracy and F1 score. This result suggests that under the conditions of complex noise, category imbalance, and fuzzy sample decision boundaries prevalent in CTG signals, it is often difficult for a single modeling strategy to adequately mine the discriminative information, whereas phased directional enhancement and meta-feature fusion can more effectively utilize the complementarities of different models on global and difficult samples, thus improving overall discriminative performance and stability.

5. Conclusions

This study addresses the automatic classification of fetal acidosis from intrapartum fetal heart rate signals, with a particular focus on two intertwined challenges in clinical data: severe class imbalance and the presence of low-confidence or ambiguous samples. A three-stage training and meta-feature fusion framework (TS-MFF) is proposed, which progressively refines the decision boundary from coarse discrimination to fine-grained identification and, finally, to adaptive decision fusion.

In stage A, a CNN-BiLSTM-attention backbone is combined with a recall feedback dynamic weighted loss. Unlike static weighting or post hoc resampling, this mechanism continuously monitors per-class recall and smoothly adjusts the loss weights via linear interpolation and exponential moving average. It alleviates the dominance of normal samples without requiring synthetic data or manual threshold tuning, and simultaneously identifies difficult samples for subsequent targeted optimization. Stage B constructs a dedicated multimodal enhancer for these samples. By integrating depthwise separable convolutions, a squeeze-and-excitation attention module, and clinical tabular features, the model learns finer temporal patterns and cross-modal representations that are essential for distinguishing moderate and severe acidosis. Stage C no longer operates on raw signals but builds a meta-learner on top of the probabilistic outputs, confidence scores, and uncertainty estimates from the first two stages. A 29-dimensional meta-feature vector is constructed and used to adaptively fuse the complementary strengths of the base model and the hard-sample model, yielding final decisions that are more robust than those produced by simple ensemble strategies.

Although this paper validates the effectiveness of the proposed three-stage framework on the CTU-CHB dataset, there are still several limitations that need to be improved in subsequent work. First, the training and validation of the model rely entirely on a single CTU-CHB public dataset, which has limited data size and diversity, and this dataset generally suffers from missing contraction signals, which restricts the full potential of multisignal synergistic analysis. Second, the proposed three-stage process is more complex in training and inference than the single-stage approach, with higher computational overheads, and further inference acceleration and model lightweighting research is needed for scenarios pursuing real-time bedside deployment.

The framework proposed in this paper provides useful ideas for solving the imbalance and difficult classification problems in medical signal analysis. The next work will focus on the following aspects: first, promoting the validation of the model on more complete multimodal signals (e.g., contractions and maternal signs) and larger, multicenter clinical datasets to assess its generalization ability and clinical robustness; second, exploring the association of classification with richer perinatal outcomes (e.g., neurodevelopmental prognosis) to build a risk assessment system with more clinical decision support value; and third, we are committed to the lightweighting and engineering of the model, and are studying the feasibility of embedding it into portable monitoring devices or deploying it as a clinical auxiliary software tool, with the ultimate goal of realizing the safe and effective translation of AI-assisted diagnostic technology to real-world labor monitoring scenarios.

Author Contributions

Conceptualization, H.W. and L.W.; methodology, Y.Y.; software, Y.Y.; validation, H.W., L.W., X.Z., N.C., J.Z. and X.L.; formal analysis, Y.Y.; investigation, X.L., X.Z. and N.C.; data curation, L.W., J.Z. and X.L.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Y., L.W., J.Z. and H.W.; visualization, X.Z.; supervision, H.W.; project administration, H.W. and N.C.; funding acquisition, H.W. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the “Scientific Research Project of Jilin Provincial Department of Education” (Grant number JJKH20261281KJ), and “Jilin Province Science and Technology Development Plan Project” (Grant number YDZJ202201ZYTS549).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This paper uses publicly available datasets. One of the CTU-CHB datasets and the MIT-BIH arrhythmia dataset can be found here: PhysioNet Databases.

Acknowledgments

We would like to express our sincerest gratitude to all those who contributed to the completion of this study and the writing of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CTG	Cardiotocography
FHR	Fetal Heart Rate
UC	Uterine Contraction
FIGO	Federation International of Gynecology and Obstetrics
pH	Arterial blood pH of the umbilical cord
HIE	Hypoxic–Ischemic Encephalopathy
CNN	Convolutional Neural Network
BiLSTM	Bidirectional Long Short-Term Memory
SE-NET	Squeeze-and-Excitation Network
DWT	Discrete Wavelet Transform
CWT	Continuous wavelet transformation
SMOTE	Synthetic Minority Over-Sampling Technique
CB Loss	Class-Balanced Loss
EMA	Exponential Moving Average
OOF	Out-Of-Fold
TS-MFF	Three-Stage Meta-Feature Fusion

References

Bobrow, C.S.; Soothill, P.W. Causes and consequences of fetal acidosis. Arch. Dis. Child.-Fetal Neonatal Ed. 1999, 80, F246–F249. [Google Scholar] [CrossRef] [PubMed]
Goodwin, T.M.; Belai, I.; Hernandez, P.; Durand, M.; Paul, R.H. Asphyxial complications in the term newborn with severe umbilical acidemia. Am. J. Obstet. Gynecol. 1992, 167, 1506–1512. [Google Scholar] [CrossRef]
Williams, K.P.; Singh, A. The correlation of seizures in newborn infants with significant acidosis at birth with umbilical artery cord gas values. Obstet. Gynecol. 2002, 100, 557–560. [Google Scholar]
Fahey, J.; King, T.L. Intrauterine asphyxia: Clinical implications for providers of intrapartum care. J. Midwifery Women’s Health 2005, 50, 498–506. [Google Scholar] [CrossRef]
van den Berg, P.P.; Nelen, W.L.; Jongsma, H.W.; Nijland, R.; Kollée, L.A.; Nijhuis, J.G.; Eskes, T.K. Neonatal complications in newborns with an umbilical artery pH < 7.00. Am. J. Obstet. Gynecol. 1996, 175, 1152–1157. [Google Scholar] [CrossRef]
Lawn, J.E.; Cousens, S.; Zupan, J. 4 million neonatal deaths: When? Where? Why? Lancet 2005, 365, 891–900. [Google Scholar] [CrossRef]
Ayres-de-Campos, D.; Arulkumaran, S. FIGO consensus guidelines on intrapartum fetal monitoring: Physiology of fetal oxygenation and the main goals of intrapartum fetal monitoring. Int. J. Gynecol. Obstet. 2015, 131, 5–8. [Google Scholar] [CrossRef]
Hussain, N.M.; O’Halloran, M.; McDermott, B.; Elahi, M.A. Fetal monitoring technologies for the detection of intrapartum hypoxia-challenges and opportunities. Biomed. Phys. Eng. Express 2024, 10, 022002. [Google Scholar] [CrossRef] [PubMed]
Ayres-de-Campos, D.; Spong, C.Y.; Chandraharan, E. FIGO consensus guidelines on intrapartum fetal monitoring: Cardiotocography. Int. J. Gynecol. Obstet. 2015, 131, 13–24. [Google Scholar] [CrossRef]
Fergus, P.; Huang, D.-S.; Hamdan, H. Prediction of intrapartum hypoxia from cardiotocography data using machine learning. In Applied Computing in Medicine and Health; Elsevier: Amsterdam, The Netherlands, 2016; pp. 125–146. [Google Scholar]
Visser, G.H.; Ayres-de-Campos, D. FIGO consensus guidelines on intrapartum fetal monitoring: Adjunctive technologies. Int. J. Gynecol. Obstet. 2015, 131, 25–29. [Google Scholar] [CrossRef] [PubMed]
Sehdev, H.M.; Stamilio, D.M.; Macones, G.A.; Graham, E.; Morgan, M.A. Predictive factors for neonatal morbidity in neonates with an umbilical arterial cord pH less than 7.00. Am. J. Obstet. Gynecol. 1997, 177, 1030–1034. [Google Scholar] [CrossRef]
Gunaratne, S.A.; Panditharatne, S.D.; Chandraharan, E. Prediction of neonatal acidosis based on the type of fetal hypoxia observed on the cardiotocograph (ctg). Eur. J. Med. Health Sci. 2022, 4, 8–18. [Google Scholar] [CrossRef]
Liang, H.; Lu, Y.; Liu, Q.; Fu, X. Fully automatic classification of cardiotocographic signals with 1D-CNN and bi-directional GRU. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, UK, 11–15 July 2022; pp. 4590–4594. [Google Scholar]
Sbrollini, A.; Carnicelli, A.; Massacci, A.; Tomaiuolo, L.; Zara, T.; Marcantoni, I.; Burattini, L.; Morettini, M.; Fioretti, S.; Burattini, L. Automatic identification and classification of fetal heart-rate decelerations from cardiotocographic recordings. In Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 17–21 July 2018; pp. 474–477. [Google Scholar]
Liang, H.; Lu, Y. A CNN-RNN unified framework for intrapartum cardiotocograph classification. Comput. Methods Programs Biomed. 2023, 229, 107300. [Google Scholar] [CrossRef]
Adhikari, A. Fetal Acidosis Prediction Using Attention Enhanced Convolutional Neural Networks. Master’s Thesis, Missouri University of Science and Technology, Rolla, MO, USA, 2025. [Google Scholar]
Lu, Y.; Liang, H.; Yu, Z.; Fu, X. MT-1DCG: A novel model for multivariate time series classification. In Proceedings of the International Conference on Intelligent Computing, Bhubaneswar, India, 16–17 December 2023; pp. 222–234. [Google Scholar]
Zhang, Y.; Zhao, Z.; Deng, Y.; Jiao, P. On multi-modal fusion learning in pathological diagnosis of fetal distress. In Proceedings of the 2023 IEEE International Conference on E-health Networking, Application & Services (Healthcom), Chongqing, China, 15–17 December 2023; pp. 119–124. [Google Scholar]
Yefei, Z.; Yanjun, D.; Xiaohong, Z.; Lihuan, S.; Zhidong, Z. Bidirectional long short-term memory-based intelligent auxiliary diagnosis of fetal health. In Proceedings of the 2021 IEEE Region 10 Symposium (TENSYMP), Jeju, Republic of Korea, 23–25 August 2021; pp. 1–5. [Google Scholar]
SM, S.M.A.M. A Deep Learning Based Approach for Detecting Fetal Stress Using CTG Signal. SSRN 2025. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5105665 (accessed on 16 February 2026).
Liu, M.; Lu, Y.; Long, S.; Bai, J.; Lian, W. An attention-based CNN-BiLSTM hybrid neural network enhanced with features of discrete wavelet transformation for fetal acidosis classification. Expert Syst. Appl. 2021, 186, 115714. [Google Scholar] [CrossRef]
Zhao, Z.; Deng, Y.; Zhang, Y.; Zhang, Y.; Zhang, X.; Shao, L. DeepFHR: Intelligent prediction of fetal Acidemia using fetal heart rate signals based on convolutional neural network. BMC Med. Inform. Decis. Mak. 2019, 19, 286. [Google Scholar] [CrossRef]
Rao, L.; Lu, J.; Wu, H.-R.; Zhao, S.; Lu, B.-C.; Li, H. Automatic classification of fetal heart rate based on a multi-scale LSTM network. Front. Physiol. 2024, 15, 1398735. [Google Scholar] [CrossRef] [PubMed]
Baghel, N.; Burget, R.; Dutta, M.K. 1D-FHRNet: Automatic diagnosis of fetal acidosis from fetal heart rate signals. Biomed. Signal Process. Control 2022, 71, 102794. [Google Scholar] [CrossRef]
Liu, M.; Xiao, Y.; Zeng, R.; Wu, Z.; Liu, Y.; Li, H. A multimodal dual-branch fusion network for fetal hypoxia detection. Expert Syst. Appl. 2025, 259, 125263. [Google Scholar] [CrossRef]
Zhang, Y.; Deng, Y.; Zhang, X.; Jiao, P.; Zhang, X.; Zhao, Z. DT-CTNet: A clinically interpretable diagnosis model for fetal distress. Biomed. Signal Process. Control 2023, 86, 105190. [Google Scholar] [CrossRef]
Zhang, Y.; Deng, Y.; Zhou, Z.; Zhang, X.; Jiao, P.; Zhao, Z. Multimodal learning for fetal distress diagnosis using a multimodal medical information fusion framework. Front. Physiol. 2022, 13, 1021400. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Bach, M.; Werner, A.; Palt, M. The proposal of undersampling method for learning from imbalanced datasets. Procedia Comput. Sci. 2019, 159, 125–134. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Sagi, O.; Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
Pavlyshenko, B. Using stacking approaches for machine learning models. In Proceedings of the 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 21–25 August 2018; pp. 255–258. [Google Scholar]
Chudáček, V.; Spilka, J.; Burša, M.; Janků, P.; Hruban, L.; Huptych, M.; Lhotská, L. Open access intrapartum CTG database. BMC Pregnancy Childbirth 2014, 14, 16. [Google Scholar] [CrossRef]
Zhao, Z.; Zhu, J.; Jiao, P.; Wang, J.; Zhang, X.; Lu, X.; Zhang, Y. Hybrid-FHR: A multi-modal AI approach for automated fetal acidosis diagnosis. BMC Med. Inform. Decis. Mak. 2024, 24, 19. [Google Scholar] [CrossRef]
Vayssiere, C.; Haberstich, R.; Sebahoun, V.; David, E.; Roth, E.; Langer, B. Fetal electrocardiogram ST-segment analysis and prediction of neonatal acidosis. Int. J. Gynecol. Obstet. 2007, 97, 110–114. [Google Scholar] [CrossRef]
Victory, R.; Penava, D.; Da Silva, O.; Natale, R.; Richardson, B. Umbilical cord pH and base excess values in relation to adverse outcome events for infants delivering at term. Am. J. Obstet. Gynecol. 2004, 191, 2021–2028. [Google Scholar] [CrossRef] [PubMed]
Olofsson, P. Umbilical cord pH, blood gases, and lactate at birth: Normal values, interpretation, and clinical utility. Am. J. Obstet. Gynecol. 2023, 228, S1222–S1240. [Google Scholar] [CrossRef]
Chudácčk, V.; Huptych, M.; Koucký, M.; Spilka, J.; Bauer, L.; Lhotska, L. Fetal heart rate data pre-processing and annotation. In Proceedings of the 2009 9th International Conference on Information Technology and Applications in Biomedicine, Larnaka, Cyprus, 4–7 November 2009; pp. 1–4. [Google Scholar]
Mark, R.; Schluter, P.; Moody, G.; Devlin, P.; Chernoff, D. An annotated ECG database for evaluating arrhythmia detectors. In Proceedings of the IEEE Transactions on Biomedical Engineering; IEEE: New York, NY, USA, 1982; p. 600. [Google Scholar]
Wang, H.; Yin, Y.; Wang, L.; Wang, Y.; Liu, X.; Shi, L. Fetal Health Diagnosis Based on Adaptive Dynamic Weighting with Main-Auxiliary Correction Network. BioTech 2025, 14, 57. [Google Scholar] [CrossRef] [PubMed]
Cömert, Z.; Yang, Z.; Velappan, S.; Boopathi, A.M.; Kocamaz, A.F. Performance evaluation of empirical mode decomposition and discrete wavelet transform for computerized hypoxia detection and prediction. In Proceedings of the 2018 26th Signal Processing and Communications Applications Conference (SIU), Izmir, Turkey, 2–5 May 2018; pp. 1–4. [Google Scholar]
Singh, H.D.; Saini, M.; Kaur, J. Fetal distress classification with deep convolutional neural network. Curr. Women’s Health Rev. 2021, 17, 60–73. [Google Scholar] [CrossRef]
Kadarina, T.M.; Basari, B.; Gunawan, D.; Auzan, A. Scalogram-Based Multiclass Fetal State Classification Using Expert-Annotated CTG and SE-ResNet-50. Informatica 2025, 49. [Google Scholar] [CrossRef]
Xu, L.; Wang, G.; Cao, Z.; Chen, Q.; Liu, G.; Wei, H. Research on multimodal deep learning based on CNN and ViT for intrapartum fetal monitoring. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, 5–8 December 2023; pp. 4459–4464. [Google Scholar]
O’sullivan, M.; Gabruseva, T.; Boylan, G.B.; O’Riordan, M.; Lightbody, G.; Marnane, W. Classification of fetal compromise during labour: Signal processing and feature engineering of the cardiotocograph. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 1331–1335. [Google Scholar]

Figure 1. Preprocessed presentation view. (a) Normal samples after pretreatment; (b) moderate acidosis samples after pretreatment; (c) severe acidosis samples after pretreatment.

Figure 2. Diagram of LSTM mechanism.

Figure 3. Difficult sample identification process.

Figure 4. Flowchart of the three-stage learning framework model.

Figure 5. Cumulative sum of confusion matrices for five-fold cross-validation at each stage: (a) stage A confusion matrix; (b) stage B confusion matrix; (c) stage C confusion matrix.

Figure 6. Plot of weights vs. recall for each category.

Table 1. Clinical text information for the CTU-CHB dataset.

Information	Max	Min	Mean
Maternal age	46	18	29.7
Gestational age	43	37	40
Fetal weight	4750	1970	3407
Apgar 1 min	10	1	8.26
Apgar 5 min	10	4	9.07
Base deficit in extracellular fluid (BDecf, mmol/L)	26.11	−3.40	4.60
Base excess (BE)	−0.20	−26.80	−6.38
pH	7.47	6.85	7.23
Gravidity	11	1	1.4
pCO₂	12.30	0.70	7.07
Parity	7	0	0.4

Table 2. Quantitative demonstration of preprocessing.

Statistical Indicators	Normal	Moderate	Severe
Spike count (\|ΔFHR\| > 25 bpm) per record	109.26 ± 94.77	162.01 ± 111.92	117.18 ± 95.01
Repair count (points)	10,222.47 ± 4205.55	10,882.42 ± 4572.74	9965.05 ± 3454.40
Repair duration (s)	2555.62 ± 1051.39	2720.61 ± 1143.19	2491.26 ± 863.60
Mean raw FHR (bpm)	115.44 ± 17.93	117.19 ± 18.51	107.48 ± 29.61
Mean clean FHR (bpm)	134.29 ± 11.46	133.71 ± 11.31	135.35 ± 14.54

Table 3. Stage C input characteristics and dimension descriptions.

Features	Characterization	Dimension	Description
Basic probability	Category probability distributions for stages A and B	6	Raw classification probabilities for the three categories
Confidence and uncertainty	Categorical confidence levels for stages A and B	2	Reflects how well the model captures its classification
	Difference in confidence of the two models	1	Measuring differences in classification certainty
	Entropy of stage A and B classification	2	Measuring uncertainty in master model classification results
Classification consistency	Consistency of classification categories between the two models	1	-
Probability difference	Maximum value of the absolute difference between the probability vectors of the two models, mean value	2	Reflects the maximum local difference/overall average difference between the two types of probability distributions
Probability difference	Difference in probabilities between the two models on a category-by-category basis	3	-
Sorted Indexes	Category-ordered indexes for stage A and B probabilities	6	-
One-hot encoding	Category one-hot coding for stages A and B	6	-

Table 4. Model parameter ratios.

Stage A		Stage B		Stage C
Module	Parameters	Module	Parameters	Module	Parameters
Conv1d-1	(1, 16)	Depth separable convolutional block 1	(1, 32)	FC layer 1	(29, 128)
Dropout	0.4	SE Attention Block	(32, 32)	Dropout	0.3
Conv1d-2	(16, 32)	Depth separable convolutional block 2	(32, 64)	FC layer 2	(128, 64)
Conv1d-3	(32, 64)	BiLSTM	(64, 256)	FC layer3	(64, 32)
Conv1d-Transition	(64, 128)	FC layer	(256, 128)	Output layer	(32, 3)
BiLSTM	(128, 128)	Form-coding FC layer 1	(22,128)	Optimizer	Adam
Multi-head attention	8 head (128, 128)	Form-coding FC layer 2	(128, 64)	Batch Size	32
FC layer	(256, 128)	Signal and form splicing layer	(128 + 64,192)	Three-stage Patience	100
Output layer	(128, 3)	FC layer	(192, 128)	Three-stage Epochs	500
Optimizer	Adam	Dropout	0.3	-	-
Initial learning rate	5 × 10⁻⁴	Output layer	(128, 3)	-	-
Weight decay	1 × 10⁻⁶	Optimizer	Adam	-	-

Table 5. Comparison of missing value duration thresholds.

Methods	Control Methods					This Paper’s Methods
Threshold	5 s	10 s	20 s	25 s	30 s	15 s
Accuracy	80.42%	79.71%	79.71%	78.81%	80.81%	82.80%

Table 6. Comparison of different selection methods for suffering sample selection.

Threshold Name	Confidence Level				Entropy Uncertainty				This Paper’s Methods
Threshold	0.3	0.4	0.5	0.6	0.8	0.85	0.9	0.95	confidence mean
Accuracy	79.91%	81.35%	80.62%	81.32%	80.80%	79.71%	79.53%	80.80%	82.80%

Table 7. Demonstration of ablation experiment results.

Architecture	F1 Score	Precision	Recall	Macro-F1	Accuracy
A + weightless	88.60 ± 0.24% 0.00 ± 0.00% 0.00 ± 0.00%	79.53 ± 0.39% 0.00 ± 0.00% 0.00 ± 0.00%	100.00 ± 0.00% 0.00 ± 0.00% 0.00 ± 0.00%	29.53 ± 0.08%	79.53 ± 0.39%
A + Dynamic weighting	88.88 ± 1.60% 28.41 ± 5.82% 23.62 ± 8.51%	84.62 ± 2.36% 34.99 ± 8.25% 54.0 ± 27.84%	93.62 ± 0.90% 24.73 ± 6.10% 16.11 ± 6.19%	46.97 ± 2.54%	78.81 ± 1.27%
A + B (No Depth-Separable Convolution) + C	90.02 ± 1.37% 19.35 ± 7.42% 36.64 ± 13.40%	84.53 ± 2.59% 43.11 ± 13.51% 52.56 ± 17.38%	96.36 ± 2.20% 13.08 ± 5.42% 31.67 ± 18.89%	48.67 ± 6.64%	64.28 ± 2.02%
A + B (No se-net) + C	86.77 ± 4.99% 24.61 ± 14.54% 38.28 ± 19.01%	84.78 ± 3.08% 44.31 ± 35.18% 54.57 ± 26.25%	89.99 ± 11.35% 23.52 ± 16.15% 35.00 ± 23.80%	49.89 ± 9.16%	68.71 ± 4.64%
A + B (Pure signal) + C	88.89 ± 1.57% 24.05 ± 6.84% 24.34 ± 9.65%	83.95 ± 2.90% 45.82 ± 27.42% 60.67 ± 33.49%	94.53 ± 1.12% 20.4 ± 8.81% 16.11 ± 6.19%	45.76 ± 5.31%	75.01 ± 3.28%
A + B + (stacking)	89.93 ± 1.30% 13.77 ± 8.29% 37.88 ± 13.97%	82.65 ± 1.48% 34.67 ± 21.25% 83.33 ± 21.08%	98.64 ± 1.33% 8.68 ± 5.32% 25.00 ± 10.83%	47.20 ± 4.90%	68.70 ± 2.70%
TS-MFF	90.53 ± 1.64% 26.11 ± 4.46% 44.70 ± 21.84%	84.16 ± 1.64% 65.56 ± 19.37% 69.33 ± 21.75%	97.95 ± 2.20% 17.36 ± 5.62% 33.89 ± 19.59%	53.78 ± 7.22%	82.80 ± 2.82%

Table 8. Generalization of different methods for mitigating category imbalance across data.

Datasets	Architecture	F1 Score	Precision	Recall	Macro-F1	Accuracy
MIT-BIH	Unweighted	98.50 ± 0.06% 0.882 ± 1.41% 0.00 ± 0.00% 95.28 ± 1.11%	97.57 ± 0.16% 93.76 ± 4.54% 0.00 ± 0.00% 94.39 ± 1.60%	99.45 ± 0.21% 83.66 ± 3.98% 0.00 ± 0.00% 96.22 ± 1.43%	71.11 ± 0.64%	97.05 ± 0.28%
	Inverse Frequency Weighting	96.83 ± 0.68% 83.42 ± 5.28% 38.24 ± 5.93% 94.74 ± 1.50%	98.86 ± 0.21% 83.52 ± 10.36% 25.52 ± 4.44% 93.91 ± 2.89%	94.88 ± 1.25% 84.15 ± 3.37% 76.98 ± 10.23% 95.66 ± 1.42%	78.31 ± 2.83%	94.41 ± 1.08%
	Smote	96.53 ± 0.63% 85.94 ± 3.97% 36.64 ± 2.82% 96.52 ± 0.88%	99.28 ± 0.28% 82.53 ± 9.01% 23.56 ± 2.60% 96.01 ± 1.50%	93.95 ± 1.30% 90.60 ± 3.65% 85.48 ± 12.10% 97.06 ± 1.02%	78.91 ± 1.04%	93.99 ± 0.98%
	Focal Loss	98.42 ± 0.20% 85.85 ± 4.54% 0.00 ± 0.00% 95.68 ± 1.62%	97.38 ± 0.17% 89.23 ± 8.77% 0.00 ± 0.00% 96.61 ± 1.87%	99.49 ± 0.39% 83.16 ± 2.92% 0.00 ± 0.00% 94.81 ± 2.32%	70.72 ± 1.02%	96.71 ± 0.74%
	CB Loss (0.99)	98.51 ± 0.12% 88.17 ± 3.52% 0.00 ± 0.00% 95.49 ± 0.76%	97.61 ± 0.18% 92.23 ± 6.05% 0.00 ± 0.00% 94.68 ± 1.90%	99.43 ± 0.27% 84.63 ± 2.95% 0.00 ± 0.00% 96.86 ± 1.55%	71.31 ± 0.83%	97.42 ± 0.37%
	This paper proposes dynamic weighting	98.62 ± 0.13% 89.20 ± 1.90% 51.16 ± 1.45% 95.39 ± 0.72%	98.42 ± 0.09% 95.05 ± 2.72% 55.05 ± 8.00% 95.20 ± 2.27%	98.83 ± 0.31% 84.15 ± 3.37% 48.92 ± 4.74% 95.65 ± 1.35%	83.59 ± 0.63%	97.42 ± 0.17%
CTU-CHB	Unweighted	88.60 ± 0.24% 0.00 ± 0.00% 0.00 ± 0.00%	79.53 ± 0.39% 0.00 ± 0.00% 0.00 ± 0.00%	100.0 ± 0.00% 0.00 ± 0.00% 0.00 ± 0.00%	29.53 ± 0.08%	79.53 ± 0.39%
	Inverse Frequency Weighting	85.04 ± 2.63% 14.28 ± 11.80% 16.72 ± 13.28%	84.43 ± 2.28% 11.74 ± 9.63% 31.33 ± 35.77%	85.88 ± 5.49% 19.12 ± 16.65% 17.78 ± 19.37%	38.68 ± 3.55%	72.10 ± 3.00%
	Smote	83.55 ± 4.54% 9.52 ± 12.34% 24.21 ± 13.43%	84.79 ± 4.57% 8.44 ± 10.37% 28.06 ± 15.44%	83.85 ± 11.50% 15.16 ± 23.84% 28.06 ± 21.52%	39.09 ± 2.77%	70.64 ± 6.64%
	Focal-loss	80.72 ± 6.85% 21.71 ± 9.08% 24.62 ± 14.71%	85.26 ± 2.69% 22.23%± 8.76% 30.07 ± 18.71%	77.24 ± 10.90% 25.05 ± 15.42% 27.50 ± 17.04%	42.35 ± 4.12%	66.67 ± 8.45%
	Cbloss (0.99)	89.43 ± 0.99% 5.02 ± 6.17% 31.56 ± 11.45%	82.78 ± 1.36% 26.67 ± 38.87% 40.33 ± 18.51%	97.27 ± 1.16% 2.86 ± 3.50% 27.22 ± 11.17%	42.00 ± 2.90%	79.90 ± 1.27%
	This paper proposes dynamic weighting	88.88 ± 1.60% 28.41 ± 5.82% 23.62 ± 8.51%	84.62 ± 2.36% 34.99 ± 8.25% 54.0 ± 27.84%	93.62 ± 0.90% 24.73 ± 6.10% 16.11 ± 6.19%	46.97 ± 2.54%	78.81 ± 1.27%

Table 9. Wilcoxon signed-rank statistical test.

	Unweighted	Inverse Frequency Weighting	Smote	Focal Loss	Cbloss (0.99)
CTU-CHB p-value	0.03125	0.03125	0.03125	0.03125	0.03125
MIT-BIH p-value	0.03125	0.03125	0.03125	0.03125	0.03125

Table 10. Comparison with existing processing timing signal models.

Architecture	F1 Score	Accuracy	Precision	Recall	Macro-F1
TCN	70.81 ± 0.73%	78.99 ± 1.68%	65.41 ± 3.02%	78.99 ± 1.68%	31.58 ± 2.67%
M-scale CNN	71.85 ± 0.99%	79.53 ± 0.69%	67.98 ± 2.81%	79.53 ± 0.69%	33.22 ± 2.83%
FCN	72.91 ± 2.33%	79.35 ± 1.67%	68.33 ± 4.08%	79.35 ± 1.67%	36.27 ± 7.08%
ConvNeXt-1D	72.64 ± 0.87%	77.36 ± 2.17%	71.00 ± 2.26%	77.36 ± 2.17%	39.86 ± 3.60%
Resnet	73.80 ± 3.64%	76.45 ± 3.61%	71.92 ± 5.22%	76.45 ± 3.61%	40.79 ± 9.11%
TS-MFF	78.84 ± 2.96%	82.80 ± 2.82%	82.80 ± 2.82%	80.67 ± 4.54%	53.78 ± 7.22%

Table 11. Comparison of the downscaled results with the existing model.

References	Method	Accuracy	F1 Score	Precision	Recall
Comert et al. [43]	EMD + DWT + SVM	67.0%	-	-	57.42%
Singh et al. [44]	HoloViz + CNN	69.6%	66%	63%	70%
Kadarina et al. [45]	SE-ResNet50+DWT	-	72.67%	-	-
Ling Xu et al. [46]	CNN+ViT	-	78.0%	-	-
Liu et al. [22]	CNN-BiLSTM+ Attention, DWT	71.71 ± 8.61%	-	-	75.23 ± 9.58%
O’Sullivan et al. [47]	ARMA + SVM	83.3%	-	-	82.6%
The method proposed in this paper	TS-MFF	83.7 ± 2.77%	80.46 ± 3.33%	83.27 ± 4.85%	83.70 ± 2.77%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; Yin, Y.; Zhang, X.; Liu, X.; Zhao, J.; Che, N.; Wang, L. Automatic Identification of Fetal Acidosis Based on Three-Stage Training and Meta-Feature Fusion. Appl. Sci. 2026, 16, 2045. https://doi.org/10.3390/app16042045

AMA Style

Wang H, Yin Y, Zhang X, Liu X, Zhao J, Che N, Wang L. Automatic Identification of Fetal Acidosis Based on Three-Stage Training and Meta-Feature Fusion. Applied Sciences. 2026; 16(4):2045. https://doi.org/10.3390/app16042045

Chicago/Turabian Style

Wang, Haiyan, Yanxing Yin, Xin Zhang, Xiaotong Liu, Jian Zhao, Na Che, and Liu Wang. 2026. "Automatic Identification of Fetal Acidosis Based on Three-Stage Training and Meta-Feature Fusion" Applied Sciences 16, no. 4: 2045. https://doi.org/10.3390/app16042045

APA Style

Wang, H., Yin, Y., Zhang, X., Liu, X., Zhao, J., Che, N., & Wang, L. (2026). Automatic Identification of Fetal Acidosis Based on Three-Stage Training and Meta-Feature Fusion. Applied Sciences, 16(4), 2045. https://doi.org/10.3390/app16042045

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Identification of Fetal Acidosis Based on Three-Stage Training and Meta-Feature Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. CTU-CHB Dataset

2.2. Data Preprocessing

2.3. Description of the Framework

2.3.1. Convolutional Neural Network (CNN)

2.3.2. Bidirectional Long- and Short-Term Memory Network (BiLSTM)

2.3.3. Attention Mechanisms

2.3.4. Difficult Sample Identification Mechanism

2.3.5. Depthwise Separable Convolution Module

2.3.6. Squeeze-and-Excitation Network (SE-Net)

2.3.7. Adaptive Weighted Loss Function

2.3.8. Meta-Feature Fusion-Based Multi-Model Classification Fusion Framework

2.4. Algorithm Description

2.5. Overview of the Overall Framework

3. Experiments

3.1. Experimental Environment and Parameter Configuration

3.2. Evaluation Criteria

3.3. Sensitivity Analysis

3.3.1. Sensitivity Analysis of Missing Value Duration Thresholds

3.3.2. Threshold Sensitivity Analysis for Difficult Sample Recognition

3.4. Ablation Experiments

3.5. Generalization Experiments

3.5.1. Dynamically Weighted Generalization Test

3.5.2. Statistical Testing

3.6. Comparison of Existing Models

3.7. Demonstrating the Confusion Matrix

3.8. Weight Visualization

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI