1. Introduction
Fetal electrocardiogram (FECG) is a cornerstone signal for non-invasive prenatal evaluation of fetal cardiac health. It captures a myriad of key physiological indicators, most notably fetal heart rate (FHR) and the morphology of the QRS complex. These parameters offer a pivotal and reliable foundation for the early diagnosis of critical fetal distress conditions [
1]. Scalp electrocardiography can obtain clear fetal electrocardiograms, but the signal can only be obtained during delivery, and the risk of fetal infection is relatively high. Non-invasive fetal ECG (FECG) extraction is typically implemented by acquiring mixed abdominal electrocardiogram (AECG) signals via patch-type electrodes affixed to the maternal abdomen, which constitutes the core of a non-invasive abdominal patch-based real-time FECG monitoring and analysis system. This approach offers prominent merits including fetal safety, compatibility with real-time processing, and feasibility for long-term continuous monitoring, establishing it as a key research focus in clinical prenatal fetal health assessment [
2]. The complete workflow of signal acquisition, FECG extraction, and real-time analysis in this system is illustrated in
Figure 1. The system is structured as a four-stage pipeline:
Figure 1a shows a wearable abdominal biosensor patch with textile-integrated hydrogel electrodes and a low-power wireless chip, which collects raw multi-channel AECG signals non-invasively and transmits data via Bluetooth;
Figure 1b shows a patient-facing mobile terminal that supports real-time FHR display and localized arrhythmia alerts;
Figure 1c shows an AI-powered IoMT cloud platform that performs secure AECG signal reconstruction and MECG–FECG separation; and
Figure 1d shows a bidirectional remote clinician portal that enables multi-modal fetal rhythm review and real-time prenatal health assessment. However, the amplitude of the FECG signal is only 1/10 to 1/5 of the maternal ECG (MECG), and the two overlap significantly in both time and frequency domains. Simultaneously, complex interferences such as baseline drift, electromyographic noise, and power line interference are superimposed, collectively rendering accurate FECG extraction a persistent challenge in biomedical signal processing [
3].
Conventional FECG extraction algorithms rely on traditional signal processing techniques, with the core idea to suppress MECG interference through linear separation or template matching. Blind source separation (BSS) and adaptive noise cancellation (ANC) are typical representatives. The ANC was first systematically proposed by Widrow et al. [
4], achieving interference cancellation by constructing an MECG reference signal. However, limited by the nonlinear propagation characteristics of MECG from the maternal thorax to the abdomen, waveform differences between the reference signal and the actual interference led to suboptimal separation results. Independent component analysis (ICA), as a fundamental technique in BSS, was utilized by Zarzoso et al. [
5] to blindly separate the MECG and FECG based on statistical independence. Nevertheless, its effectiveness was compromised by sensitivity to non-stationary noise [
6]. More fundamentally, the linear mixing assumption underlying BSS and ICA does not hold for abdominal ECG recordings, where the maternal ECG propagates through a nonlinear, time-varying volume conductor from the thorax to the abdomen, causing waveform distortion and phase shifts that cannot be captured by linear demixing models [
7]. To optimize the robustness of traditional methods, Zhang et al. [
8] proposed an improved ANC scheme combining SVD with a smoothing window (SWSVD), enhancing interference suppression capability by adaptively constructing reference signals. However, performance still degraded significantly in low signal-to-noise ratio scenarios. Template matching (TM) and singular value decomposition (SVD) are also important branches of traditional methods. Cerutti et al. [
9] proposed a template subtraction method based on coherent averaging, which canceled MECG interference by leveraging an averaged MECG template. Nonetheless, its performance heavily relied on accurate detection of MECG QRS complexes. When maternal and fetal QRS complexes overlap in time, which occurs frequently because their heart rates differ by only 20 to 40 bpm on average, template subtraction either fails to remove the maternal component completely or inadvertently attenuates the fetal waveform, leading to missed or distorted fetal beats [
10]. Liu et al. [
11] proposed a method that combined RR interval smoothing with an SVD template to enhance FECG detection accuracy in overlapping regions through correction of falsely detected R-waves. Nevertheless, the generalization capability of manually designed template parameters remained limited. Kanjilal et al. [
12] employed SVD-based decomposition of the AECG matrix to extract the FECG from the residual after removing MECG-dominant components. However, it could not effectively separate fetal signals that overlap spectrally with MECG. Overall, traditional methods have simple structures and low computational cost. Nonetheless, they rely on handcrafted feature design and have poor adaptability to signal variations and noise.
In recent years, deep learning has provided end-to-end solutions for FECG extraction. Mohebbian et al. [
13] introduced an attention-based CycleGAN, which employed attention masks to focus on fetal signal regions and reached a 99.70% F1-score. However, the full-layer attention design led to computational redundancy. Huang et al. [
14] proposed TCGAN, which utilized temporal convolution to enhance temporal feature capture capability and improved waveform detail preservation. Nevertheless, the instability of adversarial training easily leads to mode collapse, where the generator learns to produce only a limited subset of plausible FECG morphologies and fails to capture the full variability of real fetal ECG waveforms. This is particularly problematic for fetal monitoring, where subtle morphological changes in the P-wave, T-wave, or ST segment can carry diagnostic significance. Basak et al. [
15] developed 1D-CycleGAN, achieving morphology-preserving FECG extraction through spectral consistency constraints, further optimizing the waveform fidelity of GAN-based methods.
Beyond GAN-based approaches, attention mechanisms have been extensively investigated to enhance feature discrimination. Wang et al. [
16] proposed PA
2Net, which integrated periodic-aware attention with residual connections and used KL divergence shared weights to improve the robustness of joint FECG detection, achieving a positive predictive value (PPV) of 99.74% on the Abdominal and Direct Fetal ECG Database (ADDB), though the stacking of multiple modules increased model complexity. Chen et al. [
17] developed a CNN–transformer hybrid model and compared the effectiveness of channel and spatial attention, verifying the advantage of channel attention for time-series signal adaptation. Wang et al. [
18] introduced ECA-Net, an efficient local cross-channel interaction strategy without dimensionality reduction, implementing efficient channel attention via 1D convolution and providing theoretical support for attention design. Wang et al. [
19] further proposed Correlation-Aware Attention CycleGAN, combining attention mechanisms with GANs to enhance feature discrimination capability. Nonetheless, the self-attention mechanism of transformers still suffers from the problem of computational cost increasing quadratically with sequence length.
Given the limitations of existing methods in handling complex noise and computational efficiency, alternative generative frameworks have been explored. Diffusion models have also been applied to signal reconstruction. Chen et al. [
20] proposed DIFF-FECG, a conditional diffusion-based method designed to handle non-Gaussian noise, achieving a Pearson correlation coefficient (PCC) of 0.92. Flow matching methods offer a promising alternative. Lipman et al. [
21] first proposed the generative modeling framework based on flow matching, which learns a deterministic vector field to achieve efficient mapping between distributions. Its single-step inference characteristic avoids the efficiency loss caused by multi-step iterations. However, existing flow matching methods lack specific feature enhancement design for the FECG extraction task, have limited ability to distinguish channel features between MECG and FECG, and struggle to further improve signal separation accuracy.
Despite these advances, several key limitations persist. Traditional methods lack robustness and have poor adaptability to MECG variations and complex noise. Deep learning methods suffer from a performance–efficiency imbalance: GANs and full-attention models are computationally complex, while diffusion models remain time-consuming for inference. The baseline flow matching model offers efficiency but lacks targeted attention enhancement, resulting in limited feature discrimination capability.
We propose a fetal electrocardiogram extraction algorithm based on minimal channel attention and flow matching (MCA-FM). We draw on the design ideas of hierarchical multi-kernel filtering (HNF) and diffusion-inspired conditional fusion (Diff) feature extraction modules found in [
20] and add a minimal channel attention mechanism and strong residual connection to achieve synergistic optimization of accuracy and robustness. Specifically, a precise mapping from AECG to FECG is constructed based on deterministic flow transformation. Inspired by the non-dimensionality-reduction cross-channel interaction idea of ECA-Net [
18], the newly added minimal channel attention module includes only global average pooling and a single linear layer. It uses learnable mixing weights to focus on key frequency channels related to the fetal signal, avoiding the redundant computation of full-layer attention. Enhanced residual connections fuse original features with attention-enhanced features, alleviating the gradient vanishing problem in deep models. Simultaneously, bridge variance scheduling (BVS) and target matching loss (TM) are introduced to ensure training stability and waveform morphology fidelity.
The main contributions of this paper are summarized as follows:
Novel Generative Modeling for FECG Extraction: We formulate non-invasive FECG extraction as a continuous-time generative process using flow matching. By learning a deterministic mapping via a probabilistic path, the model effectively isolates clean fetal signals from highly complex maternal mixtures, overcoming the limitations of linear assumptions in traditional blind source separation methods.
Minimal Channel Attention for Feature Discrimination: We introduce a minimal channel attention module tailored for physiological time-series signals. By capturing cross-channel dependencies without dimensionality reduction, this design significantly enhances the model’s ability to discriminate between MECG and FECG features. Crucially, it avoids computational redundancy, achieving a strict balance between high accuracy and efficient feature extraction.
Real-World Clinical Robustness and Efficient Inference: Validated on ADDB and BDDB via leave-one-out cross-validation, our method maintains consistent waveform reconstruction and R-peak detection across diverse subjects under severe interference such as uterine contractions. Leveraging single-step flow matching inference, it achieves linear time complexity and ultra-fast execution, making it well-suited for real-time prenatal monitoring and deployment on resource-constrained portable devices.
The remainder of this paper is organized as follows.
Section 2 details the proposed MCA-FM framework, including the experimental datasets, preprocessing steps, model design, training strategy, inference process, and evaluation metrics.
Section 3 presents the experimental results, including waveform reconstruction and R-peak detection performance, visualization verification, statistical validation, ablation studies, sensitivity analysis, normalization strategy comparison, cross-dataset generalization, evaluation on synthetic data, consistency of fetal physiological parameters across datasets, and resource footprint analysis.
Section 4 discusses the advantages and limitations of the proposed method in comparison with existing approaches. Finally,
Section 5 concludes the paper and outlines directions for future research.
3. Results
All experiments were conducted on a unified platform to ensure reproducibility. The hardware configuration included a Lenovo Legion Y9000P IRX8 laptop equipped with 16.0 GB DDR5 RAM and an NVIDIA GeForce RTX 4050 Laptop GPU (6 GB VRAM) for accelerated deep learning. The software environment comprised 64-bit Windows 11 Home (version 25H2), Python 3.9 within a Conda virtual environment, and PyTorch 2.7.1 as the core deep learning framework.
To comprehensively verify the performance of the proposed model, systematic experiments were conducted on the ADDB and BDDB clinical datasets. Consistent with the experimental design outlined in the
Section 2, the experiments used leave-one-out cross-validation and evaluated performance from multiple dimensions. Evaluation metrics included waveform reconstruction quality (MAE, SPC, PCC) and R-peak detection accuracy (Sen, PPV, F1), as well as Pearson correlation (
r), root mean square error (RMSE) for heart rate (HR) estimation, and mean difference with 95% limits of agreement (LoA) for R-to-R wave (RR) interval estimation. These multi-dimensional clinical metrics assess morphology, localization, consistency, and generalization for rigorous clinical evaluation. The specific results are as follows.
3.1. Waveform Reconstruction Performance
On the basis of the designed preprocessing pipeline, we further evaluated the overall performance of the MCA-FM model on real clinical datasets. We further tested the model on BDDB, which features a broader variety of interferences than ADDB, to examine its generalization ability.
Table 1 and
Table 2 present the quantitative evaluation of the proposed model’s FECG waveform reconstruction quality on the ADDB and BDDB datasets, respectively. The core metrics are MAE, SPC, and PCC. The model demonstrated excellent FECG extraction capability on both types of real datasets.
The ADDB dataset (
Table 1) consists of recordings from five subjects, with detailed waveform reconstruction metrics reported for each. The model achieved excellent reconstruction performance, with mean PCC of 0.94 and MAE of 0.21, indicating high morphological agreement with the direct scalp ECG reference. The mean SPC of 0.94 confirmed strong frequency-domain consistency. Among individual subjects, r01, r04, r07, and r08 exhibited particularly high performance (PCC ≥ 0.96), while r10 displayed relatively lower PCC (0.86), possibly due to higher noise levels and atypical signal characteristics in that recording.
The BDDB dataset (
Table 2), comprising recordings from 12 subjects with significant physiological interferences, such as uterine contractions and electrode displacement, more closely resembled practical clinical scenarios. Despite these challenges, the model maintained good waveform reconstruction stability, achieving mean PCC of 0.91, SPC of 0.92, and MAE of 0.27 ± 0.125. Most subjects attained PCC above 0.92, with B2_Labour_11 reaching the highest PCC of 0.97. However, subject B2_Labour_03 exhibited substantially lower performance (PCC = 0.53, MAE = 0.66). Further analysis revealed that all four abdominal channels of this subject had severely degraded signal quality, with the best channel achieving a PCC of only 0.52 and no channel exceeding 0.60. The fetal frequency band signal-to-noise ratio was approximately −0.1 dB, indicating that fetal ECG energy was nearly indistinguishable from background noise.
3.2. R-Peak Detection Performance
R-peak detection accuracy is a key indicator for clinical application, and performance across individual subjects reflects the model’s adaptability to varying signal quality.
Table 3 and
Table 4 present the detailed R-peak classification results (TP, FP, FN, TN) for each subject in the ADDB and BDDB datasets, respectively. From these, Sen, PPV, and F1 metrics were calculated. The model achieved consistently high detection accuracy across subjects, as further illustrated by the confusion matrices in
Figure 3.
In the ADDB dataset (
Table 3,
Figure 3), the R-peak detection performance across the five subjects was generally balanced: Subjects r01 and r07 had zero false positives (FP = 0) and zero false negatives (FN = 0), with completely accurate classification. Subjects r04 and r08 had only two false positives each, with missed detections ≤ 3. Subject r10 illustrated six false positives and eight missed detections, possibly related to signal baseline drift caused by fetal movement in that recording, yet its F1-score was still 98.80%, indicating very high overall R-peak detection accuracy. The overall confusion matrix shows that the model’s true positive rate (TPR) and true negative rate (TNR) on ADDB were both close to 100%, with only a small number of misjudgments in high-interference segments.
In the BDDB dataset (
Table 4,
Figure 3), among the 12 subjects, B2_Labour_01 and B2_Labour_08, which had excellent waveform reconstruction metrics, also achieved completely accurate R-peak detection. Eleven subjects had F1-scores ≥ 97.10%, maintaining a high level of detection accuracy. Only B2_Labour_03 performed relatively weaker (FP = 143, FN = 176). A waveform comparison with the best-performing subject confirmed that QRS complexes were almost entirely obscured by noise in this recording, resulting in the high false positive and false negative counts. Such recordings are typically classified as invalid in clinical practice and highlight the need for a signal quality assessment front-end in real-world wearable monitoring. The remaining subjects had false positives/false negatives ≤ 31, indicating that the model was robust to signals from most clinical individuals.
3.3. Visualization Verification
To further corroborate the quantitative results with intuitive evidence, two representative subjects from the BDDB dataset were selected for layered validation: subject B2_Labour_04 (low SNR with weak FECG amplitude, a typical challenging scenario) was used to demonstrate the model’s FECG extraction effect and waveform consistency, while subject B2_Labour_11 (strong uterine contraction interference) was adopted to verify the R-peak localization accuracy.
3.3.1. Visual Analysis of Extracted FECG Waveforms
Figure 4 illustrates the FECG extraction process using subject B2_Labour_04 as an example. The raw AECG signal (
Figure 4a) contains strong MECG components, background noise, and weakly expressed fetal ECG signals. After processing by the proposed model, the maternal component (
Figure 4b) was effectively separated and the clean FECG signal (
Figure 4c) was successfully extracted, with morphological features clearly restored. Compared with the true scalp FECG signal (
Figure 4d), the extracted FECG achieved high consistency in both time and amplitude.
To further evaluate the model’s temporal stability over a longer duration,
Figure 5 presents a continuous 30 s overlay of the extracted FECG and the ground-truth scalp FECG for the same subject, divided into three 10 s panels for visual clarity. The two waveforms exhibit strong agreement in both QRS timing and overall morphology, with no visible degradation or drift over the extended recording period, confirming the model’s consistent performance.
3.3.2. Verification of R-Wave Localization Accuracy
Based on the extraction results from subject B2_Labour_11, the model’s R-peak detection performance is verified (
Figure 6).
Figure 6a shows the true FECG signal and expert-annotated R-peaks (red dots);
Figure 6b shows the extracted FECG and predicted R-peaks (purple dots). The predicted R-peaks were highly aligned with true R-peaks, with no missed or false detections, consistent with quantitative metrics (Sen = 99.70%, PPV = 99.40%, F1 = 99.50%). The time offset was less than 50 ms (clinically acceptable). Under strong uterine contraction interference, zero missed and false detections were achieved, demonstrating that the minimal channel attention module accurately focuses on fetal QRS features and suppresses background interference.
3.4. Statistical Validation of Cross-Dataset Stability
To statistically verify the cross-dataset stability of MCA-FM, we compared the PCC and F1 scores between ADDB (
n = 5) and BDDB (
n = 12) using the Mann–Whitney U test (independent samples, two-sided). The results are visualized in
Figure 7. No statistically significant differences were found for either PCC (
p = 0.1037) or F1 (
p = 0.2226), indicating that the model maintains consistent performance across the two independent clinical datasets despite the larger variability in BDDB.
3.5. Ablation Study
To evaluate the contribution of each key component of MCA-FM, we conducted ablation experiments on the ADDB dataset by systematically removing or replacing six components: minimal channel attention (MCA), enhanced residual connection, bridge variance scheduling (BVS), target matching (TM) loss, single-step inference (replaced with 20-step Euler ODE solver), and instance normalization (IN). All ablations were performed under the same leave-one-subject-out cross-validation protocol as the full model. The results are summarized in
Table 5.
Removing MCA leads to a slight decrease in Sen (99.64% to 99.54%), PPV (99.67% to 99.61%), and F1 (99.66% to 99.58%), indicating that MCA primarily contributes to improving recall and overall detection accuracy by enhancing the selection of key fetal features through cross-channel attention. Removing the enhanced residual connection causes a clear drop in F1 (99.66% to 99.46%) and Sen (99.64% to 99.50%), confirming its role in stabilizing gradient flow in deeper networks and preserving effective features that would otherwise be weakened during training.
Replacing BVS with a linear schedule results in a small but consistent degradation across all metrics, most notably F1, which decreases from 99.66% to 99.46%, and Sen, which decreases from 99.64% to 99.55%. This demonstrates that the parabolic variance schedule provides a measurable positive contribution, particularly to the overall F1 score.
Removing instance normalization from all HNF blocks leads to a consistent drop in R-peak detection performance: F1 falls from 99.66% to 99.58%, Sen from 99.64% to 99.58%, and PPV from 99.67% to 99.58%. These results confirm that instance normalization, applied internally on deep feature maps, makes a meaningful contribution to detection accuracy, likely by stabilizing the feature distributions across different input segments. This finding also validates the complementary design where input-level Z-score normalization handles global amplitude variations while internal IN enhances feature-level stability [
27].
In stark contrast, replacing the target matching loss with a standard noise-prediction loss causes a catastrophic collapse: Sen drops from 99.64% to 37.10%, PPV from 99.67% to 38.68%, and F1 from 99.66% to 35.96%. This unequivocally demonstrates that direct regression of the clean FECG signal is essential for this task. Finally, multi-step inference (20 steps) achieves nearly identical accuracy to the single-step full model (for example, F1 99.55% versus 99.66%) but increases inference time from 6.05 ms/segment to 33.31 ms/segment, validating the efficiency of our single-step design.
3.6. Sensitivity Analysis of BVS Parameter
The bridge variance scheduling (BVS) introduces a parabolic noise schedule parameterized by σ
max, which controls the peak noise level at the midpoint t = 0.5. To investigate the robustness of MCA-FM to this hyperparameter, we evaluated the model on ADDB with σ
max ∈ {0.1, 0.3, 0.5, 0.7, 1.0} while keeping all other settings unchanged. The results are summarized in
Table 6 and visualized in
Figure 8.
The model achieves the best performance at the default σmax = 0.5, with PCC = 0.944 ± 0.050, F1 = 99.66 ± 0.49%, and MAE = 0.206 ± 0.066. When σmax deviates from this value, performance degrades moderately: PCC decreases by approximately 0.04 to 0.05, F1 drops by about 0.6 to 1.6 percentage points, and MAE increases by roughly 0.05 to 0.07. These results confirm that σmax = 0.5 provides a balanced trade-off between diffusion strength and signal preservation, and that the model maintains reasonable performance within a moderate range (0.3 to 0.7).
3.7. Effect of Normalization
To justify the choice of Z-score normalization, we compared it with two alternative preprocessing strategies on the ADDB dataset: Min–max normalization (scaling to [0, 1]) and no normalization (only filtering). All other experimental settings (model architecture, training hyperparameters, and five-fold leave-one-subject-out cross-validation) were kept identical. The results are summarized in
Table 7.
Table 7 compares the performance of three normalization strategies on the ADDB dataset. Z-score normalization yields the highest PCC (0.944 ± 0.050) and F1 (99.66 ± 0.49%), slightly outperforming min–max (PCC = 0.935 ± 0.014, F1 = 99.18 ± 0.35%) and no normalization (PCC = 0.936 ± 0.052, F1 = 99.15 ± 0.30%). These results indicate that Z-score provides a modest but consistent advantage in waveform reconstruction and R-peak detection, likely due to its ability to stabilize training by centering and scaling the input signals. As the primary objective of MCA-FM is to recover the morphological shape of the FECG waveform and to detect R-peaks, both of which are largely invariant to global amplitude scaling, the use of Z-score normalization does not affect the core task. Therefore, we adopt Z-score as the default preprocessing method.
3.8. Cross-Dataset Generalization Analysis
To evaluate the generalization capability of MCA-FM across different datasets, we conducted a cross-dataset validation experiment. The model trained on ADDB was tested on BDDB, and the model trained on BDDB was tested on ADDB, using the same preprocessing pipeline and leave-one-out protocol. The results are summarized in
Table 8.
When trained on ADDB and tested on BDDB, the model maintained an F1 score of 96.98% and PCC of 0.89, demonstrating that even with only five training subjects, the model generalizes reasonably well to a larger and more challenging dataset. When trained on BDDB and tested on ADDB, the model achieved excellent performance (PCC = 0.97, F1 = 99.53%), indicating that the richer training data from BDDB transfers effectively to the smaller ADDB dataset. These cross-dataset results confirm that MCA-FM learns transferable features rather than overfitting to dataset-specific characteristics.
3.9. Evaluation on Synthetic Data with Controlled Interferences
To further evaluate the model’s robustness under controlled interference conditions beyond the real-world clinical datasets, we tested MCA-FM on the FECGSYNDB. The results for five representative interference scenarios are summarized in
Table 9.
Across the five interference scenarios, the model maintained reasonable performance, with PCC ranging from 0.81 to 0.92 and F1 from 85.64% to 95.13%. The variation in performance across cases reflects the differing severity of the simulated interferences. Case 1 (fetal movement) and Case 2 (MHR/FHR acceleration/decelerations) achieved the highest F1 scores (94.36% and 95.13%, respectively). Case 3 (uterine contractions) yielded the lowest PCC (0.81) but still maintained an F1 above 91%. These results on synthetic data with controlled interferences complement the evaluations on ADDB and BDDB and further demonstrate the model’s ability to handle diverse physiological artifacts.
3.10. Consistency of Fetal Physiological Parameters (HR and RR Intervals) Across Datasets
HR and RR intervals are the most direct and critical clinical physiological indicators for fetal cardiac monitoring, and their estimation accuracy and consistency across datasets are important benchmarks for evaluating the clinical applicability of FECG extraction models. Based on the FECG extraction and R-peak detection results of the ADDB and BDDB clinical datasets, this section conducted correlation analysis and Bland–Altman agreement analysis on the predicted and ground-truth HR and RR interval values, with the visualization results shown in
Figure 9 and
Figure 10, respectively. Note that the coordinate axes of the HR and RR Bland–Altman plots differ because heart rate is measured in beats per minute while RR interval is measured in milliseconds; this reflects the typical magnitude of each metric rather than a difference in estimation quality.
Figure 9 presents the HR correlation and agreement analysis across the ADDB and BDDB datasets.
Figure 9a and
Figure 9c are scatter plots of predicted HR (
HRpred) versus ground-truth HR (
HRECG) for the ADDB and BDDB datasets, respectively, demonstrating strong linear correlation between the predicted and true HR values (ADDB:
r = 0.907, RMSE = 4.26 bpm; BDDB:
r = 0.908, RMSE = 5.51 bpm;
p < 0.001), with all sample points closely clustered around the identity line (y = x).
Figure 9b,d are Bland–Altman plots illustrating the mean difference and 95% limits of agreement (LoA) between
HRpred and
HRECG: the ADDB dataset has a mean difference of −0.045 bpm and LoA of [−8.405, 8.315] bpm, while the BDDB dataset has a mean difference of 0.011 bpm and LoA of [−10.797, 10.818] bpm. The HR difference values of both datasets are tightly distributed around the zero line, with no obvious systematic bias, indicating high agreement in HR estimation across different clinical datasets.
Figure 10 shows the RR interval estimation consistency across the ADDB and BDDB datasets via Bland–Altman plots.
Figure 10a is the Bland–Altman plot for the ADDB dataset, where the mean difference between predicted and ground-truth RR intervals is −0.02 ms with 95% LoA of [−29.56, 29.52] ms;
Figure 10b is the Bland–Altman plot for the BDDB dataset, where the mean difference is −0.04 ms with 95% LoA of [−34.76, 34.67] ms. The two datasets exhibit negligible overall systematic bias (mean difference ≈ −0.03 ms) and tight 95% LoA (≈±30–35 ms), with residual errors randomly distributed around the zero line without trending deviation. This confirms that the model has high temporal fidelity in recovering fetal heart rate variability and that the RR interval estimation results have good consistency across different clinical datasets.
3.11. Time Complexity, Resource Footprint, and Real-Time Performance Analysis
Beyond reconstruction and detection accuracy, the practical deployment of a fetal ECG extraction model in real-time monitoring scenarios also depends on its computational efficiency and resource requirements. This section evaluates the time complexity, inference efficiency, memory footprint, and energy consumption of the MCA-FM model on both GPU and CPU, with the quantitative results summarized in
Table 10.
Table 10 presents the total inference time, mean and median latency per segment, throughput, real-time factor (RTF), time complexity, peak memory usage, and energy consumption per segment for both GPU and CPU inference on the ADDB and BDDB datasets. The signal segment length is uniformly set to 1000 samples (5 s at a sampling rate of 200 Hz). RTF is defined as the ratio of signal duration to inference time; RTF > 1 indicates that the model’s inference speed exceeds real-time requirements.
On GPU, the model achieves a mean inference latency of approximately 4.5 ms per 5 s segment with an RTF exceeding 1100, far surpassing real-time requirements. The peak GPU memory usage is 41.3 MB, and the energy consumption is 220–519 mJ per segment. On CPU, the mean latency is approximately 61–63 ms per segment, corresponding to an RTF of approximately 80, which still exceeds real-time requirements by a substantial margin. The peak CPU memory usage is 666.5–674.6 MB, and the energy consumption is 548–1342 mJ per segment. The model has a linear time complexity of O(N·L), and the mean inference time per segment on both platforms is far lower than the actual segment duration (5 s). These results demonstrate that MCA-FM offers excellent computational efficiency and real-time performance. The low GPU memory footprint and moderate energy consumption further indicate that the model is suitable for both high-throughput clinical workstation deployment and resource-constrained edge computing scenarios, providing a solid technical foundation for practical clinical application.
4. Discussion
This study proposes the MCA-FM based on two core improvements: minimal channel attention and flow matching. The model achieves efficient noninvasive FECG extraction with strong waveform reconstruction, R-peak detection, and generalization. In the following, we compare with mainstream methods, validate our improvements, and discuss the model’s limitations and future research directions.
To ensure a fair comparison, all baseline methods were reimplemented and retrained under the same preprocessing pipeline, segmentation scheme, sampling rate, and leave-one-out cross-validation protocol as MCA-FM. The comparison includes representative methods from multiple categories: ICA-based (SA-KICA [
31]), diffusion-based (DIFF-FECG [
20]), GAN-based (1D-CycleGAN [
19]), CNN–transformer [
17], and time-frequency domain approaches [
26]. The complete results are summarized in
Table 11.
Kernel-based blind source separation approaches, such as SA-KICA [
31], extend traditional ICA by incorporating spectral attention to enhance separation. However, their underlying linear mixing assumption limits performance under nonlinear interference. On ADDB, SA-KICA achieves PCC = 0.73 ± 0.064 and MAE = 0.52 ± 0.071, substantially lower than MCA-FM’s PCC = 0.94 ± 0.050 and MAE = 0.21 ± 0.066. On BDDB, the extracted FECG amplitude is also severely distorted. MCA-FM abandons the linear assumption, learns a single-step deterministic mapping via CFM, and directly models the nonlinear AECG-FECG relationship, overcoming nonlinear interference.
As a conditional diffusion-based method, DIFF-FECG [
20] handles non-Gaussian noise through multi-step denoising. However, its iterative inference process leads to a higher computational cost. Under our standardized evaluation protocol, it achieves PCC = 0.74 ± 0.142 and F1 = 92.20 ± 6.70% on ADDB, and PCC = 0.69 ± 0.162 and F1 = 87.46 ± 12.42% on BDDB, both considerably lower than MCA-FM. In contrast, MCA-FM’s single-step flow matching inference not only achieves higher accuracy but also offers substantially faster inference.
Generative adversarial networks (GANs) [
19] risk mode collapse under low SNR, losing weak features (e.g., P-waves, T-waves). On ADDB, 1D-CycleGAN achieves F1 = 98.91 ± 1.48% for R-peak detection, while MCA-FM reaches 99.66 ± 0.49%. The gain is achieved through minimal channel attention, which dynamically focuses on the fetal QRS frequency band (10–15 Hz), thereby avoiding feature loss from adversarial training. On BDDB, the performance gap widens further, with 1D-CycleGAN dropping to F1 = 96.57 ± 7.89% compared with MCA-FM’s 97.37 ± 6.43%, highlighting MCA-FM’s greater stability under strong physiological interferences.
The CNN–transformer hybrid model [
17] combines the local feature extraction capability of CNNs with the global context modeling of transformers. However, the standard self-attention mechanism brings considerable computational overhead when processing long input sequences, which limits its efficiency for real-time monitoring scenarios. Under our standardized evaluation, it achieves PCC = 0.83 ± 0.061 and F1 = 98.17 ± 0.66% on ADDB, and PCC = 0.85 ± 0.109 and F1 = 96.80 ± 5.96% on BDDB, trailing MCA-FM by a clear margin on both datasets. This suggests that, while hybrid architectures are promising, the lightweight channel attention in MCA-FM provides a more effective and efficient alternative for fetal ECG extraction.
Time-frequency domain methods [
26] convert AECG to 2D representations, introducing redundancy and higher complexity. On ADDB, their MAE (0.27 ± 0.065) is 29% higher than MCA-FM’s, and on BDDB their F1 drops to 97.27 ± 6.46% compared with MCA-FM’s 97.37 ± 6.43%. The 2D transformation also increases the number of parameters and processing steps without a corresponding gain in extraction quality. MCA-FM directly optimizes 1D signal mapping without domain transformation, achieving a better accuracy–efficiency balance.
Specifically, the minimal channel attention module in MCA-FM follows ECA-Net’s dimensionality-reduction-free cross-channel interaction [
18]. This design avoids unnecessary computational burden while significantly enhancing feature selection. After adding this module, PCC increased to 0.94 ± 0.050 and SPC reached 0.94 ± 0.044, confirming its effectiveness in improving the frequency-domain consistency of fetal QRS complexes. The enhanced residual connection draws on the periodic-aware residual design of PA
2Net [
16] for FECG detection. It retains the identity mapping inside the HNF block and adds a new linear fusion of “original feature + weighted residual” (
,
= 0.05) after the attention module. The resulting R-peak detection F1 reaches 97.37 ± 6.43%, demonstrating its ability to stabilize gradient flow and avoid loss of effective features under strong interference. The residual false positives and false negatives observed in the confusion matrices are predominantly concentrated in low-quality segments where the fetal ECG is nearly obscured by transient artifacts. Integrating a signal quality assessment module to identify and flag such segments before processing would be a practical direction for reducing these errors. The BDDB dataset is particularly informative for assessing clinical robustness, as its greater diversity of physiological interferences provides a more rigorous test of generalization. The fact that MCA-FM maintains high PCC and F1 on BDDB underscores its practical potential in challenging clinical scenarios.
As shown in the quantitative comparison results in
Table 11, although MCA-FM outperforms existing methods on most metrics, the following limitations still exist. The experiments are based on two clinical datasets of healthy full-term pregnancies, ADDB (5 subjects) and BDDB (12 subjects), supplemented by the FECGSYNDB synthetic dataset for controlled interference evaluation. However, real pathological scenarios such as preterm birth (<37 weeks) or fetal arrhythmias (e.g., supraventricular tachycardia) are not included. In such clinical scenarios, the FECG amplitude is lower (about 0.1–0.3 mV, 50% lower than that of full-term fetuses) and waveform distortion is more severe, so model performance might decrease by 10–15%. The assumption of short-term noise stationarity within each 5 s window, while common in fetal ECG processing, has not been formally verified. These limitations indicate directions for the model’s further optimization, which will be addressed in future research.
In the future, we hope to collaborate with obstetric departments to collect AECG data from 24 to 42 weeks of gestation (including 30 preterm cases and 20 arrhythmia cases), and introduce domain adaptation training (e.g., domain-adversarial neural networks) to reduce the distribution difference between healthy and pathological data. We aim to further enhance the model’s clinical value and promote the popularization of non-invasive fetal monitoring technology in primary healthcare institutions.