2.2.2. Backbone 1D U-Net
The model backbone adopts a 1D U-Net with a four-layer encoder–decoder structure. The U-Net architecture is commonly used for 2D image segmentation [
24]. However, its symmetric encoder–decoder structure and core design of skip connections can be effectively adapted to handle one-dimensional temporal waveform signals. In this study, the 2D convolutions in the encoder–decoder convolutional blocks are replaced with 1D convolutions, each with a kernel size of 3 and a stride of 1, making the network suitable for processing one-dimensional temporal blood pressure waveforms. Downsampling in the encoder is performed using 1D max-pooling with a pool size of 2 and stride of 2. Let
denote the number of input channels,
denote the number of output channels, and
denote the convolution kernel size. The number of parameters of the two convolution operations can then be expressed as follows:
When is used in this study, the parameter size and computational cost of 1D convolution are much lower than those of 2D convolution, theoretically reducing the convolutional overhead by approximately 67%. This approach not only allows for the precise capture of local fine physiological features, such as the pulse upstroke and rebound waves, along the temporal dimension, but also avoids the redundant computations associated with 2D convolutions. Therefore, this design achieves both model lightweighting and effective feature extraction at the backbone-network level.
First, the preprocessed blood pressure signal is fed into the encoder, where it is downsampled layer by layer through four convolutional blocks (ConvBlocks). The number of signal channels is doubled from 32 to 256 in each layer. Each convolutional block consists of two 1D convolutions, batch normalization, and ReLU activation During encoding, the signal’s temporal dimension is progressively compressed through pooling operations, extracting deep features at different scales. Subsequently, at the bottleneck layer at the network’s bottom, the deep features output by the encoder are further fused and enhanced to improve representational capacity. The decoder restores the signal’s scale layer by layer through upsampling. To compensate for the information loss caused by deep downsampling, features are concatenated with corresponding encoding layers via skip connections. Meanwhile, the four-layer structure balances feature extraction depth and parameter volume, ensuring that long-term blood pressure trends are not missed due to insufficient layers while avoiding overfitting issues often caused by deeper networks. This structure not only extracts local details from the blood pressure signal but also preserves the overall trend of changes, providing fundamental features for subsequent waveform reconstruction.
2.2.3. Blood Pressure Value Output Network
The blood pressure output network consists of an LSTM based on a fused multi-scale features module, a global average pooling module, and a two-layer fully connected prediction head. This architecture leverages the middle-to-deep layer skip connection features and bottleneck layer features from the 1D U-Net encoder to effectively capture multi-scale temporal information from the pulse wave, enabling efficient blood pressure regression prediction. The detailed structure is illustrated in
Figure 3.
- (a)
LSTM Based on Fused Multi-Scale Features
To better handle information from different temporal scales in the 1D U-Net module, this study designs an LSTM temporal modeling module based on multi-scale feature fusion. Different from existing multi-scale feature fusion architectures, the proposed fusion strategy is not a simple stacking of existing network modules. Instead, it is designed as a selective mid-to-deep feature reuse and lightweight fusion strategy tailored for radar-based pulse-wave blood pressure regression.
Traditional multi-scale feature fusion architectures usually enhance feature representation by introducing additional scale-construction pathways or scale-selection mechanisms. The typical logic of FPN-style architectures is to construct a multi-level output feature pyramid: deep features are progressively upsampled through a top-down pathway and fused with the corresponding shallow features via lateral connections, commonly using element-wise addition. This structure promotes layer-wise interaction among features at different levels. But it usually requires additional upsampling operations, lateral
convolutional transformations for channel alignment, and layer-wise fusion computation. Therefore, it is more suitable for visual tasks, such as object detection and image segmentation, where multi-scale output feature maps are required [
25]. Inception-style architectures usually introduce multiple parallel convolutional branches within the same layer. These branches extract feature representations with different receptive fields using convolutional kernels of different sizes or pooling paths, and their outputs are then fused through channel-wise concatenation. Their multi-scale representation capability mainly comes from additionally constructed parallel branches, which increases convolutional computation, the number of parameters, and intermediate feature storage cost [
26]. In addition, dynamic weighting fusion methods, such as gating mechanisms or adaptive scale-selection mechanisms, usually require an extra weight-generation module to adaptively assign the importance of features from different scales according to the input features. Although these methods improve fusion flexibility, they also introduce additional parameters, weight computation procedures, and training complexity [
27,
28].
Different from the above multi-scale fusion methods, the proposed method does not construct an additional top-down feature pyramid, introduce new parallel multi-branch convolutional structures, or employ a dynamic weight-generation module. Instead, it directly reuses the hierarchical temporal features already formed by the encoder of the 1D U-Net during forward propagation. In this design, feature maps at different scales, namely e3, e4, and b, are extracted through skip connections. Here, e1, e2, e3, and e4 represent the output features from different stages of the U-Net encoder through skip connections, while b corresponds to the bottleneck feature at the end of the encoder. These features contain temporal information at different levels. Moreover, shallow features e1 and e2 are not fused in this study because they mainly retain local details and low-level waveform information, which may introduce redundancy and noise into blood pressure regression. In contrast, mid- to deep-layer skip connection features e3 and e4 and the bottleneck feature b provide stronger abstract temporal representations, making them more suitable for subsequent temporal modeling and regression prediction. Therefore, the proposed method is not a naive stacking of U-Net features, but a task-oriented selective multi-scale feature reuse strategy for blood pressure regression.
In terms of the fusion operation, this study adopts a “concatenate first, then project” strategy. Specifically, e3, e4, and b are first concatenated along the channel dimension to preserve the diversity and complementarity of features from different scales. Then, a standard learnable convolution is used for channel-level projection and fusion, thereby achieving cross-scale information integration with relatively low computational overhead. The fused single-stream feature is subsequently fed into the LSTM module for unified temporal dependency modeling, allowing the model to capture the dynamic evolution of pulse-wave features over time. Overall, through selective feature reuse and single-stream lightweight fusion, this strategy preserves multi-scale temporal information while reducing additional structural overhead, thereby achieving a balance between feature representation capability and model complexity.
- (b)
Global Average Pooling
After multi-scale feature fusion and LSTM modeling, the feature tensor still retains temporal information. If directly input into fully connected layers, the high dimensionality would make the model sensitive to the length of the input sequence. Therefore, an effective global aggregation mechanism is required. Therefore, this study uses Global Average Pooling (GAP) [
29] to further compress the temporal features output by LSTM into a fixed-dimensional global representation. After completing multi-scale feature fusion and LSTM temporal modeling, the network obtains the temporal hidden representation:
represents the batch size,
represents the number of channels, and
represents the temporal dimension length.
The essence of GAP is global average pooling along the temporal dimension, which aggregates the response strength of each channel over the entire duration, providing a global feature description of the waveform. It effectively consolidates global temporal information, preventing the model from overly relying on specific time points. Compared with directly flattening the temporal features output by the LSTM and feeding them into fully connected layers, this study applies GAP along the temporal dimension to independently average each channel. GAP itself introduces no learnable parameters and compresses the feature dimension from
to
. If the number of hidden units in the subsequent fully connected layer is
, the number of parameters can be reduced from
to
, thereby substantially reducing the parameter size of the regression head and the overall model complexity. Moreover, GAP demonstrates strong adaptability to sequences of varying lengths, enhancing the model’s robustness. The global average pooling along the temporal dimension is calculated as follows:
here,
denotes the features after pooling.
- (c)
Fully Connected Layer
The pooled global features are fed into a two-layer fully connected network for blood pressure regression prediction. The first layer performs high-order feature fusion and dimensional transformation through nonlinear mapping, using the ReLU activation function to enhance the model’s ability to fit complex signal distributions. Additionally, Dropout is introduced to randomly suppress the response of certain neurons, effectively reducing the risk of overfitting and improving the model’s generalization and robustness across radar signals from different individuals. The second layer directly maps the fused features to a two-dimensional output, corresponding to SBP and DBP, achieving continuous blood pressure prediction and providing a simple and efficient prediction output structure for the blood pressure signal regression task [
30].