3.1. Overall Architecture
We now present the two modules that address the limitations discussed above. DMSB targets the limited receptive field and parameter inefficiency of multi-scale CNNs; CTA targets the dilution of softmax attention by common-mode background responses. The two modules are cascaded into a DMSC Block and stacked via residual connections [
33], forming the complete DMSCNet framework.
Given a sensor sequence
of length
T with
C channels, DMSCNet maps it to an activity class probability distribution
, where
K is the number of classes. The backbone consists of two cascaded residual blocks (ResBlocks) [
33], each containing two cascaded DMSC Blocks with a skip connection. ResBlock 1 uses a
convolution to project the input to 128 channels, as its input and output channel dimensions differ; ResBlock 2 takes 128 channels as both input and output, and its skip connection reduces to an identity mapping. The residual backbone is followed by global average pooling (GAP) [
34], layer normalization (LN) [
35], Dropout, and a fully connected classification layer that outputs the prediction. The overall architecture and the internal expansion of a DMSC Block are illustrated in
Figure 1.
Within each DMSC Block, DMSB precedes CTA in a cascaded arrangement: the local features extracted by the convolutional layers form the basis for the queries and keys of attention computation, ensuring that attention operates on discriminative intermediate representations rather than on low-level raw inputs. Global average pooling compresses the feature map into a 128-dimensional vector, eliminating positional bias along the temporal dimension—a property particularly important for sensor data in which the same activity does not start at a fixed position.
3.2. Dilated Multi-Scale Branch Block (DMSB)
InceptionTime [
7] partially mitigates the limitation of a single receptive field through multi-branch parallel convolutions, but its branches operate directly on the raw input, which reduces parameter efficiency when the input channel count is large (for example, the 21 channels of RealWorld). DMSB extends this design with a shared bottleneck layer [
33] and dilated convolutions [
36], further widening multi-scale receptive field coverage under a controlled parameter budget. DMSB consists of a shared bottleneck layer, three parallel dilated convolution branches, a pooling bypass, and SE-based channel recalibration [
11]. Its internal structure is shown on the right of
Figure 1.
Bottleneck layer: Let denote the input tensor of DMSB. A shared convolution first compresses it to 32 channels, yielding . This operation unifies the branch inputs to 32 channels, decoupling the branches’ parameter count from .
Dilated convolution branches: Three parallel branches apply 1D convolutions to
B with different kernel sizes
k and dilation rates
d:
,
, and
, corresponding to receptive fields of 7, 9, and 9 steps. These settings follow the typical temporal structure of human activities [
36]. Taking the 50 Hz sampling rate of UCI-HAR as an example, one step corresponds to approximately 20 ms. A receptive field of 7 then covers about 140 ms, capturing short-range transient events such as a single heel-strike impulse within the gait cycle. A receptive field of 9 covers about 180 ms, corresponding to the transition interval between adjacent limb-motion segments. Different
k–
d combinations maintain similar receptive fields while concentrating each branch’s convolutional weights on distinct periodic components in the frequency domain, avoiding feature redundancy across branches. Although a single DMSB has a limited receptive field, four stacked DMSC Blocks accumulate a receptive field of approximately 57 steps along the mixed-path route, covering about 44% of the 128-step input window. This allows the model to capture long-range activity patterns, such as sit-to-stand transitions that unfold over several seconds.
Pooling branch: An additional branch bypasses the bottleneck layer and operates directly on the raw input H, applying max pooling with followed by a convolution that projects to 32 channels. Max pooling retains the positional information of signal peaks, which helps capture impulse-type features such as elevator vibrations in USC-HAD and complements the convolution branches, whose output is smoothed by the bottleneck.
Feature fusion and SE channel attention: The outputs of the four branches are concatenated along the channel dimension to form a 128-dimensional feature map. The SE module [
11] generates a 128-dimensional channel descriptor via global average pooling, produces a channel weight vector
through a two-layer fully connected network with a reduction ratio of 16, and applies channel-wise scaling to the concatenated feature map. The discriminative contribution of the concatenated channels varies substantially, and channel recalibration dynamically suppresses low-discriminability channels without changing the feature dimension. The recalibrated features pass through Batch Normalization and ReLU activation, yielding the DMSB output
.
3.3. Contrastive Temporal Attention (CTA) Module
Standard softmax attention [
21] assigns a single set of weights over all time steps. When sensor data contain baseline drift and inter-subject variability, background responses in the attention matrix cannot be automatically suppressed, and time steps unrelated to activity discrimination still receive non-negligible attention weights. Motivated by differential attention [
32], CTA adopts a dual query–key–value framework and performs a weighted subtraction between two attention paths. The subtraction cancels common-mode background responses jointly captured by the two paths, while amplifying discriminative differential-mode signals. CTA further introduces a depth-adaptive contrastive strength
together with a learnable dynamic update mechanism: shallow CTA layers apply a mild subtraction to preserve feature transformation capacity, while deep CTA layers apply a stronger subtraction to refine attention allocation and use output scaling to avoid over-perturbing already stabilized high-level features.
Notation: Let the input sequence to CTA be , the output of DMSB, with as the sequence length and . The number of heads is . Because the query and key are split into two paths (each containing h heads), the per-head dimension is . The value vector is split into h heads of dimension , preserving the input–output dimensionality.
Dual attention computation:
M is linearly projected to obtain the query
and the key
, which are then split along the head dimension into two groups:
and
. The value
is split as specified in the Notation. The two attention matrices are computed as
The core output of CTA is the weighted contrast of the two attention matrices:
The two projection weights are independently initialized and jointly trained, naturally differentiating during training into overlapping but non-identical attention patterns: each path locks onto a set of discriminative positions in the sequence, while both produce a certain level of low-level response over non-discriminative regions. The subtraction cancels the low-level common-mode responses shared by the two paths. The discriminative positions that each path locks onto are preserved with opposite signs, yielding a signed bipolar encoding of the sequence. This mechanism is driven entirely by end-to-end training and does not rely on external noise annotations.
Intuitively, the two attention paths can be understood as two observers that both respond to the persistent background of the sequence—the baseline drift, inter-subject offsets, and device-orientation effects that span all time steps—but lock onto slightly different discriminative moments. Because the background is shared by both paths, it appears with nearly the same magnitude in
and
and is largely cancelled by
. The discriminative moments each path attends to are not shared, so they survive the subtraction: positions emphasized by
remain positive, and those emphasized by
become negative. The result is a signed bipolar map in which background is suppressed toward zero and the two complementary sets of discriminative positions are separated by sign. This is why subtracting two attention paths suppresses common-mode background while preserving—and even sharpening—activity-related information. A schematic of this process is shown in
Figure 2.
Depth-adaptive contrastive strength
:
is designed under two principles. Shallow features are not yet fully abstracted and carry high information density. At shallow depth, therefore, the subtraction should remain mild so that CTA retains strong feature transformation capacity and does not disrupt useful low-level patterns. Deep features are already highly abstracted and can tolerate a more aggressive subtraction that refines the attention allocation. At the same time, the overall output magnitude should converge moderately, so that stabilized representations are not over-perturbed. Based on these priors,
takes a monotonically increasing exponential form with respect to depth:
This function takes the value 0.20 at
and 0.56 at
, satisfying the prior constraint of smaller values at shallow depth and larger values at deeper depth. At shallow depth,
is small, the subtraction is mild, and the output scaling factor
is large (0.80 at
), so CTA contributes substantially to feature transformation. At deeper depth,
is large and the subtraction is more aggressive, but the output magnitude after
scaling is small (0.44 at
), avoiding over-modification of stabilized high-level features. The three constants
control the upper bound, decay amplitude, and decay rate of the function respectively. The effect of these constants on performance is small; a sensitivity analysis reported in
Section 4.5 confirms that perturbing each by
leaves the F1-score essentially unchanged.
uses
as its initial bias and is computed in the forward pass as
where
are independent learnable parameter vectors, initialized from a zero-mean Gaussian distribution and kept separate from the query/key matrices, dedicated to modulating the contrastive strength.
serves only as the initial training bias; the final value of
is driven by the learnable term
and converges in a data-driven manner. The
parameters of each DMSC Block are mutually independent and converge during training to contrastive strengths that match the feature abstraction level of that layer. This yields differentiated modulation across depths at the data-driven level, rather than sharing a single
across all layers.
CTA sublayer output: The core CTA output is passed through RMSNorm [
37], multiplied by the scaling factor
, and added to the residual to form the complete CTA sublayer output
:
The scaling factor decreases with depth, so the magnitude of the CTA sublayer output declines progressively at deeper layers. This stabilizes gradient propagation while preventing deep attention from over-perturbing mature feature representations.
The CTA sublayer is followed by a standard Pre-Norm FFN sublayer [
38]. This sublayer first applies LayerNorm, then a linear transformation expanding the feature dimension from 128 to 256, followed by GELU activation [
39] and Dropout. A second linear transformation projects the dimension back from 256 to 128, followed by a final Dropout, and the result is added to the input through a residual connection. The
expansion of the intermediate dimension provides sufficient expressive capacity for the non-linear transformation. CTA does not apply a causal mask, so every time step in the sequence can attend to the global context, consistent with the bidirectional dependency of signals within an HAR window. The overall topology of a DMSC Block places DMSB first, followed by the CTA sublayer and the FFN sublayer in sequence, with residual connections spanning the entire Block. The complete structure is shown in
Figure 1.