DMSCNet: A Dilated Multi-Scale Contrastive Attention Network for Sensor-Based Human Activity Recognition

Wu, Qingshan; Chu, Shengguang; Li, Kewen; Wang, Liechong

doi:10.3390/app16126037

Open AccessArticle

DMSCNet: A Dilated Multi-Scale Contrastive Attention Network for Sensor-Based Human Activity Recognition

by

Qingshan Wu

,

Shengguang Chu

^*,

Kewen Li

and

Liechong Wang

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 6037; https://doi.org/10.3390/app16126037 (registering DOI)

Submission received: 23 April 2026 / Revised: 26 May 2026 / Accepted: 10 June 2026 / Published: 15 June 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Wearable-sensor human activity recognition (HAR) plays a key role in health monitoring, elderly care, and human–computer interaction. Deep learning dominates the field, but two limitations remain. CNNs with fixed kernels cannot capture cross-scale temporal events such as gait cycles and postural transitions in a single layer, and softmax attention on small sensor datasets is often diluted by common-mode background responses across the sequence. We propose DMSCNet, an end-to-end framework with two modules. The Dilated Multi-Scale Branch Block (DMSB) combines a shared bottleneck, parallel dilated convolutions, a pooling bypass, and SE-based channel recalibration to widen the temporal receptive field under a controlled parameter budget. The Contrastive Temporal Attention (CTA) module adopts a dual-path differential design, in which the two paths learn overlapping but non-identical attention patterns and their subtraction suppresses shared low-level responses while preserving the discriminative positions each path locks onto, encoded with opposite signs. DMSB and CTA are cascaded into a DMSC Block and stacked residually. On UCI-HAR, USC-HAD, and RealWorld, DMSCNet reaches F1-scores of 97.65%, 91.80%, and 99.05%, outperforming nine baselines. Ablations confirm that SE acts along the channel axis and CTA along the temporal axis, and visualization reveals a dynamic–static dichotomy together with a signed bipolar encoding pattern produced by the dual-path subtraction.

Keywords:

human activity recognition; wearable sensors; time series classification; dilated convolution; multi-scale feature extraction; attention mechanism

1. Introduction

Human activity recognition (HAR) aims to automatically identify ongoing actions or behaviors from time-series signals acquired by sensors. As a key technology connecting the physical world with intelligent systems, HAR plays a foundational role in applications including health monitoring, fall detection and rehabilitation for the elderly, smart homes, sports analytics, and natural human–computer interaction [1,2]. The proliferation of the Internet of Things and wearable devices has made continuous activity data collection via personal devices feasible, further broadening the practical scope of HAR.

Among various sensing modalities, HAR based on wearable inertial sensors (accelerometers, gyroscopes, magnetometers, etc.) offers notable advantages over vision-based approaches: it does not rely on external cameras; is unaffected by lighting, viewpoint, or occlusion; better preserves user privacy; and supports all-day, low-power continuous monitoring [3]. With standardized integration of multi-axis inertial sensors into smartphones and smartwatches, the barrier to data acquisition has continued to fall [4]. Research on wearable-sensor HAR has accumulated a substantial body of work over the past decade, evolving from handcrafted feature classification to end-to-end deep learning methods based on CNNs, RNNs, and their hybrid architectures [5].

Early sensor-based HAR relied on handcrafted features with classical classifiers, whose dependence on domain expertise limited cross-scenario generalization. With the rise of deep learning, CNNs became the dominant backbone for HAR, owing to their local inductive bias and robustness on small datasets, and existing deep methods still face two structural limitations.

At the convolutional feature extraction level, CNNs with fixed kernel sizes [6] struggle to jointly model cross-scale temporal events such as gait cycles (approximately 0.5–1.0 s) and postural transitions (several seconds) within a single layer. Existing multi-scale methods [7,8] and dilated-convolution-based temporal methods [9,10] partially alleviate this issue through parallel convolutional branches, yet these branches typically operate directly on the raw input, which yields low parameter efficiency when the number of sensor channels is large.

At the attention modeling level, channel attention [11,12] and multivariate temporal attention [13,14] have been introduced to HAR to enhance discriminative capability. The standard softmax self-attention used in these methods is easily diluted on small sensor datasets by common-mode background responses arising from inter-subject variability, baseline drift, and device orientation changes. The attention weights of discriminative time steps are obscured by broadly distributed noise, making it difficult to highlight the key segments on which activity recognition truly depends.

To address these two issues, we propose DMSCNet (Dilated Multi-Scale Contrastive Attention Network), an end-to-end deep learning framework for wearable-sensor HAR. Its core design centers on two complementary modules. The Dilated Multi-Scale Branch Block (DMSB) compresses the input channel dimension through a shared bottleneck layer, extracts cross-scale temporal features in parallel using multiple dilated convolution branches with different receptive fields, preserves pre-bottleneck information through a pooling bypass, and performs dynamic recalibration of discriminative channels via SE channel attention. This module strikes a balance between parameter efficiency and receptive field coverage. The Contrastive Temporal Attention (CTA) module adopts a dual-path differential structure. By suppressing common-mode background responses jointly captured by the two paths and amplifying discriminative differential-mode signals, it addresses the tendency of standard softmax attention to be diluted by noise on small sensor datasets. DMSB and CTA are cascaded into a DMSC Block, and stacked DMSC Blocks form the complete DMSCNet network via residual connections.

This dual-path design distinguishes CTA from existing HAR attention: unlike channel attention (SE, ECA) that recalibrates along the channel axis, or single-path softmax attention whose single weight distribution is easily diluted by common-mode background, CTA contrasts two temporal paths so that their subtraction cancels shared background and preserves each path’s discriminative positions as a signed bipolar encoding. Originally proposed for autoregressive language models, dual-path differential attention is, to our knowledge, applied here to sensor-based HAR for the first time.

The main contributions of this paper are summarized as follows:

(1): We propose DMSCNet, an end-to-end deep learning framework for wearable-sensor HAR. The framework unifies dilated multi-scale convolution and Contrastive Temporal Attention within the cascaded structure of a DMSC Block, enabling simultaneous improvements at both the feature extraction and temporal selection levels.
(2): We design the DMSB module, which combines a shared bottleneck, multi-branch dilated convolutions, a pooling bypass, and SE-based channel recalibration to systematically extend multi-scale receptive field coverage under a controlled parameter budget, providing a unified feature extraction solution for cross-scale temporal events in HAR.
(3): We design the CTA module, which introduces a dual-path differential attention mechanism to sensor-based time-series classification. CTA improves upon standard self-attention by suppressing common-mode background responses and amplifying discriminative differential-mode signals, and we evaluate it on three public HAR datasets.

The remainder of this paper is organized as follows. Section 2 reviews work related to DMSCNet. Section 3 presents the overall architecture of DMSCNet and the design of its core modules. Section 4 reports detailed experimental results, ablation analyses, and visualization analyses on three public HAR datasets. Section 5 concludes the paper.

2. Related Work

This section reviews the work most relevant to DMSCNet. We first summarize the evolution of CNN-based HAR, covering classical convolutional architectures as well as their multi-scale and dilated extensions (Section 2.1). We then examine attention mechanisms introduced into HAR and identify the core limitation of single softmax attention on small-scale sensor data (Section 2.2). Finally, we review recent advances in differential attention and analyze its applicability to sensor-based HAR (Section 2.3).

2.1. CNN-Based HAR

Wearable-sensor human activity recognition has undergone a paradigm shift from handcrafted features to deep learning. Several systematic surveys published in recent years [15,16,17,18] have traced this transition. Early studies relied mainly on handcrafted time-domain, frequency-domain, and statistical features, paired with classical classifiers such as support vector machines, random forests, and naive Bayes for activity recognition [1,2]; Anguita et al. achieved a robust baseline on UCI-HAR using a multiclass SVM [3]. Handcrafted feature design, however, depends heavily on domain expertise and performs poorly in cross-subject and cross-placement generalization. With the success of deep learning in image and sequence modeling, end-to-end feature learning gradually became the mainstream in HAR. Ronao and Cho applied one-dimensional convolutional neural networks to smartphone sensor signals, offering the first systematic demonstration that deep CNNs outperform handcrafted features [4]. Yang et al. proposed a multi-channel temporal CNN that further treats multi-axis sensor channels as independent input channels under a unified model [19]. These early studies established CNNs as the standard end-to-end feature extractor for HAR.

Building on this foundation, general CNN architectures for time-series classification were transferred to HAR. Wang et al. proposed the fully convolutional network (FCN) and the deep residual network (ResNet) as two classical baselines for time-series classification [6]: the former uses a simple structure of three 1D convolutional layers followed by global average pooling to map raw sensor sequences directly to class space; the latter adds residual connections on top of FCN to mitigate vanishing gradients in deep networks. DeepSense [20] proposed a unified convolutional–recurrent framework for multi-modal sensor data, demonstrating the potential of end-to-end deep learning for fusing heterogeneous inputs such as accelerometer and gyroscope signals. These works collectively drove the transition from handcrafted features to end-to-end learning and laid the groundwork for later multi-scale and attention-based methods.

To overcome the receptive field limitation of a single kernel size, multi-scale convolution and temporal dilated convolution have emerged as two important directions. InceptionTime [7] transferred the Inception module from computer vision to 1D time-series classification, extracting complementary temporal features from the same input through parallel kernels of different sizes, and established a strong baseline on UCR time-series benchmarks. InnoHAR [8] combined Inception modules with GRU units, unifying multi-scale convolution and temporal modeling within a single HAR framework for the first time. The other direction originates from the temporal convolutional network (TCN) [9], which uses dilated convolutions to achieve exponential receptive field growth under linear parameter cost, offering a more efficient alternative to recurrent networks for long-range temporal modeling. TCN-Inception [10] further combines the dilated convolutions of TCN with the multi-scale structure of Inception, and is among the most representative multi-scale temporal convolution methods in current HAR research.

Beyond pure convolutional methods, CNN–RNN hybrids have been extensively explored to jointly model spatial correlation and long-range temporal dependency. DeepConvLSTM [5] stacks LSTM layers on top of convolutional feature extraction and is a classical hybrid model in HAR, adopted in many follow-up works. HAR tasks, however, typically use fixed-length sliding windows of around 128 time steps as input, which is much shorter than sequences in natural language or video classification. On such short sequences, the long-range dependency advantage of LSTM/GRU has little room to play out, and the sequential recurrent structure further imposes costs on training efficiency and parallelism. Recent HAR research has also shown a trend of shifting back from recurrent structures to fully convolutional or attention-based designs.

Two structural limitations remain in existing CNN-based methods for HAR. First, convolutional layers dominated by fixed kernel sizes struggle to jointly model cross-scale temporal events such as gait cycles (approximately 0.5–1.0 s) and postural transitions (several seconds) within the same layer. Second, the branches of existing multi-scale methods typically operate directly on the raw input, yielding low parameter efficiency when the number of sensor channels is large. These limitations motivate introducing dilated convolutions and parameter sharing into multi-scale convolutional frameworks.

2.2. Attention Mechanisms for HAR

Adding attention on top of a CNN backbone has become a dominant technique for strengthening discriminative capability in HAR. The Transformer architecture proposed by Vaswani et al. [21], centered on self-attention, achieved a breakthrough in natural language processing and subsequently drove the widespread adoption of attention in vision and temporal modeling. Channel attention methods dynamically recalibrate multi-channel feature maps to amplify discriminative channels and suppress redundant ones, a natural fit for multi-sensor HAR inputs. SE networks [11] were first proposed for vision tasks and learn channel weights through a global average pooling step followed by a fully connected bottleneck; SE is the most widely used channel attention module. CBAM [22] adds a spatial attention branch on top of SE, forming a two-dimensional attention along both channel and spatial axes. ECA-Net [12] replaces the fully connected bottleneck with a 1D convolution, substantially reducing parameter count while maintaining performance. These modules are easy to embed into any CNN backbone and are used in HAR to handle weight differences among heterogeneous sensor channels such as accelerometers, gyroscopes, and magnetometers. Their scope, however, is limited to the channel axis and does not address temporal discrimination directly.

Attention-based methods designed for HAR introduce more explicit attention structures at the temporal or multivariate level. MuVAN [13] designs a multi-view attention mechanism for multivariate time series and replaces convolutional feature extraction with a pure attention architecture; without a strong local inductive bias, however, this approach has limited discriminative learning capacity on small-scale sensor data and tends to underfit in fine-grained tasks. Mahmud et al. introduced self-attention into wearable-sensor HAR and explored the feasibility of a pure attention architecture for the task [23]. Li et al. proposed a two-stream convolution-augmented Transformer that combines the local inductive bias of convolutions with the global modeling of self-attention [14], and Dirgová Luptáková et al. systematically evaluated Transformer architectures for wearable HAR [24]. Recent work extends the combination of attention and CNNs in several directions. Suh et al. proposed TASKED, which couples a Transformer with adversarial learning and self-knowledge distillation for cross-subject generalization [25]. Wang et al. proposed an attention-based multi-feature extraction framework that models correlations across feature subspaces [26]. Mekruksavanich and Jitpattanakul proposed Att-ResBiGRU, combining residual structures with attention for position-independent HAR [27]. Among hybrid architectures, CNN-A-BiLSTM [28] introduces attention into a CNN + BiLSTM framework to improve robustness under noisy data; TECA-HAR [29], a recent method, couples ECA channel attention with a CNN–LSTM hybrid and performs steadily across several HAR benchmarks. Taken together, these works indicate that combining attention with a CNN backbone is more effective for HAR than pure attention architectures. This is because HAR datasets are relatively small, and the local inductive bias of convolutions is indispensable for regularization.

Despite their structural differences, these attention methods all rely on a single softmax attention: one shared weight distribution over all time steps. In sensor-based HAR, inter-subject variability, baseline drift, and device orientation changes produce background responses that span the entire sequence. Under limited training data, standard softmax attention cannot distinguish these common-mode backgrounds from discriminative temporal patterns, and the attention weights of key time steps are diluted by background responses. How to suppress common-mode backgrounds and highlight discriminative differential-mode signals within the attention mechanism itself remains an open problem.

2.3. Differential Attention

A large body of work has accumulated around improvements to standard self-attention. One representative direction targets computational efficiency: Sparse Transformer [30] restricts attention to sparse connection patterns and reduces complexity for long sequences, while Linear Attention [31] rewrites self-attention in a linearized form through kernel methods, scaling computation linearly with sequence length. These methods target the efficiency of long sequences and do not address the dilution of attention weights by common-mode responses on small-scale sensor data.

Differential Transformer [32], proposed by Ye et al., addresses this problem differently. Its core idea is to split standard self-attention into two independent paths and take their difference as the final output: the two paths learn overlapping but non-identical attention patterns from similar inputs, and the subtraction cancels the background responses shared by both paths while preserving the discriminative differential-mode signals. Experiments show that this mechanism alleviates the attention dilution from irrelevant context in large language models and delivers consistent improvements on long-text understanding and hallucination reduction.

Differential attention was originally proposed for autoregressive large language models and was validated on long-context, causally masked text sequences. Applications of differential attention to sensor-based HAR remain very limited. HAR and NLP scenarios differ in several key respects: sequences are shorter (typically around 128 time steps), channels are more numerous and multi-modal, no causal structure is required (bidirectional attention is applicable), and the training data volume is orders of magnitude smaller than LLM pretraining corpora. Adapting the dual-path contrast of differential attention to the bidirectional short-sequence setting of sensor-based HAR is a promising direction.

CTA differs from existing HAR attention mechanisms in three ways. Channel attention such as SE and ECA reweights features along the channel axis, while CTA works along the temporal axis to select discriminative time steps, so the two are complementary and DMSCNet uses both. Standard single-path softmax attention produces one non-negative weight distribution that cannot separate common-mode background from discriminative patterns, whereas CTA contrasts two paths whose subtraction cancels the shared component and amplifies the differential one. Finally, the differential attention of Differential Transformer was designed for long, causally masked language sequences, while HAR windows are short, bidirectional, multi-channel, and far smaller in data. CTA adapts the dual-path contrast to this setting and adds a depth-adaptive contrastive strength suited to shallow HAR stacks.

3. Methods

This section presents the DMSCNet framework in three parts. Section 3.1 describes the overall architecture and the data flow through the network. Section 3.2 details the DMSB module, which addresses the limited receptive field and low parameter efficiency of existing multi-scale CNNs. Section 3.3 details the CTA module, which suppresses the common-mode background responses that dilute standard softmax attention on small-scale sensor data.

3.1. Overall Architecture

We now present the two modules that address the limitations discussed above. DMSB targets the limited receptive field and parameter inefficiency of multi-scale CNNs; CTA targets the dilution of softmax attention by common-mode background responses. The two modules are cascaded into a DMSC Block and stacked via residual connections [33], forming the complete DMSCNet framework.

Given a sensor sequence

X \in R^{T \times C}

of length T with C channels, DMSCNet maps it to an activity class probability distribution

\hat{y} \in R^{K}

, where K is the number of classes. The backbone consists of two cascaded residual blocks (ResBlocks) [33], each containing two cascaded DMSC Blocks with a skip connection. ResBlock 1 uses a

1 \times 1

convolution to project the input to 128 channels, as its input and output channel dimensions differ; ResBlock 2 takes 128 channels as both input and output, and its skip connection reduces to an identity mapping. The residual backbone is followed by global average pooling (GAP) [34], layer normalization (LN) [35], Dropout, and a fully connected classification layer that outputs the prediction. The overall architecture and the internal expansion of a DMSC Block are illustrated in Figure 1.

Within each DMSC Block, DMSB precedes CTA in a cascaded arrangement: the local features extracted by the convolutional layers form the basis for the queries and keys of attention computation, ensuring that attention operates on discriminative intermediate representations rather than on low-level raw inputs. Global average pooling compresses the

T \times 128

feature map into a 128-dimensional vector, eliminating positional bias along the temporal dimension—a property particularly important for sensor data in which the same activity does not start at a fixed position.

3.2. Dilated Multi-Scale Branch Block (DMSB)

InceptionTime [7] partially mitigates the limitation of a single receptive field through multi-branch parallel convolutions, but its branches operate directly on the raw input, which reduces parameter efficiency when the input channel count is large (for example, the 21 channels of RealWorld). DMSB extends this design with a shared bottleneck layer [33] and dilated convolutions [36], further widening multi-scale receptive field coverage under a controlled parameter budget. DMSB consists of a shared bottleneck layer, three parallel dilated convolution branches, a pooling bypass, and SE-based channel recalibration [11]. Its internal structure is shown on the right of Figure 1.

Bottleneck layer: Let

H \in R^{T \times C_{in}}

denote the input tensor of DMSB. A shared

1 \times 1

convolution first compresses it to 32 channels, yielding

B \in R^{T \times 32}

. This operation unifies the branch inputs to 32 channels, decoupling the branches’ parameter count from

C_{in}

.

Dilated convolution branches: Three parallel branches apply 1D convolutions to B with different kernel sizes k and dilation rates d:

(k_{1}, d_{1}) = (7, 1)

,

(k_{2}, d_{2}) = (5, 2)

, and

(k_{3}, d_{3}) = (3, 4)

, corresponding to receptive fields of 7, 9, and 9 steps. These settings follow the typical temporal structure of human activities [36]. Taking the 50 Hz sampling rate of UCI-HAR as an example, one step corresponds to approximately 20 ms. A receptive field of 7 then covers about 140 ms, capturing short-range transient events such as a single heel-strike impulse within the gait cycle. A receptive field of 9 covers about 180 ms, corresponding to the transition interval between adjacent limb-motion segments. Different k–d combinations maintain similar receptive fields while concentrating each branch’s convolutional weights on distinct periodic components in the frequency domain, avoiding feature redundancy across branches. Although a single DMSB has a limited receptive field, four stacked DMSC Blocks accumulate a receptive field of approximately 57 steps along the mixed-path route, covering about 44% of the 128-step input window. This allows the model to capture long-range activity patterns, such as sit-to-stand transitions that unfold over several seconds.

Pooling branch: An additional branch bypasses the bottleneck layer and operates directly on the raw input H, applying max pooling with

k = 3

followed by a

1 \times 1

convolution that projects to 32 channels. Max pooling retains the positional information of signal peaks, which helps capture impulse-type features such as elevator vibrations in USC-HAD and complements the convolution branches, whose output is smoothed by the bottleneck.

Feature fusion and SE channel attention: The outputs of the four branches are concatenated along the channel dimension to form a 128-dimensional feature map. The SE module [11] generates a 128-dimensional channel descriptor via global average pooling, produces a channel weight vector

w \in R^{128}

through a two-layer fully connected network with a reduction ratio of 16, and applies channel-wise scaling to the concatenated feature map. The discriminative contribution of the concatenated channels varies substantially, and channel recalibration dynamically suppresses low-discriminability channels without changing the feature dimension. The recalibrated features pass through Batch Normalization and ReLU activation, yielding the DMSB output

M \in R^{T \times 128}

.

3.3. Contrastive Temporal Attention (CTA) Module

Standard softmax attention [21] assigns a single set of weights over all time steps. When sensor data contain baseline drift and inter-subject variability, background responses in the attention matrix cannot be automatically suppressed, and time steps unrelated to activity discrimination still receive non-negligible attention weights. Motivated by differential attention [32], CTA adopts a dual query–key–value framework and performs a weighted subtraction between two attention paths. The subtraction cancels common-mode background responses jointly captured by the two paths, while amplifying discriminative differential-mode signals. CTA further introduces a depth-adaptive contrastive strength

λ

together with a learnable dynamic update mechanism: shallow CTA layers apply a mild subtraction to preserve feature transformation capacity, while deep CTA layers apply a stronger subtraction to refine attention allocation and use output scaling to avoid over-perturbing already stabilized high-level features.

Notation: Let the input sequence to CTA be

M \in R^{L \times d}

, the output of DMSB, with

L = T

as the sequence length and

d = 128

. The number of heads is

h = 4

. Because the query and key are split into two paths (each containing h heads), the per-head dimension is

d_{h} = d / (2 h) = 16

. The value vector is split into h heads of dimension

2 d_{h} = 32

, preserving the input–output dimensionality.

Dual attention computation: M is linearly projected to obtain the query

Q = M \cdot W_{Q} \in R^{L \times d}

and the key

K = M \cdot W_{K} \in R^{L \times d}

, which are then split along the head dimension into two groups:

Q = [Q_{1}; Q_{2}]

and

K = [K_{1}; K_{2}]

. The value

V = M \cdot W_{V} \in R^{L \times d}

is split as specified in the Notation. The two attention matrices are computed as

A_{1} = softmax (\frac{Q_{1} K_{1}^{⊤}}{\sqrt{d_{h}}}), A_{2} = softmax (\frac{Q_{2} K_{2}^{⊤}}{\sqrt{d_{h}}})

(1)

The core output of CTA is the weighted contrast of the two attention matrices:

CTA (M) = (A_{1} - λ \cdot A_{2}) \cdot V

(2)

The two projection weights are independently initialized and jointly trained, naturally differentiating during training into overlapping but non-identical attention patterns: each path locks onto a set of discriminative positions in the sequence, while both produce a certain level of low-level response over non-discriminative regions. The subtraction

A_{1} - λ \cdot A_{2}

cancels the low-level common-mode responses shared by the two paths. The discriminative positions that each path locks onto are preserved with opposite signs, yielding a signed bipolar encoding of the sequence. This mechanism is driven entirely by end-to-end training and does not rely on external noise annotations.

Intuitively, the two attention paths can be understood as two observers that both respond to the persistent background of the sequence—the baseline drift, inter-subject offsets, and device-orientation effects that span all time steps—but lock onto slightly different discriminative moments. Because the background is shared by both paths, it appears with nearly the same magnitude in

A_{1}

and

A_{2}

and is largely cancelled by

A_{1} - λ \cdot A_{2}

. The discriminative moments each path attends to are not shared, so they survive the subtraction: positions emphasized by

A_{1}

remain positive, and those emphasized by

A_{2}

become negative. The result is a signed bipolar map in which background is suppressed toward zero and the two complementary sets of discriminative positions are separated by sign. This is why subtracting two attention paths suppresses common-mode background while preserving—and even sharpening—activity-related information. A schematic of this process is shown in Figure 2.

Depth-adaptive contrastive strength

λ

:

λ_{init}

is designed under two principles. Shallow features are not yet fully abstracted and carry high information density. At shallow depth, therefore, the subtraction should remain mild so that CTA retains strong feature transformation capacity and does not disrupt useful low-level patterns. Deep features are already highly abstracted and can tolerate a more aggressive subtraction that refines the attention allocation. At the same time, the overall output magnitude should converge moderately, so that stabilized representations are not over-perturbed. Based on these priors,

λ_{init}

takes a monotonically increasing exponential form with respect to depth:

λ_{init} (l) = 0.8 - 0.6 \cdot exp (- 0.3 l), l \in {0, 1, 2, 3}

(3)

This function takes the value 0.20 at

l = 0

and 0.56 at

l = 3

, satisfying the prior constraint of smaller values at shallow depth and larger values at deeper depth. At shallow depth,

λ_{init}

is small, the subtraction is mild, and the output scaling factor

(1 - λ_{init})

is large (0.80 at

l = 0

), so CTA contributes substantially to feature transformation. At deeper depth,

λ_{init}

is large and the subtraction is more aggressive, but the output magnitude after

(1 - λ_{init})

scaling is small (0.44 at

l = 3

), avoiding over-modification of stabilized high-level features. The three constants

{0.8, 0.6, 0.3}

control the upper bound, decay amplitude, and decay rate of the function respectively. The effect of these constants on performance is small; a sensitivity analysis reported in Section 4.5 confirms that perturbing each by

\pm 10 %

leaves the F1-score essentially unchanged.

λ

uses

λ_{init}

as its initial bias and is computed in the forward pass as

λ = exp (q_{1}^{⊤} k_{1}) - exp (q_{2}^{⊤} k_{2}) + λ_{init}

(4)

where

q_{1}, k_{1}, q_{2}, k_{2} \in R^{d_{h}}

are independent learnable parameter vectors, initialized from a zero-mean Gaussian distribution and kept separate from the query/key matrices, dedicated to modulating the contrastive strength.

λ_{init}

serves only as the initial training bias; the final value of

λ

is driven by the learnable term

exp (q_{1}^{⊤} k_{1}) - exp (q_{2}^{⊤} k_{2})

and converges in a data-driven manner. The

q_{1}, k_{1}, q_{2}, k_{2}

parameters of each DMSC Block are mutually independent and converge during training to contrastive strengths that match the feature abstraction level of that layer. This yields differentiated modulation across depths at the data-driven level, rather than sharing a single

λ

across all layers.

CTA sublayer output: The core CTA output is passed through RMSNorm [37], multiplied by the scaling factor

(1 - λ_{init})

, and added to the residual to form the complete CTA sublayer output

O_{1}

:

O_{1} = (1 - λ_{init}) \cdot RMSNorm (CTA (LN (M))) + M

(5)

The scaling factor

(1 - λ_{init})

decreases with depth, so the magnitude of the CTA sublayer output declines progressively at deeper layers. This stabilizes gradient propagation while preventing deep attention from over-perturbing mature feature representations.

The CTA sublayer is followed by a standard Pre-Norm FFN sublayer [38]. This sublayer first applies LayerNorm, then a linear transformation expanding the feature dimension from 128 to 256, followed by GELU activation [39] and Dropout. A second linear transformation projects the dimension back from 256 to 128, followed by a final Dropout, and the result is added to the input through a residual connection. The

2 \times

expansion of the intermediate dimension provides sufficient expressive capacity for the non-linear transformation. CTA does not apply a causal mask, so every time step in the sequence can attend to the global context, consistent with the bidirectional dependency of signals within an HAR window. The overall topology of a DMSC Block places DMSB first, followed by the CTA sublayer and the FFN sublayer in sequence, with residual connections spanning the entire Block. The complete structure is shown in Figure 1.

4. Experiments and Results

We evaluate DMSCNet on three public HAR datasets through comparison experiments, ablation studies, and visualization analyses. Section 4.1 and Section 4.2 describe the datasets and implementation details. Section 4.3 compares DMSCNet against nine representative HAR baselines, and Section 4.5 isolates the contributions of the SE channel attention and the CTA temporal attention. Section 4.6 analyzes the learned feature space, classification decisions, and attention behavior to examine how DMSCNet operates internally.

4.1. Datasets

We evaluate DMSCNet on three publicly available wearable-sensor activity recognition datasets: UCI-HAR [40], USC-HAD [41], and RealWorld [42]. The three datasets differ in the number of subjects, activity granularity, and sensor configuration, providing complementary perspectives for assessing recognition accuracy and generalization. For UCI-HAR we adopt the official partition (7352 training samples and 2947 test samples). For USC-HAD and RealWorld we randomly partition each into 70% training and 30% test. We further hold out 15% of the training set as a validation set for early stopping and model selection; the test set is used only for final evaluation after training. Note that the official UCI-HAR split is already subject-independent (disjoint subjects in training and test), whereas the random 70/30 split used for USC-HAD and RealWorld is not; we therefore additionally evaluate cross-subject generalization in Section 4.4. Table 1 summarizes the three datasets.

UCI-HAR was collected from 30 subjects wearing a waist-mounted smartphone (Samsung Galaxy S II) and covers six daily activities. The sampling rate is 50 Hz, and each sample contains nine channels—tri-axial body acceleration, gravity acceleration, and angular velocity.

USC-HAD was collected from 14 subjects wearing a waist-mounted MotionNode inertial sensor at 100 Hz. It covers 12 fine-grained activities, in which pairs such as walking upstairs/downstairs and elevator up/down produce highly similar signal patterns that demand fine-grained discrimination.

RealWorld was collected from 15 subjects performing eight activities in natural settings, with tri-axial accelerometers deployed at seven body positions: chest, forearm, head, waist, thigh, shin, and upper arm. We concatenate the signals from the seven positions along the channel dimension to form a 21-channel input at 50 Hz, assessing recognition performance under multi-position sensor fusion.

4.2. Implementation Details

All experiments were conducted on the same platform to ensure fair comparison. The hardware comprises an Intel Core i5-12400F CPU, an NVIDIA GeForce RTX 4060 Ti 16 GB GPU, and 32 GB of RAM. The software environment consists of Ubuntu 20.04.6, Python 3.8, CUDA 12.3, and PyTorch 2.1.1. All baselines and ablation variants are trained from scratch using the same training protocol.

We use the AdamW optimizer with an initial learning rate of

3 \times 10^{- 4}

and a weight decay of

1 \times 10^{- 3}

. The learning rate schedule combines a 15-epoch linear warmup with cosine annealing with warm restarts, stabilizing gradients in early training and enabling finer convergence in later epochs. The batch size is 16 and the maximum number of training epochs is 300; training terminates early when the validation F1-score does not improve for 35 consecutive epochs. Dropout is set to 0.1. The loss function is cross-entropy with label smoothing (

ε = 0.05

), which together with weight decay and Dropout forms a multi-level regularization scheme that mitigates overfitting under limited data. Unless otherwise stated, all results are reported as mean ± standard deviation over five independent runs with different random seeds (0–4) under a fixed data split.

For evaluation we report accuracy and F1-score. Accuracy measures the overall classification correctness across all test samples, while F1-score combines precision and recall and better reflects performance on minority-class activities under class imbalance. The two metrics jointly assess overall performance and class-level balance.

4.3. Comparison Experiments

To evaluate the overall performance of DMSCNet, we compare it against nine representative HAR methods on the three datasets, spanning classical convolutional baselines, multi-scale convolution, temporal convolution, attention networks, and CNN–RNN hybrids. FCN [6] is a classical fully convolutional baseline for time-series classification, with three convolutional layers followed by global average pooling. ResNet [6] extends FCN with residual connections to mitigate vanishing gradients in deep networks. DeepConvLSTM [5] stacks convolutional layers with LSTM layers to jointly model spatial correlation and long-range temporal dependency. InceptionTime [7] adopts the parallel multi-scale kernel structure of Inception networks. TCN-Inception [10] combines temporal convolutional networks with Inception modules. MuVAN [13] enhances feature representation for multivariate time series through a multi-view attention mechanism. CNN-A-BiLSTM [28] introduces attention into a CNN–BiLSTM hybrid to improve robustness to noisy data. TECA-HAR [29] couples ECA channel attention with a CNN–LSTM hybrid. HART [43] is a Transformer-based HAR architecture using sensor-wise tokenization and a lightweight Transformer encoder; we include it as a representative Transformer-based baseline and re-implement it under our unified protocol.

To ensure a fair comparison, we reproduce each baseline following the original authors’ public code and paper descriptions, adjusting only the input and output dimensions to match the channel count and class count of our datasets. The learning rate for each method is set to the optimal value reported in its original paper, or, when no value is specified, to the one used for DMSCNet. Batch size, training epochs, early stopping, and loss function follow the same configuration as DMSCNet.

As shown in Table 2, DMSCNet achieves the best performance on all six metrics across the three datasets. On UCI-HAR, DMSCNet reaches 97.65% F1-score, about 1.0 percentage point above the runner-up TCN-Inception. On USC-HAD, DMSCNet reaches 91.80% F1-score, about 1.28 percentage points above TCN-Inception; this dataset contains activity pairs with highly similar signals such as walking upstairs/downstairs and elevator up/down, and the largest improvement here indicates that CTA is effective at suppressing shared background responses and sharpening the discriminative differences between similar activities, provided the sensor modality carries discriminative information for the pair (e.g., walking upstairs/downstairs); it does not extend to pairs whose distinguishing cues are largely absent from the modality itself, such as elevator up/down (Section 4.6.2). On RealWorld, which has a more complex sensor configuration in both position and count, DMSCNet still retains an advantage, showing strong generalization to multi-position sensor fusion.

Compared with multi-scale methods such as InceptionTime and TCN-Inception, DMSCNet shows a consistent advantage on all three datasets. Widening the receptive field through dilated convolutions is more efficient than simply diversifying kernel sizes, and the synergy with CTA enables the model to locate discriminative time steps over a fuller range of the activity cycle. Attention-based methods MuVAN and CNN-A-BiLSTM lag noticeably in most cases—MuVAN drops to 69.10% F1-score on USC-HAD—indicating that pure or shallow attention mechanisms, without a strong local inductive bias, struggle to learn discriminative features on small-scale, short-window sensor data. Among the Transformer- and attention-based methods, HART outperforms the pure-attention MuVAN but remains below the convolutional baselines and DMSCNet on all three datasets, consistent with Section 2.2: on small, short-window HAR data the local inductive bias of convolutions remains important and a Transformer encoder alone is less data-efficient. DeepConvLSTM, as a representative CNN–RNN hybrid, falls below the convolutional methods on all three datasets, suggesting that LSTM-based sequential modeling does not necessarily carry an advantage on HAR’s fixed-length, short-window inputs. TECA-HAR, the most recent CNN–LSTM + attention hybrid published in 2025, performs steadily across the three datasets but remains below DMSCNet. The above results use a random split; we further assess cross-subject generalization in Section 4.4.

4.4. Cross-Subject Evaluation

The random 70/30 split used for USC-HAD and RealWorld in Section 4.3 allows windows from the same subject to appear in both the training and test sets, which can overestimate generalization to unseen users. Since subject variability is a central challenge in practical HAR, we evaluate all methods under a stricter cross-subject protocol using leave-one-subject-out (LOSO) cross-validation: 14 folds on USC-HAD and 15 folds on RealWorld. In each fold, the data of one subject are held out for testing, two further subjects form a subject-disjoint validation set for early stopping, and the remaining subjects are used for training, so that no subject appears in more than one split. All folds are fixed in advance and shared across methods, and all training hyperparameters follow Section 4.2. UCI-HAR is not re-evaluated because its official split is already subject-independent.

Table 3 reports the results as mean ± standard deviation across folds. As expected, every method drops relative to the random split, confirming that the random partition overestimates cross-subject performance. Under LOSO, DMSCNet still achieves the best F1-score on both datasets (79.85% on USC-HAD and 91.50% on RealWorld), retaining its lead over all baselines. The fold-wise standard deviation also quantifies inter-subject variability: weaker models such as MuVAN and DeepConvLSTM exhibit the largest variance, whereas DMSCNet shows the smallest, indicating more stable generalization across users. A paired t-test across subject folds confirms that the improvement of DMSCNet over the strongest baseline (TCN-Inception) is statistically significant (

p < 0.05

) on both datasets for both metrics.

4.5. Ablation Study

To quantify the individual contributions and joint effects of the modules in DMSCNet, we conduct an ablation study on the three datasets. Starting from a baseline with no attention, we progressively add SE channel attention and CTA temporal attention. The no-attention variant removes both the SE module inside DMSB and the CTA module from the full model, keeping only the dilated multi-scale branches and the pooling branch as the backbone. The + SE variant enables only the SE channel attention inside DMSB, with CTA still disabled. The + CTA variant (the complete DMSCNet) cascades CTA temporal attention on top of the + SE variant. All other hyperparameters and training settings are identical across the three variants. The ablation results, reported in F1-score (%), are summarized in Table 4.

Starting from the full DMSCNet, we remove or replace one design choice at a time (Table 4). Attention modules: Removing CTA lowers the F1-score on all three datasets, and additionally removing SE lowers it further, confirming that the two attention modules act complementarily. The effect is clearest on USC-HAD, where removing CTA alone drops F1 by 1.65 points and removing both drops it by 2.35 points; the larger contribution of CTA on USC-HAD is consistent with its fine-grained activity pairs, for which suppressing shared background responses is most beneficial. CTA produces a clear advantage in fine-grained scenarios where the two activities remain separable in the sensor signal (e.g., walking upstairs/downstairs); for pairs limited by the sensing modality itself, such as elevator up/down, attention cannot recover information the signal does not contain (Section 4.6.2). DMSB structural components: Removing the shared bottleneck or the pooling bypass also reduces performance on all datasets, verifying that both contribute—the bottleneck improves parameter efficiency before the multi-scale branches, and the pooling bypass preserves peak/impulse information that the bottleneck-smoothed convolution branches tend to attenuate. Dilation configuration: The chosen

(k, d) = (7, 1) (5, 2) (3, 4)

outperforms disabling dilation (all dilation

= 1

), removing multi-scale (Single-scale,

k = 3

), and an alternative combination, showing that the gain stems jointly from the multi-scale design and the dilation; because dilation adds no parameters, this advantage is obtained at essentially the same parameter cost. The contributions are most pronounced on UCI-HAR and USC-HAD; on RealWorld the differences are small and comparable to the run-to-run standard deviation, owing to the near-saturated information of its 21-channel multi-position input. SE performs static recalibration along the channel axis and CTA performs dynamic contrastive filtering along the temporal axis; the two act on different dimensions and complement each other.

To verify that the design does not rely on finely tuned constants, we perturb each of the three constants a, b, and c of

λ_{init} (l) = a - b \cdot exp (- c l)

by

\pm 10 %

while keeping the others fixed, and evaluate on USC-HAD (Figure 3). All perturbed configurations stay within 0.13 percentage points of the baseline F1-score, which is smaller than the run-to-run standard deviation, confirming that performance is insensitive to these constants. The baseline configuration

(a, b, c) = (0.8, 0.6, 0.3)

yields the same USC-HAD F1-score (91.80%) as DMSCNet in Table 2, confirming that the sensitivity analysis uses the same model and setting as the main experiments.

4.6. Visualization Analysis

This section analyzes DMSCNet from three angles: feature space (t-SNE), classification decisions (confusion matrices), and attention behavior. Each angle is examined across UCI-HAR, USC-HAD, and RealWorld, which differ in channel count (9, 6, 21), class count (6, 12, 8), and sensor configuration (single waist sensor, single waist sensor, multi-position fusion).

4.6.1. Feature Space Distribution (t-SNE)

Figure 4, Figure 5 and Figure 6 show the feature distributions of the test samples from each dataset after the global average pooling layer of DMSCNet. On all three datasets, a clear spatial separation emerges between dynamic and static activities, indicating that the multi-scale dilated convolutions in DMSB and the temporal selection in CTA learn discriminative feature representations under different sensor configurations. Differences in cluster structure across the three datasets reflect their respective modality and task characteristics.

On UCI-HAR, Laying occupies an isolated region clearly separated from the other classes; Walking and Walk Down appear next to each other in the lower part of the plot, while Walk Up occupies the upper part alone; the Sitting and Standing clusters overlap noticeably in the center, reflecting the inherent similarity of these two activities at the waist acceleration level.

USC-HAD exhibits a more complex cluster structure owing to its larger class count and difficult pairs. The five walking activities, Run Fwd, Jump, and Sleep each form distinct clusters that separate cleanly, while Stand, Elev Up, and Elev Down show large overlap in the lower-central region and also overlap with some Walk Down samples. The physical meaning of this structural overlap is discussed further in the confusion matrix analysis in Section 4.6.2.

RealWorld’s eight classes form a compact overall distribution. Climb Down, Climb Up, Jumping, and Sitting each occupy distinct regions; Walking and Running appear next to each other in the upper right, and Lying and Standing are adjacent on the left, with the main contours of neighboring clusters still distinguishable. Overall, the inter-class separability on RealWorld under a 21-channel input surpasses that on UCI-HAR and USC-HAD.

To complement the qualitative t-SNE visualization with a quantitative measure, we compute cluster separability directly on the 128-dimensional GAP-layer features (not the 2-D projection) using the Silhouette score and the Davies–Bouldin index (Table 5). DMSCNet attains Silhouette scores of 0.659, 0.572, and 0.625 on UCI-HAR, USC-HAD, and RealWorld, indicating well-separated class clusters. USC-HAD shows the highest Davies–Bouldin index, consistent with its larger class count and the hard elevator pair, matching the cluster overlap observed in the t-SNE plot and the confusion matrix.

4.6.2. Classification Decision Analysis (Confusion Matrices)

Figure 7, Figure 8 and Figure 9 present the normalized confusion matrices on the three datasets, confirming the class-level discriminability observed in the previous subsection and revealing the cause of the overlapping clusters in USC-HAD.

On UCI-HAR, the diagonal accuracy of all six classes is at or above 95.5%, and Walk Down and Laying reach 100%. The main misclassifications occur between Sitting and Standing, with mutual confusion of 2–3% in each direction; a small number of Sitting samples are classified as Laying, and a small number of Walking samples as Sitting. These errors concentrate among static postures with similar physical signals, echoing the overlap between Sitting and Standing clusters in the t-SNE plot.

On USC-HAD, most classes achieve accuracy above 97% (Sleep reaches 100%), with Stand at 94.0% being the only notable exception among the non-elevator classes. Elev Up and Elev Down, however, drop markedly to 45.9% and 61.8% respectively, with severe bidirectional confusion between them (Elev Up misclassified as Elev Down in 40.7% of cases, Elev Down as Elev Up in 22.9%) and a non-trivial fraction further classified as Sit or Stand.

The difficulty of recognizing Elev Up and Elev Down is a well-known issue in HAR, not a specific weakness of DMSCNet. All baselines we reproduce perform poorly on these two classes, typically treating the two directions as a single activity and further confusing them with Stand. The root cause is the modality limit of a waist-mounted inertial sensor in elevator scenarios. First, elevators operate mostly in a constant-velocity phase, during which the passenger is stationary relative to the cabin; accelerometer readings after gravity removal are close to zero and nearly indistinguishable from standing still. Second, the direction-specific acceleration impulses occur only during start and stop transients, which are often truncated or absent within the 128-step sliding window, leaving little discriminative information. Resolving this fundamentally requires auxiliary modalities such as a barometer or floor position; algorithmic improvements alone cannot exceed the information ceiling imposed by the modality itself. We emphasize that this degradation is a limitation of the sensing modality, not of CTA or DMSCNet specifically; the fine-grained effectiveness claimed in Section 4.3 and Section 4.5 is therefore conditional on modality-available information.

On RealWorld, the diagonal accuracy of all eight classes is at or above 97.8%, with Sitting and Standing reaching 99.5% and 99.4% respectively. Misclassifications are extremely sparse overall, the largest single off-diagonal entry is 1.2%, and no similar-activity pair exhibits notable confusion. This balanced high performance is consistent with the 21-channel multi-position sensor input.

A consistent pattern emerges across the three datasets: the misclassifications of DMSCNet concentrate on activity pairs with highly similar physical signals—small-fraction confusion between static postures on UCI-HAR, severe modality-induced confusion in the elevator scenario on USC-HAD, and highly balanced classification on RealWorld where sensor information is sufficient.

4.6.3. Attention Mechanism Analysis

We analyze CTA behavior in two parts: class-level temporal response patterns (Figure 10, Figure 11 and Figure 12), followed by the dual-path

A_{1} / A_{2}

outputs before and after the differential contrast (Figure 13, Figure 14 and Figure 15). All heatmaps are taken from the CTA output of the deepest DMSC Block.

Figure 10, Figure 11 and Figure 12 show the CTA outputs of the deepest DMSC Block for the 6 classes of UCI-HAR, the 12 classes of USC-HAD, and the 8 classes of RealWorld, respectively. Despite substantial differences in class granularity and sensor configuration, the class-level CTA responses exhibit a consistent “dynamic–static dichotomy.” Dynamic activities (the Walking series of UCI-HAR; the Walk series, Run Fwd, and Jump of USC-HAD; Climb Down/Up, Jumping, Running, and Walking of RealWorld) produce dispersed, locally concentrated red–blue alternating patterns corresponding to rapid changes within the gait cycle. Static activities (Sit/Stand across the three datasets; Lying on RealWorld) concentrate their responses on a few positions near the beginning of the sequence, with other positions near zero, indicating that CTA needs only a small number of temporal anchors to discriminate static scenarios. Extremely static scenarios (Laying on UCI-HAR; Sleep on USC-HAD) show response magnitudes that further decay to 1/5–1/20 of those of dynamic activities, approaching a uniform distribution; in such scenarios CTA does not need to focus on any specific time step to make a reliable prediction.

To quantify the attention behavior, we compute three metrics on the deepest CTA layer (Table 6): the entropy of the CTA response, the cosine similarity between the two paths

A_{1}

and

A_{2}

, and the sparsity after subtraction. These corroborate the dynamic–static dichotomy. Dynamic activities exhibit mid-range entropy, reflecting dispersed responses across the gait cycle. Extremely static activities such as Sleep approach a uniform distribution (entropy 4.80) with an

A_{1}

–

A_{2}

cosine close to 1 (0.996), indicating that both paths respond almost entirely to common-mode background, which the subtraction then cancels (sparsity 0.991). The elevator classes show the lowest entropy (Elev Down 2.84), quantifying the onset-spike-with-quiet-tail pattern concentrated at start/stop transients. The consistently high

A_{1}

–

A_{2}

cosine on static classes quantifies the common-mode component that CTA suppresses, providing explicit numerical support for the common-mode-suppression mechanism.

The consistent recurrence of this dichotomy across the three datasets indicates that the temporal selection behavior of CTA aligns with the physical characteristics of the activities, rather than being an incidental bias of the training data.

On USC-HAD, the CTA responses of Elev Up and Elev Down exhibit a pattern distinct from the other classes—an extremely strong local response at the start of the sequence, an order of magnitude higher than in other classes, with the remaining positions close to zero. This “onset spike with a long quiet tail” pattern aligns with the physics of elevator motion, where only the start and stop transients carry discriminable signals, and it echoes from the attention perspective the modality-limit analysis in Section 4.6.2: CTA tends to focus on the start and stop transients, but when the constant-velocity phase carries no usable information, attention alone cannot compensate for the information deficit of the modality itself.

Figure 13, Figure 14 and Figure 15 show the two independent attention paths

A_{1}

and

A_{2}

, together with the CTA output after their differential contrast, on the three datasets.

A_{1}

and

A_{2}

use a sequential blue palette to display their respective positive attention weights, while CTA uses a divergent red–white–blue palette to show both positive and negative polarity responses.

Across the three datasets we observe the same behavior: the strong responses of

A_{1}

and

A_{2}

lie at different time steps along the sequence. After the subtraction

A_{1} - λ \cdot A_{2}

, the temporal patterns captured by the two paths are preserved in the CTA output with positive and negative polarities respectively. For the Walking sample of UCI-HAR, the strong responses of

A_{1}

concentrate in the middle of the sequence and those of

A_{2}

toward the end. On USC-HAD,

A_{1}

and

A_{2}

concentrate their strong responses in the front–middle and middle–rear segments respectively. On RealWorld,

A_{1}

shows a broader response distribution, while

A_{2}

produces concentrated responses in the latter part of the sequence. In the CTA outputs of all three datasets, the temporal signals from these two types of positions appear clearly distinguished by positive and negative polarity.

This observation reveals the key difference between CTA and a single softmax attention: differential attention does not compress the two paths into a single non-negative scalar; it preserves them as a signed bipolar encoding—patterns from

A_{1}

with positive polarity, and patterns from

A_{2}

with negative polarity. Positive and negative polarities carry complementary temporal information along the channel dimension for subsequent feature fusion. Combined with the F1-score improvements of CTA over SE-only attention reported in Section 4.5 (+0.70, +1.65, and +0.17 percentage points on UCI-HAR, USC-HAD, and RealWorld respectively), this analysis offers a visual explanation of how CTA improves upon standard self-attention.

4.7. Noise Robustness

To examine the stability of the dual-path subtraction under noisy and unstable signals, we add perturbations only to the test input on USC-HAD, while the model is trained on clean data, and compare the full DMSCNet with a variant without CTA (Figure 16). We use two perturbations of different nature: Gaussian white noise, which is independent at each time step, and low-frequency baseline drift, which spans the whole sequence and is therefore a common-mode component. The advantage of CTA grows with perturbation strength: the gap over the variant without CTA increases from

+ 1.65

on clean data to

+ 5.05

under Gaussian noise (

σ = 0.3

) and to

+ 10.28

under baseline drift (

A = 0.5

). CTA degrades more slowly under both perturbations, and the gap widens markedly faster under baseline drift than under Gaussian white noise. This asymmetry is exactly what the common-mode-suppression mechanism predicts: the subtraction cancels the component shared by both paths, such as sequence-wide baseline drift, whereas time-step-independent white noise is not common-mode and is therefore only partially mitigated. The result provides empirical evidence that CTA specifically suppresses common-mode interference and remains stable under noise.

5. Conclusions

We propose DMSCNet, an end-to-end framework for wearable-sensor-based HAR that addresses two structural limitations of existing methods. DMSB widens the temporal receptive field under a controlled parameter budget through a shared bottleneck, parallel dilated convolutions, a pooling bypass, and SE channel recalibration. CTA introduces a dual-path differential structure that cancels common-mode background responses and preserves discriminative differential-mode signals as a signed bipolar encoding. DMSB and CTA are cascaded into a DMSC Block and stacked via residual connections. On the UCI-HAR, USC-HAD, and RealWorld datasets, DMSCNet achieves F1-scores of 97.65%, 91.80%, and 99.05%, respectively, outperforming nine representative baselines. These interpretations are further supported quantitatively by the cluster-separability and attention metrics (Table 5 and Table 6), which confirm the dynamic–static dichotomy and the common-mode-suppression behavior rather than relying on qualitative visualization alone.

Several directions remain open for future work. Multi-modal fusion beyond a single inertial sensor could resolve modality-limited activities. Self-supervised pretraining on large-scale unlabeled sensor streams offers a path toward stronger cross-subject generalization. Integration with streaming decoding would enable continuous real-time detection.

Author Contributions

Conceptualization, K.L. and Q.W.; methodology, Q.W.; software, Q.W. and S.C.; validation, Q.W., S.C. and L.W.; formal analysis, Q.W.; investigation, Q.W. and S.C.; resources, K.L.; data curation, Q.W. and S.C.; writing—original draft preparation, Q.W.; writing—review and editing, K.L., S.C. and L.W.; visualization, Q.W.; supervision, K.L.; project administration, K.L.; funding acquisition, K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the major project of National Natural Science Foundation of China (51991365) and the Natural Science Foundation of Shandong Province of China (ZR2021MF082).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The UCI-HAR, USC-HAD, and RealWorld datasets used in this study are publicly available through their original publications.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BiLSTM	Bidirectional Long Short-Term Memory
CBAM	Convolutional Block Attention Module
CNN	Convolutional Neural Network
ECA	Efficient Channel Attention
FFN	Feed-Forward Network
GELU	Gaussian Error Linear Unit
GRU	Gated Recurrent Unit
LSTM	Long Short-Term Memory
RMSNorm	Root Mean Square Normalization
RNN	Recurrent Neural Network
SE	Squeeze-and-Excitation
SVM	Support Vector Machine

References

Lara, O.D.; Labrador, M.A. A Survey on Human Activity Recognition Using Wearable Sensors. IEEE Commun. Surv. Tutor. 2013, 15, 1192–1209. [Google Scholar] [CrossRef]
Bulling, A.; Blanke, U.; Schiele, B. A Tutorial on Human Activity Recognition Using Body-Worn Inertial Sensors. ACM Comput. Surv. 2014, 46, 33:1–33:33. [Google Scholar] [CrossRef]
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. Human Activity Recognition on Smartphones Using a Multiclass Hardware-Friendly Support Vector Machine. In Proceedings of the International Workshop on Ambient Assisted Living (IWAAL), Vitoria-Gasteiz, Spain, 3–5 December 2012; pp. 216–223. [Google Scholar] [CrossRef]
Ronao, C.A.; Cho, S.B. Human Activity Recognition with Smartphone Sensors Using Deep Learning Neural Networks. Expert Syst. Appl. 2016, 59, 235–244. [Google Scholar] [CrossRef]
Ordóñez, F.J.; Roggen, D. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Yan, W.; Oates, T. Time Series Classification from Scratch with Deep Neural Networks: A Strong Baseline. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 1578–1585. [Google Scholar] [CrossRef]
Ismail Fawaz, H.; Lucas, B.; Forestier, G.; Pelletier, C.; Schmidt, D.F.; Weber, J.; Webb, G.I.; Idoumghar, L.; Muller, P.A.; Petitjean, F. InceptionTime: Finding AlexNet for Time Series Classification. Data Min. Knowl. Discov. 2020, 34, 1936–1962. [Google Scholar] [CrossRef]
Xu, C.; Chai, D.; He, J.; Zhang, X.; Duan, S. InnoHAR: A Deep Neural Network for Complex Human Activity Recognition. IEEE Access 2019, 7, 9893–9902. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Al-qaness, M.A.A.; Dahou, A.; Trouba, N.T.; Abd Elaziz, M.; Helmi, A.M. TCN-Inception: Temporal Convolutional Network and Inception Modules for Sensor-Based Human Activity Recognition. Future Gener. Comput. Syst. 2024, 160, 375–388. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
Yuan, Y.; Xun, G.; Ma, F.; Wang, Y.; Du, N.; Jia, K.; Su, L.; Zhang, A. MuVAN: A Multi-View Attention Network for Multivariate Temporal Data. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 717–726. [Google Scholar] [CrossRef]
Li, B.; Cui, W.; Wang, W.; Zhang, L.; Chen, Z.; Wu, M. Two-Stream Convolution Augmented Transformer for Human Activity Recognition. AAAI Conf. Artif. Intell. 2021, 35, 286–293. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Zhang, S.; Shahabi, F.; Xia, S.; Deng, Y.; Alshurafa, N. Deep Learning in Human Activity Recognition with Wearable Sensors: A Review on Advances. Sensors 2022, 22, 1476. [Google Scholar] [CrossRef]
Saleem, G.; Bajwa, U.I.; Raza, R.H. Toward Human Activity Recognition: A Survey. Neural Comput. Appl. 2023, 35, 4145–4182. [Google Scholar] [CrossRef]
Kaseris, M.; Kostavelis, I.; Malassiotis, S. A Comprehensive Survey on Deep Learning Methods in Human Activity Recognition. Mach. Learn. Knowl. Extr. 2024, 6, 842–876. [Google Scholar] [CrossRef]
Gu, F.; Chung, M.H.; Chignell, M.; Valaee, S.; Zhou, B.; Liu, X. A Survey on Deep Learning for Human Activity Recognition. ACM Comput. Surv. 2022, 54, 177:1–177:34. [Google Scholar] [CrossRef]
Yang, J.; Nguyen, M.N.; San, P.P.; Li, X.; Krishnaswamy, S. Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI), Buenos Aires, Argentina, 25–31 July 2015; pp. 3995–4001. [Google Scholar]
Yao, S.; Hu, S.; Zhao, Y.; Zhang, A.; Abdelzaher, T. DeepSense: A Unified Deep Learning Framework for Time-Series Mobile Sensing Data Processing. In Proceedings of the 26th International Conference on World Wide Web (WWW), Perth, Australia, 3–7 April 2017; pp. 351–360. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Mahmud, S.; Tonmoy, M.T.H.; Bhaumik, K.K.; Rahman, A.K.M.M.; Amin, M.A.; Shoyaib, M.; Khan, M.A.H.; Ali, A.A. Human Activity Recognition from Wearable Sensor Data Using Self-Attention. In Proceedings of the 24th European Conference on Artificial Intelligence (ECAI), Santiago de Compostela, Spain, 29 August–8 September 2020; pp. 1332–1339. [Google Scholar] [CrossRef]
Dirgová Luptáková, I.; Kubovčík, M.; Pospíchal, J. Wearable Sensor-Based Human Activity Recognition with Transformer Model. Sensors 2022, 22, 1911. [Google Scholar] [CrossRef] [PubMed]
Suh, S.; Rey, V.F.; Lukowicz, P. TASKED: Transformer-Based Adversarial Learning for Human Activity Recognition Using Wearable Sensors via Self-Knowledge Distillation. Knowl.-Based Syst. 2023, 260, 110143. [Google Scholar] [CrossRef]
Wang, Y.; Xu, H.; Liu, Y.; Wang, M.; Wang, Y.; Yang, Y.; Zhou, S.; Zeng, J.; Xu, J.; Li, S. A Novel Deep Multifeature Extraction Framework Based on Attention Mechanism Using Wearable Sensor Data for Human Activity Recognition. IEEE Sens. J. 2023, 23, 7188–7198. [Google Scholar] [CrossRef]
Mekruksavanich, S.; Jitpattanakul, A. Device Position-Independent Human Activity Recognition with Wearable Sensors Using Deep Neural Networks. Appl. Sci. 2024, 14, 2107. [Google Scholar] [CrossRef]
Yin, X.; Liu, Z.; Liu, D.; Ren, X. A Novel CNN-based Bi-LSTM Parallel Model with Attention Mechanism for Human Activity Recognition with Noisy Data. Sci. Rep. 2022, 12, 7878. [Google Scholar] [CrossRef]
Refat, M.A.R.; Hossain, M.P.; Islam, M.R.; Rahman, A.; Al Farid, F.; Karim, H.A.; Miah, A.S.M. A Hybrid LSTM-CNN Model with Efficient Channel Attention for Enhanced Human Activity Recognition Using Wearable Sensors. Discov. Appl. Sci. 2025, 8, 113. [Google Scholar] [CrossRef]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating Long Sequences with Sparse Transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar] [CrossRef]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the 37th International Conference on Machine Learning (ICML), PMLR, Vienna, Austria, 12–18 July 2020; pp. 5156–5165. [Google Scholar]
Ye, T.; Dong, L.; Xia, Y.; Sun, Y.; Zhu, Y.; Huang, G.; Wei, F. Differential Transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar] [CrossRef]
Lin, M.; Chen, Q.; Yan, S. Network in Network. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Zhang, B.; Sennrich, R. Root Mean Square Layer Normalization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T. On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML), PMLR, Vienna, Austria, 12–18 July 2020; pp. 10524–10533. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A Public Domain Dataset for Human Activity Recognition Using Smartphones. In Proceedings of the 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 24–26 April 2013; pp. 437–442. [Google Scholar]
Zhang, M.; Sawchuk, A.A. USC-HAD: A Daily Activity Dataset for Ubiquitous Activity Recognition Using Wearable Sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (UbiComp), Pittsburgh, PA, USA, 5–8 September 2012; pp. 1036–1043. [Google Scholar] [CrossRef]
Sztyler, T.; Stuckenschmidt, H. On-Body Localization of Wearable Devices: An Investigation of Position-Aware Activity Recognition. In Proceedings of the 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom), Sydney, Australia, 14–19 March 2016; pp. 1–9. [Google Scholar] [CrossRef]
Ek, S.; Portet, F.; Lalanda, P. Transformer-based models to deal with heterogeneous environments in Human Activity Recognition. Pers. Ubiquitous Comput. 2023, 27, 2267–2280. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of DMSCNet. The (left) panel shows the backbone (two stacked ResBlocks and the classification head); the (right) panel expands the internal structure of a single DMSC Block (DMSB module, CTA sublayer, and FFN sublayer).

Figure 2. Schematic of the CTA module. Two attention paths

A_{1}

and

A_{2}

both respond to the shared, sequence-wide background but attend to different discriminative positions. The subtraction

A_{1} - λ \cdot A_{2}

cancels the common-mode background and preserves the two paths’ discriminative positions with opposite signs, yielding a signed bipolar encoding (red: from

A_{1}

; blue: from

A_{2}

), which is then multiplied by V.

Figure 2. Schematic of the CTA module. Two attention paths

A_{1}

and

A_{2}

both respond to the shared, sequence-wide background but attend to different discriminative positions. The subtraction

A_{1} - λ \cdot A_{2}

cancels the common-mode background and preserves the two paths’ discriminative positions with opposite signs, yielding a signed bipolar encoding (red: from

A_{1}

; blue: from

A_{2}

), which is then multiplied by V.

Figure 3. Sensitivity of DMSCNet to the three constants

(a, b, c)

of

λ_{init} (l) = a - b \cdot exp (- c l)

on USC-HAD. Each constant is perturbed by

\pm 10 %

with the others fixed. All curves stay within 0.13 percentage points of the baseline, smaller than the run-to-run standard deviation.

Figure 3. Sensitivity of DMSCNet to the three constants

(a, b, c)

of

λ_{init} (l) = a - b \cdot exp (- c l)

on USC-HAD. Each constant is perturbed by

\pm 10 %

with the others fixed. All curves stay within 0.13 percentage points of the baseline, smaller than the run-to-run standard deviation.

Figure 4. t-SNE visualization of UCI-HAR test-set features.

Figure 5. t-SNE visualization of USC-HAD test-set features.

Figure 6. t-SNE visualization of RealWorld test-set features.

Figure 7. Normalized confusion matrix of DMSCNet on the UCI-HAR test set.

Figure 8. Normalized confusion matrix of DMSCNet on the USC-HAD test set.

Figure 9. Normalized confusion matrix of DMSCNet on the RealWorld test set.

Figure 10. Class-level CTA output heatmaps on UCI-HAR for the 6 activities (deepest DMSC Block).

Figure 11. Class-level CTA output heatmaps on USC-HAD for the 12 activities (deepest DMSC Block).

Figure 12. Class-level CTA output heatmaps on RealWorld for the 8 activities (deepest DMSC Block).

Figure 13. Comparison of

A_{1}

,

A_{2}

, and CTA attention for a Walking sample on UCI-HAR.

Figure 13. Comparison of

A_{1}

,

A_{2}

, and CTA attention for a Walking sample on UCI-HAR.

Figure 14. Comparison of

A_{1}

,

A_{2}

, and CTA attention for a representative sample on USC-HAD.

Figure 14. Comparison of

A_{1}

,

A_{2}

, and CTA attention for a representative sample on USC-HAD.

Figure 15. Comparison of

A_{1}

,

A_{2}

, and CTA attention for a representative sample on RealWorld.

Figure 15. Comparison of

A_{1}

,

A_{2}

, and CTA attention for a representative sample on RealWorld.

Figure 16. Noise robustness on USC-HAD. Perturbations are added only to the test input, while the model is trained on clean data. (a) Gaussian white noise; (b) low-frequency baseline drift. The advantage of the full model (with CTA) over the variant without CTA widens as the perturbation increases, and more steeply under baseline drift than under Gaussian noise.

Table 1. Overview of the datasets used in our experiments.

Dataset	Subjects	Channels	Sampling Rate (Hz)	Samples	Activity Classes
UCI-HAR	30	9	50	10,299	Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, Laying
USC-HAD	14	6	100	21,608	Walking Forward, Walking Left, Walking Right, Walking Upstairs, Walking Downstairs, Running Forward, Jumping, Sitting, Standing, Sleeping, Elevator Up, Elevator Down
RealWorld	15	21	50	24,665	Climbing Down, Climbing Up, Jumping, Lying, Standing, Sitting, Running, Walking

Table 2. Performance comparison on UCI-HAR, USC-HAD, and RealWorld, reported as mean ± std (%) over 5 independent runs with different random seeds under a fixed data split. UCI-HAR uses its official split; USC-HAD and RealWorld use a random 70/30 split. HART is re-implemented under the same protocol. The best value in each column is shown in bold.

Method	UCI-HAR		USC-HAD		RealWorld
Method	F1	Acc	F1	Acc	F1	Acc
FCN	93.10 ± 0.31	94.41 ± 0.35	88.60 ± 0.38	90.55 ± 0.42	98.62 ± 0.15	98.98 ± 0.12
ResNet	95.25 ± 0.25	96.18 ± 0.22	89.90 ± 0.31	91.68 ± 0.35	98.58 ± 0.13	98.92 ± 0.11
DeepConvLSTM	90.12 ± 0.58	92.45 ± 0.52	86.32 ± 0.68	89.10 ± 0.65	94.25 ± 0.39	95.10 ± 0.42
InceptionTime	95.30 ± 0.18	96.15 ± 0.21	89.45 ± 0.35	91.12 ± 0.38	98.85 ± 0.08	99.02 ± 0.10
TCN-Inception	96.65 ± 0.15	97.18 ± 0.18	90.52 ± 0.36	91.95 ± 0.32	99.01 ± 0.07	99.04 ± 0.08
MuVAN	84.22 ± 0.91	86.42 ± 0.85	69.10 ± 1.22	68.02 ± 1.15	94.18 ± 0.51	95.05 ± 0.55
CNN-A-BiLSTM	94.80 ± 0.24	95.65 ± 0.28	88.15 ± 0.41	90.40 ± 0.45	97.68 ± 0.14	98.15 ± 0.16
TECA-HAR	94.95 ± 0.21	95.72 ± 0.24	89.62 ± 0.34	91.48 ± 0.39	98.42 ± 0.11	98.88 ± 0.13
HART	92.45 ± 0.48	93.65 ± 0.45	86.75 ± 0.72	88.25 ± 0.65	97.52 ± 0.25	97.85 ± 0.22
DMSCNet	97.65 ± 0.14	97.78 ± 0.13	91.80 ± 0.22	92.52 ± 0.21	99.05 ± 0.12	99.12 ± 0.10

Table 3. Cross-subject performance under leave-one-subject-out (LOSO) evaluation, reported as mean ± std (%) across subject folds (14 for USC-HAD, 15 for RealWorld). UCI-HAR is omitted (its official split is already subject-independent). The best value in each column is shown in bold.

Method	USC-HAD		RealWorld
Method	F1	Acc	F1	Acc
FCN	73.12 ± 5.84	75.31 ± 5.21	85.11 ± 4.12	86.41 ± 3.85
ResNet	75.20 ± 4.91	76.84 ± 4.52	86.33 ± 3.56	87.12 ± 3.20
DeepConvLSTM	68.44 ± 6.88	71.05 ± 6.42	79.56 ± 5.65	81.25 ± 5.12
InceptionTime	74.91 ± 4.63	76.50 ± 4.15	87.15 ± 3.42	88.02 ± 3.10
TCN-Inception	76.45 ± 4.12	78.11 ± 3.95	88.20 ± 3.15	89.15 ± 2.85
MuVAN	58.33 ± 7.91	61.20 ± 7.25	76.91 ± 6.48	78.50 ± 6.10
CNN-A-BiLSTM	73.80 ± 5.20	74.15 ± 4.88	85.95 ± 3.90	86.20 ± 3.65
TECA-HAR	75.80 ± 4.60	77.20 ± 4.25	87.05 ± 3.55	87.90 ± 3.22
HART	73.80 ± 6.20	75.12 ± 5.85	89.65 ± 4.35	90.25 ± 4.15
DMSCNet	79.85 ± 3.22 *	81.42 ± 2.85 *	91.50 ± 2.45 *	92.35 ± 2.10 *

* indicates

p < 0.05

in the paired t-test against the strongest baseline (TCN-Inception).

Table 4. Ablation study (F1-score %, mean ± std over 5 runs). Each variant modifies one design choice of the full DMSCNet. The full model is shown in bold.

Configuration	UCI-HAR	USC-HAD	RealWorld
DMSCNet (full)	97.65 ± 0.14	91.80 ± 0.22	99.05 ± 0.12
Attention modules
w/o CTA	96.95 ± 0.18	90.15 ± 0.31	98.88 ± 0.15
w/o CTA & SE (no attention)	96.15 ± 0.22	89.45 ± 0.35	98.58 ± 0.19
DMSB structural components
w/o Shared Bottleneck	97.10 ± 0.16	90.85 ± 0.26	98.92 ± 0.14
w/o Pooling Bypass	96.88 ± 0.19	90.32 ± 0.29	98.75 ± 0.16
Dilation configuration
$(k, d) = (7, 1) (5, 2) (3, 4)$ [ours]	97.65 ± 0.14	91.80 ± 0.22	99.05 ± 0.12
All dilation $= 1$	97.25 ± 0.18	90.88 ± 0.28	98.85 ± 0.15
Single-scale ( $k = 3$ )	97.05 ± 0.21	90.35 ± 0.32	98.78 ± 0.18
Alt. $(k, d) = (3, 1) (3, 2) (3, 4)$	97.45 ± 0.15	91.25 ± 0.25	98.96 ± 0.13

Table 5. Cluster separability of the GAP-layer features, computed in the 128-dimensional feature space (not the 2-D t-SNE projection). Higher Silhouette and lower Davies–Bouldin indicate better separability (↑: higher is better; ↓: lower is better).

Dataset	Silhouette ↑	Davies–Bouldin ↓
UCI-HAR	0.659	0.457
USC-HAD	0.572	4.334
RealWorld	0.625	0.501

Table 6. Quantitative analysis of CTA attention on USC-HAD (deepest CTA layer, one representative sample per class). Entropy measures the concentration of the temporal response;

A_{1}

–

A_{2}

cosine measures the overlap (common-mode) shared by the two paths; sparsity is the fraction of near-zero entries after subtraction.

Table 6. Quantitative analysis of CTA attention on USC-HAD (deepest CTA layer, one representative sample per class). Entropy measures the concentration of the temporal response;

A_{1}

–

A_{2}

cosine measures the overlap (common-mode) shared by the two paths; sparsity is the fraction of near-zero entries after subtraction.

Activity	CTA Entropy	$A_{1}$ – $A_{2}$ Cosine	Sparsity
Walk Fwd	4.323	0.412	0.823
Walk Left	4.346	0.347	0.723
Walk Right	4.331	0.384	0.868
Walk Up	4.430	0.533	0.842
Walk Down	4.314	0.326	0.883
Run Fwd	4.411	0.510	0.857
Jump	4.454	0.662	0.619
Sit	4.568	0.991	0.215
Stand	4.240	0.642	0.884
Sleep	4.796	0.996	0.991
Elev Up	3.837	0.487	0.914
Elev Down	2.840	0.417	0.976

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, Q.; Chu, S.; Li, K.; Wang, L. DMSCNet: A Dilated Multi-Scale Contrastive Attention Network for Sensor-Based Human Activity Recognition. Appl. Sci. 2026, 16, 6037. https://doi.org/10.3390/app16126037

AMA Style

Wu Q, Chu S, Li K, Wang L. DMSCNet: A Dilated Multi-Scale Contrastive Attention Network for Sensor-Based Human Activity Recognition. Applied Sciences. 2026; 16(12):6037. https://doi.org/10.3390/app16126037

Chicago/Turabian Style

Wu, Qingshan, Shengguang Chu, Kewen Li, and Liechong Wang. 2026. "DMSCNet: A Dilated Multi-Scale Contrastive Attention Network for Sensor-Based Human Activity Recognition" Applied Sciences 16, no. 12: 6037. https://doi.org/10.3390/app16126037

APA Style

Wu, Q., Chu, S., Li, K., & Wang, L. (2026). DMSCNet: A Dilated Multi-Scale Contrastive Attention Network for Sensor-Based Human Activity Recognition. Applied Sciences, 16(12), 6037. https://doi.org/10.3390/app16126037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

DMSCNet: A Dilated Multi-Scale Contrastive Attention Network for Sensor-Based Human Activity Recognition

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based HAR

2.2. Attention Mechanisms for HAR

2.3. Differential Attention

3. Methods

3.1. Overall Architecture

3.2. Dilated Multi-Scale Branch Block (DMSB)

3.3. Contrastive Temporal Attention (CTA) Module

4. Experiments and Results

4.1. Datasets

4.2. Implementation Details

4.3. Comparison Experiments

4.4. Cross-Subject Evaluation

4.5. Ablation Study

4.6. Visualization Analysis

4.6.1. Feature Space Distribution (t-SNE)

4.6.2. Classification Decision Analysis (Confusion Matrices)

4.6.3. Attention Mechanism Analysis

4.7. Noise Robustness

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI