Next Article in Journal
Compact GCPW–SSPP Low-Pass Filter with Wide Stopband and Suppressed Radiation Using Multi-Arm Star-Shaped Slots
Previous Article in Journal
An Open Hardware ML-KEM Polynomial Ring Accelerator on Chipyard RISC-V SoC: System-Level Integration and Evaluation
Previous Article in Special Issue
Development of Resonant De Ice Device Based on Visual Detection of Line Ice Coverage
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Silhouette-Based Cross-View Motion Gait Recognition via a Multi-Scale Temporal Difference Unit

1
Department of Physical Education, Northeastern University, Shenyang 110819, China
2
Sport Policy, Management and International Development, Moray House School of Education and Sport, University of Edinburgh, Edinburgh EH8 9YL, UK
3
School of Mechanical Engineering & Automation, Northeastern University, Shenyang 110819, China
4
Artificial Intelligence and Data Business Department, State Power Investment Corporation Digital Technology Co., Ltd., Beijing 102209, China
5
National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University, Shenyang 102209, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(12), 2512; https://doi.org/10.3390/electronics15122512 (registering DOI)
Submission received: 27 April 2026 / Revised: 30 May 2026 / Accepted: 5 June 2026 / Published: 7 June 2026

Abstract

Gait is a behavioral biometric trait that enables non-invasive person recognition based on individual walking patterns. Camera-based gait acquisition is convenient, but silhouette sequences often contain substantial motion-irrelevant appearance information, such as body shape, clothing, and carried objects. To address this problem, a multi-scale time series differencer is proposed to acquire tensor difference data between adjacent frames, so as to extract dynamic feature information in motion gait image sequences. Experiments on the CASIA-B dataset show that the proposed method achieves Rank-1 accuracies of 97.7%, 94.6%, and 80.0% under NM, BG, and CL conditions, respectively. Ablation results further demonstrate that MTDU improves the mean accuracy from 84.7% to 90.8% compared with single-scale temporal differencing. The multi-scale time series differencer shows potential for fields including sports motion gait detection and recognition, surveillance security motion gait identity authentication, and medical motion gait recovery assessment for sports injuries, demonstrating practical application value.

1. Introduction

Gait recognition, as a non-invasive biometric identification technology, typically requires only ordinary camera equipment for data acquisition and can be performed under long-range conditions, thereby significantly reducing dependence on subject cooperation. Unlike static image-based features such as face, fingerprint, and iris, gait features originate from an individual’s unique and temporally continuous movement patterns. Owing to their complexity and habitual nature, such dynamic behavioral biometric features are difficult to counterfeit effectively [1]. Therefore, this technology has important application value in public security surveillance and identity authentication [2].
In addition, gait recognition has shown cross-domain application potential. For example, in clinical medicine and sports science, the movement patterns of subjects can be quantitatively evaluated by combining surveillance cameras, gait analysis systems, and related auxiliary modules [3]. Specifically, in competitive sports scenarios such as badminton, the system can assist in identifying potential sports-injury risks in athletes and objectively analyzing the technical standardization of lower-limb actions, thereby providing data support for sports training and rehabilitation guidance.
Despite its significant advantages, gait recognition technology still faces many challenges in practical deployment and application. In theory, direct extraction of gait features from the temporal domain can obtain essential representations that are less affected by covariates such as viewpoint and clothing [4]. However, compared with audio and static images, video data exhibit higher dimensionality and computational complexity, while redundant information and noise are often hidden between consecutive frames. During vision-based gait acquisition, the system inevitably captures covariates unrelated to identity, such as clothing variation, carried objects, background interference, and walking-direction differences. These additional factors cause substantial visual-representation discrepancies for the same individual under different conditions, severely disturbing stable gait feature extraction and reliable matching, and significantly increasing the complexity and uncertainty of the recognition task. Therefore, how to efficiently and accurately capture and extract implicit gait-habit features from massive video data has become a core scientific problem to be solved in visual gait recognition.
Traditional gait recognition methods generally rely on handcrafted features, usually extracting spatiotemporal features from RGB images or binarized silhouette images. Representative studies include Han et al., who generated the Gait Energy Image (GEI) via temporal average pooling and extracted features accordingly [5], and Wang et al., who proposed a causal intervention method to encode gait sequences [6]. Such template-based methods can integrate sequence information of varying lengths to some extent, but their major drawback is insufficient modeling of dynamic spatiotemporal dependencies between frames. Consequently, under complex real-world scenarios such as cross-view settings, their recognition performance often degrades markedly, and they lack adequate adaptability and generalization to common practical noise (e.g., occlusion and illumination variation).
With the widespread adoption of convolutional neural networks (CNNs), the technical route of gait recognition has shifted from handcrafted feature extraction to using raw gait sequence data as input, explicitly modeling temporal information in network construction to improve recognition accuracy. For instance, Chao et al. treated gait sequences as sets of temporal information, extracted frame-level features with 2D CNNs, and aggregated them into sequence-level representations [7]. Hou et al. proposed the gait lateral network and set residual network, achieving highly discriminative compact-feature learning and collaborative optimization of silhouette-set information, respectively [8]. These methods verified the effectiveness of deep networks on mainstream datasets.
Existing methods can be broadly divided into two lines. One line focuses on silhouette appearance feature extraction, simplifying videos into images or templates and addressing external interference using strategies such as static learning and gait partitioning [9]. Although simple and efficient, this line overlooks dynamic features and thus faces difficulty in further performance improvement. The other line emphasizes spatiotemporal feature mining, including skeleton-based methods and deep temporal network methods. Skeleton-based methods can abstract motion features such as stride and joint angle. The OU-MVLP-Pose database released by An et al. provides data support [10], and Topham et al. [11] explored passive gait identification in realistic and uncontrolled environments using deep learning and spatiotemporal biometric representations. Such methods provide an effective solution for mitigating appearance-related interference. However, such methods strongly depend on the accuracy of skeleton extraction algorithms.
In deep temporal networks, long short-term memory (LSTM) and 3D convolution are typical solutions. Sepas-Moghaddam and Etemad used bidirectional recurrent neural networks with attention mechanisms to learn local spatiotemporal relations [12]. Liao et al. integrated dynamic information using LSTM [13]. However, combining LSTM with CNN tends to increase network complexity and requires larger data volume and training expertise. For 3D convolution, Zhang et al. proposed GaitMGL [14], and Huang et al. adopted 3D local convolution and adversarial domain adaptation. Nevertheless, 3D convolution has a large parameter scale and cannot reuse pretrained weights from 2D CNNs, resulting in high training cost [15].
To address bottlenecks such as occlusion and cross-view variation, Fan et al. proposed GaitPart [16], which improves robustness by strengthening local spatial learning and short-range temporal modeling. However, this method still suffers from temporal information loss and insufficient modeling of global features.
Current technology still shows deficiencies in adapting to complex environments, extracting temporal features, and jointly optimizing model efficiency and generalization. Existing solutions generally follow two major routes. The first route integrates static features and performs secondary encoding with convolutional image models. Although computationally efficient to some extent, it inevitably loses global temporal information, which is precisely the key discriminative basis of gait recognition. The second route constructs deep temporal networks and reduces covariate interference using local-view strategies. However, such models often involve a large number of parameters, resulting in complex architectures and significantly increased training cost and computational burden. More importantly, both paradigms struggle to effectively balance covariate suppression and temporal information preservation, thus limiting robustness and generalizability in real complex scenarios.
This work argues that the core task of gait recognition is to extract dynamic motion parameters. However, gait image sequences acquired by imaging devices usually contain substantial information related to individual geometric shape, which constitutes key covariate interference in gait recognition. At present, CNN models that directly use raw gait images as input generally fail to effectively eliminate the negative impact of such geometric information on model complexity and recognition performance. On the other hand, skeleton-sequence-based methods can remove geometric information, but skeleton extraction itself relies on independent preprocessing models. Moreover, due to discrete inter-frame characteristics, the model still needs to infer dynamic motion parameters internally, which makes the architecture redundant and significantly increases training computation cost.
More specifically, the proposed MTDU differs essentially from existing temporal modeling and gait representation methods. Compared with 3D CNN or LSTM approaches, MTDU does not introduce additional heavy network branches to implicitly learn temporal dependencies from video sequences. Instead, it explicitly injects motion priors through temporal differencing, so that dynamic gait cues are highlighted before feature extraction. Compared with single-frame or single-scale differencing strategies, MTDU further introduces multi-scale temporal intervals controlled by K and S, enabling the model to capture both instantaneous motion changes and longer-period gait dynamics within a unified representation. In addition, unlike GaitPart, which improves robustness mainly by dividing silhouettes into local body regions and learning part-level representations, MTDU directly suppresses static appearance information such as body shape, clothing, and carried objects at the input feature level. Therefore, the main contribution of MTDU lies not merely in adding temporal information, but in providing a lightweight, explicit, and multi-scale motion-enhancement mechanism that improves the balance between temporal motion preservation, covariate suppression, and model efficiency.
The core innovation of this paper is the design of a silhouette-based multi-scale temporal differencer module. By performing difference operations on image sequences across different temporal scales and combining 1 × 1 convolution kernels for linear transformation and scaling in feature space, this module adaptively enhances dynamic motion parameter features, thereby providing effective support for subsequent gait feature extraction. The module has the following advantages:
(1)
Introducing adjacent frames and differential frames effectively suppresses geometric features in gait image sequences, reducing covariate interference during feature extraction.
(2)
Difference operations preserve dynamic motion parameters between adjacent frames while integrating temporal information into single-frame images. The differencing operation is parameter-free, and only the scale-fusion 1 × 1 convolution introduces a small number of learnable parameters.
By significantly compressing the feature dimensionality of gait image data while introducing only a small number of learnable parameters, the silhouette-based multi-scale temporal differencer greatly reduces computational complexity and improves recognition efficiency.

2. Gait Recognition Method

2.1. Overall Model Architecture

The gait recognition network proposed in this paper is designed to extract discriminative spatiotemporal features from gait silhouette sequences. As shown in Figure 1, the model adopts an end-to-end deep learning architecture and mainly consists of three core components: a Multi-scale Temporal Difference Unit (MTDU), a motion gait backbone network, and a set pooling module.
The model input is defined as a gait silhouette image sequence X R T × 1 × H × W , where T denotes the number of temporal frames, and H and W denote image height and width, respectively. As shown in Figure 1, the forward propagation process can be systematically divided into three stages: motion feature extraction, spatial feature mapping, and temporal feature aggregation.
First, the input sequence X is fed into MTDU, which is designed to extract instantaneous motion information between adjacent and non-adjacent frames. By computing pixel-level differences between the current frame and historical frames at different temporal intervals, MTDU generates a differential feature map sequence X d i f f containing multi-scale motion information. Next, X d i f f is fed into the motion gait backbone network. This backbone network is composed of multiple convolutional layers, whose core function is to extract high-dimensional spatial abstract features from each frame of motion difference maps. After frame-wise processing by the backbone network, spatial information of original images is mapped into a set of feature vectors F = { f 1 , f 2 ,   , f T } , where each f t R C . Finally, to obtain a feature representation that is insensitive to sequence length and has a global spatiotemporal receptive field, feature sequence F is processed by the set pooling module. This module integrates statistical pooling and Temporal Pyramid Pooling (TPP), aggregating variable-length frame-level features into a fixed-dimensional global gait representation V. The representation is then fed to a fully connected classifier for identity recognition.
The overall computation flow can be formulated as:
X d i f f = M ( X ) F = B ( X d i f f ) V = P ( F ) Y = C l a s s i f i e r ( V )
where M ( ) , B ( ) , and P ( ) denote mapping functions of MTDU, the motion gait backbone network, and the set pooling module, respectively; Y is the predicted probability output.

2.2. Multi-Scale Temporal Differencing

In silhouette-based gait recognition, a key issue for robustness improvement is how to effectively decouple identity-relevant motion features from binarized image sequences while suppressing static interference associated with covariates such as clothing and carried objects. Traditional temporal modeling methods (e.g., 3D CNN or LSTM) often process appearance and motion features jointly, making models prone to overfitting to static appearance silhouettes. To this end, we propose a lightweight and efficient Multi-scale Temporal Difference Unit (MTDU). Through explicit inter-frame differencing, this module forces the model to focus on dynamic changes along the temporal dimension, thereby extracting cleaner gait motion parameters.
Temporal differencing computes the pixel-wise intensity difference between frames separated by a temporal delay δ. Let the input gait silhouette sequence be X R T × C × H × W , where T is the sequence length, C is the channel number (typically C = 1 for binarized silhouettes), and H and W are the image height and width. For frame x t , the core operation computes the pixel-wise difference between the current and historical frame.
Define temporal delay as δ . The temporal differential feature d t ( δ ) is:
d t ( δ ) = x t x t δ
To handle sequence-boundary cases, when t δ < 1 , zero-padding is adopted, i.e., x t δ = 0 . This simple linear operation forms the basis of MTDU. Its output d t ( δ ) is a tensor with the same resolution as the input but with essentially changed value distribution.
A major challenge in gait recognition is interference from appearance covariates, such as coats, backpacks, or body shape differences. These factors are mainly reflected in static silhouette shape. In traditional direct-silhouette input methods, neural networks tend to memorize these static shape cues, leading to substantial performance degradation in cross-clothing or cross-carrying scenarios.
The temporal differencer exploits the relatively static nature of covariates in short time windows. For static regions in the image (e.g., trunk center and stationary backpack regions), semantic occupancy remains nearly unchanged between adjacent frames, i.e., x t ( i , j ) x t δ ( i , j ) . Therefore
d t ( δ ) ( i , j ) = x t ( i , j ) x t δ ( i , j ) 0
The model input is a binarized silhouette image, where silhouette pixels are assigned 1 and background pixels are assigned 0. As shown in Figure 2, for visualization, regions with value 1 after differencing are marked in red and regions with value −1 are marked in blue. It should be noted that temporal differencing does not completely eliminate all appearance-related information. As shown in Figure 2, the backpack is still partially visible after differencing because its outer boundary moves together with the human body and therefore produces nonzero edge responses. However, compared with the original silhouette, most interior static regions of the backpack and body are suppressed, since their pixel occupancy remains largely unchanged within a short temporal interval.
Through this operation, large amounts of static background and overlapping internal body regions are set to zero and removed. MTDU can be interpreted as a temporal high-pass filtering operation because it suppresses slowly varying static silhouette regions and highlights frame-to-frame changes, filtering out low-frequency static appearance information while retaining high-frequency dynamic change edges. This enables the model to focus on limb swing and trunk displacement, significantly reducing interference from external conditions such as clothing and improving generalization in complex scenarios.
Beyond noise suppression, a more important role of temporal differencing is explicit motion encoding. For binarized silhouettes, pixel values are in {0,1}. The value space of d t ( δ ) expands to {−1,0,1}, with clear physical meaning:
  • Positive region ( d t ( δ ) > 0 ): x t ( i , j ) = 1 and x t δ ( i , j ) = 0 . This corresponds to regions occupied at time t but not at time t δ , i.e., new positions or leading edges of motion.
  • Negative region ( d t ( δ ) < 0 ): x t ( i , j ) = 0 and x t δ ( i , j ) = 1 . This corresponds to regions left at time t but occupied at time t δ , i.e., old positions or trailing edges.
  • Zero region ( d t ( δ ) = 0 ): Static or overlapping regions.
In this way, MTDU transforms motion information originally implicit across frames into spatial distribution features within a single frame. Furthermore, the area of nonzero pixels is related to the magnitude of inter-frame displacement and may reflect relative motion intensity under the same frame rate and view. As shown in Figure 3, when motion speed is low, inter-frame overlap is large and nonzero pixels are fewer in the difference map; when motion speed is high, overlap is smaller, and nonzero pixels are more.
This explicit extraction of motion parameters is highly efficient. Instead of requiring neural networks to implicitly learn the concept of motion via numerous 3D convolution parameters, MTDU directly injects motion priors. This not only reduces feature complexity to be learned and shortens training time, but also significantly decreases parameter count, making the network more lightweight.
Gait involves not only large-amplitude limb swings but also subtle periodic movements of the head, shoulders, and waist. These micro-motion features often contain highly discriminative identity information. However, such subtle spatial changes are easily smoothed out by pooling during repeated downsampling in deep networks.
Placed at the network front end, MTDU operates directly on original-resolution images and can sensitively capture tiny edge displacements. Even a few-pixel head tilt or waist twist can produce clear edge responses in differential maps. Meanwhile, as shown in Figure 4, differential operations approximately preserve human edge contour characteristics across views (because edges are the regions with strongest change). Thus, MTDU removes internal redundant filling while preserving high-frequency edge information that describes posture and motion details, providing rich and fine-grained input for subsequent feature extraction.
A single temporal delay delta is insufficient to comprehensively characterize complex gait motion. As illustrated in Figure 5, different body parts have different movement frequencies: hand and foot swings are faster and larger in amplitude, whereas trunk displacement is relatively smooth. Smaller delta values are suitable for capturing instantaneous rapid motion and fine detail changes, while larger delta values reflect longer-term motion trends and overall posture evolution.
To enhance adaptability to different motion speeds and amplitudes, we extend the basic differencer to a Multi-scale Temporal Difference Unit (MTDU). Two key hyperparameters are introduced: maximum difference size K and stride S. MTDU computes differences between the current frame and multiple historical frames in parallel, then stacks them along the channel dimension.
Specifically, for time step t, MTDU generates feature map M t as:
M t = C o n c a t d t ( 1 ) , d t ( 1 + S ) , , d t ( 1 + ( m 1 ) S )
where m is the number of scales and satisfies 1 + ( m 1 ) S K . Through this multi-scale design, the model can simultaneously observe motion states over different past time spans, constructing a comprehensive motion feature space containing short-term transient changes and long-term motion trends. This greatly enriches input information and enables adaptive learning of the optimal motion representation scale for different body parts.
After multi-scale differencing and stacking, channel dimension expands from C to C × m . We apply a 1 × 1 convolution followed by LeakyReLU to linearly fuse multi-scale channels and introduce nonlinearity after differencing (equivalent to a fully connected layer over channels in implementation).
Let W be convolution kernel weights and b be bias. The final MTDU output Y t is:
Y t = σ W M t + b
where “*” denotes convolution and σ ( ) is the LeakyReLU activation. This layer is crucial: by learning W , the network can automatically weight the importance of different delays delta, for example, assigning higher weight to short-term differences in some motion phases and emphasizing long-term differences in others. Bias b allows learning an activation threshold. In differential maps, tiny nonzero values may result from noise; bias plus activation can perform gating/denoising, suppressing irrelevant background perturbations and activating only salient motion regions. This layer can also flexibly adjust output channel dimensionality to match the input interface of subsequent backbone networks.
In summary, MTDU is a preprocessing module integrating covariate suppression, explicit motion feature extraction, and multi-scale spatiotemporal aggregation. At very low computational cost, it effectively addresses appearance interference and motion information extraction difficulties in gait recognition. By combining mathematical differencing with learnable deep convolution, MTDU provides high-SNR and highly discriminative dynamic gait representations for subsequent backbone networks, forming the cornerstone of high-performance recognition in this model.

2.3. Motion Gait Backbone Network

After processing by MTDU, the original binarized silhouette sequence is transformed into a dynamic differential feature map sequence Y = { Y 1 , Y 2 , , Y T } . To map these low-level motion-edge cues into high-dimensional semantic features, we design a motion gait backbone network. We adopt a lightweight classical CNN backbone, similar to the one used in GaitSet, for fair comparison with prior works. The design choice prioritizes simplicity and reproducibility.
The backbone adopts a classical CNN architecture to extract spatial features frame by frame. Unlike 3D convolutional networks for video processing, this backbone runs as a 2D CNN with parameters shared along the temporal dimension. This means each frame’s differential map is processed with identical weights, ensuring consistency in feature extraction while significantly reducing parameter count. Specifically, the backbone consists of stacked Conv-BN-ReLU blocks and max pooling layers. Let the backbone contain L layers, and let the transformation of layer l be F l . For input feature h t ( l 1 ) at frame t, output h t ( l ) is computed as:
h t ( l ) = F l ( h t ( l 1 ) ) = R e L U ( B N ( W l h t ( l 1 ) ) )
where W l denotes convolution kernel weights, B N ( ) denotes batch normalization, and R e L U ( ) is the nonlinear activation function. To enlarge the receptive field and reduce spatial resolution, max pooling layers are inserted after specific convolutional blocks.
After deep processing by the backbone, input motion difference maps are mapped to a sequence of high-dimensional feature vectors. Finally, global average pooling (GAP) compresses spatial dimensions H × W to 1 × 1 , obtaining frame-level feature representation F = { f 1 , f 2 , , f T } , where f t R D and D is the channel dimension. This process effectively abstracts local motion details of each frame into compact semantic vectors, preparing for subsequent temporal aggregation.

2.4. Set Pooling

A major practical challenge in gait recognition is that input sequence length T is often variable and strongly affected by device frame rate and walking speed. Although recurrent neural networks (RNNs) or LSTMs can handle variable-length sequences, their serial computation is difficult to parallelize and they tend to forget early information in long sequences. Therefore, we adopt a set pooling strategy, where we combine order-invariant global pooling with coarse temporal partitioning to balance robustness and temporal structure preservation, generating sequence length-insensitive global representations via statistical aggregation.
The set pooling module aims to extract the most representative global gait patterns from frame-level sequence F. To balance robustness and temporal structure information, we combine global statistical pooling and Temporal Pyramid Pooling (TPP).
Global statistical pooling assumes that gait features are highly redundant over time and that certain key postures (e.g., maximum stride) are decisive for identity recognition. We use max pooling and mean pooling to aggregate whole-sequence information. Max pooling extracts the strongest feature activations in the sequence, capturing the most salient gait cues (e.g., extreme limb swing positions):
v m a x = max T t 1 ( f t )
Mean pooling computes the centroid of sequence features, reflecting average gait state and overall movement style:
v m e a n = 1 T t = 1 T f t
These two statistics characterize the gait set from the perspectives of saliency and overallness, and are completely independent of frame order, thus showing strong robustness to frame rate variation and random frame dropping.
Although global pooling has permutation invariance, it completely loses temporal evolution information of gait cycles (e.g., order of stepping actions). To recover partial temporal structure while maintaining robustness, we introduce TPP.
TPP divides feature sequence F into subregions at different scales along the temporal dimension and performs local aggregation in each subregion. Let the pyramid-level set be S = { 1,2 , 4 , } . For scale s S , the time axis is evenly split into s bins. The representation of bin j ( 1 j s ), v s , j is the average of frame features in that bin:
v s , j = A v g P o o l ( { f t t B i n s , j } )
When s = 1 , this degenerates into global average pooling and captures global information.
When s = 2 , the sequence is split into first and second halves, roughly capturing initial and terminal gait states.
When s = 4 , the sequence is further subdivided, enabling finer local temporal dependency capture.
Through this multi-scale partitioning, TPP forms a coarse-to-fine temporal feature pyramid. It enables the model to observe gait patterns at different temporal granularities, preserving local temporal order while maintaining tolerance to slight temporal shifts via local pooling.
The final global gait representation V is the concatenation of all pooled features. Let C o n c a t ( ) denote channel-wise concatenation:
V = C o n c a t v m a x , v m e a n , v T P P
where v T P P contains flattened vectors of all sub-bin features v s , j across all scales s S .
With this design, the set pooling module maps arbitrary-length input sequence X into a fixed-dimensional D o u t embedding vector V . If the frame feature dimension is d and the pyramid levels are {1,2,4}, the TPP output dimension is (1 + 2 + 4)d, concatenating max and mean pooling yields. This vector fuses salient features, average states, and multi-scale temporal structure information, providing a comprehensive and robust identity description for the final classifier. Compared with complex sequence-modeling networks, set pooling is not only computationally efficient but also superior for gait data that are highly periodic and susceptible to interference.

3. Experimental Results and Analysis

3.1. Experimental Setup

A GPU cloud server was used for deep learning training and ablation experiments. The server was equipped with a 25-core Xeon(R) Platinum 8470Q CPU and one RTX 5090 GPU with 32 GB VRAM. The operating system was Ubuntu 22.04, with CUDA 12.8 acceleration and GPU driver version 580.76.05. Python version was 3.14.0, and the reinforcement learning framework used PyTorch 2.7.0 and torchvision 0.22.0.
For MTDU settings, maximum difference size K was set to 3 and stride S to 1. In the triplet loss, margin M was 0.2, and hyperparameters alpha and beta were both 0.1. The dropout rate was set to 0.4. Batch size was 8. During training, input gait sequence length was fixed to 40 frames. At test time, the set pooling module allows variable-length sequences; no temporal sampling was applied. During testing, complete gait sequences were fed into the proposed model to extract gait features. The Adam optimizer was used in all experiments, with a learning rate of 0.001 and the total training process was conducted for 100,000 iterations. To further clarify the lightweight characteristic of the proposed MTDU, we provide a parameter analysis of the module. The temporal differencing operation itself is parameter-free, since it only performs pixel-wise subtraction between silhouette frames at different temporal intervals. In our implementation, the silhouette input has one channel, and MTDU uses K = 3 and S = 1; therefore, three temporal difference maps are generated and stacked along the channel dimension. These multi-scale difference maps are then fused by a 1 × 1 convolution. When the output channel is set to one, this convolution contains only three weights and one bias, resulting in four learnable parameters in total.
The widely used CASIA-B gait database was adopted for evaluation [17]. This large-scale dataset contains gait data from 124 subjects. For each subject, data were collected from 11 viewing angles. At each view, subjects were recorded under three walking conditions: normal walking (NM), carrying a bag (BG), and wearing a coat (CL). NM contains 6 sequences, while BG and CL each contain 2 sequences. Therefore, each subject has 110 gait sequences in total.
To ensure fair evaluation, we followed the dataset’s standard protocol. Specifically, samples of the first 74 subjects (ID: 001-074) were used as the training set for parameter learning, and the remaining 50 subjects (ID: 075-124) as the test set. During testing, the gallery and probe settings were as follows: for each test subject, the first 4 NM sequences were selected as gallery samples; the remaining 6 sequences, including 2 NM, 2 BG, and 2 CL sequences, were used as probe samples for recognition, comprehensively evaluating robustness under different covariates.

3.2. Experimental Results

In gait recognition, a common evaluation metric is rank-1 identification accuracy from the Cumulative Matching Characteristic (CMC) curve, computed by a nearest-neighbor classifier. Specifically, for pedestrian X i with probe-view feature representation F ( X i p ) , view transformation first maps it to gallery-view feature F ( X i g ) , and Euclidean distance is then used as a similarity measure.
To verify the effectiveness of the proposed method, comparisons were conducted on CASIA-B with several recent representative methods, including GaitSet, GaitPart, GLN, SRN, and ST. The results are summarized in Table 1. Overall, the proposed method achieves high recognition accuracy under most views. For clearer analysis, the results in Table 1 are discussed based on two aspects: walking conditions (NM, BG, CL) and overall performance across settings. The compared methods in Table 1, including GaitSet, GaitPart, GLN, SRN, and ST, are representative and competitive literature methods evaluated on the CASIA-B dataset, and are therefore selected as reference baselines to assess the relative performance of the proposed MTDU method.
First, regarding external conditions, Table 1 shows that when testing changes from NM to BG or CL, model accuracy generally drops, indicating significant covariate interference on gait representation. The best reference model accuracies under NM, BG, and CL are 97.5%, 94.3%, and 79.9%, respectively. Under the same setting, the proposed method reaches 97.7%, 94.6%, and 80.0%, respectively. Compared with the strongest reference model in each condition, the proposed method achieves modest absolute gains of 0.2, 0.3, and 0.1 percentage points under NM, BG, and CL, respectively. Although these improvements are small, the mean accuracies under all three walking conditions show consistently positive gains, suggesting a stable performance trend at the condition level rather than an isolated improvement under a single setting. Nevertheless, since repeated experiments with different random seeds were not conducted in this study, these marginal gains should be interpreted cautiously. Therefore, the proposed method is better characterized as achieving competitive performance with slight improvements over the strongest reference models, rather than demonstrating a substantial performance advantage. The improvement is more evident when compared with GaitPart, especially under BG and CL conditions, whereas the margin over the strongest reference models remains relatively limited.
Second, in real applications, collected gait data usually come from arbitrary views and multiple external condition changes. Therefore, overall robustness across conditions is particularly important. Based on Table 1, the overall average accuracies across the three conditions for GaitPart, SRN, and ST are 88.3%, 89.8%, and 90.1%, respectively, whereas the proposed method achieves 90.8%. This corresponds to absolute gains of 2.5, 1.0, and 0.7 percentage points, respectively. This indicates that the proposed method achieves competitive robustness under BG and CL conditions, although its advantage over the strongest reference models is relatively modest.

3.3. Ablation Study

Ablation experiments were conducted for the proposed multi-scale temporal differencer to analyze whether explicit temporal differencing enhances motion information modeling for silhouette sequence input, and whether extending from single-scale to multi-scale differencing further improves feature discriminability and robustness to external covariates. Under unchanged training strategy, loss function, backbone network, and set pooling structure, only the temporal differencing unit was replaced. All variants used the same backbone, pooling module, optimizer, and training schedule. If channel numbers differ, report parameter counts. Three models were constructed: a baseline without temporal differencing (Baseline), a baseline with single-scale temporal differencing (Baseline + TDU), and the full model with MTDU (Baseline + MTDU).
Performance was evaluated with rank-1 accuracy under the standard CASIA-B protocol and reported for NM, BG, and CL, together with the average to assess overall robustness. To ensure a fair comparison, all three variants in the ablation study share the same backbone network, set pooling module, classifier, loss function, optimizer, batch size, input sequence length, and training schedule. The only difference among them lies in the input temporal preprocessing strategy: Baseline directly uses the original silhouette sequence as input, Baseline + TDU uses single-scale temporal differencing, and Baseline + MTDU uses multi-scale temporal differencing. Therefore, the large performance gap between Baseline and Baseline + TDU is not caused by changes in network depth, training strategy, or feature aggregation, but by the introduction of explicit temporal motion cues.
Table 2 shows rank-1 accuracies of the three models under different conditions. Baseline reaches 76.1% under NM, but drops to 62.3% and 44.8% under BG and CL, indicating that relying only on appearance/static structure features is vulnerable to backpack occlusion and coat-induced shape variation. After introducing single-scale differencing (Baseline + TDU), the average accuracy increases from 61.1% to 84.7%, corresponding to an absolute gain of 23.6 percentage points. In terms of relative improvement, the CL condition shows the largest gain, increasing from 44.8% to 70.2%, corresponding to a relative improvement of 56.7%.
Compared with single-scale differencing, Baseline + MTDU further improves NM/BG/CL performance, with average accuracy reaching 90.8%. This gain is mainly reflected in BG and CL scenarios with stronger covariate interference. It indicates that multi-scale temporal delays can complement motion information that single adjacent-frame differencing cannot capture: shorter delays emphasize instantaneous edge changes and subtle swings, while longer delays help characterize more complete gait cycle dynamics and displacement trends. Through fusing multi-scale differential features, the model learns motion patterns at different temporal granularities, thereby maintaining more stable recognition under stronger appearance variation and more severe occlusion.
Overall, the multi-scale temporal differencer enhances feature discriminability and cross-condition robustness through multi-scale temporal modeling, providing higher-SNR dynamic input features for subsequent backbone and set pooling modules.

3.4. Prototype Demonstration and Potential Applications

The proposed gait recognition method not only performs well in controlled environments but also shows broad application potential in highly dynamic sports scenarios such as basketball and badminton [18]. For these two sports, this section builds an online recognition system suitable for different usage scenarios, as shown in Figure 6. The system uses a mobile phone for real-time video acquisition, pushes image streams to an Ubuntu 22.04 server via Wi-Fi and uses YOLO11m for human target detection, YOLO11s-seg for human silhouette extraction, and the Faiss open-source vector retrieval library as the database.
In sports broadcasting and analysis, frequent camera-angle switching, motion blur, and long-distance capture often cause failure of traditional face-based recognition methods. By leveraging gait-specific spatiotemporal motion features as biometric fingerprints and extracting sequence features during running and defensive sliding, continuous cross-camera tracking and precise identity confirmation of specific players can be achieved [19,20,21]. The system first uses YOLO11m for detection to obtain bounding-box image regions, then uses YOLO11s-seg to extract human silhouettes and construct silhouette frames, followed by motion gait recognition to generate gait sequence feature vectors. These vectors are compared with the athlete vector database for the current match to produce identity results. Even under back-facing posture, blurred faces, or occluded jersey numbers, the method maintains strong recognition robustness, providing reliable data support for automatic highlight generation and tactical trajectory analysis. Figure 7 shows the UI during recognition. It should be clarified that the modules described in this section are used only for constructing a prototype online demonstration system and are not part of the core gait recognition model proposed in this paper. Specifically, YOLO11m is employed for human detection, YOLO11s-seg is used for silhouette extraction, and Faiss is adopted for feature vector retrieval in the prototype system. These modules serve as front end preprocessing and retrieval components for practical demonstration purposes.
Gait feature vectors encode not only identity information but also rich semantics of movement states. By computing feature distance between an athlete’s real-time gait feature vector and that athlete’s baseline-state vector, gait patterns variation during competition can be quantitatively analyzed [22]. Larger feature deviations often indicate physical decline or compensatory movement and can assist coaches in timely substitution decisions. In badminton training scenarios, by comparing step-feature sequences between trainees and elite players and computing action similarity, technical movements such as lunges and chasse steps can be guided more intuitively and standardized [23]. Figure 8 shows the UI during system operation.

4. Conclusions

To improve gait recognition performance under complex conditions such as cross-view and cross-appearance variation, this paper addresses temporal modeling of binarized human silhouette sequences and proposes a multi-scale temporal differencing-based gait recognition method. By temporal differencing on human silhouette time series, the method explicitly emphasizes motion gait information and extracts dynamic motion parameters closely related to identity discrimination, thereby effectively representing dynamic walking characteristics. Experiments on the CASIA-B dataset show that the proposed method can stably extract highly discriminative gait features. The proposed method achieves competitive overall performance under the evaluated protocol, with relatively clear gains over GaitPart under the carrying-bag condition. However, the margin over the strongest reference models remains limited.
However, all experiments in this study are conducted on the CASIA-B dataset, which mainly reflects controlled cross-view and cross-covariate settings. Although the proposed method shows promising performance and potential applicability in scenarios such as sports motion analysis and cross-camera identity recognition, further validation on more challenging real-world datasets and practical environments is still needed. Future work will evaluate the proposed method in more realistic scenarios involving outdoor environments, viewpoint changes, and complex motion conditions to further verify its robustness and practical applicability.

Author Contributions

Data curation, B.Z. and Z.L.; formal analysis, B.Z. and Z.L.; methodology, B.Z., Z.L. and D.J.; software, B.Z. and D.J.; writing—original draft preparation, B.Z. and D.J.; writing—review and editing, B.Z., Q.M., J.Z. and Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Liaoning Provincial Doctoral Research Initiation Foundation, grant number 2025-BS-0118; the Fundamental Research Funds for the Central Universities, grant number N25WPY017; the Liaoning Provincial Education Science Planning Project, grant number JG24CB164; China Student Sports Federation, grant number L202503004.The APC was funded by the authors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CASIA-B dataset used in this study is publicly available from the Institute of Automation, Chinese Academy of Sciences, subject to its data access policy. The processed experimental results and implementation details are available from the corresponding author upon reasonable request.

Conflicts of Interest

The author Mr. Zihao Xiang, who is affiliated with the State Power Investment Corporation Digital Technology Co., Ltd., confirm that there are no existing conflicts of interest.

References

  1. Huang, Y.; Huang, P. Survey on Appearance-Based Gait Recognition. J. People’s Public Secur. Univ. China (Sci. Technol.) 2025, 31, 1–9. [Google Scholar] [CrossRef]
  2. Jiang, D.; Wang, H.; Li, T.; Gouda, M.A.; Zhou, B. Real-Time Tracker of Chicken for Poultry Based on Attention Mechanism-Enhanced YOLO-Chicken Algorithm. Comput. Electron. Agric. 2025, 237, 110640. [Google Scholar] [CrossRef]
  3. Khaliluzzaman, M.; Uddin, A.; Deb, K.; Hasan, M.J. Person Recognition Based on Deep Gait: A Survey. Sensors 2023, 23, 4875. [Google Scholar] [CrossRef] [PubMed]
  4. Xu, J.; Zhao, X.; Qian, L. Gait Recognition Based on Key Point Motion Trajectory Modeling. J. Northeast. Univ. (Nat. Sci.) 2024, 45, 33–39. [Google Scholar] [CrossRef]
  5. Han, J.; Bhanu, B. Individual Recognition Using Gait Energy Image. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 316–322. [Google Scholar] [CrossRef] [PubMed]
  6. Wang, J.; Hou, S.; Guo, X.; Huang, Y.; Huang, Y.-Z.; Zhang, T.; Wang, L. GaitC3I: Robust Cross-Covariate Gait Recognition via Causal Intervention. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 8057–8070. [Google Scholar] [CrossRef]
  7. Chen, J.; Wang, Z.; Zheng, C.; Zeng, K.; Zou, Q.; Xiong, Z. Understanding Dynamic Associations: Gait Recognition via Cross-View Spatiotemporal Aggregation Network. IEEE Trans. Circuits Syst. Video Technol. 2022, 1–15. [Google Scholar] [CrossRef]
  8. Hou, S.; Cao, C.; Liu, X.; Huang, Y. Gait Lateral Network: Learning Discriminative and Compact Representations for Gait Recognition. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12354, pp. 382–398. [Google Scholar] [CrossRef]
  9. Dutta, S.; Mitra, A.; Paul, S. Gait Analysis in the Age of Artificial Intelligence: A Comprehensive Review of Advances, Challenges and Future Directions. Int. J. Innov. Comput. Appl. 2025, 15, 220–235. [Google Scholar] [CrossRef]
  10. An, W.; Yu, S.; Makihara, Y.; Wu, X.; Xu, C.; Yu, Y.; Liao, R.; Yagi, Y. Performance Evaluation of Model-Based Gait on Multi-View Very Large Population Database with Pose Sequences. IEEE Trans. Biom. Behav. Identity Sci. 2020, 2, 421–430. [Google Scholar] [CrossRef]
  11. Topham, L.K.; Khan, W.; Al-Jumeily, D.; Kolivand, H.; Aldhaibani, O.; Hussain, A. Enabling Passive Gait Identification in Realistic and Uncontrolled Environments Using Deep Learning and Spatiotemporal Biometrics. Int. J. Intell. Syst. 2026, 2026, 9024180. [Google Scholar] [CrossRef]
  12. Sepas-Moghaddam, A.; Etemad, A. Deep Gait Recognition: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 264–284. [Google Scholar] [CrossRef] [PubMed]
  13. Liao, R.; Cao, C.; Garcia, E.B.; Yu, S.; Huang, Y. Pose-Based Temporal-Spatial Network (PTSN) for Gait Recognition with Carrying and Clothing Variations. In Biometric Recognition; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; Volume 10568, pp. 474–483. [Google Scholar] [CrossRef]
  14. Zhang, Z.; Wei, S.; Xi, L.; Wang, C. GaitMGL: Multi-Scale Temporal Dimension and Global–Local Feature Fusion for Gait Recognition. Electronics 2024, 13, 257. [Google Scholar] [CrossRef]
  15. Huang, T.; Ben, X.; Gong, C.; Xu, W.; Wu, Q.; Zhou, H. GaitDAN: Cross-View Gait Recognition via Adversarial Domain Adaptation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 8026–8040. [Google Scholar] [CrossRef]
  16. Fan, C.; Peng, Y.; Cao, C.; Liu, X.; Hou, S.; Chi, J.; Huang, Y.; Li, Q.; He, Z. GaitPart: Temporal Part-Based Model for Gait Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14225–14233. [Google Scholar] [CrossRef]
  17. Castro, F.M.; Marín-Jiménez, M.J.; Mata, N.G.; Muñoz-Salinas, R. Fisher Motion Descriptor for Multiview Gait Recognition. Int. J. Pattern Recognit. Artif. Intell. 2017, 31, 1756002. [Google Scholar] [CrossRef]
  18. Hsu, Y.-L.; Chang, H.-C.; Chiu, Y.-J. Wearable Sport Activity Classification Based on Deep Convolutional Neural Network. IEEE Access 2019, 7, 170199–170212. [Google Scholar] [CrossRef]
  19. Wiles, T.M.; Kim, S.K.; Stergiou, N.; Likens, A.D. Pattern Analysis Using Lower Body Human Walking Data to Identify the Gaitprint. Comput. Struct. Biotechnol. J. 2024, 24, 281–291. [Google Scholar] [CrossRef] [PubMed]
  20. Jiang, D.; Kong, L.; Wang, H.; Pan, D.; Li, T.; Tan, J. Precise Control Mode for Concrete Vibration Time Based on Attention-Enhanced Machine Vision. Autom. Constr. 2024, 158, 105232. [Google Scholar] [CrossRef]
  21. Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Advances in EEG-Based Emotion Recognition: Challenges, Methodologies, and Future Directions. Appl. Soft Comput. 2025, 180, 113478. [Google Scholar] [CrossRef]
  22. Ferraz, A.; Duarte-Mendes, P.; Sarmento, H.; Valente-Dos-Santos, J.; Travassos, B. Tracking Devices and Physical Performance Analysis in Team Sports: A Comprehensive Framework for Research—Trends and Future Directions. Front. Sports Act. Living 2023, 5, 1284086. [Google Scholar] [CrossRef] [PubMed]
  23. Seong, M.; Kim, G.; Yeo, D.; Kang, Y.; Yang, H.; DelPreto, J.; Matusik, W.; Rus, D.; Kim, S. MultiSenseBadminton: Wearable Sensor-Based Biomechanical Dataset for Evaluation of Badminton Performance. Sci. Data 2024, 11, 343. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overall architecture of the model.
Figure 1. Overall architecture of the model.
Electronics 15 02512 g001
Figure 2. Visualization of differential operations.
Figure 2. Visualization of differential operations.
Electronics 15 02512 g002
Figure 3. Comparison under different speeds. (a) Low speed; (b) high speed.
Figure 3. Comparison under different speeds. (a) Low speed; (b) high speed.
Electronics 15 02512 g003
Figure 4. Comparison of differential results between frontal and lateral views.
Figure 4. Comparison of differential results between frontal and lateral views.
Electronics 15 02512 g004
Figure 5. Comparison of differential results at different views with delta = 1, 2, and 3.
Figure 5. Comparison of differential results at different views with delta = 1, 2, and 3.
Electronics 15 02512 g005
Figure 6. Schematic diagram of the online human motion gait recognition system.
Figure 6. Schematic diagram of the online human motion gait recognition system.
Electronics 15 02512 g006
Figure 7. User interface of the long-range athlete identification system for large-scale events.
Figure 7. User interface of the long-range athlete identification system for large-scale events.
Electronics 15 02512 g007
Figure 8. User interface of the badminton motion gait-assisted training system.
Figure 8. User interface of the badminton motion gait-assisted training system.
Electronics 15 02512 g008
Table 1. Accuracy Comparison under Different Viewing Angles and Conditions on the CASIA-B Dataset.(%).
Table 1. Accuracy Comparison under Different Viewing Angles and Conditions on the CASIA-B Dataset.(%).
ConditionMethodViewMean
18°36°54°72°90°108°126°144°162°180°
NMGaitSet90.897.999.496.993.691.795.097.898.996.885.895.0
GaitPart94.198.699.398.594.092.395.998.499.297.890.496.2
GLN93.299.399.598.796.195.697.298.199.398.492.997.1
SRN94.499.399.498.796.896.897.598.599.598.892.397.5
ST95.399.299.198.395.494.496.598.999.498.292.097.0
MTDU (Ours)95.899.199.498.996.697.197.598.799.298.793.497.7
BGGaitSet83.891.291.888.883.381.084.190.092.294.479.087.2
GaitPart83.194.896.795.188.384.989.093.596.193.885.891.0
GLN91.197.797.895.292.591.292.496.097.595.088.194.0
SRN91.597.498.497.192.289.793.196.297.596.588.094.3
ST91.394.995.593.490.594.490.895.897.694.488.093.3
MTDU (Ours)91.896.598.296.492.893.493.296.497.696.388.294.6
CLGaitSet61.475.480.777.372.170.171.573.573.568.450.070.4
GaitPart70.785.586.983.377.172.576.982.283.868.266.577.6
GLN70.682.485.282.779.276.476.278.977.978.764.377.5
SRN69.282.584.081.078.676.378.682.880.576.864.777.7
ST73.096.495.682.776.874.377.180.779.677.664.779.9
MTDU (Ours)73.886.991.483.880.278.478.782.582.479.262.480.0
Table 2. Rank-1 accuracy of ablation studies on different model configurations (%).
Table 2. Rank-1 accuracy of ablation studies on different model configurations (%).
ModelRank-1 Accuracy
NMBGCLMean
Baseline76.162.344.861.1
Baseline + TDU94.389.770.284.7
Baseline + MTDU97.794.680.090.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, B.; Li, Z.; Ma, Q.; Zhang, J.; Xiang, Z.; Jiang, D. Silhouette-Based Cross-View Motion Gait Recognition via a Multi-Scale Temporal Difference Unit. Electronics 2026, 15, 2512. https://doi.org/10.3390/electronics15122512

AMA Style

Zhang B, Li Z, Ma Q, Zhang J, Xiang Z, Jiang D. Silhouette-Based Cross-View Motion Gait Recognition via a Multi-Scale Temporal Difference Unit. Electronics. 2026; 15(12):2512. https://doi.org/10.3390/electronics15122512

Chicago/Turabian Style

Zhang, Bowen, Zhaoxing Li, Qibiao Ma, Jian Zhang, Zihao Xiang, and Daqi Jiang. 2026. "Silhouette-Based Cross-View Motion Gait Recognition via a Multi-Scale Temporal Difference Unit" Electronics 15, no. 12: 2512. https://doi.org/10.3390/electronics15122512

APA Style

Zhang, B., Li, Z., Ma, Q., Zhang, J., Xiang, Z., & Jiang, D. (2026). Silhouette-Based Cross-View Motion Gait Recognition via a Multi-Scale Temporal Difference Unit. Electronics, 15(12), 2512. https://doi.org/10.3390/electronics15122512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop