Bias-Alleviated Zero-Shot Sports Action Recognition Enabled by Multi-Scale Semantic Alignment

Qiang Zheng; Wen Qin; Fanyi Meng; Hongyang Liu

doi:10.3390/sym17111959

,

and

¹

College of Physical Education, Woosuk University, Jeonju 55338, Republic of Korea

²

Hebei Academy of Fine Arts, Shijiazhuang 050000, China

^*

Author to whom correspondence should be addressed.

Symmetry2025, 17(11), 1959;https://doi.org/10.3390/sym17111959

This article belongs to the Special Issue Application of Symmetry/Asymmetry and Machine Learning

Version Notes

Order Reprints

Abstract

Zero-shot action recognition remains challenging due to the visual–semantic gap and the persistent bias toward seen classes, particularly under the generalized setting where both seen and unseen categories appear during inference. To address these issues, we propose Multi-Scale Semantic Alignment framework for Zero-Shot Sports Action Recognition (MSA-ZSAR), a framework that integrates a multi-scale spatiotemporal feature extractor to capture both coarse and fine-grained motion dynamics, a dual-branch semantic alignment strategy that adapts to different levels of semantic availability, and a bias-suppression mechanism to improve the balance between seen and unseen recognition. This design ensures that the model can effectively align visual features with semantic representations while alleviating overfitting to source classes. Extensive experiments demonstrate the effectiveness of the proposed framework. MSA-ZSAR achieves 52.8% unseen accuracy, 69.7% seen accuracy, and 61.3% harmonic mean, consistently surpassing prior approaches. These results confirm that the proposed framework delivers balanced and superior performance in realistic generalized zero-shot scenarios.

Keywords:

sports action recognition; generalized zero-shot learning; transductive learning; multi-scale feature extraction; semantic alignment; source class bias

1. Introduction

Symmetry, understood as approximate invariance under left–right reflection, viewpoint rotation and temporal translation, is pervasive in sports motion. Effective recognition systems should enforce invariance or equivariance to these transformations while remaining sensitive to task-relevant asymmetries such as handedness, stance and implement orientation. In the generalized zero-shot setting, distributional asymmetry between seen and unseen classes and the frequent presence of class imbalance further complicate learning. To address these challenges, we adopt symmetry-consistent spatiotemporal encoding together with bias-suppression mechanisms to restore balance and enhance generalization.

With the development of sports technology, action recognition plays a crucial role in training, competitive analysis, and live event broadcasting [,,]. By automating the capture and understanding of movements, coaches can provide targeted guidance, athletes can optimize their training strategies, and event organizers can enhance the viewing experience. However, sports actions are highly complex and diverse. Sports differ markedly from one another, and even within the same discipline motion patterns evolve with individual idiosyncrasies and new techniques, which complicates recognition, especially when zero-shot models must label previously unseen actions. Although traditional supervised learning performs well on known actions, it relies on a large amount of labeled data and is difficult to cope with the rapid iteration of sports actions [,,]. Sports videos contain complex temporal dynamics, making the mapping between visual features and semantic descriptions more challenging [,]. Especially in the case of generalized zero-shot recognition, where both known and new actions need to be distinguished, capturing spatiotemporal features and achieving semantic alignment become key difficulties. At the same time, the continuous emergence of new actions leads to sample scarcity and class imbalance (as shown in Figure 1), further exacerbating model bias and generalization difficulties [,].

Figure 1. The source class bias problem in sports action recognition.

To address the challenges posed by complex temporal dynamics, diverse actions, and the continuous emergence of new categories in sports action recognition, this paper proposes a deep zero-shot action recognition network based on transductive learning. The network incorporates two key components: multi-scale spatio-temporal feature extraction and kinematic semantic mapping. First, a lightweight multi-scale feature extraction module is designed to capture detailed motion patterns and joint dynamics from high-dimensional features. This is achieved through parallel convolutional operations and channel attention mechanisms, which are further enhanced by the integration of 3D convolution models to improve spatio-temporal modeling and generalization capabilities. Second, a semantic mapping layer is introduced to project visual features into a high-dimensional kinematic semantic space, thereby enhancing the model’s generalization performance for unseen actions.

Additionally, embeddings encoding kinematic semantics for seen and unseen classes serve as the initialization of the classifier. By combining the classification loss of source classes with the deviation loss of target classes, the proposed method effectively mitigates the category bias issue commonly encountered in generalized zero-shot learning. Since most existing approaches rely on natural language semantic embeddings—which may not fully capture the spatio-temporal and kinematic characteristics of actions—this paper further introduces a dynamic alignment mechanism. This mechanism aims to reduce cross-modal discrepancies and promote balanced learning between known and novel categories, thereby improving the model’s practicality and robustness in generalized zero-shot recognition of sports actions.

The GZSL setting is adopted as it more faithfully reflects deployment conditions, wherein test instances may arise from both seen and unseen categories. In contrast, conventional ZSL evaluates only unseen classes and implicitly assumes the absence of seen categories at test time—an unrealistic premise that can inflate performance. By introducing a mixed-label search space, GZSL exposes source-class bias, which is specifically addressed through the proposed semantic alignment and bias-suppression mechanisms of MSA-ZSAR.

Two transductive learning strategies are proposed based on the availability of semantic information in the target categories: the semantic-guided transfer (SGT) method, which leverages semantic assistance, and the Semantic-Free Transfer (SFT) method, which operates under semantic deficiency. In the SGT method, pre-constructed kinematic semantic vectors are utilized, and the semantic embeddings of both source and target categories serve as the initialization parameters for the classifier. During the training process, the model is optimized using the classification loss for source categories and the deviation loss for target categories. By incorporating kinematic attributes, the method effectively reduces the discrepancy between visual features and semantic representations, thereby improving the generalization performance in recognizing novel actions.

In contrast, the SFT method generates pseudo-semantic embeddings through unsupervised clustering in the absence of target semantic information. During the training phase, it first acquires prior knowledge from source class data, and subsequently performs transductive learning by combining supervision signals from the source classes with pseudo-labels from the target classes, enabling gradual adaptation to target features. In the inference phase, a two-step procedure—classification followed by recognition—is employed. The method first distinguishes between source and target classes, and then identifies the specific category within the corresponding subspace using the nearest neighbor algorithm. This approach effectively mitigates category confusion and source class bias, thereby enhancing the zero-shot recognition capability for emerging sports actions. The main contributions of this work are summarized as follows:

This paper proposes a progressive MSA-ZSAR framework that decomposes training into source domain learning and transfer learning. The framework first establishes a stable visual–semantic mapping through regression-based alignment, and then extends it with an improved QFSL loss equipped with dynamic loss weighting. This design effectively mitigates class bias while enhancing generalization capability across seen and unseen categories.
To address varying levels of semantic availability, two complementary strategies are developed. SGT leverages pre-defined kinematic embeddings and joint optimization to align visual and semantic spaces, whereas SFT constructs pseudo-semantic prototypes via unsupervised clustering and employs a progressive classify-then-recognize scheme. This dual strategy ensures effective transfer in both rich and limited-semantics scenarios.
Beyond these modules, semantic prototypes are dynamically refined through the CPG mechanism, which expands the semantic space and captures intra-class diversity, particularly for unlabeled categories. Furthermore, a MME strategy aggregates predictions from multiple trained models or checkpoints, reducing variance and improving robustness. Together, these mechanisms yield superior accuracy and stability in generalized zero-shot sports action recognition.

We begin by framing the study’s context and importance. The remainder of the paper is structured as follows: Section 2 surveys related work, highlighting advances and open issues in deep learning for sports action recognition and zero-shot learning; Section 3 details the architecture and core components of the proposed Multi-Scale Semantic Alignment framework; Section 4 describes the experimental protocol and analyzes the results to demonstrate effectiveness and robustness while noting areas for refinement; Section 5 concludes the work and outlines directions for future research.

2. Related Work

2.1. Sports Action Recognition Based on Deep Learning

As computer vision and deep learning advance rapidly, sports action recognition is now widely employed for automated event analytics, athlete coaching, and enhancing spectator experiences [,]. Deep learning facilitates the automatic extraction of spatio-temporal features from video data, enabling precise identification of a wide range of movement actions and offering robust support for efficient sports data analysis and utilization. Kong et al. proposed a robust compressive tracker (CT) with scale adaptability and occlusion recovery capabilities, integrating candidate box generation with long-term recurrent region-guided convolutional networks to enable simultaneous athlete tracking and action recognition []. Tejero-de-Pablos et al. focused on Japanese kendo and developed a video summarization method based on athlete action cues, utilizing deep neural networks to extract action features and segment video clips into exciting and non-exciting segments []. Pratik V et al. introduced a super-long-term convolutional neural network (SLRCNN), which combines LRCNN for action classification with SRCNN to enhance the resolution and visual quality of handball videos []. Wang et al. put forward a spatial representation and motion attention fusion network (SRMA), employing a dual-branch attention module to jointly capture motion information and spatial features, thereby addressing the insufficient modeling of motion information []. Guo et al. devised the BiMACL method, which captures inter-frame correspondences through spatio-temporal attention and cross-Transformer modules while alleviating the interference of similar segments, thus improving recognition performance []. Nevertheless, most existing approaches rely heavily on large-scale labeled training data and are primarily optimized for known categories, resulting in limited recognition capability for novel classes with insufficient samples and a susceptibility to source class bias in the generalized zero-shot learning (GZSL) scenario.

2.2. Zero-Shot Learning and Generalized Zero-Shot Learning

ZSL aims to recognize samples of unseen classes through a shared semantic space during training; GZSL entails recognizing both seen and unseen categories at once, but often suffers from source class bias due to model preference for seen classes, leading to a significant decline in the recognition performance of target classes. To address this, Li et al. introduced the Entropy Guided Reinforced Partial Convolution Network (ERPCNet), which dynamically fuses local information based on semantic and visual correlations without the need for manual annotations, thereby improving the performance of both ZSL and GZSL, while also achieving fast convergence and good interpretability []. In addition, Zhang et al. developed Zero-shot Temporal Activity Detection (ZSTAD), which directly predicts activity instances in videos through an activity graph Transformer and leverages label embeddings to capture semantic commonalities between seen and unseen classes, enabling the effective detection of previously unseen activities []. An over-complete distribution generation scheme based on a CVAE was proposed by Keshari et al., which integrates OBTL and Center Loss to better separate classes and mitigate degradation on unseen categories []. Min et al. proposed the Domain-Aware Visual Bias Elimination Network (DVBE), which tackles both seen and unseen domains by learning complementary semantic-aligned and non-semantic-aligned visual embeddings. It integrates cross-attention–based statistical compression with adaptive-margin Softmax to sharpen inter-class separation, enabling accurate recognition of seen classes and entropy-based discovery of unseen ones []. Zhang et al. designed a dual-branch architecture that maps semantic and visual features into a shared embedding space, incorporating a pseudo-labeling strategy to reduce misclassification among unseen categories []. However, these approaches exhibit certain limitations in the domain of sports action recognition: many are developed on general object or activity datasets, thereby failing to capture the fine-grained spatio-temporal characteristics inherent to sports actions; moreover, in the GZSL setting, the issue of source class bias remains prevalent, which hampers recognition performance for unseen categories.

2.3. Transductive Zero-Shot Learning

Fu et al. proposed a heterogeneous multi-view hypergraph label propagation method that exploits complementary information from multiple semantic representations within a transductive embedding space and leverages the manifold structures across different representation spaces to enhance label propagation effectiveness []. Guo et al. tackled transductive zero-shot recognition by developing a joint learning strategy that constructs a shared model space (SMS), enabling efficient knowledge transfer through attribute-based relationships []. Xu et al. developed GAGCN, which builds visual association graphs to relate unseen actions to similar seen actions and utilizes hierarchical knowledge for robust visual-to-semantic knowledge transfer []. Rahman et al. proposed an improved approach for transductive zero-shot object detection that progressively updates model parameters through self-learning and hybrid pseudo-labeling, while mitigating catastrophic forgetting to reduce misclassification of unseen classes []. Tian et al. presented the Coupled Adversarial Graph Embedding (CAGE) framework, which, in a transductive setting, constructs structured graphs for both seen and unseen videos to support visual-to-semantic embedding, and incorporates adversarial constraints to extract discriminative features of unseen classes and refine shared information, thereby enhancing model adaptability and robustness [].

3. Proposed Method

3.1. Preparation Work

Generalized Zero-Shot Learning (GZSL) is an extension of Zero-Shot Learning (ZSL) [], aiming to recognize instances from both seen classes (denoted as

ys

) and unseen classes (denoted as

yu

) during the testing phase. In contrast to traditional ZSL, which evaluates performance solely on unseen classes

yu

, GZSL better reflects real-world scenarios, where instances from both seen and unseen classes typically coexist. During the training phase, the model is provided with a training set denoted as

D^{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{N_{s}}

, where

x_{i}^{s} \in X

represents visual features and

y_{i}^{s} \in Y^{s}

denotes class labels. Additionally, each class

y \in Y^{s} \cup Y^{u}

is associated with a semantic embedding vector

a_{y} \in A

, such as an attribute vector or word vector. During the testing phase, test set

D^{t e s t}

includes samples from

Y^{s} \cup Y^{u}

, requiring the model to perform predictions within the joint label space

Y^{s} \cup Y^{u}

. Since only labeled data from

ys

was utilized during training, the model tends to develop stronger discriminative capability for

ys

in the learned feature space, while exhibiting limited generalization performance for

yu

. This imbalance gives rise to the source bias problem, whereby the model is more prone to misclassifying samples from

yu

as

ys

during inference, resulting in a notable decline in recognition performance for the target classes.

To alleviate source class bias, GZSL typically relies on a shared semantic space to bridge the gap between seen and unseen classes. By learning a mapping function

F : X \to A

that projects visual features into this semantic space, knowledge transfer across categories can be effectively achieved. However, in tasks such as sports action recognition that demand the capture of fine-grained spatio-temporal features, relying solely on semantic space mapping proves insufficient to fully mitigate source class bias. Moreover, such approaches may be hindered by the inherent semantic–visual discrepancy. Accordingly, a central challenge tackled here is developing GZSL approaches that fuse semantic cues with the spatio-temporal dynamics of actions. For clarity and consistency, all abbreviations used in this paper are summarized in Table 1.

Table 1. List of abbreviations used in this paper.

3.2. The Design of the MSA-ZSAR Framework

The model proposed in this study is designed to address the issue of source bias in sports action recognition under the generalized zero-shot setting while effectively leveraging the spatio-temporal and semantic information present in video data. Figure 2 shows the model’s overarching architecture. It comprises four main components: a feature extraction module, a semantic-guided transfer (SGT), a Semantic-Free Transfer (SFT), and a joint classification and reasoning module. These components work in concert to handle both scenarios where target semantics are known and where they are unknown.

Figure 2. The Architecture of MSA-ZSAR Framework for Sports Action Recognition.

(1) Multi-scale Feature Extraction Module: The MSFE is inserted as a plug-in block atop a shared 3D-CNN stem. Each branch factors spatio-temporal processing (e.g., spatial then temporal) and varies temporal stride and dilation while keeping depth and channel budget lightweight. Branch outputs are normalized and passed through residual adapters, then fused by feature concatenation followed by a pointwise projection to unify channel dimensions. The fusion gate comprises a squeeze-and-excitation style channel re-weighting coupled with a temporal self-attention mask; the gate is conditioned on the fused representation and applied multiplicatively to suppress redundant responses while emphasizing salient motion cues. A temporal pyramid–style head aggregates signals at multiple temporal scopes using average pooling with nonoverlapping bins, and the result is projected, globally averaged, and layer-normalized to produce a compact video descriptor. The module reuses backbone parameters wherever possible, adds only shallow lateral paths, and is trained end-to-end with the rest of the network, thereby providing richer multi-granular features to downstream semantic alignment without redesigning the backbone.

(2) Semantic-Guided Transfer (SGT): The SGT module aims to reduce source-class bias in GZSL and achieve consistent recognition across seen and unseen categories. It takes multi-scale visual features extracted by the backbone and projects them into a semantic space, where alignment is performed with class-level semantic embeddings. To better capture semantic dependencies, a Transformer encoder is employed to model contextual relationships within semantic embeddings, thereby producing refined class prototypes. These semantic prototypes then guide cross-modal attention layers to progressively align visual features with their corresponding semantic space. In the second stage, referred to as semantic-guided transductive optimization, unlabeled target domain samples are introduced, and the joint loss function is employed to further refine this model:

L = L_{ce} + β (t) L_{p}

(1)

Among them,

L_{ce}

denotes the cross-entropy loss, which is employed to enhance the classification accuracy of seen categories;

L_{p}

represents the probability maximization (PM) loss, which mitigates source class bias by increasing the confidence in the predicted distribution for unseen categories.

β (t)

is a dynamic weight coefficient that gradually increases during training, enabling a balance between the performance of seen and unseen categories during the model convergence phase (detailed parameters of the loss function are presented in Section 3.3). During training, SGT adopts a source-domain learning strategy in which labeled samples are available only for seen classes. Optimization is carried out by a joint loss composed of cross-entropy (CE), probability maximization (PM), and mean squared error (MSE). The CE term ensures classification accuracy on seen categories, the PM term alleviates source-class bias by encouraging confident predictions, and the MSE term enforces regression of visual features toward semantic vectors to preserve intra-class compactness and inter-class separability. A dynamic weight schedule balances these losses throughout training to improve convergence and generalization.

At inference time, SGT directly leverages the semantic embeddings of both seen and unseen categories, without requiring additional transductive refinement. In this way, SGT provides a robust inductive pathway for semantic alignment, which is later complemented by the SFT branch when semantic availability is limited.

In addition, the multiple interaction layers refers to the stacked Transformer encoder blocks within the SGT module, where cross-modal attention is applied iteratively rather than in a single step. Semantic embeddings are treated as queries and visual features as keys and values, enabling progressive refinement of visual–semantic alignment across layers. In our implementation, three encoder layers are used, which allows the model to capture fine-grained dependencies and mitigate shallow alignment, ultimately improving recognition performance on both seen and unseen classes.

(3) Semantic-Free Transfer (SFT): The SFT module is introduced to address the challenge of source-class bias in GZSL scenarios where semantic annotations are missing or unreliable. Instead of relying on predefined semantic vectors, SFT derives pseudo semantics directly from data. In the initial stage, unsupervised clustering is applied to backbone-extracted visual features, producing cluster centroids that act as pseudo semantic prototypes. These prototypes can be further refined with prior knowledge such as word embeddings. This stage establishes a coarse-grained semantic structure that provides the foundation for subsequent learning. In the source-domain learning stage, labeled samples from seen classes are utilized to calibrate pseudo prototypes and align them with visual features. Optimization integrates cross-entropy for classification accuracy, probability maximization for bias suppression, and mean squared error for prototype regression, thereby progressively improving discriminability in the semantic-free setting.

The final transductive refinement stage incorporates unlabeled target-domain data to further enhance the robustness of the learned prototypes. By employing confidence-weighted pseudo labeling and iterative prototype updates, the model adaptively reshapes its semantic structure to accommodate unseen categories. This process mitigates the impact of noisy pseudo labels and strengthens stability in prediction. Through this multi-stage design, SFT gradually captures increasingly fine-grained categorical distinctions while reducing reliance on noisy or incomplete semantics. As a complementary counterpart to SGT, SFT provides an alternative inductive–transductive pathway that improves balance between seen and unseen recognition performance in generalized zero-shot sports action recognition.

Both SGT and SFT are employed to accommodate heterogeneous semantic availability in GZSL. When reliable descriptors (attributes or word embeddings) exist, SGT is used to realize fine-grained visual–semantic alignment via transformer-based contextual modeling. When semantics are incomplete, noisy, or absent, SFT is applied to construct and refine pseudo-semantic prototypes through unsupervised clustering and multi-stage optimization. By integrating these complementary branches, stable performance is maintained under weak semantics while available semantic knowledge is fully exploited.

Ground-truth (GT) semantics for seen classes are integrated to stabilize cross-modal alignment and to serve as a reference for bias suppression. During training, GT embeddings act as anchors that constrain the visual–semantic mapping, limiting drift when pseudo or noisy semantics are introduced by SFT and enabling effective guidance for SGT. This anchoring yields better-calibrated decision boundaries that generalize to unseen categories, while also providing a consistent benchmark for objective evaluation.

To further reinforce semantic–visual alignment in zero-shot recognition, we incorporate a Clustering-based Prototype Generation (CPG) mechanism that refines class prototypes by clustering visual embeddings to capture intra-class diversity and aligning them with semantic descriptors. For unseen categories, pseudo-prototypes are derived by transferring structural relationships from seen classes, thereby alleviating semantic sparsity and enhancing class discriminability. Moreover, to improve robustness and stability under GZSL, we introduce a Multi-model Ensemble Evaluation (MME) strategy, which aggregates predictions from multiple independently trained models or checkpoints. This ensemble approach effectively reduces variance across random splits, narrows the performance disparity between seen and unseen classes, and yields more reliable generalization performance.

3.3. Design of Loss Function

To effectively enhance the model’s generalization capability on both seen and unseen categories and to mitigate the inherent source-class bias in GZSL, we extend the conventional QFSL loss into a multi-stage and multi-objective formulation. The improved loss is structured around three progressive principles: first establishing a stable mapping, then performing domain adaptation, and finally dynamically balancing discrimination and generalization. Concretely, in the early training phase, a mean squared error (MSE) term is employed to regress visual embeddings toward their semantic prototypes, ensuring intra-class compactness and inter-class separability. In the subsequent stage, a cross-entropy (CE) loss combined with probability maximization (PM) is introduced to strengthen classification accuracy on seen classes while simultaneously increasing confidence for unseen categories, thereby alleviating the bias toward source domains. Finally, a dynamic weighting scheme adjusts the relative importance of CE and PM terms as training progresses, allowing the model to gradually shift from stability to generalization. This staged optimization not only refines semantic–visual alignment but also ensures that the model learns transferable representations that remain robust when evaluated on novel action categories.

Firstly, during the supervised learning stage in the source domain, visual features are aligned to the normalized semantic embedding space using the Semantic Regression Loss, which is mathematically defined as:

L_{reg} = \frac{1}{N} \sum_{i = 1}^{N} {\hat{z}}_{i} - z_{i 2}^{2}

(2)

here,

{\hat{z}}_{i} = \frac{f_{θ} (x_{i})}{‖ f_{θ} (x_{i}) ‖_{2}}

denotes the predicted semantic vector, and z_i represents the semantic embedding of the corresponding category; both are L2-normalized to maintain scale consistency. After a stable visual–semantic mapping is learned at this stage, the target domain features extracted from it are subjected to K-Means clustering to generate pseudo-semantic prototypes, which serve as the initialization for target class embeddings in SFT. Subsequently, transductive learning is performed within the extended semantic space to refine the target reference QFSL framework, which comprises a supervised classification loss and a target class probability maximization loss. The former applies label-smoothed cross-entropy to the seen class sample set S:

L_{ce} = \frac{1}{∣ S ∣} \sum_{i \in S} CE (softmax (o_{i}), y_{i}; ϵ)

(3)

here,

O_{i}

denotes the logits generated by the classifier,

y_{i}

represents the true label,

ϵ

is the smoothing parameter.

For the target class sample set T, the objective is to maximize the total probability mass assigned to the target class set

Y_{u}

, which effectively encourages the model to allocate greater attention to unseen classes during inference.

L_{p} = - \frac{1}{∣ T ∣} \sum_{i \in T} log \sum_{c \in Y^{u}} p_{i, c}

(4)

where

p_{i} = softmax (o_{i})

. By integrating the two aforementioned loss terms, the total loss function for transductive learning can be formulated as follows:

L = L_{ce} + β (t) L_{p}

(5)

Among them,

β (t) = min (2.0, 1.0 + t / 5)

represents a dynamic weight coefficient that increases as the training round t progresses. This design prioritizes the preservation of the discrimination ability for seen classes during the early stages of training, while gradually enhancing the adaptability to unseen classes in the middle and later stages, thereby effectively mitigating the category bias phenomenon in GZSL.

In terms of optimization strategies, this paper adopts the Adam optimizer to ensure more stable gradient updates. During multi-stage training, the learning rate in stage 1 is set to 0.8 of that in stage 0 to prevent excessive shifts in feature maps. Concurrently, the weight decay coefficient is progressively increased in subsequent stages to strengthen the regularization effect. To curb overfitting, the transformer module includes a dropout layer with p = 0.15. Additionally, Batch Normalization is applied in both the SGT and the SFT to enhance the stability of feature distributions.

It is worth emphasizing that this optimization system is applicable in both transductive learning settings. In the SGT mode, it directly aligns and classifies the real semantics of seen and target classes; in contrast, in the SFT mode, it generates proxy semantic prototypes via unsupervised clustering and progressively approximates the true semantic structure through a multi-stage optimization process. Experimental results demonstrate that this multi-objective, multi-stage loss design not only preserves high classification accuracy in the source domain but also significantly enhances the model’s generalization performance in the target domain, thereby effectively mitigating the source class bias issue in GZSL.

4. Experiments

4.1. Experimental Setup

To assess the proposed model’s effectiveness and generalization in athlete action recognition, this study selected two challenging and widely adopted benchmark datasets for video action recognition: HMDB51 and UCF101 [,]. HMDB51 consists of 6766 videos that represent 51 common human actions, including a variety of sports-related actions such as throwing, jumping, and hitting. UCF101 is a larger-scale dataset containing 13,320 videos that cover 101 action categories, such as swimming, basketball, gymnastics, and badminton, which comprehensively reflect the diversity and complexity of athletic movements.

The datasets were first divided in accordance with the standard GZSL protocol, where the split between seen and unseen classes was fixed in advance to avoid any label leakage or semantic bias. Training was conducted exclusively on seen classes, while unseen classes were strictly held out for evaluation. Within the seen classes, an 8:2 division was then applied to create training and validation subsets, ensuring that no unseen samples were ever introduced during training. Specifically, HMDB51 was partitioned into 26 seen and 25 unseen classes, and UCF101 into 51 seen and 50 unseen classes. This division procedure was repeated 15 times to enhance robustness and fairness, while maintaining consistent class frequency across seen and unseen sets. The final reported results are given as the average classification accuracy with standard deviation over the 15 splits, thereby ensuring both statistical reliability and methodological rigor. The results are averaged over 15 fixed seen/unseen class splits, with each split evaluated under five independent random seeds.

Furthermore, to enhance the semantic expression and generalization performance of the model in athlete action recognition, this study incorporates the Kinetics dataset as an external knowledge source. The Kinetics dataset encompasses a broader range of human activity categories and provides a richer source of prior knowledge for recognizing athletic actions. Specifically, Kinetics-400 consists of approximately 220,000 video clips, covering 400 action categories, while Kinetics-700 contains over 500,000 videos spanning 700 action types. By leveraging prior knowledge from this large-scale dataset, the model can achieve high recognition accuracy and robustness, even when confronted with unseen sports events or novel athletic actions.

In the context of GZSL, the model’s prediction scope during the testing phase is not confined to the source class set Ys observed during training, but also encompasses the unseen target class set Yu. This feature holds particular significance for athlete action recognition, as real-world sports scenarios demand that the model not only accurately identify common basic actions (e.g., running, jumping, throwing), but also generalize and recognize novel or complex actions that have not been previously encountered. Consequently, evaluating model performance based solely on the accuracy of source or target classes fails to fully reflect its effectiveness in practical applications.

H = \frac{2 \times A c c (Y s) \times A c c (Y u)}{A c c (Y s) + A c c (Y u)}

(6)

Among them,

A c c (Y s)

denotes the average accuracy of the model on the source classes, while

A c c (Y u)

denotes the average accuracy on the target classes. The harmonic mean effectively prevents the model from exhibiting a lopsided performance in recognizing source versus target classes, thereby yielding a more balanced and equitable assessment of performance across class types. Within the athlete action recognition setting, this evaluation more accurately characterizes generalization and robustness over heterogeneous actions, yielding consistent results for both seen and unseen classes.

The proposed method was designed and implemented based on the PyTorch framework (version 1.12.1), and random initialization of network parameters was ensured. All experiments were conducted on a Dell PowerEdge T630 workstation (Dell Inc., Round Rock, TX, USA) equipped with 128 GB of RAM and an NVIDIA RTX 3090 GPU with 24 GB of VRAM.

4.2. The Effectiveness of Dynamic Loss Weight $β (t)$

In zero-shot action recognition tasks, the training samples exhibit an inherent imbalance between source and target classes. Directly applying a fixed weight during optimization may lead the model to over-rely on the source classes, which in turn weakens its generalization capability on unseen classes. To address this issue, we introduce a dynamic loss weight

β (t)

into the loss function. This weight adaptively adjusts the optimization balance between source and target classes throughout the training process, enabling the model to progressively transfer knowledge from the source class space to the target classes.

To evaluate the effectiveness of the proposed strategy, this experiment designed four comparative schemes, including a fixed weight setting with

β = 1.0

, the linearly increasing strategy proposed in this paper defined as

β (t) = m i n (2.0, 1.0 + t / 5)

, a fixed higher weight with

β = 2.0

, and the removal of the target term retaining only the cross-entropy loss. The experiments were conducted on the UCF101 dataset under the generalized zero-shot learning protocol with 15 random splits. The visual features were extracted using R(2+1)D-18 as the visual encoder, and 300-dimensional semantic embeddings generated by Word2Vec were employed as auxiliary knowledge. In addition to conventional evaluation metrics, Acc(Ys), Acc(Yu), and the harmonic mean H-a newly designed metric, the Bias-Alleviated Recognition Score (BARS), was introduced to better assess model performance:

BARS = λ H + (1 - λ) F 1 (Y_{u})

(7)

Among them,

λ

takes a value of 0.5. This metric further incorporates the bias rates of both the source and target classes based on the harmonic mean, enabling a more comprehensive evaluation of the model’s ability to mitigate training bias and enhance recognition performance for the target class.

It can be observed from the Table 2 that different weighting strategies have a substantial impact on the model’s recognition performance. The upward arrow signifies that higher values are preferable, whereas the downward arrow indicates that lower values are more favorable. Fixed weights of

β = 1.0

and

β = 2.0

both demonstrate high accuracy on seen classes; however, they perform poorly in recognizing unseen classes, resulting in moderate values for the harmonic mean and BARS scores. The strategy of removing the target term further exacerbates category bias, leading to a significant decline in overall performance. In contrast, the linearly increasing weighting strategy

β (t)

proposed in this paper achieves the most balanced performance across both seen and unseen classes. Specifically, the accuracy on unseen classes improves to 50.5%, while the harmonic mean and BARS scores reach 65.0% and 80.0%, respectively, outperforming all other strategies significantly. These results confirm that dynamic loss weighting can effectively mitigate category bias and notably enhance the model’s generalization capability in generalized zero-shot action recognition tasks.

Table 2. The influence of different weighting strategies on the recognition performance of the model.

4.3. Quality of SFT Pseudo-Semantic Prototypes

To provide additional evidence of effectiveness, this work examines how pseudo-semantic prototypes influence performance in generalized zero-shot sports action recognition. Experiments are performed on HMDB51 and UCF101 under the generalized zero-shot setting with 15 random splits. The visual features are extracted using the R(2+1)D-18 network, and 300-dimensional semantic embeddings derived from Word2Vec are employed as auxiliary semantic knowledge.

As shown in Table 3, the experimental results indicate that the SFT strategy achieves a relatively optimal balance between the source and target classes. Specifically, the accuracy of the source class reaches 94.8%, demonstrating that this method maintains strong recognition performance on seen classes while effectively preventing overfitting on the source class. The accuracy of the target class improves to 42.3%, significantly surpassing the baseline model, which confirms the positive impact of pseudo-semantic prototypes on recognizing unseen classes. Moreover, the H of the source and target class accuracies reaches 58.5%, further illustrating that the proposed method effectively mitigates the performance imbalance between classes in the generalized zero-shot learning scenario.

Table 3. Comparison of Comprehensive Indicators of Different Pseudo-Semantic Prototypes.

Specifically, the performance of No Prototype and Random Prototype on the target class was relatively low (Yu < 35%), resulting in both the harmonic mean and BARS being at a low level. This indicates that in the absence of effective prototype constraints, the model tends to overly rely on source class embeddings, thereby amplifying the bias. In contrast, K-Means Prototype (SFT) and GMM Prototype significantly improved the recognition performance of the target class. Among them, K-Means achieved the best performance on Yu (42.3%) and BARS (76.3%), with the harmonic mean also increasing to 58.5%, demonstrating its strong generalization ability. This suggests that unsupervised clustering methods can generate more discriminative pseudo-semantic prototypes, thereby reducing the mismatch between visual features and semantic representations.

Further, the Bias and BARS in Figure 3 provide a more intuitive analysis of bias and balance. It can be observed that the Bias of No Prototype and Random Prototype is relatively large, indicating a significant imbalance between the source class and the target class. However, with the introduction of K-Means and GMM, the Bias decreases significantly, while the BARS index increases, verifying that the pseudo-semantic prototype effectively suppresses the source class bias while enhancing the performance of the target class.

Figure 3. Performance Comparison under Different Pseudo-Semantic Prototype Generation Strategies.

4.4. Multi-Scale Feature Extraction Module (MSFE) and Robustness

In generalized zero-shot action recognition, models typically perform well in controlled environments but often face various types of noise and disturbances—such as random occlusion, blurring, compression, and illumination changes—in real-world applications. To assess the robustness of the proposed MSFE, this experiment evaluates the model’s performance on both clean and perturbed samples.

Specifically, we select the accuracy of the target class, and the harmonic mean as the primary evaluation metrics to analyze the degree of performance degradation under varying perturbation intensities. Furthermore, to quantitatively assess model stability, we define

Δ h %

and

Δ a c c (t) %

as the relative degradation rates under perturbed conditions, which serve as indicators for measuring robustness improvement. Additionally, we incorporate the overall Precision, Recall, and F1-score metrics to evaluate the model’s generalization capability in noisy environments.

A comparative study against the backbone baseline indicates that the MSFE module markedly improves robustness to perturbations in Table 4. Under random occlusion, the baseline exhibits a decline of approximately 7.3% in unseen-class accuracy and 2.8% in harmonic mean, whereas the MSFE-enhanced model shows only a 2.1% and 0.8% decrease, respectively, with the harmonic mean remaining above 60%. These results suggest that MSFE preserves salient motion cues across multiple spatio-temporal scales, thereby mitigating the loss of discriminative capability.

Table 4. Comparative Analysis of Pseudo-Semantic Prototypes.

Beyond robustness, consistent gains in recognition quality are observed with MSFE relative to the baseline. The advantage is attributed to its established components: parallel spatio-temporal branches with heterogeneous receptive fields to capture complementary short- and long-range dynamics; feature concatenation followed by a pointwise projection to enable cross-scale interaction while controlling dimensionality; an attention-based fusion gate that combines channel re-weighting with temporal self-attention to emphasize informative motion patterns and suppress redundancy; and a temporal pyramid–style aggregation head to summarize multi-resolution dynamics prior to global averaging and normalization. Collectively, these design choices yield embeddings with improved class separability, reflected under moderate perturbations by precision, recall, and F1 scores of 93.7%, 90.1%, and 92.2, and translating into superior Acc(Yu) and H within the GZSL setting.

4.5. Bias Suppression Effects of SGT and SFT Branches and Joint Inference

To further verify the complementarity of semantic embeddings and pseudo-semantic prototypes in the generalized zero-shot learning task, this experiment evaluates three reasoning strategies: a single-branch model based solely on semantic embeddings of real categories, a single-branch model relying exclusively on pseudo-semantic prototypes, and a fusion model that combines the predictions of both branches through weighted averaging. By comparing the performance across seen classes, unseen classes, and overall accuracy, this analysis aims to reveal the distinct roles of semantic and pseudo-semantic features in model reasoning and provide a foundation for the subsequent design of more effective fusion mechanisms.

It can be seen from the Table 5 that the SGT-only model achieves the highest accuracy of 93.3% on seen classes. However, its recognition rate on unseen classes is only 41.6%, which results in a substantial drop in the overall harmonic mean to 57.5, indicating a strong bias toward source classes. In contrast, the SFT-only model improves the recognition rate on unseen classes to 44.1%, representing an increase of 2.5 percentage points compared to the SGT-only model, and raises the harmonic mean to 59.7. Nevertheless, its accuracy on seen classes declines slightly to 92.5%, and the model still fails to achieve a balanced performance across both seen and unseen classes. Notably, the naive-fusion strategy achieves a more favorable trade-off between the two tasks. It attains an accuracy of 95.8% on seen classes and significantly improves the recognition rate on unseen classes to 50.5%, surpassing the performance of the single-branch models. The harmonic mean rises to 65.0, indicating improved recognition of unseen classes and a reduction in source class bias.

Table 5. Comparison of Comprehensive Indicators of Different Pseudo-Semantic Prototypes.

To visually substantiate the proposed method’s effectiveness in mitigating bias, Figure 4 compares the performance of various reasoning strategies in the source and target classes in the GZSL setting. It can be seen that single-branch methods suffer from a notable performance discrepancy between source and target classes. Specifically, the SGT-only method favors seen classes, whereas SFT-only, although it improves the recognition of target classes, compromises the performance on source classes. This suggests that a single branch struggles to achieve a balanced performance between the two types of classes. The results also indicate a complementary relationship between semantic embedding and pseudo-semantic prototypes in the reasoning process. In contrast, the fusion strategy significantly boosts the performance on unseen classes while preserving the accuracy on source classes, thereby achieving a more balanced overall performance. The result points are closer to the ideal diagonal, suggesting that the model has effectively reduced the bias toward source classes and achieved a better equilibrium between source and target class performance.

Figure 4. Effect of Fusion on Bias Alleviation in GZSL.

4.6. Ablation Study

To further validate the contribution of each component in the proposed MSA-ZSAR framework, we conduct ablation experiments by incrementally removing or retaining individual modules, including the SGT, SFT, the CPG, and the MME. These studies are designed to reveal the independent role of each module in mitigating source-class bias, enhancing semantic–visual alignment, and improving overall generalization in the GZSL setting. Table 6 reports the results under the GZSL protocol. The baseline corresponds to the backbone with only multi-scale feature extraction, while subsequent rows progressively add SGT, SFT, CPG, and MME. Results are evaluated using Acc(Ys), Acc(Yu), and the H to provide a balanced assessment.

Table 6. Ablation study of MSA-ZSAR modules on GZSL performance.

The results indicate that each component yields a measurable, complementary gain. SGT improves recognition of unseen classes by tightening semantic–visual alignment: transformer-based contextualization reduces manifold mismatch and produces better-calibrated similarities, which directly benefits transfer beyond the source label space. SFT further alleviates source-class bias by constructing and refining data-driven pseudo semantics; this weakly supervised signal expands the effective semantic support where descriptors are sparse, increasing Acc(Yu) with only minor impact on seen-class precision. CPG contributes additional improvements by refining class prototypes: clustering consolidates intra-class structure and suppresses noisy idiosyncrasies, leading to embeddings with higher intra-class compactness and inter-class separability. Finally, MME stabilizes evaluation by aggregating complementary decision tendencies from independently trained models/checkpoints, thereby reducing variance across splits and narrowing the seen–unseen performance gap.

Table 7 shows a monotonic but moderate increase in parameters, GFLOPs, and latency as MSFE and subsequent alignment modules are added, reflecting the added multi-branch processing and fusion. Despite this overhead, the full model remains within sub-10 ms per-clip inference, indicating suitability for real-time use. Overall, the method achieves a favorable accuracy–efficiency balance, with modest complexity increases justified by the observed performance gains.

Table 7. Computational cost of different model variants.

Representative best–worst cases were examined to complement the quantitative results. When actions exhibit distinctive spatio-temporal cues and reliable semantics, the fused model is observed to rank the ground-truth class with a clear posterior margin, yielding an unseen-class accuracy of 50.5% and a harmonic mean of 65.0, compared with 41.6% and 57.5 for the SGT-only variant, which corresponds to gains of 8.9 percentage points in unseen-class accuracy and 7.5 in the harmonic mean. Under weaker semantics or fine-grained ambiguities, the SFT-only variant provides modest relief, improving unseen-class accuracy by 2.5 percentage points and the harmonic mean by 2.2 over SGT-only, while a slight reduction in seen-class accuracy is incurred, indicating the limits of pseudo-semantic guidance in the absence of strong descriptors. Robustness analyses further indicate that, under random occlusion, the baseline experiences larger performance drops—7.3% in unseen-class accuracy and 2.8 in the harmonic mean—whereas the MSFE-enhanced model degrades by only 2.1% and 0.8, with the harmonic mean remaining above 60. Under moderate perturbations, strong detection quality is retained, with precision of 93.7%, recall of 90.1%, and an F1 score of 92.2.

4.7. Comparative Experiments with Existing Methods

To further demonstrate the model’s efficacy and competitiveness in athlete action recognition, this study conducted comparative experiments with several representative zero-shot and generalized zero-shot action recognition methods. Specifically, FGGA integrates GAN-based feature generation with attention-enhanced graph convolution on knowledge graphs for zero-shot recognition []; DASZL introduces a compositional strategy for fine-grained recognition by modeling activities as dynamic action signatures []; GAGCN constructs a visually coherent graph through grouped attention convolution []; and The TLPK method proposes a transductive framework targeting generalized zero-shot sports action recognition []; SDR-CLIP construct a more effective semantic space for zero-shot video action recognition, enabling improved feature generation []; RGSCL proposed a reservation-based gate trained on fictive samples method and improves generative feature fidelity through hypersphere-based semantic contrast [].

For evaluation, two challenging video action recognition datasets, HMDB51 and UCF101, were utilized to assess target class recognition accuracy. Both word embeddings (W) and manually annotated attributes (A) served as semantic embedding approaches, and all methods were evaluated under the same experimental protocol specifically, 15 random splits of source and target classes to enable a fair comparison. The proposed method employs the R(2+1)D-18 network as the visual encoder and integrates semantic generation with bias suppression strategies to improve cross-class generalization. Mean accuracy on target classes is taken as the key metric, enabling a comprehensive assessment of performance and reliability under GZSL protocols.

As shown in Table 8, the proposed method achieves superior target-class recognition on both HMDB51 and UCF101 relative to representative baselines. On the HMDB51 dataset, the target class accuracy of the proposed method reaches 49.2%, approximately 3.6% higher than that of TLPK. For the UCF101 dataset, The algorithm shows an accuracy of 50.5%., outperforming the current best method, TLPK, which achieves 49.9%. Compared with TLPK, improved performance is observed across datasets, and, in contrast to DASZL—which relies on manually curated attributes—the method attains stronger cross-class generalization without additional annotation effort. These results indicate that the approach adopted in this study is better aligned with the demands of generalized zero-shot recognition. The performance gains are attributed to the synergy among the core components introduced in this paper. Multi-scale spatio-temporal encoding yields more transferable video representations, enhancing robustness to intra-class variability and temporal dynamics. The dual-branch design unifies semantic-guided alignment when reliable descriptors are available with semantic-free modeling when semantics are sparse, and their fusion reduces source-class bias while preserving discriminability on seen categories.

Table 8. Comparison of Comprehensive Indicators of Different Pseudo-Semantic Prototypes.

5. Conclusions

This paper addresses the challenges in sports action recognition posed by the complexity and diversity of actions, as well as the continuous emergence of unseen categories. To tackle these issues, we propose a generalized zero-shot action recognition framework, MSA-ZSAR, based on transductive learning. The framework enhances spatio-temporal modeling through a multi-scale feature extraction module and integrates both SGT and SFT modules, thereby maintaining strong generalization capability under varying levels of semantic availability. In addition, the improved loss design, incorporating cross-entropy, probability maximization, and mean squared error regularization, effectively mitigates the source-class bias problem. Extensive experiments demonstrate that the proposed method achieves a best overall accuracy of 50.5% and a harmonic mean of 65.0%, surpassing existing approaches and validating its effectiveness in balancing seen and unseen class recognition. These results also confirm improvements in bias reduction metrics such as the BARS, further supporting the robustness of our framework.

Author Contributions

Q.Z. and H.L. conceived and designed the study; Q.Z. developed the methodology and performed the experiments; F.M. implemented the software and contributed to data processing; W.Q. conducted the literature review and assisted in analysis; H.L. supervised the research and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mitsuzumi, Y.; Kimura, A.; Irie, G.; Nakazawa, A. Cross-Action Cross-Subject Skeleton Action Recognition via Simultaneous Action-Subject Learning with Two-Step Feature Removal. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 2182–2186. [Google Scholar]
Wang, G.; Guo, J.; Zhang, J.; Qi, X.; Song, H. Design of Human Action Recognition Method Based on Cross Attention and 2s-AGCN Model. In Proceedings of the 2024 IEEE 6th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Hangzhou, China, 23–25 October 2024; pp. 1341–1345. [Google Scholar]
Chen, Y.; Zhang, J.; Wang, Y. Human action recognition and analysis methods based on OpenPose and deep learning. In Proceedings of the 2024 International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India, 23–24 February 2024; pp. 1–5. [Google Scholar]
Huang, S. Sports Action Recognition and Standardized Training using Puma Optimization Based Residual Network. In Proceedings of the 2024 International Conference on Intelligent Algorithms for Computational Intelligence Systems (IACIS), Hassan, India, 23–24 August 2024; pp. 1–4. [Google Scholar]
Kim, J.S. Efficient human action recognition with dual-action neural networks for virtual sports training. In Proceedings of the 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Yeosu, Republic of Korea, 26–28 October 2022; pp. 1–3. [Google Scholar]
Kim, J.S. Action Recognition System Using Full-body XR Devices for Sports Metaverse Games. In Proceedings of the 2024 15th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 16–18 October 2024; pp. 1962–1965. [Google Scholar]
Zhou, S. A survey of pet action recognition with action recommendation based on HAR. In Proceedings of the 2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Niagara Falls, ON, Canada, 17–20 November 2022; pp. 765–770. [Google Scholar]
Sheikh, A.; Kuhite, C.; Chamate, S.; Raut, V.; Huddar, R.; Bhoyar, K.K. Sports Recognition in Videos Using Deep Learning. In Proceedings of the 2024 2nd International Conference on Emerging Trends in Engineering and Medical Sciences (ICETEMS), Nagpur, India, 22–23 November 2024; pp. 375–379. [Google Scholar]
Mandal, D.; Narayan, S.; Dwivedi, S.K.; Gupta, V.; Ahmed, S.; Khan, F.S.; Shao, L. Out-of-distribution detection for generalized zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9985–9993. [Google Scholar]
Qiao, W.; Yang, G.; Gao, M.; Liu, Y.; Zhang, X. Research on Sports Action Analysis and Diagnosis System Based on Grey Clustering Algorithm. In Proceedings of the 2023 Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 14–16 April 2023; pp. 472–475. [Google Scholar]
Liu, Y.; Yuan, J.; Tu, Z. Motion-driven visual tempo learning for video-based action recognition. IEEE Trans. Image Process. 2022, 31, 4104–4116. [Google Scholar] [CrossRef] [PubMed]
Tu, Z.; Xie, W.; Dauwels, J.; Li, B.; Yuan, J. Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 1423–1437. [Google Scholar] [CrossRef]
Kong, L.; Huang, D.; Qin, J.; Wang, Y. A joint framework for athlete tracking and action recognition in sports videos. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 532–548. [Google Scholar] [CrossRef]
Tejero-de Pablos, A.; Nakashima, Y.; Sato, T.; Yokoya, N. Human action recognition-based video summarization for RGB-D personal sports video. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
Pratik, V.; Palani, S. Super Long Range CNN for Video Enhancement in Handball Action Recognition. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence For Internet of Things (AIIoT), Vellore, India, 3–4 May 2024; pp. 1–6. [Google Scholar]
Wang, K. Research on Application of Computer Vision in Movement Recognition System of Sports Athletes. In Proceedings of the 2024 International Conference on Electronics and Devices, Computational Science (ICEDCS), Marseille, France, 23–25 September 2024; pp. 999–1004. [Google Scholar]
Guo, H.; Yu, W.; Yan, Y.; Wang, H. Bi-directional motion attention with contrastive learning for few-shot action recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5490–5494. [Google Scholar]
Li, Y.; Liu, Z.; Yao, L.; Wang, X.; McAuley, J.; Chang, X. An entropy-guided reinforced partial convolutional network for zero-shot learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5175–5186. [Google Scholar] [CrossRef]
Zhang, L.; Chang, X.; Liu, J.; Luo, M.; Li, Z.; Yao, L.; Hauptmann, A. TN-ZSTAD: Transferable network for zero-shot temporal activity detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3848–3861. [Google Scholar] [CrossRef] [PubMed]
Keshari, R.; Singh, R.; Vatsa, M. Generalized zero-shot learning via over-complete distribution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13300–13308. [Google Scholar]
Min, S.; Yao, H.; Xie, H.; Wang, C.; Zha, Z.J.; Zhang, Y. Domain-aware visual bias eliminating for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12664–12673. [Google Scholar]
Zhang, L.; Wang, P.; Liu, L.; Shen, C.; Wei, W.; Zhang, Y.; Van Den Hengel, A. Towards effective deep embedding for zero-shot learning. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2843–2852. [Google Scholar] [CrossRef]
Fu, Y.; Hospedales, T.M.; Xiang, T.; Gong, S. Transductive multi-view zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2332–2345. [Google Scholar] [CrossRef] [PubMed]
Guo, Y.; Ding, G.; Jin, X.; Wang, J. Transductive zero-shot recognition via shared model space learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Xu, Y.; Han, C.; Qin, J.; Xu, X.; Han, G.; He, S. Transductive zero-shot action recognition via visually connected graph convolutional networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3761–3769. [Google Scholar] [CrossRef] [PubMed]
Rahman, S.; Khan, S.; Barnes, N. Transductive learning for zero-shot object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 6082–6091. [Google Scholar]
Tian, Y.; Huang, Y.; Xu, W.; Kong, Y. Coupling Adversarial Graph Embedding for transductive zero-shot action recognition. Neurocomputing 2021, 452, 239–252. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, G. Semantic feedback for generalized zero-shot learning. In Proceedings of the 2024 4th International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 19–21 January 2024; pp. 298–302. [Google Scholar]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar] [CrossRef]
Sun, B.; Kong, D.; Wang, S.; Li, J.; Yin, B.; Luo, X. GAN for vision, KG for relation: A two-stage network for zero-shot action recognition. Pattern Recognit. 2022, 126, 108563. [Google Scholar] [CrossRef]
Kim, T.S.; Jones, J.; Peven, M.; Xiao, Z.; Bai, J.; Zhang, Y.; Qiu, W.; Yuille, A.; Hager, G.D. Daszl: Dynamic action signatures for zero-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 1817–1826. [Google Scholar]
Su, T.; Wang, H.; Qi, Q.; Wang, L.; He, B. Transductive learning with prior knowledge for generalized zero-shot action recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 260–273. [Google Scholar] [CrossRef]
Gowda, S.N.; Sevilla-Lara, L. Telling stories for common sense zero-shot action recognition. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 4577–4594. [Google Scholar]
Shang, J.; Niu, C.; Tao, X.; Zhou, Z.; Yang, J. Generalized zero-shot action recognition through reservation-based gate and semantic-enhanced contrastive learning. Knowl.-Based Syst. 2024, 301, 112283. [Google Scholar] [CrossRef]

Figure 1. The source class bias problem in sports action recognition.

Figure 2. The Architecture of MSA-ZSAR Framework for Sports Action Recognition.

Figure 3. Performance Comparison under Different Pseudo-Semantic Prototype Generation Strategies.

Figure 4. Effect of Fusion on Bias Alleviation in GZSL.

Table 1. List of abbreviations used in this paper.

Abbreviation	Full Term
ZSL	Zero-Shot Learning
GZSL	Generalized Zero-Shot Learning
MSA-ZSAR	Multi-Scale Semantic Alignment framework for Zero-Shot Sports Action Recognition
SGT	Semantic-Guided Transfer
SFT	Semantic-Free Transfer
QFSL Loss	Quasi-Fully Supervised Learning Loss
CPG	Clustering-based Prototype Generation
MME	Multi-model Ensemble Evaluation
CE	Cross-Entropy
PM	Probability Maximization
MSE	Mean Squared Error
BARS	Bias-Alleviated Recognition Score
Acc(Ys)	Accuracy on Seen Classes
Acc(Yu)	Accuracy on Unseen Classes
H	Harmonic Mean

Table 2. The influence of different weighting strategies on the recognition performance of the model.

Strategy ( $β$ )	Acc(Ys) ↑	Acc(Yu) ↑	H ↑	P(Yu) ↑	R(Yu) ↑	F1(Yu) ↑	BARS ↑
1.0	90.5 ± 1.6	42.2 ± 2.0	55.7 ± 4.5	95.4 ± 0.7	82.2 ± 2.5	90.8 ± 0.8	73.2 ± 2.3
min(2.0, 1.0+t/5)	95.9 ± 1.2	50.5 ± 1.9	65.0 ± 1.5	98.9 ± 0.5	91.3 ± 1.7	95.0 ± 1.1	80.0 ± 1.9
2.0	92.1 ± 0.9	44.1 ± 0.8	57.5 ± 2.4	96.7 ± 0.5	83.7 ± 3.7	89.6 ± 2.0	73.5 ± 3.5
Remove the target item	85.2 ± 1.2	39.1 ± 2.1	53.2 ± 4.2	94.4 ± 1.3	80.8 ± 1.6	88.8 ± 2.9	71 ± 2.4

Table 3. Comparison of Comprehensive Indicators of Different Pseudo-Semantic Prototypes.

Strategy	Acc(Ys) ↑	Acc(Yu) ↑	H ↑	P(Yu) ↑	R(Yu) ↑	F1(Yu) ↑
No prototype	95.1 ± 1.8	28.4 ± 3.1	43.7 ± 2.4	92.0 ± 1.7	70.2 ± 5.1	79.7 ± 4.2
Random prototype	93.7 ± 2.4	34.9 ± 2.7	50.6 ± 2.5	95.5 ± 2.0	75.1 ± 1.5	84.2 ±2.5
SFT	94.8 ± 2.1	42.3 ± 1.7	58.5 ± 1.8	98.9 ± 1.9	90.4 ± 1.9	94.5 ± 1.4
GMM Prototype	94.5 ± 23	41.7 ± 2.3	57.9 ± 3.4	97.5 ± 0.7	89.1 ± 3.1	93.1 ± 0.8

Table 4. Comparative Analysis of Pseudo-Semantic Prototypes.

Model	Acc(Yu)-Clean ↑	Acc(Yu)-Perturbed ↑	$Δ Acc$ ↓	H-Clean ↑	H-Perturbed ↑	$Δ H$ ↓	P ↑	R ↑	F1 ↑
Baseline	50.5 ± 2.2	43.2 ± 1.9	7.3	41.3 ± 3.8	37.5 ± 5.6	2.8	90.5 ± 1.1	86.4 ± 2.9	87.5 ± 2.0
MSFE	50.8 ± 1.5	48.9 ± 1.7	2.1	65.2 ± 2.1	62.4 ± 1.8	0.8	93.7 ± 2.1	90.1 ± 3.2	92.2 ± 1.8

Table 5. Comparison of Comprehensive Indicators of Different Pseudo-Semantic Prototypes.

Strategy	Acc(Ys) ↑	Acc(Yu) ↑	H ↑
SGT-only	93.3 ± 1.8	41.6 ± 2.3	57.5 ± 1.8
SFT-only	92.5 ± 1.0	44.1 ± 2.1	59.7 ± 2.4
Naive-fusion	95.8 ± 1.2	50.5 ± 1.5	65.0 ± 1.9

Table 6. Ablation study of MSA-ZSAR modules on GZSL performance.

Model Variant	Acc(Ys)	Acc(Yu)	H
Baseline	90.1 ± 2.5	42.7 ± 2.0	47.7 ± 3.1
+ SGT	92.3 ± 2.3	45.4 ± 2.4	53.4 ± 2.2
+ SFT	91.6 ± 1.8	46.1 ± 2.7	55.8 ± 2.4
+ CPG	92.0 ± 0.7	48.7 ± 3.0	58.9 ± 2.1
+ MME (Full Model)	95.9 ± 1.2	50.8 ± 1.5	65.0 ± 1.5

Table 7. Computational cost of different model variants.

Model Variant	Params (M)	GFLOPs	Latency (ms)
Baseline	33.1	28.5	6.8
Baseline + MSFE	36.4	32.7	7.5
Full Model	41.2	36.1	8.9

Table 8. Comparison of Comprehensive Indicators of Different Pseudo-Semantic Prototypes.

Method	Model	Word Embeddings	HMDB51 (ACC(Yu) ↑)	UCF101 (ACC(Yu) ↑)	HMDB51 (H ↑)	UCF101 (H ↑)
FGGA []	I3D	W	33.9 ± 2.8	30.1 ± 3.2	N/A	N/A
DASZL []	TSM	A	N/A	48.9 ± 2.9	N/A	45.3 ± 2.8
GAGCN []	Ts+Pose	W	31.1 ± 3.4	31.8 ± 3.7	N/A	N/A
TLPK []	R3D	W	45.6 ± 2.5	49.9 ± 2.3	37.1 ± 10.8	51.8 ± 4.4
SDR-CLIP []	I3D	W	38.9 ± 3.5	44.8 ± 3.2	43.1 ± 5.3	41.8 ± 2.2
RGSCL []	STC	W	32.8 ± 2.5	28.8 ± 3.4	40.1 ± 7.2	39.5 ± 5.8
Proposed	R(2+1)D-18	W	49.2 ± 3.4	50.5 ± 1.5	64.2 ± 0.7	65.0 ± 1.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Bias-Alleviated Zero-Shot Sports Action Recognition Enabled by Multi-Scale Semantic Alignment

Abstract

1. Introduction

2. Related Work

2.1. Sports Action Recognition Based on Deep Learning

2.2. Zero-Shot Learning and Generalized Zero-Shot Learning

2.3. Transductive Zero-Shot Learning

3. Proposed Method

3.1. Preparation Work

3.2. The Design of the MSA-ZSAR Framework

3.3. Design of Loss Function

4. Experiments

4.1. Experimental Setup

4.2. The Effectiveness of Dynamic Loss Weight $β (t)$

4.3. Quality of SFT Pseudo-Semantic Prototypes

4.4. Multi-Scale Feature Extraction Module (MSFE) and Robustness

4.5. Bias Suppression Effects of SGT and SFT Branches and Joint Inference

4.6. Ablation Study

4.7. Comparative Experiments with Existing Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Bias-Alleviated Zero-Shot Sports Action Recognition Enabled by Multi-Scale Semantic Alignment

Abstract

1. Introduction

2. Related Work

2.1. Sports Action Recognition Based on Deep Learning

2.2. Zero-Shot Learning and Generalized Zero-Shot Learning

2.3. Transductive Zero-Shot Learning

3. Proposed Method

3.1. Preparation Work

3.2. The Design of the MSA-ZSAR Framework

3.3. Design of Loss Function

4. Experiments

4.1. Experimental Setup

4.2. The Effectiveness of Dynamic Loss Weight β ( t )

4.3. Quality of SFT Pseudo-Semantic Prototypes

4.4. Multi-Scale Feature Extraction Module (MSFE) and Robustness

4.5. Bias Suppression Effects of SGT and SFT Branches and Joint Inference

4.6. Ablation Study

4.7. Comparative Experiments with Existing Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

4.2. The Effectiveness of Dynamic Loss Weight $β (t)$