SoccerDETR: Real-Time Soccer Object Detection via Visual State Space Models with Semantic-Aware Feature Fusion

Zhou, Dongyang; Li, Yuheng

doi:10.3390/technologies14030142

Open AccessArticle

SoccerDETR: Real-Time Soccer Object Detection via Visual State Space Models with Semantic-Aware Feature Fusion

by

Dongyang Zhou

¹ and

Yuheng Li

^2,*

¹

Xi’an Thermal Power Research Institute Co., Ltd., Xi’an 710054, China

²

Key Laboratory of Internet Information Retrieval of Hainan Province, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Technologies 2026, 14(3), 142; https://doi.org/10.3390/technologies14030142

Submission received: 8 February 2026 / Revised: 20 February 2026 / Accepted: 23 February 2026 / Published: 27 February 2026

Download

Browse Figures

Versions Notes

Abstract

Real-time object detection in soccer videos presents significant challenges due to the dynamic nature of matches, varying object scales, and the stringent requirement for efficient processing. In this work, we define real-time detection as that which achieves inference speeds of at least 30 frames per second (FPS), which is the minimum requirement for smooth video processing and live broadcast applications. While transformer-based detectors have achieved remarkable accuracy, their quadratic computational complexity limits their real-time applications. In this paper, we propose SoccerDETR, a novel real-time detection framework that integrates MobileMamba-based visual state space models with an efficient transformer encoder for soccer object detection. Our approach introduces four key innovations: (1) a MobileMamba backbone leveraging selective state space modeling to achieve linear computational complexity while maintaining global receptive fields; (2) a Semantic-aware Dynamic Feature Fusion Module (SDFM) that adaptively aggregates multi-scale features through progressive semantic injection; (3) a Spatial-Channel Synergistic Attention (SCSA) mechanism that explores the synergistic effects between spatial and channel attention for enhanced feature representation; and (4) a Separable Dynamic Decoder that employs dynamic convolution attention to replace traditional cross-attention, significantly reducing computational overhead. Additionally, we design a Scale-Aware Focal Loss (SAFL) that addresses the class imbalance and scale variation problems inherent in soccer scenarios. Extensive experiments on the Soccana and SoccerNet datasets demonstrate that SoccerDETR achieves state-of-the-art performance with 94.2% mAP@50 on Soccana and 91.8% mAP@50 on SoccerNet, while maintaining real-time inference speed of 78 FPS on a single NVIDIA RTX 4090 GPU with a batch size of 1 and an input resolution 640 × 640. Our method outperforms existing approaches by 2.3–5.7% in mAP while being 1.5–3.2× faster, demonstrating the effectiveness of state space models for efficient sports video object detection. Comprehensive ablation studies validate the effectiveness of each proposed component, and cross-dataset experiments demonstrate strong generalization capability.

Keywords:

soccer object detection; real-time object detection; state space models; transformer; multi-scale feature fusion; attention mechanism

1. Introduction

Soccer, as the world’s most popular sport with an estimated four billion fans globally [1], generates vast amounts of video content that requires automated analysis for applications ranging from tactical analysis to broadcast enhancement [2]. The global sports analytics market is projected to be worth $5.2 billion by 2027 [3], with soccer video analysis representing a significant portion of this growth. Object detection in soccer videos, particularly the localization of players, referees, and the ball, serves as a fundamental task for downstream applications, including player tracking, action recognition, event spotting, and automated highlight generation [4]. These applications have profound implications for coaching staff seeking tactical insights, broadcasters aiming to enhance viewer experience, and sports scientists analyzing player performance metrics.

However, soccer video analysis presents unique and formidable challenges that distinguish it from general object detection tasks: (1) significant scale variations between players at different distances from the camera, where a player near the camera may occupy thousands of pixels while distant players span merely dozens; (2) frequent occlusions in crowded scenes, particularly during set pieces, corner kicks, and goal-mouth scrambles where multiple players cluster together; (3) the extremely small size of the ball relative to the image resolution, often appearing as a mere 10–30 pixel blob in broadcast footage; (4) rapid motion blur affecting both players during sprints and the ball during powerful shots; (5) varying lighting conditions across different stadiums and times of day; and (6) the stringent requirement for real-time processing to enable live analysis and instant replay generation [5]. These challenges collectively demand detection systems that are simultaneously accurate, robust, and computationally efficient.

Recent advances in object detection have been dominated by two paradigms: CNN-based detectors such as the YOLO series [6,7,8] and transformer-based approaches like DETR [9] and its variants [10,11]. While YOLO models excel in speed through their single-stage architecture and optimized implementations, they rely on hand-crafted components like Non-Maximum Suppression (NMS) that introduce latency and can lead to suboptimal performance in crowded scenes where multiple players overlap. The NMS post-processing step, while effective for suppressing duplicate detections, operates as a greedy algorithm that may inadvertently remove valid detections in dense scenarios [12,13]. Transformer-based detectors elegantly eliminate NMS through set prediction with Hungarian matching but suffer from quadratic computational complexity

O (N^{2})

with respect to sequence length, severely limiting their applicability in real-time scenarios where processing speed is paramount [14].

The recently proposed RT-DETR [14] addresses this dilemma by designing an efficient hybrid encoder that decouples intra-scale and cross-scale feature interaction, achieving real-time performance while maintaining end-to-end detection capabilities. This architectural innovation, along with related advances in efficient neural network training [15], demonstrated that transformer-based detectors can compete with and even surpass YOLO models in the speed–accuracy trade-off. However, the transformer backbone still incurs substantial computational costs when processing the high-resolution inputs typical in sports broadcasting, where 1080p or even 4K resolution is standard. The self-attention mechanism, despite its effectiveness in capturing global dependencies, becomes a computational bottleneck as image resolution increases.

Meanwhile, State Space Models (SSMs), particularly Mamba [16], have emerged as a promising alternative that achieves linear complexity

O (N)

while effectively capturing long-range dependencies through selective state space mechanisms [17,18,19]. Unlike transformers that compute pairwise attention between all tokens, SSMs process sequences through a recurrent formulation that maintains a compressed hidden state, enabling efficient processing of long sequences. The selective mechanism in Mamba allows the model to dynamically adjust its information propagation based on input content, achieving content-aware filtering that rivals the expressiveness of attention mechanisms. Recent works have successfully adapted Mamba to visual recognition tasks, demonstrating competitive performance with significantly reduced computational overhead [20].

In this paper, we propose SoccerDETR, a novel detection framework that synergistically combines the efficiency of state space models with the accuracy of transformer-based detection; it is specifically tailored for the demanding requirements of soccer video analysis. As illustrated in Figure 1, our framework builds upon the successful RT-DETR architecture while introducing four key innovations that collectively address the computational and accuracy challenges:

(1) MobileMamba Backbone: We adopt MobileMamba [20] as our feature extraction backbone, which employs the Selective Scan 2D (SS2D) mechanism to process visual features with linear complexity. The SS2D module innovatively scans image patches along four directions (horizontal left-to-right, horizontal right-to-left, vertical top-to-bottom, and vertical bottom-to-top) and merges the results through a learnable aggregation mechanism. This multi-directional scanning strategy enables comprehensive spatial understanding without the quadratic overhead of self-attention, making it particularly suitable for processing high-resolution soccer broadcast footage. The MobileMamba architecture further incorporates Multi-Receptive Field Feature Interaction (MRFFI) modules that capture both local details and global context efficiently.

(2) Semantic-aware Dynamic Feature Fusion Module (SDFM): We incorporate the SDFM [21] to achieve adaptive multi-scale feature aggregation that is crucial for detecting objects across the extreme scale range present in soccer videos. Unlike conventional feature pyramid networks (FPNs) that use fixed fusion strategies with simple element-wise addition or concatenation, SDFM dynamically adjusts fusion weights based on semantic content through a learned attention mechanism. This enables more effective information flow between different resolution levels, ensuring that small objects like the ball receive adequate feature support from high-resolution feature maps while large objects benefit from the rich semantic information in low-resolution features.

(3) Spatial-Channel Synergistic Attention (SCSA): We integrate SCSA [22] into our feature enhancement pipeline to boost the discriminative power of extracted features. SCSA consists of two complementary components: Shareable Multi-Semantic Spatial Attention (SMSA) and Progressive Channel-wise Self-Attention (PCSA). SMSA captures multi-scale spatial information through parallel depthwise convolutions with varying kernel sizes, enabling the model to attend to objects of different scales simultaneously. PCSA computes channel-wise dependencies through an efficient self-attention mechanism that progressively refines channel representations. These two components work synergistically, with spatial attention guiding channel recalibration and channel attention informing spatial focus, achieving superior feature enhancement with minimal computational overhead.

(4) Separable Dynamic Decoder: We adopt the Separable Dynamic Decoder [23] that fundamentally reimagines the query–feature interaction mechanism in transformer decoders. Traditional multi-head cross-attention computes attention weights between all query–feature pairs, resulting in complexity

O (N \cdot M)

where N is the number of object queries and M is the number of feature tokens. The Separable Dynamic Decoder replaces this with dynamic convolution attention, which generates query-specific convolution kernels and applies them to feature maps through separable convolutions. This innovative design reduces the decoder complexity from

O (N \cdot M)

to

O (N + M)

, enabling efficient processing while maintaining the expressive power needed for accurate detection.

Furthermore, we propose a Scale-Aware Focal Loss (SAFL) that combines the class-balancing properties of focal loss [24] with scale-adaptive weighting to address the inherent class imbalance and scale variation in soccer detection scenarios. Standard focal loss effectively handles the foreground–background imbalance by down-weighting easy examples, but it treats all positive samples equally regardless of their scale. In soccer detection, this leads to suboptimal performance with regard to small objects like the ball, which are inherently harder to detect but equally important for downstream applications. Our SAFL explicitly up-weights the loss contribution from small objects, ensuring that the model dedicates sufficient learning capacity to these challenging cases.

Our main contributions can be summarized as follows:

We propose SoccerDETR, a detection framework that integrates visual state space models with transformer-based detection for soccer video analysis. To the best of our knowledge, this is among the first works to combine SSM backbones with transformer-based detection heads in the sports domain (see Table 1 for a systematic comparison). Unlike prior works that simply applied general-purpose detectors to sports scenarios, our architecture is purpose-built to address the unique challenges of soccer detection: extreme scale variation (20–100× between players and balls), dense player clustering, and real-time processing requirements. The adoption of MobileMamba as the backbone achieves linear computational complexity $O (N)$ while maintaining global receptive fields through multi-directional selective scanning, achieving significant computational savings compared to the quadratic $O (N^{2})$ complexity of transformer-based detectors.
We design a novel feature enhancement pipeline that synergistically combines SDFM and SCSA modules in a principled manner. The key architectural insight is that SDFM first performs content-aware multi-scale fusion through learned channel attention weights, and SCSA subsequently enhances the fused features through joint spatial-channel modulation, where spatial attention guides channel recalibration and vice versa. This two-stage enhancement is specifically motivated by the soccer detection scenario, where objects span extreme scale ranges and appear in cluttered backgrounds.
We introduce a Scale-Aware Focal Loss (SAFL) with a formally defined scale-adaptive weighting function $w_{s c a l e} = {(s_{m a x} / (s + ϵ))}^{β}$ that explicitly addresses the scale imbalance problem inherent in soccer detection. We provide gradient analysis showing how SAFL modulates the gradient magnitude for objects at different scales, ensuring that small objects (e.g., balls) receive proportionally larger gradient signals during training.
We conduct extensive experiments on Soccana and SoccerNet datasets with statistical significance testing (paired t-tests, where each training run with a different random seed constitutes one sample) and 5-fold cross-validation, demonstrating state-of-the-art performance with 94.2% mAP@50 on Soccana and 91.8% mAP@50 on SoccerNet. Comprehensive ablation studies validate the contribution of each proposed component, and cross-dataset experiments demonstrate strong generalization capability.
We provide detailed computational complexity analysis including theoretical complexity bounds, actual FLOPs/Params/FPS measurements, and per-component latency breakdown, offering actionable insights for future research in efficient sports video analysis.

2. Related Work

2.1. Object Detection in Sports Videos

Object detection in sports videos has attracted considerable attention due to its practical applications in performance analysis, broadcast enhancement, and automated content generation. The unique characteristics of sports videos, including fast motion, frequent occlusions, and domain-specific object categories, have motivated the development of specialized detection approaches.

Early approaches relied on traditional computer vision techniques such as background subtraction, color-based segmentation, and hand-crafted features like HOG (Histogram of Oriented Gradients) and SIFT (Scale-Invariant Feature Transform) [5]. These methods, while computationally efficient, struggled with the variability in lighting conditions, camera angles, and player appearances that characterize broadcast soccer footage. Background subtraction techniques, in particular, failed when cameras panned or zoomed, which is common in soccer broadcasting.

With the advent of deep learning, CNN-based detectors have become the dominant paradigm for sports video analysis. Komorowski et al. [25] proposed FootAndBall, a fully convolutional network specifically designed for simultaneous player and ball detection in soccer videos. Their architecture employed a multi-scale feature pyramid to handle the extreme scale variations between players and the ball, achieving real-time performance on standard hardware. However, the fixed receptive fields of convolutional networks limited their ability to capture global context, which is important for disambiguating players in crowded scenes.

Naik et al. [26] employed YOLO-based detection with team classification for tactical analysis, demonstrating the practical utility of detection systems for coaching applications. Their work highlighted the importance of not just detecting players but also classifying them by team affiliation, which requires an understanding of jersey colors and spatial context. Subsequent works have explored semi-supervised learning approaches to reduce annotation requirements [27], recognizing that obtaining large-scale annotations for sports videos is labor-intensive and expensive.

The SoccerNet benchmark [2,4] has established standardized evaluation protocols for soccer video understanding, including object detection, tracking, action spotting, and camera calibration. The benchmark comprises over 500 complete soccer games from major European leagues, providing a comprehensive testbed for algorithm development. Recent SoccerNet challenges have demonstrated that while general-purpose detectors achieve reasonable performance, specialized architectures that address soccer-specific challenges such as small ball detection and player occlusion can yield significant improvements [27]. The challenge results have consistently shown that ball detection remains the most challenging aspect, with even state-of-the-art methods achieving significantly lower accuracy on balls compared to players.

Beyond soccer, object detection has been extensively studied in other sports, including basketball [28], tennis, and American football. These works have revealed common challenges across sports domains, including fast motion, occlusion, and the need for real-time processing. However, soccer presents unique challenges due to the large field of play, the small size of the ball relative to the image, and the high density of players during certain game situations.

Table 1 provides a systematic comparison of representative methods for sports video object detection, highlighting the key differences in backbone architecture, detection paradigm, multi-scale fusion strategy, attention mechanism, and loss function design. As shown, most existing methods either rely on CNN backbones with limited global receptive fields or transformer backbones with quadratic complexity. Our SoccerDETR is the first to integrate a state space model backbone with adaptive multi-scale fusion and synergistic attention, while introducing a scale-aware loss specifically designed for the extreme scale variations in soccer scenarios.

Table 1. Systematic comparison of representative detection methods for sports video analysis. “SSM” denotes the State Space Model. “Adaptive” indicates content-aware dynamic fusion. “Synergistic” refers to joint spatial-channel attention with mutual interaction.

Method	Backbone	Paradigm	Fusion	Attention	Scale Loss
Faster R-CNN [29]	CNN	Two-stage	FPN	None	No
FCOS [30]	CNN	One-stage	FPN	None	No
YOLOv8 [8]	CNN	One-stage	PANet	None	No
YOLOv10 [6]	CNN	One-stage	PANet	None	No
DETR [9]	CNN+Trans.	End-to-end	None	Self-Attn	No
Def. DETR [10]	CNN+Trans.	End-to-end	MS Def.	Deformable	No
DINO [11]	CNN+Trans.	End-to-end	MS Def.	Deformable	No
RT-DETR [14]	CNN+Trans.	End-to-end	Hybrid	Self-Attn	No
FootAndBall [25]	CNN	One-stage	FPN	None	No
Ours	SSM	End-to-end	SDFM	SCSA	SAFL

2.2. Real-Time Object Detection

Real-time object detection has been a central focus of computer vision research, driven by applications in autonomous driving, robotics, and video surveillance. The fundamental challenge lies in achieving high accuracy while maintaining low latency, as many applications require processing video streams at 30 FPS or higher.

The YOLO (You Only Look Once) series [6,7,8,31,32,33,34] has dominated real-time object detection through continuous architectural innovations and engineering optimizations. The key insight of YOLO is to formulate detection as a single regression problem, predicting bounding boxes and class probabilities directly from full images in a single evaluation. This contrasts with two-stage detectors like Faster R-CNN [29] that begin by generating region proposals and then classify each proposal.

YOLOv4 [31] introduced numerous training tricks and architectural improvements including CSPDarknet backbone, PANet neck, and mosaic data augmentation. YOLOX [32] adopted an anchor-free design and decoupled detection head, simplifying the architecture while improving performance. YOLOv6 [33] focused on industrial deployment, optimizing for various hardware platforms including edge devices. YOLOv7 [34] proposed Extended Efficient Layer Aggregation Networks (E-ELAN) for more effective feature aggregation.

YOLOv8 [8] introduced a unified framework supporting detection, segmentation, and pose estimation, with an anchor-free design and decoupled head that improved both accuracy and training stability. YOLOv9 [7] proposed Programmable Gradient Information (PGI) to address information bottleneck issues in deep networks, ensuring that gradient information flows effectively through the network during training. Most recently, YOLOv10 [6] eliminated NMS through consistent dual assignments during training, achieving truly end-to-end detection while maintaining competitive speed.

Despite these advances, YOLO models still rely on hand-crafted components and heuristics that may not generalize well across domains. The anchor-based versions require careful anchor design for each dataset, while even anchor-free versions use fixed assignment strategies that may not be optimal for all scenarios.

Transformer-based detectors, initiated by DETR (DEtection TRansformer) [9], reformulate detection as a set prediction problem using a Transformer encoder–decoder architecture with bipartite matching loss. This elegant formulation eliminates the need for NMS and anchor design, enabling truly end-to-end training and inference. However, the original DETR suffered from slow convergence (requiring 500 epochs) and poor performance on small objects.

Subsequent works have addressed these limitations through various innovations. Deformable DETR [10] introduced deformable attention that attends to a sparse set of sampling points around reference points, reducing computational complexity while improving small object detection. Conditional DETR [35] proposed conditional cross-attention that conditions spatial queries on decoder embeddings, accelerating convergence. DAB-DETR [36] used dynamic anchor boxes as queries, providing better spatial priors. DN-DETR [37] introduced query denoising training that adds noise to ground-truth boxes and trains the model to reconstruct them, significantly accelerating convergence. DINO [11] combined denoising training with contrastive denoising and mixed query selection, achieving state-of-the-art accuracy among DETR variants.

RT-DETR [14] achieved the first real-time transformer-based detection by designing an efficient hybrid encoder that decouples intra-scale and cross-scale feature interaction. The hybrid encoder first processes each scale independently with self-attention, then fuses information across scales through a lightweight cross-scale fusion module. Combined with uncertainty-minimal query selection that identifies the most informative queries, RT-DETR demonstrated that Transformers can compete with YOLO in speed while maintaining superior accuracy. RT-DETRv2 [38] further improved performance through bag-of-freebies training techniques.

2.3. State Space Models for Vision

State Space Models (SSMs) have emerged as an efficient alternative to Transformers for sequence modeling, offering linear complexity while maintaining the ability to capture long-range dependencies. The theoretical foundation of SSMs lies in classical control theory, where continuous-time linear systems are used to model dynamic processes.

The S4 (Structured State Space) model [39] introduced a breakthrough in efficient long-range sequence modeling by parameterizing the state matrix with a specific structure (HiPPO—High-order Polynomial Projection Operators) that enables stable training and efficient computation. S4 achieved state-of-the-art results on the Long-Range Arena benchmark, demonstrating the ability to model dependencies spanning thousands of time steps with linear complexity.

Mamba [16] further improved efficiency and expressiveness through selective state space mechanisms with hardware-aware implementations. The key innovation of Mamba is making the SSM parameters input-dependent, enabling content-aware filtering that allows the model to selectively propagate or forget information based on the input content. This selectivity is crucial for modeling complex sequences where different parts of the input have varying importance. Mamba also introduced a hardware-aware algorithm that leverages GPU memory hierarchy for efficient computation, achieving significant speedups over naive implementations.

For visual recognition, adapting SSMs to 2D images requires addressing the fundamental challenge that images are not naturally sequential. Vision Mamba (Vim) [17] proposed bidirectional scanning that processes image patches in both forward and backward directions, capturing dependencies in both directions along the scanning path. VMamba [18] introduced the 2D Selective Scan (SS2D) mechanism that scans image patches along four directions (horizontal, vertical, and two diagonals) and merges the results. This multi-directional scanning ensures that each patch can aggregate information from all spatial directions, approximating the global receptive field of self-attention.

MobileMamba [20] proposed a lightweight architecture specifically designed for efficient visual recognition on resource-constrained devices. The key innovation is the Multi-Receptive Field Feature Interaction (MRFFI) module that combines the Long-Range Wavelet Transform-Enhanced Mamba (WTE-Mamba) for capturing global dependencies, Efficient Multi-Kernel Depthwise Convolution (MK-DeConv) for local feature extraction, and Eliminate Redundant Identity components for efficiency. MobileMamba achieves state-of-the-art efficiency–accuracy trade-offs, outperforming both CNN-based and Transformer-based lightweight models.

Recent works have explored applying visual SSMs to various downstream tasks, including object detection [40], semantic segmentation, and video understanding. These works have demonstrated that SSMs can serve as effective backbones for dense prediction tasks, offering a compelling alternative to transformers when computational efficiency is paramount. The success of visual SSMs motivates our exploration of integrating them into the detection framework for efficient soccer video analysis.

2.4. Attention Mechanisms for Object Detection

Attention mechanisms have proven crucial for enhancing feature representations in detection networks, enabling models to focus on relevant regions and channels while suppressing irrelevant information. The development of attention mechanisms has progressed from simple channel attention to sophisticated spatial-channel interactions.

Squeeze-and-Excitation (SE) networks [41] introduced channel attention through global average pooling followed by a bottleneck MLP that produces channel-wise scaling factors. This simple yet effective mechanism enables the network to recalibrate channel responses based on global context, improving feature discriminability with minimal computational overhead. SE attention has been widely adopted in various detection architectures as a plug-and-play module.

CBAM (Convolutional Block Attention Module) [42] extended SE by combining channel and spatial attention sequentially. The spatial attention module computes attention weights based on the spatial distribution of features, enabling the network to focus on informative regions. However, the sequential application of channel and spatial attention may not fully exploit their interactions.

ECA-Net [43] proposed efficient channel attention without dimensionality reduction, using 1D convolution to capture local cross-channel interactions. This design avoids the information loss caused by the bottleneck in SE networks while maintaining efficiency. The kernel size of the 1D convolution is adaptively determined based on channel dimension, enabling flexible modeling of channel dependencies.

Recent work has explored the synergistic effects between different attention types. Coordinate Attention [44] factorizes channel attention into two 1D feature encoding processes that capture long-range dependencies along horizontal and vertical directions separately. This design preserves precise positional information that is lost in global pooling operations.

SCSA (Spatial-Channel Synergistic Attention) [22] proposed a novel attention mechanism that explores the synergistic effects between spatial and channel attention. SCSA consists of two components: Shareable Multi-Semantic Spatial Attention (SMSA) that captures multi-scale spatial information through parallel depthwise convolutions with varying kernel sizes, and Progressive Channel-wise Self-Attention (PCSA) that computes channel-wise dependencies through an efficient self-attention mechanism. The key insight is that spatial attention can provide discriminative priors that guide channel recalibration, while channel attention can inform which spatial regions are most relevant. This synergistic interaction achieves superior performance compared to sequential or parallel application of independent attention modules.

2.5. Multi-Scale Feature Fusion

Multi-scale feature fusion is essential for detecting objects across a wide range of scales, which is particularly important in soccer detection where players and balls exhibit extreme scale variations. The goal is to combine high-resolution features that preserve spatial details with low-resolution features that capture semantic information.

Feature Pyramid Networks (FPNs) [45] introduced a top-down pathway with lateral connections that builds a feature pyramid with strong semantics at all scales. The top-down pathway upsamples spatially coarser but semantically stronger feature maps, which are then combined with bottom-up features through lateral connections. FPN has become a standard component in modern detection architectures.

PANet (Path Aggregation Network) [46] enhanced FPN by adding a bottom-up path augmentation that propagates low-level features to higher levels, enabling better localization. The additional bottom-up pathway shortens the information path between low-level and high-level features, facilitating the flow of accurate localization signals.

BiFPN (Bidirectional Feature Pyramid Network) [47] proposed weighted bidirectional feature fusion that learns the importance of different input features. Unlike FPN and PANet, which treat all input features equally, BiFPN assigns learnable weights to each input during fusion, enabling the network to emphasize more informative features.

The Semantic-aware Dynamic Feature Fusion Module (SDFM) [21] takes a different approach by dynamically adjusting fusion weights based on semantic content. SDFM uses channel attention to compute content-aware fusion weights, enabling adaptive information flow between different resolution levels. This is particularly beneficial for soccer detection, where the optimal fusion strategy may vary depending on the scene content (e.g., crowded penalty area vs. sparse midfield).

3. Method

3.1. Overall Architecture

The overall architecture of SoccerDETR is illustrated in Figure 1. Given an input image

I \in R^{H \times W \times 3}

, our framework consists of four main components: (1) a MobileMamba backbone for efficient feature extraction with linear complexity; (2) an Efficient Transformer Encoder with SDFM for adaptive multi-scale feature fusion; (3) SCSA modules for synergistic spatial-channel feature enhancement; and (4) a Separable Dynamic Decoder for efficient query–feature interaction and final detection.

The processing pipeline proceeds as follows. First, the input image is processed by the MobileMamba backbone, which extracts multi-scale features

{P_{3}, P_{4}, P_{5}}

with strides

{8, 16, 32}

, respectively. These features capture information at different spatial resolutions, with

P_{3}

preserving fine-grained details suitable for small object detection and

P_{5}

encoding high-level semantic information for recognizing object categories.

The multi-scale features are then fed into the SDFM module, which performs adaptive feature fusion based on semantic content. Unlike conventional FPN, which uses fixed fusion weights, SDFM dynamically adjusts the contribution of each scale based on the input content, enabling more effective information aggregation across scales.

Following SDFM, the SCSA modules enhance the fused features through synergistic spatial-channel attention. The spatial attention component identifies informative regions while the channel attention component recalibrates channel responses, with both components working collaboratively to boost feature discriminability.

The enhanced features are processed by the Efficient Transformer Encoder, which performs intra-scale self-attention to capture global dependencies within each scale. The encoder output is then used for Uncertainty-minimal Query Selection, which identifies the most informative feature locations as initial object queries.

Finally, the Separable Dynamic Decoder refines the object queries through dynamic convolution attention, producing the final detection results, including bounding box coordinates and class probabilities. The entire pipeline is trained end-to-end with our proposed Scale-Aware Focal Loss.

3.2. MobileMamba Backbone with SS2D

Traditional transformer backbones suffer from quadratic complexity

O (N^{2})

due to self-attention operations, where N is the number of tokens (patches). For a typical input image of size

640 \times 640

with patch size

16 \times 16

, this results in

N = 1600

tokens, leading to attention matrices of size

1600 \times 1600 = 2.56

million elements. This quadratic scaling becomes prohibitive for the high-resolution inputs common in sports broadcasting.

We address this limitation by adopting MobileMamba [20] as our backbone, which leverages the Selective State Space Model (S6) to achieve linear complexity

O (N)

. The key insight is that State Space Models process sequences through a recurrent formulation that maintains a fixed-size hidden state, avoiding the need to compute pairwise interactions between all tokens.

The continuous-time state space model is defined by the following linear ordinary differential equations:

\begin{matrix} h^{'} (t) & = A h (t) + B x (t) \\ y (t) & = C h (t) + D x (t) \end{matrix}

(1)

where

h (t) \in R^{N}

is the hidden state that summarizes the history of the input sequence,

x (t) \in R

is the scalar input at time t,

y (t) \in R

is the scalar output, and

A \in R^{N \times N}

,

B \in R^{N \times 1}

,

C \in R^{1 \times N}

,

D \in R

are learnable parameters. The state matrix

A

governs the dynamics of the hidden state, determining how information is propagated and decayed over time.

For discrete sequence processing in neural networks, we discretize Equation (1) using the zero-order hold (ZOH) method with step size

Δ

:

\begin{matrix} \bar{A} & = exp (Δ A) \\ \bar{B} & = {(Δ A)}^{- 1} (exp (Δ A) - I) \cdot Δ B \approx Δ B \end{matrix}

(2)

The exact ZOH discretization of

\bar{B}

involves the matrix inverse

{(Δ A)}^{- 1} (exp (Δ A) - I)

, which is computationally expensive and numerically unstable for large state dimensions. The approximation

\bar{B} \approx Δ B

is derived from the first-order Taylor expansion of the matrix exponential:

exp (Δ A) \approx I + Δ A

when

∥ Δ A ∥ ≪ 1

. Substituting this into the exact formula yields

\bar{B} \approx {(Δ A)}^{- 1} (Δ A) \cdot Δ B = Δ B

. In practice, the step size

Δ

is initialized to small values (typically

10^{- 3}

to

10^{- 1}

) and the state matrix

A

is parameterized with bounded eigenvalues, ensuring that

∥ Δ A ∥ ≪ 1

holds throughout training. This simplification reduces the discretization from

O (N^{3})

(matrix inversion) to

O (N)

(element-wise scaling), which is critical for maintaining the overall linear complexity of the SSM. The discretized parameters

\bar{A}

and

\bar{B}

define the discrete-time dynamics.

The discretized state space model then becomes a linear recurrence:

\begin{matrix} h_{t} & = \bar{A} h_{t - 1} + \bar{B} x_{t} \\ y_{t} & = C h_{t} + D x_{t} \end{matrix}

(3)

This recurrence can be computed in

O (N)

time for a sequence of length N, compared to

O (N^{2})

for self-attention. Moreover, the recurrence can be parallelized using the associative scan algorithm, enabling efficient GPU implementation.

The key innovation of Mamba lies in making the parameters

B

,

C

, and

Δ

input-dependent, enabling selective information propagation:

\begin{matrix} B_{t} & = {Linear}_{B} (x_{t}) \in R^{N \times 1} \\ C_{t} & = {Linear}_{C} (x_{t}) \in R^{1 \times N} \\ Δ_{t} & = softplus ({Linear}_{Δ} (x_{t}) + {Broadcast}_{Δ} ({Parameter}_{Δ})) \in R^{+} \end{matrix}

(4)

where

{Linear}_{B}

,

{Linear}_{C}

, and

{Linear}_{Δ}

are learned linear projections, and

softplus (x) = log (1 + e^{x})

ensures the positivity of the step size. This selectivity allows the model to dynamically adjust its information propagation based on the input content, enabling content-aware filtering that is crucial for distinguishing relevant information from irrelevant information.

For 2D visual features, we employ the Selective Scan 2D (SS2D) mechanism as shown in Figure 2. The challenge in applying SSMs to images is that images are inherently 2D structures, while SSMs are designed for 1D sequences. SS2D addresses this by scanning image patches along multiple directions and aggregating the results.

Given a feature map

F \in R^{H \times W \times C}

, SS2D first flattens the spatial dimensions into sequences along four scanning directions:

Direction 1: Left-to-right, top-to-bottom (row-major order).
Direction 2: Right-to-left, bottom-to-top (reverse row-major).
Direction 3: Top-to-bottom, left-to-right (column-major order).
Direction 4: Bottom-to-top, right-to-left (reverse column-major).

Each directional sequence is processed by the selective SSM independently:

F_{o u t} = Merge (\sum_{d = 1}^{4} SSM ({Scan}_{d} (F)))

(5)

where

{Scan}_{d}

denotes scanning along direction d, SSM applies the selective state space model, and Merge reshapes the output back to 2D and aggregates the four directional outputs through summation followed by layer normalization.

This multi-directional scanning ensures that each spatial location can aggregate information from all directions, approximating the global receptive field of self-attention while maintaining linear complexity. The four directions provide complementary views of the spatial structure, with horizontal scans capturing row-wise dependencies and vertical scans capturing column-wise dependencies. The complete S6 processing flow is illustrated in Figure 3.

The MobileMamba backbone is organized into four stages with progressively increasing channel dimensions and decreasing spatial resolutions. Each stage consists of multiple SS2D blocks with residual connections. We extract features from the last three stages as

{P_{3}, P_{4}, P_{5}}

with strides

{8, 16, 32}

and channel dimensions

{256, 512, 1024}

respectively. These multi-scale features provide a rich representation suitable for detecting objects across the extreme scale range present in soccer videos.

3.3. Semantic-Aware Dynamic Feature Fusion Module

Multi-scale feature fusion is critical for detecting objects with varying sizes in soccer videos, where players may span hundreds of pixels while the ball occupies merely 10–30 pixels. Traditional feature pyramid networks use fixed fusion strategies (e.g., element-wise addition) that treat all spatial locations and channels equally. However, the optimal fusion strategy may vary depending on the scene content and the objects present.

We incorporate the Semantic-aware Dynamic Feature Fusion Module (SDFM) [21] to achieve adaptive feature aggregation based on semantic content. SDFM learns to dynamically adjust fusion weights through a channel attention mechanism, enabling content-aware information flow between different resolution levels.

As illustrated in Figure 4, given features from two adjacent scales

F_{h i g h} \in R^{C \times H \times W}

(higher resolution, lower semantic level) and

F_{l o w} \in R^{C \times H^{'} \times W^{'}}

(lower resolution, higher semantic level), SDFM first aligns their spatial dimensions through bilinear interpolation:

F_{l o w}^{u p} = Upsample (F_{l o w}, (H, W))

(6)

The aligned features are concatenated along the channel dimension to form a joint representation:

F_{c a t} = Concat (F_{h i g h}, F_{l o w}^{u p}) \in R^{2 C \times H \times W}

(7)

Channel attention weights are computed through a squeeze-and-excitation style mechanism:

z = GAP (F_{c a t}) \in R^{2 C}

(8)

w_{c} = σ ({Conv}_{1 \times 1} (BN ({Conv}_{1 \times 1} (z)))) \in R^{2 C}

(9)

where GAP denotes Global Average Pooling that compresses spatial dimensions,

σ

is the Sigmoid function that normalizes weights to

[0, 1]

, and BN represents Batch Normalization for training stability. The two

1 \times 1

convolutions form a bottleneck that first reduces dimensionality for efficiency and then restores it.

The attention weights are split into two parts corresponding to the two input features:

w_{h i g h}, w_{l o w} = Split (w_{c}) \in R^{C}, R^{C}

(10)

The fused feature is obtained through adaptive weighting:

F_{f u} = w \cdot (w_{h i g h} ⊙ F_{h i g h}) + (1 - w) \cdot (w_{l o w} ⊙ F_{l o w}^{u p})

(11)

where

w \in [0, 1]

is a learnable scalar parameter initialized to 0.5, ⊙ denotes channel-wise multiplication (broadcasting the channel weights across spatial dimensions), and the complementary weighting

(1 - w)

ensures that the total contribution is normalized.

The learnable weight w allows the network to adaptively balance the contribution of high-resolution details versus high-level semantics. During training, the network learns to adjust w based on the task requirements: for small object detection (e.g., balls), w may increase to emphasize high-resolution features, while for large object detection (e.g., players), w may decrease to leverage semantic features.

We apply SDFM in a top-down manner, progressively fusing features from

P_{5}

to

P_{4}

and then from the fused

P_{4}

to

P_{3}

. This hierarchical fusion ensures that high-level semantic information is propagated to all scales while preserving the spatial details in high-resolution features.

3.4. Spatial-Channel Synergistic Attention

To further enhance feature representations, we integrate the Spatial-Channel Synergistic Attention (SCSA) [22] module, which explores the synergistic effects between spatial and channel attention mechanisms. Unlike previous approaches that apply spatial and channel attention sequentially or in parallel without interaction, SCSA enables the two attention types to inform and enhance each other.

As shown in Figure 5, SCSA comprises two components: Shareable Multi-Semantic Spatial Attention (SMSA) and Progressive Channel-wise Self-Attention (PCSA).

Shareable Multi-Semantic Spatial Attention (SMSA) captures multi-scale spatial information through parallel depthwise convolutions with varying kernel sizes. The key insight is that objects of different scales require different receptive fields for effective attention computation. SMSA addresses this by using multiple kernel sizes simultaneously:

Given input features

F \in R^{B \times C \times H \times W}

, SMSA first splits the channels into groups and applies average pooling along different spatial axes to reduce the computational cost:

F_{h} = {AvgPool}_{W} (F) \in R^{B \times C \times H \times 1}

(12)

F_{w} = {AvgPool}_{H} (F) \in R^{B \times C \times 1 \times W}

(13)

Multi-scale spatial features are extracted through parallel depthwise 1D convolutions:

S_{k} = {DWConv 1 d}_{k} (F_{h}) \oplus {DWConv 1 d}_{k} (F_{w}), k \in {3, 5, 7, 9}

(14)

where ⊕ denotes broadcasting and element-wise addition to restore the 2D spatial structure.

The multi-scale features are concatenated and normalized:

S = σ (GroupNorm (Concat (S_{3}, S_{5}, S_{7}, S_{9})))

(15)

The use of multiple kernel sizes (3, 5, 7, 9) enables SMSA to capture spatial patterns at different scales: small kernels focus on fine-grained details suitable for small objects, while large kernels capture broader context for large objects.

Progressive Channel-wise Self-Attention (PCSA) computes channel-wise dependencies through an efficient self-attention mechanism. Unlike standard self-attention, which operates on spatial tokens, PCSA treats channels as tokens and computes attention across the channel dimension:

\begin{matrix} Q & = {DWConv}_{1 \times 1} (GroupNorm (AvgPool (F))) \in R^{B \times C \times 1 \times 1} \\ K & = {DWConv}_{1 \times 1} (GroupNorm (AvgPool (F))) \in R^{B \times C \times 1 \times 1} \\ V & = {DWConv}_{1 \times 1} (F) \in R^{B \times C \times H \times W} \end{matrix}

(16)

The channel attention is computed through element-wise operations:

C = σ (AvgPool (Q ⊙ K)) ⊙ V

(17)

This formulation is more efficient than standard self-attention because it avoids computing the full attention matrix. The element-wise product

Q ⊙ K

captures channel-wise correlations, and the subsequent average pooling and sigmoid produce channel attention weights.

The final output combines spatial and channel attention synergistically:

F_{o u t} = S ⊙ C ⊙ F

(18)

The element-wise multiplication of spatial attention

S

, channel attention

C

, and input features

F

enables the two attention types to jointly modulate the features. Spatial attention identifies where to focus, channel attention determines which features are important, and their combination provides a comprehensive attention mechanism that enhances feature discriminability for detection.

We apply SCSA after each SDFM fusion stage, enhancing the fused multi-scale features before they are processed by the transformer encoder. This placement ensures that the encoder receives well-refined features with enhanced spatial and channel discriminability.

3.5. Efficient Transformer Encoder

Following the feature enhancement by SDFM and SCSA, we employ an Efficient Transformer Encoder adapted from RT-DETR [14] to capture global dependencies within each scale. The encoder is designed to balance computational efficiency with representational power.

The encoder consists of L Transformer layers, each containing multi-head self-attention (MHSA) and feed-forward network (FFN):

\begin{matrix} F^{'} & = MHSA (LN (F)) + F \\ F_{o u t} & = FFN (LN (F^{'})) + F^{'} \end{matrix}

(19)

where LN denotes Layer Normalization and the residual connections facilitate gradient flow.

To reduce computational cost, we apply the encoder only to the highest-resolution feature map

P_{3}

and use lightweight cross-scale fusion for other scales. This design is motivated by the observation that high-resolution features benefit most from global context modeling, while lower-resolution features already have large receptive fields from the backbone.

The multi-head self-attention is computed as:

MHSA (F) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

(20)

{head}_{i} = softmax (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}}) V_{i}

(21)

where

Q_{i} = F W_{i}^{Q}

,

K_{i} = F W_{i}^{K}

,

V_{i} = F W_{i}^{V}

are the query, key, and value projections for head i, and

d_{k}

is the dimension per head.

3.6. Uncertainty-Minimal Query Selection

Following the encoder, we employ Uncertainty–minimal Query Selection [14] to identify the most informative feature locations as initial object queries. This selection mechanism replaces the learned object queries in original DETR with content-aware queries derived from the encoder output.

For each spatial location in the encoder output, we compute a classification score and localization uncertainty:

s = sigmoid ({Linear}_{c l s} (F_{e n c})) \in R^{H \times W \times K}

(22)

u = {Linear}_{l o c} (F_{e n c}) \in R^{H \times W \times 4}

(23)

where K is the number of classes and the four-dimensional output represents bounding box coordinates.

The selection score combines classification confidence and localization certainty:

score = max_{k} (s_{k}) \cdot exp (- {∥ u ∥}_{2})

(24)

The top-N locations with highest scores are selected as initial queries:

Q_{i n i t} = TopK (F_{e n c}, score, N)

(25)

This selection mechanism ensures that the decoder focuses on the most promising locations, improving both accuracy and efficiency compared to using fixed learned queries.

3.7. Separable Dynamic Decoder

Traditional transformer decoders employ multi-head cross-attention to enable object queries to attend to encoder features. Given N object queries and M feature tokens, the cross-attention complexity is

O (N \cdot M)

, which becomes a bottleneck when processing high-resolution features.

We adopt the Separable Dynamic Decoder [23] that replaces cross-attention with dynamic convolution attention, reducing complexity to

O (N + M)

. The key insight is that cross-attention can be approximated by query-specific convolutions applied to the feature map.

As illustrated in Figure 6, the decoder operates in three stages: Pre-Attention, Dynamic Convolution Attention, and Post-Attention.

Pre-Attention Stage: Proposal kernels derived from the selected queries interact with aggregated features through 2D dynamic convolution:

F_{b o x} = F_{a g g} * K_{p r o p}

(26)

where ∗ denotes dynamic convolution with query-specific kernels

K_{p r o p} \in R^{N \times k \times k \times C}

. Each query generates its own convolution kernel, enabling content-specific feature extraction.

Dynamic Convolution Attention (DyConvAtten): This module replaces the standard cross-attention mechanism. Given queries

Q \in R^{N \times d}

and feature values

V \in R^{H \times W \times d}

, the attention is computed as:

DyConvAtten (Q, V) = (V * r (Q W^{d})) * r (Q W^{p})

(27)

where

r (\cdot)

denotes a reshape operation that converts the linear projection output into convolution kernel format,

W^{d} \in R^{d \times (k^{2} \cdot g)}

generates depth-wise convolution kernels with g groups, and

W^{p} \in R^{d \times (d \cdot 1 \cdot 1)}

generates point-wise convolution kernels.

This separable formulation decomposes the attention into two sequential convolutions:

\begin{matrix} K^{d} & = r (Q W^{d}) \in R^{N \times g \times k \times k} \\ K^{p} & = r (Q W^{p}) \in R^{N \times d \times 1 \times 1} \\ F_{d e p t h} & = V * K^{d} (depth-wise convolution) \\ F_{o u t} & = F_{d e p t h} * K^{p} (point-wise convolution) \end{matrix}

(28)

The depth-wise convolution captures spatial patterns with query-specific kernels, while the point-wise convolution mixes channel information. This separable design reduces the parameter count and computational cost while maintaining expressive power.

Post-Attention Stage: Standard Multi-Head Self-Attention (MHSA) and Feed-Forward Network (FFN) refine the queries:

\begin{matrix} Q^{'} & = MHSA (Q + F_{o u t}) + Q + F_{o u t} \\ Q_{o u t} & = FFN (Q^{'}) + Q^{'} \end{matrix}

(29)

The self-attention among queries enables them to exchange information and avoid duplicate detections, while the FFN provides non-linear transformation for feature refinement.

The decoder is applied iteratively for

L_{d e c}

layers, with each layer refining the queries based on the encoder features. The final queries are used to predict bounding boxes and class probabilities through linear projection heads.

3.8. Scale-Aware Focal Loss

Soccer detection faces two types of imbalance: class imbalance (background vs. foreground) and scale imbalance (large players vs. small balls). Standard focal loss addresses class imbalance but treats all positive samples equally, regardless of their scale. We propose Scale-Aware Focal Loss (SAFL) that addresses both challenges simultaneously.

The standard Focal Loss [24] is defined as:

L_{F L} (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} log (p_{t})

(30)

where

p_{t}

is the predicted probability for the ground-truth class,

α_{t}

is the class balancing factor (which is typically set higher for rare classes), and

γ

is the focusing parameter that down-weights easy examples. The term

{(1 - p_{t})}^{γ}

reduces the loss contribution from well-classified examples, focusing training on hard examples.

We extend focal loss with scale-aware weighting that up-weights small objects:

w_{s c a l e} = {(\frac{s_{m a x}}{s + ϵ})}^{β}

(31)

where

s = \sqrt{w \cdot h}

is the object scale computed as the geometric mean of bounding box width w and height h,

s_{m a x}

is the maximum scale in the dataset (computed from training set statistics),

ϵ

is a small constant (set to 1.0) for numerical stability, and

β

controls the scale sensitivity.

The scale weight

w_{s c a l e}

is inversely proportional to object scale: small objects receive higher weights, encouraging the model to focus on these challenging cases. The exponent

β

controls the strength of this reweighting:

β = 0

recovers standard focal loss, while larger

β

values more aggressively up-weight small objects.

The Scale-Aware Focal Loss is formulated as:

L_{S A F L} (p_{t}, s) = - w_{s c a l e} \cdot α_{t} {(1 - p_{t})}^{γ} log (p_{t})

(32)

Formal Definition of the SAFL Weighting Function. The scale-aware weighting function

w_{s c a l e} : R^{+} \to R^{+}

is a monotonically decreasing function of object scale s, defined as:

w_{s c a l e} (s; s_{m a x}, β, ϵ) = {(\frac{s_{m a x}}{s + ϵ})}^{β}, s > 0, β \geq 0, ϵ > 0

(33)

where the domain is restricted to positive object scales. The function satisfies the following properties: (i)

w_{s c a l e}

is continuous and differentiable for all

s > 0

; (ii)

{lim}_{s \to 0^{+}} w_{s c a l e} = {(s_{m a x} / ϵ)}^{β}

, providing a bounded maximum weight for the smallest objects; (iii)

w_{s c a l e} (s_{m a x}) = {(s_{m a x} / (s_{m a x} + ϵ))}^{β} \approx 1

for

ϵ ≪ s_{m a x}

, ensuring that the largest objects receive approximately unit weight; (iv)

β = 0

recovers standard focal loss as

w_{s c a l e} \equiv 1

.

Gradient Behavior Analysis. To understand how SAFL modulates the training dynamics, we analyze the gradient of

L_{S A F L}

with respect to the model logit z (where

p_{t} = σ (z)

and

σ

is the sigmoid function):

\frac{\partial L_{S A F L}}{\partial z} = - w_{s c a l e} \cdot α_{t} [γ {(1 - p_{t})}^{γ - 1} p_{t} log (p_{t}) + {(1 - p_{t})}^{γ} (1 - p_{t})] \cdot sign (y)

(34)

where

y \in {- 1, + 1}

is the ground-truth label and

sign (y)

accounts for the direction of the gradient. The key observation is that the gradient magnitude is scaled by

w_{s c a l e}

, which is inversely proportional to object scale s. For a small ball with

s = 20

pixels and a large player with

s = 200

pixels (assuming

s_{m a x} = 300

,

β = 0.5

,

ϵ = 1.0

), the gradient ratio is:

\frac{| \nabla_{z} L_{S A F L} |_{s = 20}}{| \nabla_{z} L_{S A F L} |_{s = 200}} = \frac{w_{s c a l e} (20)}{w_{s c a l e} (200)} = {(\frac{200 + 1}{20 + 1})}^{0.5} \approx 3.09

(35)

This means that the ball receives approximately 3× larger gradient signals than the player, effectively compensating for the inherent difficulty of detecting small objects. The exponent

β

controls this amplification: larger

β

values produce more aggressive gradient amplification for small objects, while

β = 0

provides uniform gradients regardless of scale.

For bounding box regression, we employ the Complete IoU (CIoU) loss [48] that considers overlap area, center distance, and aspect ratio:

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(36)

where

I o U

is the Intersection over Union between predicted and ground-truth boxes,

ρ (b, b^{g t})

is the Euclidean distance between box centers, c is the diagonal length of the smallest enclosing box covering both boxes, and

α

, v are aspect ratio consistency terms:

v = \frac{4}{π^{2}} {(arctan \frac{w^{g t}}{h^{g t}} - arctan \frac{w}{h})}^{2}

(37)

α = \frac{v}{(1 - I o U) + v}

(38)

The aspect ratio term v penalizes predictions with incorrect aspect ratios, which is important for accurately localizing elongated objects like standing players.

The total loss function combines classification and regression losses:

L_{t o t a l} = λ_{c l s} L_{S A F L} + λ_{b o x} L_{C I o U} + λ_{g i o u} L_{G I o U} + λ_{a u x} L_{a u x}

(39)

where

λ_{c l s} = 2.0

,

λ_{b o x} = 5.0

,

λ_{g i o u} = 2.0

, and

λ_{a u x} = 1.0

are balancing coefficients determined through validation.

L_{G I o U}

is the Generalized IoU loss [49] that provides additional supervision for non-overlapping boxes, and

L_{a u x}

represents auxiliary losses from intermediate decoder layers that provide deep supervision.

3.9. Training Algorithm

The complete training procedure of SoccerDETR is summarized in Algorithm 1. The algorithm follows the standard end-to-end detection training paradigm with Hungarian matching for label assignment.

The Hungarian matching algorithm finds the optimal bipartite assignment between predictions and ground-truth objects by minimizing a cost matrix that combines classification and localization costs:

C_{m a t c h} = λ_{c l s} C_{c l s} + λ_{b o x} C_{L 1} + λ_{g i o u} C_{G I o U}

(40)

where

C_{c l s}

is the classification cost (negative log probability),

C_{L 1}

is the L1 distance between predicted and ground-truth box coordinates, and

C_{G I o U}

is the negative GIoU.

Algorithm 1 SoccerDETR training procedure

Require: Training dataset

D = {(I_{i}, Y_{i})}_{i = 1}^{N}

, learning rate

η

, epochs E, batch size B

Ensure: Trained model parameters

Θ

1: Initialize backbone with ImageNet weights; initialize other modules with Xavier

2: Set loss weights and SAFL parameters; initialize AdamW optimizer

3: for epoch

= 1

to E do

4: for each mini-batch

{(I_{j}, Y_{j})}_{j = 1}^{B}

do

5:

I_{j} \leftarrow Augment (I_{j})

{Mosaic, flip, color jitter}

6:

{P_{3}, P_{4}, P_{5}} \leftarrow MobileMamba (I_{j})

{Backbone}

7:

F_{f u s e d} \leftarrow SDFM (P_{3}, P_{4}, P_{5})

{Multi-scale fusion}

8:

F_{e n h a n c e d} \leftarrow SCSA (F_{f u s e d})

{Feature enhancement}

9:

F_{e n c} \leftarrow Encoder (F_{e n h a n c e d})

10:

Q_{i n i t} \leftarrow QuerySelection (F_{e n c}, N_{q u e r y})

11:

\hat{b}, \hat{c} \leftarrow Decoder (Q_{i n i t}, F_{e n c})

{Iterative refinement}

12:

π^{*} \leftarrow HungarianMatch (\hat{b}, \hat{c}, Y_{j})

13:

L_{t o t a l} \leftarrow λ_{c l s} L_{S A F L} + λ_{b o x} L_{C I o U} + λ_{g i o u} L_{G I o U} + λ_{a u x} L_{a u x}

14:

Θ \leftarrow Θ - η \nabla_{Θ} L_{t o t a l}

15: end for

16: Adjust learning rate; validate and save checkpoint periodically

17: end for

18: return

Θ

4. Experiments

4.1. Datasets

We evaluate SoccerDETR on two public soccer detection datasets that represent different aspects of the soccer detection challenge:

Soccana Player Ball Detection Dataset: This dataset contains 11,673 images collected from professional soccer matches across multiple leagues and seasons. The dataset includes three object categories: Player (0), Ball (1), and Referee (2). The images exhibit diverse scenarios including different camera angles (wide shots, medium shots, close-ups), varying lighting conditions (day games, night games under floodlights), different weather conditions, and varying player densities, from sparse counterattacks to crowded set pieces.

The dataset statistics reveal the scale imbalance challenge: players have an average bounding box area of

8500 {pixels}^{2}

(ranging from 500 to 50,000), while balls average only

450 {pixels}^{2}

(ranging from 100 to 2000). Referees fall in between with an average of

7200 {pixels}^{2}

. This 19:1 scale ratio between players and balls motivates our scale-aware loss design.

We follow the standard 80%/10%/10% split for training (9338 images), validation (1167 images), and testing (1168 images). The splits are stratified to ensure similar class distributions across sets.

SoccerNet Dataset [2]: Originally designed for action spotting in soccer videos, the detection subset contains 19,786 annotated frames extracted from broadcast footage of professional matches. The dataset includes two categories: Ball (0) and Person (1), where Person encompasses both players and referees.

SoccerNet presents additional challenges compared to Soccana: (1) broadcast overlays including scoreboards, team logos, and replay indicators that may occlude objects; (2) varying video quality from different broadcasters; (3) more extreme scale variations due to the inclusion of wide-angle shots covering the entire field; and (4) motion blur from fast camera pans during exciting moments.

The dataset is split into training (15,829 frames), validation (1978 frames), and testing (1979 frames) following the official benchmark protocol. This larger dataset enables evaluation of model scalability and generalization.

4.2. Implementation Details

Our implementation is based on PyTorch 2.0 and the MMDetection framework. Training is conducted on 2 NVIDIA RTX 4090 GPUs with 24 GB memory each, using DistributedDataParallel for multi-GPU training.

AI-Assisted Writing Tools: In the preparation of this manuscript, we utilized several AI-assisted tools for language enhancement: ChatGPT (OpenAI, San Francisco, CA, USA, GPT-3.5/GPT-4) for improving sentence structure and clarity, DeepL (DeepL SE, Cologne, Germany) for translation refinement, and Grammarly Premium (Grammarly Inc., San Francisco, CA, USA) for grammar and style checking. These tools were used solely for language polishing and expression improvement. All AI-generated content was thoroughly reviewed, edited, and validated by the authors to ensure scientific accuracy and appropriateness.

Backbone Configuration: The MobileMamba backbone is initialized with ImageNet-1K pretrained weights. We use the MobileMamba-B2 variant with 4 stages, outputting features with channel dimensions [64, 128, 320, 512] at strides [4, 8, 16, 32]. We extract

{P_{3}, P_{4}, P_{5}}

from the last three stages with strides

{8, 16, 32}

.

Encoder Configuration: The Efficient Transformer Encoder consists of 1 layer with 8 attention heads and hidden dimension 256. We use pre-normalization and GELU activation in the FFN.

Decoder Configuration: The Separable Dynamic Decoder has 6 layers with 8 attention heads. The dynamic convolution kernel size is set to

3 \times 3

with 8 groups for depth-wise convolution. The number of object queries is set to 300.

Training Configuration: We use AdamW optimizer [50] with initial learning rate

1 \times 10^{- 4}

, weight decay

1 \times 10^{- 4}

, and

(β_{1}, β_{2}) = (0.9, 0.999)

. The learning rate follows a cosine annealing schedule with 500 warmup iterations where the learning rate linearly increases from

1 \times 10^{- 6}

to

1 \times 10^{- 4}

. Training is conducted for 300 epochs with batch size 16 (4 per GPU).

Data Augmentation: Input images are resized to

640 \times 640

with the following augmentations:

Mosaic augmentation (probability 0.5): combines 4 images into one, increasing object diversity.
Random horizontal flip (probability 0.5).
Color jittering: brightness ±0.4, contrast ±0.4, saturation ±0.4, hue ±0.1.
Random scale: scale factor uniformly sampled from [0.5, 1.5].
Random crop: after scaling, crop to $640 \times 640$ .

Loss Configuration: For SAFL, we set

γ = 2.0

,

β = 0.5

, and

α_{t}

is computed based on inverse class frequency:

α_{p l a y e r} = 0.25

,

α_{b a l l} = 0.5

,

α_{r e f e r e e} = 0.25

for Soccana;

α_{p e r s o n} = 0.25

,

α_{b a l l} = 0.75

for SoccerNet. The higher weight for ball reflects its lower frequency and higher detection difficulty.

Inference Configuration: During inference, we use a single scale of

640 \times 640

without test-time augmentation. The confidence threshold is set to 0.3 for filtering low-confidence predictions. No NMS is required due to the end-to-end design. All speed measurements are conducted on a single NVIDIA RTX 4090 GPU (24 GB VRAM) (NVIDIA, Santa Clara, CA, USA) with an Intel Core i9-13900K CPU (Intel, Santa Clara, CA, USA), running CUDA 12.1 and PyTorch 2.0. We use FP32 precision for all experiments to ensure fair comparison across methods; FP16 mixed precision can further improve speed by approximately 20% but is not used in our reported results. We do not use TensorRT or ONNX optimization in the reported results to maintain consistency with baseline implementations. FPS is measured with a batch size of 1, averaged over 1000 iterations after 100 warmup iterations to ensure stable GPU frequency and eliminate cold-start effects. Latency is measured as end-to-end wall-clock time, including image preprocessing (resizing, normalization) and output postprocessing (confidence thresholding).

Baseline Training Configurations: To ensure fair comparison, all methods are trained with their officially recommended configurations on identical hardware using the same data splits and input resolution (

640 \times 640

). For CNN-based two-stage detectors (Faster R-CNN, Cascade R-CNN) and one-stage detectors (RetinaNet, FCOS), we train for 36 epochs using SGD optimizer with learning rates of 0.02 and 0.01 respectively, with flip and scale augmentation. For YOLO-series methods (YOLOv7, YOLOv8-L, YOLOv9-E, YOLOv10-L), we train for 300 epochs using SGD optimizer with learning rate 0.01, employing mosaic, flip, and HSV augmentation. For Transformer-based methods, DETR is trained for 500 epochs, Deformable DETR for 50 epochs, and DINO for 36 epochs, all using AdamW optimizer with learning rates of around

1 \times 10^{- 4}

to

2 \times 10^{- 4}

. RT-DETR-L and our SoccerDETR are both trained for 300 epochs using AdamW optimizer with learning rate

1 \times 10^{- 4}

and mosaic, flip, and color augmentation.

4.3. Evaluation Metrics

We adopt standard COCO-style metrics [51] for comprehensive evaluation:

mAP@50: Mean Average Precision at IoU threshold 0.5, the primary metric for detection accuracy
mAP@50:95: Mean AP averaged over IoU thresholds from 0.5 to 0.95 with step 0.05, measuring localization precision
mAP@75: Mean AP at IoU threshold 0.75, emphasizing precise localization
${AP}_{S}$ , ${AP}_{M}$ , ${AP}_{L}$ : AP for small (area $< 32^{2}$ ), medium ( $32^{2} <$ area $< 96^{2}$ ), and large (area $> 96^{2}$ ) objects
Per-class AP: AP for each object category (Player, Ball, Referee)

We also report efficiency metrics:

FPS: Frames Per Second measured on a single GPU with batch size 1, averaged over 1000 iterations after 100 warmup iterations.
Params: Number of model parameters in millions.
GFLOPs: Giga Floating Point Operations for a single $640 \times 640$ input.
Latency: End-to-end inference latency in milliseconds.

4.4. Comparison with State-of-the-Art Methods

We compare SoccerDETR with 12 representative detection methods spanning four categories: two-stage detectors, one-stage detectors, YOLO-series detectors, and Transformer-based detectors. All methods are trained on the same datasets with their recommended configurations and evaluated under identical conditions.

As shown in Table 2, SoccerDETR achieves the best performance on the Soccana dataset with 94.2% mAP@50 and 67.8% mAP@50:95, outperforming the previous best method RT-DETR-L by 1.9% and 3.3%, respectively. The improvements are consistent across all object categories, with particularly notable gains on ball detection (88.4% vs. 85.2%, +3.2%). This demonstrates the effectiveness of our scale-aware design for small object detection.

The inference speed of 78 FPS surpasses RT-DETR-L (74 FPS) by 5.4%, validating the efficiency of our MobileMamba backbone and Separable Dynamic Decoder. Compared to the fastest method SSD (45 FPS), SoccerDETR achieves 17.9% higher mAP@50 while being 73% faster, demonstrating an excellent accuracy-efficiency trade-off.

Compared to YOLO-series detectors, SoccerDETR shows consistent improvements across all metrics while maintaining competitive speed. The 3.0% improvement over YOLOv10-L in mAP@50 highlights the advantage of our end-to-end detection paradigm that eliminates NMS-induced errors. The improvement is more pronounced on ball detection (+5.3%), where NMS often fails due to the small size and potential overlap with player bounding boxes. To further illustrate the advantage of NMS-free detection in crowded scenes, we evaluated detection performance specifically on penalty area scenarios where player density is highest. On these challenging subsets, SoccerDETR achieves 91.8% mAP@50 compared to 87.2% for YOLOv10-L, a 4.6% improvement that is larger than the 3.0% gap on the full dataset. This confirms that end-to-end detection provides particular benefits in dense scenarios where NMS-based methods struggle with overlapping detections.

Two-stage detectors (Faster R-CNN, Cascade R-CNN) achieve reasonable accuracy but suffer from slow inference speed due to the region proposal stage. One-stage detectors (RetinaNet, FCOS) offer better speed but lag behind in accuracy, particularly for small objects.

Table 3 presents results on the more challenging SoccerNet dataset. SoccerDETR achieves 91.8% mAP@50, surpassing RT-DETR-L by 2.3%. The improvement is particularly pronounced for ball detection (84.7% vs. 81.4%, +3.3%), where our Scale-Aware Focal Loss effectively addresses the extreme scale imbalance. The consistent improvements across both datasets demonstrate the strong generalization capability of our approach.

The performance gap between Soccana and SoccerNet (94.2% vs. 91.8% mAP@50) reflects the additional challenges in SoccerNet, including broadcast overlays and more extreme scale variations. Nevertheless, SoccerDETR maintains its advantage over competing methods on both datasets. Figure 10 provides a visual comparison of mAP@50 across methods on both datasets.

Table 4 provides a detailed breakdown by object scale. SoccerDETR achieves the largest improvement on small objects (

{AP}_{S}

: 72.4% vs. 65.8% for RT-DETR-L, +6.6%), validating the effectiveness of our scale-aware design. The improvements on medium and large objects are also significant (+3.7% and +1.3%, respectively), demonstrating that our approach benefits detection across all scales.

Figure 7 and Figure 8 present qualitative detection results on both datasets. SoccerDETR demonstrates robust detection across diverse scenarios. In penalty area situations with multiple overlapping players, SoccerDETR accurately detects individual players without false positives from NMS errors. Players at different distances from the camera are detected with consistent accuracy, from close-up shots to wide-angle views. The ball is accurately localized even when appearing as a small blob of 10–20 pixels, demonstrating the effectiveness of our scale-aware design. Partially occluded players and balls are detected with reasonable confidence, showing robustness to visual clutter. On SoccerNet, detections remain accurate despite scoreboards and logos that partially occlude the playing field.

4.5. Ablation Studies

We conduct comprehensive ablation studies on the Soccana dataset to analyze the contribution of each component and validate our design choices.

4.5.1. Component Analysis

Table 5 shows the incremental contribution of each component:

Effect of MobileMamba Backbone: Replacing the ResNet-50 backbone with MobileMamba improves mAP@50 by 1.6% while reducing parameters by 8.3% (42.0 M → 38.5 M) and increasing speed by 20.6% (68 → 82 FPS). This demonstrates that state space models can effectively capture visual features with superior efficiency compared to CNNs. The improvement is particularly notable for ball detection (+2.7%), suggesting that the global receptive field of SS2D helps detect small objects.

Effect of SDFM: Adding the Semantic-aware Dynamic Feature Fusion Module results in a 1.3% improvement in mAP@50 and 2.6% improvement in ball AP. The adaptive fusion mechanism enables more effective multi-scale feature aggregation compared to standard FPN, particularly benefiting small object detection where high-resolution features are crucial.

Effect of SCSA: The Spatial-Channel Synergistic Attention contributes 1.3% improvement in mAP@50 by enhancing feature discrimination through synergistic spatial and channel attention. The improvement is consistent across all object categories, indicating that SCSA provides general feature enhancement rather than scale-specific benefits.

Effect of Separable Dynamic Decoder: The decoder replacement improves mAP@50 by 0.6% while slightly reducing parameters (40.1 M → 39.8 M). The dynamic convolution attention provides effective query-feature interaction with lower computational cost than standard cross-attention.

Effect of Scale-Aware Focal Loss: The proposed SAFL provides 0.8% improvement in mAP@50 and 1.3% improvement in ball AP. This confirms that explicitly addressing scale imbalance through loss reweighting benefits small object detection.

The cumulative improvement from baseline to full model is 5.6% mAP@50 and 9.9% ball AP, demonstrating that all components contribute meaningfully to the final performance. Figure 9 visualizes the progressive improvement and component contributions across different metrics.

4.5.2. SAFL Parameter Analysis

Table 6 analyzes the sensitivity of SAFL parameters:

Effect of $β$ (scale sensitivity): Setting

β = 0

(no scale awareness) degrades ball detection by 2.6% (88.4% → 85.8%), confirming the importance of scale-aware weighting. Increasing

β

to 0.75 or 1.0 further improves ball AP but degrades player and referee detection, as the model over-focuses on small objects. The optimal

β = 0.5

achieves the best balance.

Effect of $γ$ (focusing parameter): Lower

γ

values (1.0, 1.5) reduce the focusing effect, leading to a slightly weaker overall performance. Higher

γ

(3.0) over-focuses on hard examples, potentially ignoring easy but important examples. The standard

γ = 2.0

provides the best trade-off.

4.5.3. Backbone Comparison

Table 7 compares different backbone architectures:

CNN backbones (ResNet, ConvNeXt): ResNet-50 provides a strong baseline but is limited by local receptive fields. Deeper ResNet-101 improves accuracy but significantly increases computation. ConvNeXt-T modernizes CNN design with larger kernels and achieves better accuracy than ResNet-50.

Transformer backbones (Swin, ViT): Swin Transformer achieves good accuracy through hierarchical design and shifted windows, but the quadratic complexity of attention limits its speed. ViT-B/16 has the highest parameter count and slowest speed due to global attention on all patches.

SSM backbones (VMamba, MobileMamba): VMamba-T demonstrates the potential of state space models for vision, achieving competitive accuracy with good efficiency. MobileMamba-B2 further improves both accuracy and efficiency through the MRFFI module and optimized architecture, achieving the best results across all metrics.

The comparison validates our choice of MobileMamba as the backbone: it achieves 2.4% higher mAP@50 than VMamba-T while being 8.3% faster, and 4.1% higher mAP@50 than Swin-T while being 50% faster.

4.5.4. Decoder Architecture Analysis

Table 8 compares different decoder architectures. The Separable Dynamic Decoder achieves the best accuracy while being the most efficient, validating the effectiveness of dynamic convolution attention for query–feature interaction.

4.5.5. Number of Queries Analysis

Table 9 shows the effect of the number of object queries. Performance improves as the number of queries increases from 100 to 300, then saturates. We use 300 queries as the default, balancing accuracy and efficiency.

4.6. Computational Complexity Analysis

We provide a detailed analysis of the computational complexity of SoccerDETR, covering theoretical complexity bounds, actual resource consumption, and per-component latency breakdown.

4.6.1. Hardware Setup

All efficiency measurements are conducted on a single NVIDIA RTX 4090 GPU (24 GB VRAM) with an Intel Core i9-13900K CPU (32 GB RAM), running CUDA 12.1 and PyTorch 2.0. FPS is measured with batch size 1, input resolution

640 \times 640

, averaged over 1000 iterations after 100 warmup iterations to ensure stable measurements. FLOPs are computed using the fvcore library. Latency is measured as end-to-end wall-clock time, including pre-processing and post-processing.

4.6.2. Theoretical Complexity Comparison

Table 10 compares the computational complexity of different attention mechanisms:

Self-Attention: The quadratic complexity

O (N^{2} \cdot d)

makes it prohibitive for high-resolution inputs. For

N = 1600

tokens (640 × 640 image with 16 × 16 patches), this results in 2.56 M attention computations per layer.

Cross-Attention: The complexity

O (N \cdot M \cdot d)

depends on both query count N and feature token count M. With 300 queries and 1600 feature tokens, this results in 480 K attention computations.

Deformable Attention: By attending to only K sampling points (typically

K = 4

), the complexity reduces to

O (N \cdot K \cdot d)

, significantly improving efficiency.

SS2D: The linear complexity

O (N \cdot d)

enables efficient processing of long sequences. The four-directional scanning adds a constant factor of 4 but maintains linear scaling.

DyConvAtten: The separable design reduces complexity from multiplicative

O (N \cdot M)

to additive

O (N + M)

, enabling efficient query–feature interaction.

The actual FLOPs measurements confirm the theoretical analysis: SS2D and DyConvAtten are significantly more efficient than standard attention mechanisms, enabling real-time processing of high-resolution soccer videos.

4.6.3. Overall Model Efficiency Comparison

Table 11 provides a comprehensive efficiency comparison. SoccerDETR achieves the lowest parameter count (39.8 M), lowest FLOPs (72.4 G), highest FPS (78), and lowest latency (12.8 ms) among all compared methods, while simultaneously achieving the highest mAP@50 (94.2%). Compared to RT-DETR-L, SoccerDETR reduces parameters by 5.2%, FLOPs by 16.0%, and latency by 5.2%, while improving mAP@50 by 1.9%. The efficiency gains are primarily attributed to the linear-complexity MobileMamba backbone (replacing the quadratic-complexity Transformer backbone) and the separable dynamic decoder (replacing standard cross-attention).

4.6.4. Per-Component Latency Breakdown

Table 12 provides a latency breakdown of SoccerDETR components. The backbone accounts for the largest portion (40.6%), followed by the decoder (21.9%) and encoder (16.4%). The SDFM and SCSA modules add only 14.1% overhead while providing significant accuracy improvements, demonstrating their efficiency.

4.7. Cross-Dataset Generalization

To evaluate the generalization capability of SoccerDETR, we conduct cross-dataset experiments where models are trained on one dataset and evaluated on the other.

Table 13 shows that SoccerDETR exhibits stronger cross-dataset generalization than competing methods. When trained on Soccana and tested on SoccerNet, SoccerDETR achieves 81.2% mAP@50, outperforming RT-DETR-L by 4.4%. This suggests that our model learns more transferable representations, likely due to the global receptive field of SS2D and the adaptive fusion of SDFM. Figure 10 provides a comprehensive visualization of multi-dimensional performance comparison and cross-dataset results.

Figure 10. Performance comparison visualization. (a) Multi-dimensional radar chart comparing SoccerDETR with competing methods across mAP@50, mAP@50:95, Precision, Recall, FPS, and parameter efficiency. (b) mAP@50 comparison across Soccana and SoccerNet datasets, demonstrating consistent improvements in SoccerDETR over all baselines.

4.8. Statistical Significance Testing

To ensure the reliability of our experimental results, we conduct statistical significance testing using paired t-tests between SoccerDETR and the strongest baseline (RT-DETR-L). Specifically, we train each model five times with different random seeds (seeds 0, 42, 123, 456, 789), and each training run constitutes one sample in the paired t-test. The pairing is based on the random seed, meaning we compare SoccerDETR trained with seed 0 against RT-DETR-L trained with seed 0, and so on. This design controls for random initialization effects and provides a fair comparison of the two architectures. The resulting five paired observations are used to compute the t-statistic and p-value. We report the mean and standard deviation of mAP across the five runs.

As shown in Table 14, all improvements of SoccerDETR over RT-DETR-L are statistically significant with

p < 0.001

, confirming that the observed performance gains are not due to random variation. The low standard deviations (0.22–0.33%) indicate that SoccerDETR training is stable across different random initializations.

4.9. Five-Fold Cross-Validation Analysis

To further validate the robustness of our results and mitigate potential bias from a single train/test split, we perform 5-fold cross-validation on the Soccana dataset. The dataset is divided into five non-overlapping folds at the image level with stratified sampling to maintain class distribution balance. Each fold uses 80% of images for training and 20% for testing, and we report the mean and standard deviation of mAP across the five folds. This provides an estimate of performance variance due to data splitting rather than random initialization.

Table 15 demonstrates that SoccerDETR achieves consistent performance across all five folds, with a mean mAP@50 of 94.1% and a standard deviation of only 0.23%. The per-class results are equally stable:

{AP}_{P l a y e r}

varies by only 0.15%,

{AP}_{B a l l}

by 0.34%, and

{AP}_{R e f e r e e}

by 0.20%. The cross-validation mean (94.1%) is consistent with the standard split result (94.2%), confirming that our reported results are representative and not an artifact of a favorable data split. The slightly higher variance in ball detection reflects the inherent difficulty and variability of small object detection across different subsets of the data.

4.10. Failure Case Analysis

Despite the strong overall performance, SoccerDETR has limitations in certain challenging scenarios. When the ball occupies fewer than 10 pixels in very wide-angle shots, detection accuracy drops significantly, which is a fundamental limitation of the input resolution rather than the model architecture. During fast camera pans or powerful shots, motion blur can make the ball appear as an elongated streak, confusing the detector, and temporal modeling could potentially address this limitation. Circular objects in the background such as advertising logos and stadium lights occasionally cause false positives, and additional context modeling could help distinguish the actual ball. When the ball is completely occluded by players, detection fails, which is expected behavior since the model cannot detect objects that are not visible. These failure cases suggest directions for future improvement, including higher-resolution processing, temporal modeling, and enhanced context reasoning.

5. Discussion

Our experimental results demonstrate that SoccerDETR achieves state-of-the-art performance on soccer object detection while maintaining real-time inference speed, validating the effectiveness of integrating visual state space models with Transformer-based detection. The statistical significance testing (

p < 0.001

for all metrics) and 5-fold cross-validation (standard deviation

\leq 0.35 %

) confirm the reliability and reproducibility of these results.

The success of the MobileMamba backbone confirms that state space models can effectively replace transformers for visual feature extraction, with the linear complexity of SS2D enabling efficient processing of high-resolution inputs essential for detecting small objects like balls. The global receptive field provided by multi-directional scanning captures comprehensive spatial dependencies analogous to transformer attention but with linear complexity. As shown in the computational complexity analysis (Table 11), SoccerDETR achieves the lowest FLOPs (72.4G) and highest FPS (78) among all compared methods, with the MobileMamba backbone contributing to a 16.0% reduction in FLOPs compared to RT-DETR-L.

Our Scale-Aware Focal Loss explicitly addresses the extreme scale variations in soccer detection (20–100× between players and balls). The gradient behavior analysis (Equation (34)) reveals that SAFL amplifies gradient signals for small objects by approximately 3× compared to large objects, effectively compensating for the inherent difficulty of detecting small balls. Ablation studies confirm that removing scale awareness (

β = 0

) degrades ball detection by 2.6%.

The combination of SDFM and SCSA provides complementary benefits through adaptive multi-scale fusion and synergistic spatial-channel attention, while adding only 14.1% latency overhead (1.8 ms per image). The systematic comparison in Table 1 highlights that SoccerDETR is the first method to combine an SSM backbone with adaptive multi-scale fusion and synergistic attention, filling a gap in the existing literature.

Cross-dataset experiments demonstrate stronger generalization than competing methods, attributed to the global receptive field learning transferable representations and the adaptive fusion adjusting to domain shifts.

Beyond soccer video analysis, the proposed SoccerDETR framework has potential applications in other computer vision inspection tasks that share similar challenges of multi-scale object detection with real-time requirements. First, our approach could be adapted for exterior cladding material detection in urban environments [57]. Similar to detecting players and balls at varying distances, identifying different cladding materials in street view images requires handling significant scale variations as buildings appear at different distances from the camera. The adaptive multi-scale fusion (SDFM) and scale-aware loss (SAFL) could be particularly beneficial for this task, enabling efficient processing of large-scale urban imagery. Second, the framework shows promise for safety inspection tasks such as non-PPE detection on construction sites [58]. This task involves detecting workers without proper safety equipment from both body-worn cameras and general surveillance footage. The challenges are analogous to soccer detection: workers appear at various scales, may be partially occluded, and real-time detection is crucial for immediate safety alerts. Our efficient backbone and scale-aware training strategy could enhance detection performance in such safety-critical applications. Third, the linear complexity of our MobileMamba backbone makes it suitable for processing high-resolution imagery in industrial quality inspection, where defects may be small relative to the overall image size, similar to ball detection in wide-angle soccer footage.

Several limitations suggest future directions: temporal modeling could improve detection consistency and handle motion blur; higher-resolution processing could address extremely small ball detection; domain adaptation techniques could enable transfer to other sports; and production deployment requires model quantization and hardware optimization.

6. Conclusions

We presented SoccerDETR, a novel real-time detection framework for soccer object detection that integrates visual state space models with Transformer-based detection. Our approach introduces four key innovations that collectively address the computational and accuracy challenges of soccer video analysis: a lightweight backbone with SS2D for linear-complexity feature extraction that captures global dependencies efficiently through multi-directional scanning; SDFM for semantic-aware multi-scale fusion that adaptively aggregates features based on content, enabling effective detection across extreme scale variations; SCSA for synergistic spatial-channel attention that enhances feature discriminability through collaborative spatial and channel attention mechanisms; and a Separable Dynamic Decoder for efficient query refinement that replaces quadratic cross-attention with linear dynamic convolution attention. Additionally, we proposed Scale-Aware Focal Loss to address the inherent scale imbalance in soccer scenarios, explicitly up-weighting small object losses to improve ball detection.

Extensive experiments on Soccana and SoccerNet datasets demonstrate that SoccerDETR achieves state-of-the-art performance with 94.2% and 91.8% mAP@50, respectively, while maintaining 78 FPS real-time inference speed on a single NVIDIA RTX 4090 GPU. The improvements are particularly notable for small object (ball) detection, where our scale-aware design provides 3.2–3.3% improvement over the previous best method. Comprehensive ablation studies validate the contribution of each proposed component, and cross-dataset experiments demonstrate strong generalization capability. Our work demonstrates the potential of visual state space models for efficient sports video object detection, showing that SSM-based backbones can effectively replace Transformers in detection frameworks while achieving competitive accuracy with improved efficiency.

Author Contributions

Conceptualization, D.Z. and Y.L.; methodology, D.Z.; software, D.Z.; validation, D.Z. and Y.L.; formal analysis, D.Z.; investigation, D.Z.; resources, Y.L.; data curation, D.Z.; writing—original draft preparation, D.Z.; writing—review and editing, Y.L.; visualization, D.Z.; supervision, Y.L.; project administration, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The Soccana Player Ball Detection Dataset is publicly available at https://huggingface.co/datasets/Adit-jain/Soccana_player_ball_detection_v1 (accessed on 15 January 2026). The SoccerNet Dataset is available at https://zenodo.org/records/7808511 (accessed on 15 January 2026).

Acknowledgments

We thank the creators of the Soccana and SoccerNet datasets for making their data publicly available. We acknowledge the use of AI-assisted tools including ChatGPT (OpenAI, v.5.2) and DeepL (DeepL SE, v.2025). All AI-generated content was carefully reviewed, edited, and validated by the authors to ensure accuracy and appropriateness.

Conflicts of Interest

Author Dongyang Zhou is employed by Xi’an Thermal Power Research Institute Co., Ltd. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DETR	DEtection TRansformer.
FPN	Feature Pyramid Network.
FPS	Frames Per Second.
mAP	mean Average Precision.
NMS	Non-Maximum Suppression.
SAFL	Scale-Aware Focal Loss.
SCSA	Spatial-Channel Synergistic Attention.
SDFM	Semantic-aware Dynamic Feature Fusion Module.
SS2D	Selective Scan 2D.
SSM	State Space Model.

References

FIFA. FIFA Big Count 2024: Global Football Participation Report. 2024. Available online: https://www.fifa.com/about-fifa/organisation/big-count (accessed on 15 January 2026).
Giancola, S.; Amine, M.; Dghaily, T.; Ghanem, B. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1711–1721. [Google Scholar]
MarketsandMarkets. Sports Analytics Market by Component, Application, Sport Type, and Region—Global Forecast to 2027. 2023. Available online: https://www.marketsandmarkets.com/Market-Reports/sports-analytics-market-132532279.html (accessed on 15 January 2026).
Cioppa, A.; Giancola, S.; Deliege, A.; Kang, L.; Zhou, X.; Cheng, Z.; Ghanem, B.; Van Droogenbroeck, M. SoccerNet-Tracking: Multiple object tracking dataset and benchmark in soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–27 June 2022; pp. 3491–3502. [Google Scholar]
Kamble, P.R.; Keskar, A.G.; Bhurchandi, K.M. A deep learning ball tracking system in soccer videos. Opto-Electron. Rev. 2019, 27, 58–69. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 January 2026).
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Glasgow, UK, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4507–4515. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 16965–16974. [Google Scholar]
Lv, C.; Lv, X.l.; Wang, Z.; Zhao, T.; Tian, W.; Zhou, Q.; Zeng, L.; Wan, M.; Liu, C. A focal quotient gradient system method for deep neural network training. Appl. Soft Comput. 2025, 184, 113704. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 1–15. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Shen, Q.; Zhang, J. AI-Enhanced Disaster Risk Prediction with Explainable SHAP Analysis: A Multi-Class Classification Approach Using XGBoost. Res. Sq. 2025. [Google Scholar] [CrossRef]
He, H.; Zhang, J.; Cai, Y.; Chen, H.; Hu, X.; Gan, Z.; Wang, Y.; Wang, C.; Wu, Y.; Xie, L. Mobilemamba: Lightweight multi-receptive visual mamba network. In Proceedings of the Computer Vision and Pattern Recognition Conference, Vancouver, BC, Canada, 17–24 June 2025; pp. 4497–4507. [Google Scholar]
Tang, L.; Zhang, H.; Xu, H.; Ma, J. Rethinking the necessity of image fusion in high-level vision tasks: A practical infrared and visible image fusion network based on progressive semantic injection and scene fidelity. Inf. Fusion 2023, 99, 101870. [Google Scholar] [CrossRef]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Hu, J.; Cao, L.; Jin, X.; Zhang, S.; Ji, R. Universal Image Segmentation with Efficiency. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: New York, NY, USA, 2025. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Komorowski, J.; Kurzejamski, G.; Sarwas, G. FootAndBall: Integrated player and ball detector. In Proceedings of the International Conference on Computer Vision Theory and Applications, Valletta, Malta, 27–29 February 2020; pp. 47–56. [Google Scholar]
Tanapatpiboon, A.; Kusakunniran, W.; Limroongreungrat, W. Deep learning-based detection of players and teams in soccer videos with positioning heatmap generation. Appl. Comput. Inform. 2025. ahead-of-print. [Google Scholar] [CrossRef]
Maglo, A.; Cioppa, A.; Giancola, S.; Ghanem, B.; Van Droogenbroeck, M. Efficient tracking of team sport players with few game-specific annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 19–20 June 2022; pp. 4475–4485. [Google Scholar]
Chen, S.; Sun, P.; Song, Y.; Luo, P. DiffusionDet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 19830–19843. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3651–3660. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Lv, W.; Zhao, Y.; Xu, S.; Wei, J.; Wang, G.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote sensing image classification with state space model. In IEEE Geoscience and Remote Sensing Letters; IEEE: New York, NY, USA, 2024. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 658–666. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Computer Vision—ECCV 2014; Springer: Berlin, Germany, 2014; pp. 740–755. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Wang, S. Domain adaptation using transformer models for automated detection of exterior cladding materials in street view images. Sci. Rep. 2025, 16, 2696. [Google Scholar] [CrossRef]
Wang, S. Domain-adaptive faster R-CNN for non-PPE identification on construction sites from body-worn and general images. Sci. Rep. 2026, 16, 4793. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall architecture of the proposed SoccerDETR framework. The input image is first processed by the MobileMamba backbone to extract multi-scale features

{P_{3}, P_{4}, P_{5}}

. These features are then enhanced through the SDFM module for adaptive multi-scale fusion and SCSA attention for synergistic spatial-channel enhancement. The Efficient Transformer Encoder processes the enhanced features, and Uncertainty–minimal Query Selection identifies the most informative queries. Finally, the Separable Dynamic Decoder produces the detection results through efficient dynamic convolution attention.

Figure 1. Overall architecture of the proposed SoccerDETR framework. The input image is first processed by the MobileMamba backbone to extract multi-scale features

{P_{3}, P_{4}, P_{5}}

. These features are then enhanced through the SDFM module for adaptive multi-scale fusion and SCSA attention for synergistic spatial-channel enhancement. The Efficient Transformer Encoder processes the enhanced features, and Uncertainty–minimal Query Selection identifies the most informative queries. Finally, the Separable Dynamic Decoder produces the detection results through efficient dynamic convolution attention.

Figure 2. Architecture of the SS2D block used in MobileMamba. The block consists of Layer Normalization (LN), SS2D Block with residual connection, and Feed-Forward Network (FFN). The SS2D Block internally comprises linear projection, depthwise convolution (DWConv) for local feature extraction, SiLU activation for non-linearity, and the core SS2D operation for global dependency modeling. The residual connections facilitate gradient flow during training.

Figure 3. Illustration of the S6 (Selective State Space) processing flow. The input features undergo scan expanding to generate four directional sequences, which are processed by the S6 Block. The S6 Block contains embedding layers, linear projections for computing input-dependent parameters

A

,

D

,

Δ

,

B

, and

C

. The discretized state equations

\bar{A} = e^{Δ A}

,

\bar{B} = Δ B

and the recurrence

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}

,

y_{t} = C h_{t} + D x_{t}

are computed efficiently using parallel scan. Finally, scan merging reconstructs the 2D feature map from the four directional outputs.

Figure 3. Illustration of the S6 (Selective State Space) processing flow. The input features undergo scan expanding to generate four directional sequences, which are processed by the S6 Block. The S6 Block contains embedding layers, linear projections for computing input-dependent parameters

A

,

D

,

Δ

,

B

, and

C

. The discretized state equations

\bar{A} = e^{Δ A}

,

\bar{B} = Δ B

and the recurrence

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}

,

y_{t} = C h_{t} + D x_{t}

are computed efficiently using parallel scan. Finally, scan merging reconstructs the 2D feature map from the four directional outputs.

Figure 4. Architecture of the Semantic-aware Dynamic Feature Fusion Module (SDFM). Given input features

F_{i r}

and

F_{v i}

from different scales, SDFM first concatenates them along the channel dimension and applies Global Average Pooling (GAP) to obtain a channel descriptor. This descriptor is processed through point-wise convolutions (

P_{w}

-Conv) with Batch Normalization (BN) and Sigmoid activation to generate channel attention weights. The attention weights are split and applied to the respective input features through element-wise multiplication. The weighted features are then fused through element-wise addition, with a learnable weight w controlling the final fusion ratio to produce the output

F_{f u}

.

Figure 4. Architecture of the Semantic-aware Dynamic Feature Fusion Module (SDFM). Given input features

F_{i r}

and

F_{v i}

from different scales, SDFM first concatenates them along the channel dimension and applies Global Average Pooling (GAP) to obtain a channel descriptor. This descriptor is processed through point-wise convolutions (

P_{w}

-Conv) with Batch Normalization (BN) and Sigmoid activation to generate channel attention weights. The attention weights are split and applied to the respective input features through element-wise multiplication. The weighted features are then fused through element-wise addition, with a learnable weight w controlling the final fusion ratio to produce the output

F_{f u}

.

Figure 5. Architecture of the Spatial-Channel Synergistic Attention (SCSA) module. SCSA consists of two parallel branches: Spatial-wise Multi-Head Attention (SA) in the upper path and Channel-wise Self-Attention (CA) in the lower path. The SA branch splits input features along the channel dimension into multiple groups, applies X AvgPool operations along different axes, and processes through DWConv1d with multiple kernel sizes (3, 5, 7, 9) to capture multi-scale spatial patterns. The outputs are concatenated, normalized with GroupNorm, and passed through Sigmoid to produce spatial attention weights. The CA branch generates Query (Q), Key (K), and Value (V) through DWConv2d operations with GroupNorm, computes SE-style Attention via AvgPool and Sigmoid, and produces channel-refined features. The spatial and channel attention outputs are combined through element-wise multiplication (⊙) to produce the final enhanced features, exploiting the synergy between the two attention types.

Figure 6. Architecture of the separable dynamic decoder. The decoder receives Aggregated Features from the encoder and Proposal Kernels from query selection as inputs. In the Pre-Attention stage, proposal kernels interact with aggregated features through 2D Dynamic Convolution to generate Box Features. These features are processed by DyConvAtten blocks that replace traditional cross-attention. The Post-Attention stage employs MHSA and FFN for query refinement. The bottom panels illustrate the key difference: traditional Multi-Head Cross-Attention (left) computes

softmax (\frac{Q W^{q} {(V W^{k})}^{⊤}}{\sqrt{d}}) V W^{v}

with

O (N \cdot M)

complexity, while Dynamic Convolution Attention (right) computes

(V * r (Q W^{d})) * r (Q W^{p})

through separable operations with

O (N + M)

complexity.

Figure 6. Architecture of the separable dynamic decoder. The decoder receives Aggregated Features from the encoder and Proposal Kernels from query selection as inputs. In the Pre-Attention stage, proposal kernels interact with aggregated features through 2D Dynamic Convolution to generate Box Features. These features are processed by DyConvAtten blocks that replace traditional cross-attention. The Post-Attention stage employs MHSA and FFN for query refinement. The bottom panels illustrate the key difference: traditional Multi-Head Cross-Attention (left) computes

softmax (\frac{Q W^{q} {(V W^{k})}^{⊤}}{\sqrt{d}}) V W^{v}

with

O (N \cdot M)

complexity, while Dynamic Convolution Attention (right) computes

(V * r (Q W^{d})) * r (Q W^{p})

through separable operations with

O (N + M)

complexity.

Figure 7. Qualitative detection results on the Soccana dataset. SoccerDETR accurately detects players, referees, and balls across diverse scenarios, including crowded scenes (rows 1–2), varying scales (row 3), and partial occlusions (row 4). The detection confidence scores demonstrate high precision for all object categories. Note the accurate ball detection even when the ball is small and partially occluded by players.

Figure 8. Qualitative detection results on the SoccerNet dataset. Our method successfully handles challenging cases including broadcast overlays (rows 1–2), extreme scale variations (row 2), and complex backgrounds (row 3). The ball detection remains robust even when the ball appears very small in wide-angle shots or is partially occluded. The Person category encompasses both players and referees in this dataset.

Figure 9. Ablation study visualization. (a) Progressive mAP improvement as each component is added to the baseline. (b) Component contribution heatmap showing the improvement of each module across different metrics, including mAP@50, mAP@50:95, and per-class AP.

Table 2. Comparison with state-of-the-art methods on the Soccana dataset. Best results are written in in bold; second-best results are underlined. All methods use input size

640 \times 640

.

Table 2. Comparison with state-of-the-art methods on the Soccana dataset. Best results are written in in bold; second-best results are underlined. All methods use input size

640 \times 640

.

Method	mAP@50	mAP@50:95	AP_Player	AP_Ball	AP_Ref	FPS
Faster R-CNN [29]	82.4	51.3	89.2	68.1	89.9	24
Cascade R-CNN [52]	84.7	54.6	90.8	71.5	91.8	18
RetinaNet [24]	81.6	49.8	88.4	66.9	89.5	32
FCOS [30]	83.2	51.7	89.6	69.2	90.8	35
YOLOv7 [34]	88.5	58.9	93.2	78.4	93.9	68
YOLOv8-L [8]	89.7	60.4	94.1	80.2	94.8	72
YOLOv9-E [7]	90.8	62.1	94.8	82.5	95.1	58
YOLOv10-L [6]	91.2	62.8	95.1	83.1	95.4	65
DETR [9]	79.8	47.5	86.9	64.2	88.3	28
Deformable DETR [10]	86.4	56.2	91.8	74.6	92.8	42
DINO [11]	89.2	59.8	93.8	79.5	94.3	35
RT-DETR-L [14]	92.3	64.5	95.8	85.2	95.9	74
SoccerDETR (Ours)	94.2	67.8	97.1	88.4	97.1	78

Table 3. Comparison with state-of-the-art methods on the SoccerNet dataset. The best results are written in bold; the second-best results are underlined.

Method	mAP@50	mAP@50:95	AP_Person	AP_Ball	FPS
Faster R-CNN [29]	78.6	46.2	85.3	64.8	24
Cascade R-CNN [52]	80.9	49.5	87.1	67.4	18
RetinaNet [24]	77.2	44.8	84.1	62.5	32
FCOS [30]	79.4	47.3	85.8	65.2	35
YOLOv7 [34]	84.6	53.8	89.4	74.2	68
YOLOv8-L [8]	86.2	55.7	90.8	76.5	72
YOLOv9-E [7]	87.5	57.4	91.6	78.8	58
YOLOv10-L [6]	88.1	58.2	92.1	79.6	65
DETR [9]	75.4	42.1	82.6	60.3	28
Deformable DETR [10]	82.6	51.8	88.2	70.4	42
DINO [11]	85.8	55.2	90.4	75.8	35
RT-DETR-L [14]	89.5	60.8	93.2	81.4	74
SoccerDETR (Ours)	91.8	64.2	95.4	84.7	78

Table 4. Detailed comparison on small, medium, and large objects (Soccana dataset).

Method	AP_S	AP_M	AP_L	mAP@50
Faster R-CNN	42.3	68.5	89.2	82.4
YOLOv8-L	58.6	78.4	94.1	89.7
YOLOv10-L	61.2	80.1	95.1	91.2
RT-DETR-L	65.8	82.6	95.8	92.3
SoccerDETR	72.4	86.3	97.1	94.2

Table 5. Ablation study on the contribution of each component. The baseline is RT-DETR with ResNet-50 backbone. Each row adds one component to the previous configuration.

Configuration	mAP@50	mAP@50:95	AP_Ball	Params (M)	FPS
Baseline (RT-DETR-R50)	88.6	58.2	78.5	42.0	68
+MobileMamba Backbone	90.2 (+1.6)	60.5 (+2.3)	81.2 (+2.7)	38.5	82
+SDFM	91.5 (+1.3)	62.8 (+2.3)	83.8 (+2.6)	39.2	80
+SCSA	92.8 (+1.3)	65.1 (+2.3)	85.9 (+2.1)	40.1	79
+Separable Dynamic Decoder	93.4 (+0.6)	66.4 (+1.3)	87.1 (+1.2)	39.8	78
+Scale-Aware Focal Loss	94.2 (+0.8)	67.8 (+1.4)	88.4 (+1.3)	39.8	78

Table 6. Ablation study on SAFL parameters. Default values:

γ = 2.0

,

β = 0.5

.

Table 6. Ablation study on SAFL parameters. Default values:

γ = 2.0

,

β = 0.5

.

γ	β	mAP@50	AP_Player	AP_Ball	AP_Referee
1.0	0.5	93.5	96.8	87.1	96.6
1.5	0.5	93.8	96.9	87.6	96.9
2.0	0.0	93.1	97.2	85.8	96.3
2.0	0.25	93.7	97.0	87.5	96.6
2.0	0.5	94.2	97.1	88.4	97.1
2.0	0.75	93.9	96.5	88.6	96.6
2.0	1.0	93.4	95.8	88.9	95.5
3.0	0.5	93.6	96.4	87.9	96.5

Table 7. Comparison of different backbone architectures with the same detection head.

Backbone	mAP@50	mAP@50:95	Params (M)	GFLOPs	FPS
ResNet-50 [53]	88.6	58.2	42.0	86.2	68
ResNet-101 [53]	89.4	59.8	61.0	112.4	52
Swin-T [54]	90.1	60.5	45.3	92.4	52
Swin-S [54]	91.2	62.1	68.5	124.8	38
ViT-B/16 [55]	89.4	59.2	86.6	124.5	38
ConvNeXt-T [56]	90.5	61.2	44.8	88.6	58
VMamba-T [18]	91.8	63.4	40.2	78.6	72
MobileMamba-B2	94.2	67.8	39.8	72.4	78

Table 8. Comparison of different decoder architectures.

Decoder Type	mAP@50	mAP@50:95	Params (M)	FPS
Standard Cross-Attention	93.1	65.8	42.3	72
Deformable Attention	93.5	66.2	41.8	74
Separable Dynamic Conv	94.2	67.8	39.8	78

Table 9. Effect of the number of object queries.

Num Queries	mAP@50	mAP@50:95	Params (M)	FPS
100	92.8	65.4	39.2	85
200	93.6	66.8	39.5	82
300	94.2	67.8	39.8	78
500	94.1	67.6	40.4	72

Table 10. Computational complexity comparison of attention mechanisms. N: number of tokens, M: number of feature tokens, d: feature dimension.

Mechanism	Time Complexity	Space Complexity	Actual FLOPs (G)
Self-Attention	$O (N^{2} \cdot d)$	$O (N^{2})$	18.4
Cross-Attention	$O (N \cdot M \cdot d)$	$O (N \cdot M)$	12.6
Deformable Attention	$O (N \cdot K \cdot d)$	$O (N \cdot K)$	4.2
SS2D (Ours)	$O (N \cdot d)$	$O (N)$	2.8
DyConvAtten (Ours)	$O ((N + M) \cdot d)$	$O (N + M)$	1.6

Table 11. Comprehensive computational complexity comparison across methods. All measurements use input size

640 \times 640

on a single NVIDIA RTX 4090 GPU. “Latency” denotes end-to-end inference time per image.

Table 11. Comprehensive computational complexity comparison across methods. All measurements use input size

640 \times 640

on a single NVIDIA RTX 4090 GPU. “Latency” denotes end-to-end inference time per image.

Method	Params (M)	GFLOPs	FPS	Latency (ms)	mAP@50
Faster R-CNN [29]	41.1	134.2	24	41.7	82.4
YOLOv8-L [8]	43.7	165.2	72	13.9	89.7
YOLOv9-E [7]	57.3	189.0	58	17.2	90.8
YOLOv10-L [6]	44.4	160.4	65	15.4	91.2
DETR [9]	41.3	86.0	28	35.7	79.8
Deformable DETR [10]	40.0	78.0	42	23.8	86.4
DINO [11]	47.0	98.5	35	28.6	89.2
RT-DETR-L [14]	42.0	86.2	74	13.5	92.3
SoccerDETR (Ours)	39.8	72.4	78	12.8	94.2

Table 12. Latency breakdown of SoccerDETR components (ms per image, measured on NVIDIA RTX 4090).

Component	Latency (ms)	Percentage
MobileMamba Backbone	5.2	40.6%
SDFM + SCSA	1.8	14.1%
Transformer Encoder	2.1	16.4%
Query Selection	0.4	3.1%
Separable Dynamic Decoder	2.8	21.9%
Prediction Heads	0.5	3.9%
Total	12.8	100%

Table 13. Cross-dataset generalization results (mAP@50). Rows indicate training dataset; columns indicate test dataset.

Method	Train: Soccana		Train: SoccerNet
Method	Test: Soccana	Test: SoccerNet	Test: Soccana	Test: SoccerNet
YOLOv8-L	89.7	72.4	76.8	86.2
RT-DETR-L	92.3	76.8	79.5	89.5
SoccerDETR	94.2	81.2	83.6	91.8

Table 14. Statistical significance analysis. The results are reported as the mean ± standard deviation over 5 independent runs. p-values are computed using paired t-tests between SoccerDETR and RT-DETR-L.

Method	mAP@50 (Soccana)	mAP@50:95 (Soccana)	mAP@50 (SoccerNet)	mAP@50:95 (SoccerNet)
RT-DETR-L	92.3 ± 0.28	64.5 ± 0.35	89.5 ± 0.31	60.8 ± 0.38
SoccerDETR	94.2 ± 0.22	67.8 ± 0.30	91.8 ± 0.25	64.2 ± 0.33
p-value	<0.001	<0.001	<0.001	<0.001

Table 15. 5-fold cross-validation results on the Soccana dataset. Each fold uses 80% data for training and 20% for testing.

Fold	mAP@50	mAP@50:95	AP_Player	AP_Ball	AP_Referee
Fold 1	94.0	67.5	96.9	88.1	97.0
Fold 2	94.4	68.1	97.2	88.7	97.3
Fold 3	93.8	67.2	96.8	87.8	96.8
Fold 4	94.3	67.9	97.1	88.5	97.3
Fold 5	94.1	67.6	97.0	88.2	97.1
Mean ± Std	94.1 ± 0.23	67.7 ± 0.35	97.0 ± 0.15	88.3 ± 0.34	97.1 ± 0.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, D.; Li, Y. SoccerDETR: Real-Time Soccer Object Detection via Visual State Space Models with Semantic-Aware Feature Fusion. Technologies 2026, 14, 142. https://doi.org/10.3390/technologies14030142

AMA Style

Zhou D, Li Y. SoccerDETR: Real-Time Soccer Object Detection via Visual State Space Models with Semantic-Aware Feature Fusion. Technologies. 2026; 14(3):142. https://doi.org/10.3390/technologies14030142

Chicago/Turabian Style

Zhou, Dongyang, and Yuheng Li. 2026. "SoccerDETR: Real-Time Soccer Object Detection via Visual State Space Models with Semantic-Aware Feature Fusion" Technologies 14, no. 3: 142. https://doi.org/10.3390/technologies14030142

APA Style

Zhou, D., & Li, Y. (2026). SoccerDETR: Real-Time Soccer Object Detection via Visual State Space Models with Semantic-Aware Feature Fusion. Technologies, 14(3), 142. https://doi.org/10.3390/technologies14030142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SoccerDETR: Real-Time Soccer Object Detection via Visual State Space Models with Semantic-Aware Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Object Detection in Sports Videos

2.2. Real-Time Object Detection

2.3. State Space Models for Vision

2.4. Attention Mechanisms for Object Detection

2.5. Multi-Scale Feature Fusion

3. Method

3.1. Overall Architecture

3.2. MobileMamba Backbone with SS2D

3.3. Semantic-Aware Dynamic Feature Fusion Module

3.4. Spatial-Channel Synergistic Attention

3.5. Efficient Transformer Encoder

3.6. Uncertainty-Minimal Query Selection

3.7. Separable Dynamic Decoder

3.8. Scale-Aware Focal Loss

3.9. Training Algorithm

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Comparison with State-of-the-Art Methods

4.5. Ablation Studies

4.5.1. Component Analysis

4.5.2. SAFL Parameter Analysis

4.5.3. Backbone Comparison

4.5.4. Decoder Architecture Analysis

4.5.5. Number of Queries Analysis

4.6. Computational Complexity Analysis

4.6.1. Hardware Setup

4.6.2. Theoretical Complexity Comparison

4.6.3. Overall Model Efficiency Comparison

4.6.4. Per-Component Latency Breakdown

4.7. Cross-Dataset Generalization

4.8. Statistical Significance Testing

4.9. Five-Fold Cross-Validation Analysis

4.10. Failure Case Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI