1. Introduction
Soccer, as the world’s most popular sport with an estimated four billion fans globally [
1], generates vast amounts of video content that requires automated analysis for applications ranging from tactical analysis to broadcast enhancement [
2]. The global sports analytics market is projected to be worth
$5.2 billion by 2027 [
3], with soccer video analysis representing a significant portion of this growth. Object detection in soccer videos, particularly the localization of players, referees, and the ball, serves as a fundamental task for downstream applications, including player tracking, action recognition, event spotting, and automated highlight generation [
4]. These applications have profound implications for coaching staff seeking tactical insights, broadcasters aiming to enhance viewer experience, and sports scientists analyzing player performance metrics.
However, soccer video analysis presents unique and formidable challenges that distinguish it from general object detection tasks: (1) significant scale variations between players at different distances from the camera, where a player near the camera may occupy thousands of pixels while distant players span merely dozens; (2) frequent occlusions in crowded scenes, particularly during set pieces, corner kicks, and goal-mouth scrambles where multiple players cluster together; (3) the extremely small size of the ball relative to the image resolution, often appearing as a mere 10–30 pixel blob in broadcast footage; (4) rapid motion blur affecting both players during sprints and the ball during powerful shots; (5) varying lighting conditions across different stadiums and times of day; and (6) the stringent requirement for real-time processing to enable live analysis and instant replay generation [
5]. These challenges collectively demand detection systems that are simultaneously accurate, robust, and computationally efficient.
Recent advances in object detection have been dominated by two paradigms: CNN-based detectors such as the YOLO series [
6,
7,
8] and transformer-based approaches like DETR [
9] and its variants [
10,
11]. While YOLO models excel in speed through their single-stage architecture and optimized implementations, they rely on hand-crafted components like Non-Maximum Suppression (NMS) that introduce latency and can lead to suboptimal performance in crowded scenes where multiple players overlap. The NMS post-processing step, while effective for suppressing duplicate detections, operates as a greedy algorithm that may inadvertently remove valid detections in dense scenarios [
12,
13]. Transformer-based detectors elegantly eliminate NMS through set prediction with Hungarian matching but suffer from quadratic computational complexity
with respect to sequence length, severely limiting their applicability in real-time scenarios where processing speed is paramount [
14].
The recently proposed RT-DETR [
14] addresses this dilemma by designing an efficient hybrid encoder that decouples intra-scale and cross-scale feature interaction, achieving real-time performance while maintaining end-to-end detection capabilities. This architectural innovation, along with related advances in efficient neural network training [
15], demonstrated that transformer-based detectors can compete with and even surpass YOLO models in the speed–accuracy trade-off. However, the transformer backbone still incurs substantial computational costs when processing the high-resolution inputs typical in sports broadcasting, where 1080p or even 4K resolution is standard. The self-attention mechanism, despite its effectiveness in capturing global dependencies, becomes a computational bottleneck as image resolution increases.
Meanwhile, State Space Models (SSMs), particularly Mamba [
16], have emerged as a promising alternative that achieves linear complexity
while effectively capturing long-range dependencies through selective state space mechanisms [
17,
18,
19]. Unlike transformers that compute pairwise attention between all tokens, SSMs process sequences through a recurrent formulation that maintains a compressed hidden state, enabling efficient processing of long sequences. The selective mechanism in Mamba allows the model to dynamically adjust its information propagation based on input content, achieving content-aware filtering that rivals the expressiveness of attention mechanisms. Recent works have successfully adapted Mamba to visual recognition tasks, demonstrating competitive performance with significantly reduced computational overhead [
20].
In this paper, we propose SoccerDETR, a novel detection framework that synergistically combines the efficiency of state space models with the accuracy of transformer-based detection; it is specifically tailored for the demanding requirements of soccer video analysis. As illustrated in
Figure 1, our framework builds upon the successful RT-DETR architecture while introducing four key innovations that collectively address the computational and accuracy challenges:
(1) MobileMamba Backbone: We adopt MobileMamba [
20] as our feature extraction backbone, which employs the Selective Scan 2D (SS2D) mechanism to process visual features with linear complexity. The SS2D module innovatively scans image patches along four directions (horizontal left-to-right, horizontal right-to-left, vertical top-to-bottom, and vertical bottom-to-top) and merges the results through a learnable aggregation mechanism. This multi-directional scanning strategy enables comprehensive spatial understanding without the quadratic overhead of self-attention, making it particularly suitable for processing high-resolution soccer broadcast footage. The MobileMamba architecture further incorporates Multi-Receptive Field Feature Interaction (MRFFI) modules that capture both local details and global context efficiently.
(2) Semantic-aware Dynamic Feature Fusion Module (SDFM): We incorporate the SDFM [
21] to achieve adaptive multi-scale feature aggregation that is crucial for detecting objects across the extreme scale range present in soccer videos. Unlike conventional feature pyramid networks (FPNs) that use fixed fusion strategies with simple element-wise addition or concatenation, SDFM dynamically adjusts fusion weights based on semantic content through a learned attention mechanism. This enables more effective information flow between different resolution levels, ensuring that small objects like the ball receive adequate feature support from high-resolution feature maps while large objects benefit from the rich semantic information in low-resolution features.
(3) Spatial-Channel Synergistic Attention (SCSA): We integrate SCSA [
22] into our feature enhancement pipeline to boost the discriminative power of extracted features. SCSA consists of two complementary components: Shareable Multi-Semantic Spatial Attention (SMSA) and Progressive Channel-wise Self-Attention (PCSA). SMSA captures multi-scale spatial information through parallel depthwise convolutions with varying kernel sizes, enabling the model to attend to objects of different scales simultaneously. PCSA computes channel-wise dependencies through an efficient self-attention mechanism that progressively refines channel representations. These two components work synergistically, with spatial attention guiding channel recalibration and channel attention informing spatial focus, achieving superior feature enhancement with minimal computational overhead.
(4) Separable Dynamic Decoder: We adopt the Separable Dynamic Decoder [
23] that fundamentally reimagines the query–feature interaction mechanism in transformer decoders. Traditional multi-head cross-attention computes attention weights between all query–feature pairs, resulting in complexity
where
N is the number of object queries and
M is the number of feature tokens. The Separable Dynamic Decoder replaces this with dynamic convolution attention, which generates query-specific convolution kernels and applies them to feature maps through separable convolutions. This innovative design reduces the decoder complexity from
to
, enabling efficient processing while maintaining the expressive power needed for accurate detection.
Furthermore, we propose a Scale-Aware Focal Loss (SAFL) that combines the class-balancing properties of focal loss [
24] with scale-adaptive weighting to address the inherent class imbalance and scale variation in soccer detection scenarios. Standard focal loss effectively handles the foreground–background imbalance by down-weighting easy examples, but it treats all positive samples equally regardless of their scale. In soccer detection, this leads to suboptimal performance with regard to small objects like the ball, which are inherently harder to detect but equally important for downstream applications. Our SAFL explicitly up-weights the loss contribution from small objects, ensuring that the model dedicates sufficient learning capacity to these challenging cases.
Our main contributions can be summarized as follows:
We propose SoccerDETR, a detection framework that integrates visual state space models with transformer-based detection for soccer video analysis. To the best of our knowledge, this is among the first works to combine SSM backbones with transformer-based detection heads in the sports domain (see
Table 1 for a systematic comparison). Unlike prior works that simply applied general-purpose detectors to sports scenarios, our architecture is purpose-built to address the unique challenges of soccer detection: extreme scale variation (20–100× between players and balls), dense player clustering, and real-time processing requirements. The adoption of MobileMamba as the backbone achieves linear computational complexity
while maintaining global receptive fields through multi-directional selective scanning, achieving significant computational savings compared to the quadratic
complexity of transformer-based detectors.
We design a novel feature enhancement pipeline that synergistically combines SDFM and SCSA modules in a principled manner. The key architectural insight is that SDFM first performs content-aware multi-scale fusion through learned channel attention weights, and SCSA subsequently enhances the fused features through joint spatial-channel modulation, where spatial attention guides channel recalibration and vice versa. This two-stage enhancement is specifically motivated by the soccer detection scenario, where objects span extreme scale ranges and appear in cluttered backgrounds.
We introduce a Scale-Aware Focal Loss (SAFL) with a formally defined scale-adaptive weighting function that explicitly addresses the scale imbalance problem inherent in soccer detection. We provide gradient analysis showing how SAFL modulates the gradient magnitude for objects at different scales, ensuring that small objects (e.g., balls) receive proportionally larger gradient signals during training.
We conduct extensive experiments on Soccana and SoccerNet datasets with statistical significance testing (paired t-tests, where each training run with a different random seed constitutes one sample) and 5-fold cross-validation, demonstrating state-of-the-art performance with 94.2% mAP@50 on Soccana and 91.8% mAP@50 on SoccerNet. Comprehensive ablation studies validate the contribution of each proposed component, and cross-dataset experiments demonstrate strong generalization capability.
We provide detailed computational complexity analysis including theoretical complexity bounds, actual FLOPs/Params/FPS measurements, and per-component latency breakdown, offering actionable insights for future research in efficient sports video analysis.
3. Method
3.1. Overall Architecture
The overall architecture of SoccerDETR is illustrated in
Figure 1. Given an input image
, our framework consists of four main components: (1) a MobileMamba backbone for efficient feature extraction with linear complexity; (2) an Efficient Transformer Encoder with SDFM for adaptive multi-scale feature fusion; (3) SCSA modules for synergistic spatial-channel feature enhancement; and (4) a Separable Dynamic Decoder for efficient query–feature interaction and final detection.
The processing pipeline proceeds as follows. First, the input image is processed by the MobileMamba backbone, which extracts multi-scale features with strides , respectively. These features capture information at different spatial resolutions, with preserving fine-grained details suitable for small object detection and encoding high-level semantic information for recognizing object categories.
The multi-scale features are then fed into the SDFM module, which performs adaptive feature fusion based on semantic content. Unlike conventional FPN, which uses fixed fusion weights, SDFM dynamically adjusts the contribution of each scale based on the input content, enabling more effective information aggregation across scales.
Following SDFM, the SCSA modules enhance the fused features through synergistic spatial-channel attention. The spatial attention component identifies informative regions while the channel attention component recalibrates channel responses, with both components working collaboratively to boost feature discriminability.
The enhanced features are processed by the Efficient Transformer Encoder, which performs intra-scale self-attention to capture global dependencies within each scale. The encoder output is then used for Uncertainty-minimal Query Selection, which identifies the most informative feature locations as initial object queries.
Finally, the Separable Dynamic Decoder refines the object queries through dynamic convolution attention, producing the final detection results, including bounding box coordinates and class probabilities. The entire pipeline is trained end-to-end with our proposed Scale-Aware Focal Loss.
3.2. MobileMamba Backbone with SS2D
Traditional transformer backbones suffer from quadratic complexity due to self-attention operations, where N is the number of tokens (patches). For a typical input image of size with patch size , this results in tokens, leading to attention matrices of size million elements. This quadratic scaling becomes prohibitive for the high-resolution inputs common in sports broadcasting.
We address this limitation by adopting MobileMamba [
20] as our backbone, which leverages the Selective State Space Model (S6) to achieve linear complexity
. The key insight is that State Space Models process sequences through a recurrent formulation that maintains a fixed-size hidden state, avoiding the need to compute pairwise interactions between all tokens.
The continuous-time state space model is defined by the following linear ordinary differential equations:
where
is the hidden state that summarizes the history of the input sequence,
is the scalar input at time
t,
is the scalar output, and
,
,
,
are learnable parameters. The state matrix
governs the dynamics of the hidden state, determining how information is propagated and decayed over time.
For discrete sequence processing in neural networks, we discretize Equation (
1) using the zero-order hold (ZOH) method with step size
:
The exact ZOH discretization of involves the matrix inverse , which is computationally expensive and numerically unstable for large state dimensions. The approximation is derived from the first-order Taylor expansion of the matrix exponential: when . Substituting this into the exact formula yields . In practice, the step size is initialized to small values (typically to ) and the state matrix is parameterized with bounded eigenvalues, ensuring that holds throughout training. This simplification reduces the discretization from (matrix inversion) to (element-wise scaling), which is critical for maintaining the overall linear complexity of the SSM. The discretized parameters and define the discrete-time dynamics.
The discretized state space model then becomes a linear recurrence:
This recurrence can be computed in time for a sequence of length N, compared to for self-attention. Moreover, the recurrence can be parallelized using the associative scan algorithm, enabling efficient GPU implementation.
The key innovation of Mamba lies in making the parameters
,
, and
input-dependent, enabling selective information propagation:
where
,
, and
are learned linear projections, and
ensures the positivity of the step size. This selectivity allows the model to dynamically adjust its information propagation based on the input content, enabling content-aware filtering that is crucial for distinguishing relevant information from irrelevant information.
For 2D visual features, we employ the Selective Scan 2D (SS2D) mechanism as shown in
Figure 2. The challenge in applying SSMs to images is that images are inherently 2D structures, while SSMs are designed for 1D sequences. SS2D addresses this by scanning image patches along multiple directions and aggregating the results.
Given a feature map , SS2D first flattens the spatial dimensions into sequences along four scanning directions:
Direction 1: Left-to-right, top-to-bottom (row-major order).
Direction 2: Right-to-left, bottom-to-top (reverse row-major).
Direction 3: Top-to-bottom, left-to-right (column-major order).
Direction 4: Bottom-to-top, right-to-left (reverse column-major).
Each directional sequence is processed by the selective SSM independently:
where
denotes scanning along direction
d, SSM applies the selective state space model, and Merge reshapes the output back to 2D and aggregates the four directional outputs through summation followed by layer normalization.
This multi-directional scanning ensures that each spatial location can aggregate information from all directions, approximating the global receptive field of self-attention while maintaining linear complexity. The four directions provide complementary views of the spatial structure, with horizontal scans capturing row-wise dependencies and vertical scans capturing column-wise dependencies. The complete S6 processing flow is illustrated in
Figure 3.
The MobileMamba backbone is organized into four stages with progressively increasing channel dimensions and decreasing spatial resolutions. Each stage consists of multiple SS2D blocks with residual connections. We extract features from the last three stages as with strides and channel dimensions respectively. These multi-scale features provide a rich representation suitable for detecting objects across the extreme scale range present in soccer videos.
3.3. Semantic-Aware Dynamic Feature Fusion Module
Multi-scale feature fusion is critical for detecting objects with varying sizes in soccer videos, where players may span hundreds of pixels while the ball occupies merely 10–30 pixels. Traditional feature pyramid networks use fixed fusion strategies (e.g., element-wise addition) that treat all spatial locations and channels equally. However, the optimal fusion strategy may vary depending on the scene content and the objects present.
We incorporate the Semantic-aware Dynamic Feature Fusion Module (SDFM) [
21] to achieve adaptive feature aggregation based on semantic content. SDFM learns to dynamically adjust fusion weights through a channel attention mechanism, enabling content-aware information flow between different resolution levels.
As illustrated in
Figure 4, given features from two adjacent scales
(higher resolution, lower semantic level) and
(lower resolution, higher semantic level), SDFM first aligns their spatial dimensions through bilinear interpolation:
The aligned features are concatenated along the channel dimension to form a joint representation:
Channel attention weights are computed through a squeeze-and-excitation style mechanism:
where GAP denotes Global Average Pooling that compresses spatial dimensions,
is the Sigmoid function that normalizes weights to
, and BN represents Batch Normalization for training stability. The two
convolutions form a bottleneck that first reduces dimensionality for efficiency and then restores it.
The attention weights are split into two parts corresponding to the two input features:
The fused feature is obtained through adaptive weighting:
where
is a learnable scalar parameter initialized to 0.5, ⊙ denotes channel-wise multiplication (broadcasting the channel weights across spatial dimensions), and the complementary weighting
ensures that the total contribution is normalized.
The learnable weight w allows the network to adaptively balance the contribution of high-resolution details versus high-level semantics. During training, the network learns to adjust w based on the task requirements: for small object detection (e.g., balls), w may increase to emphasize high-resolution features, while for large object detection (e.g., players), w may decrease to leverage semantic features.
We apply SDFM in a top-down manner, progressively fusing features from to and then from the fused to . This hierarchical fusion ensures that high-level semantic information is propagated to all scales while preserving the spatial details in high-resolution features.
3.4. Spatial-Channel Synergistic Attention
To further enhance feature representations, we integrate the Spatial-Channel Synergistic Attention (SCSA) [
22] module, which explores the synergistic effects between spatial and channel attention mechanisms. Unlike previous approaches that apply spatial and channel attention sequentially or in parallel without interaction, SCSA enables the two attention types to inform and enhance each other.
As shown in
Figure 5, SCSA comprises two components: Shareable Multi-Semantic Spatial Attention (SMSA) and Progressive Channel-wise Self-Attention (PCSA).
Shareable Multi-Semantic Spatial Attention (SMSA) captures multi-scale spatial information through parallel depthwise convolutions with varying kernel sizes. The key insight is that objects of different scales require different receptive fields for effective attention computation. SMSA addresses this by using multiple kernel sizes simultaneously:
Given input features
, SMSA first splits the channels into groups and applies average pooling along different spatial axes to reduce the computational cost:
Multi-scale spatial features are extracted through parallel depthwise 1D convolutions:
where ⊕ denotes broadcasting and element-wise addition to restore the 2D spatial structure.
The multi-scale features are concatenated and normalized:
The use of multiple kernel sizes (3, 5, 7, 9) enables SMSA to capture spatial patterns at different scales: small kernels focus on fine-grained details suitable for small objects, while large kernels capture broader context for large objects.
Progressive Channel-wise Self-Attention (PCSA) computes channel-wise dependencies through an efficient self-attention mechanism. Unlike standard self-attention, which operates on spatial tokens, PCSA treats channels as tokens and computes attention across the channel dimension:
The channel attention is computed through element-wise operations:
This formulation is more efficient than standard self-attention because it avoids computing the full attention matrix. The element-wise product captures channel-wise correlations, and the subsequent average pooling and sigmoid produce channel attention weights.
The final output combines spatial and channel attention synergistically:
The element-wise multiplication of spatial attention , channel attention , and input features enables the two attention types to jointly modulate the features. Spatial attention identifies where to focus, channel attention determines which features are important, and their combination provides a comprehensive attention mechanism that enhances feature discriminability for detection.
We apply SCSA after each SDFM fusion stage, enhancing the fused multi-scale features before they are processed by the transformer encoder. This placement ensures that the encoder receives well-refined features with enhanced spatial and channel discriminability.
3.5. Efficient Transformer Encoder
Following the feature enhancement by SDFM and SCSA, we employ an Efficient Transformer Encoder adapted from RT-DETR [
14] to capture global dependencies within each scale. The encoder is designed to balance computational efficiency with representational power.
The encoder consists of
L Transformer layers, each containing multi-head self-attention (MHSA) and feed-forward network (FFN):
where LN denotes Layer Normalization and the residual connections facilitate gradient flow.
To reduce computational cost, we apply the encoder only to the highest-resolution feature map and use lightweight cross-scale fusion for other scales. This design is motivated by the observation that high-resolution features benefit most from global context modeling, while lower-resolution features already have large receptive fields from the backbone.
The multi-head self-attention is computed as:
where
,
,
are the query, key, and value projections for head
i, and
is the dimension per head.
3.6. Uncertainty-Minimal Query Selection
Following the encoder, we employ Uncertainty–minimal Query Selection [
14] to identify the most informative feature locations as initial object queries. This selection mechanism replaces the learned object queries in original DETR with content-aware queries derived from the encoder output.
For each spatial location in the encoder output, we compute a classification score and localization uncertainty:
where
K is the number of classes and the four-dimensional output represents bounding box coordinates.
The selection score combines classification confidence and localization certainty:
The top-
N locations with highest scores are selected as initial queries:
This selection mechanism ensures that the decoder focuses on the most promising locations, improving both accuracy and efficiency compared to using fixed learned queries.
3.7. Separable Dynamic Decoder
Traditional transformer decoders employ multi-head cross-attention to enable object queries to attend to encoder features. Given N object queries and M feature tokens, the cross-attention complexity is , which becomes a bottleneck when processing high-resolution features.
We adopt the Separable Dynamic Decoder [
23] that replaces cross-attention with dynamic convolution attention, reducing complexity to
. The key insight is that cross-attention can be approximated by query-specific convolutions applied to the feature map.
As illustrated in
Figure 6, the decoder operates in three stages: Pre-Attention, Dynamic Convolution Attention, and Post-Attention.
Pre-Attention Stage: Proposal kernels derived from the selected queries interact with aggregated features through 2D dynamic convolution:
where ∗ denotes dynamic convolution with query-specific kernels
. Each query generates its own convolution kernel, enabling content-specific feature extraction.
Dynamic Convolution Attention (DyConvAtten): This module replaces the standard cross-attention mechanism. Given queries
and feature values
, the attention is computed as:
where
denotes a reshape operation that converts the linear projection output into convolution kernel format,
generates depth-wise convolution kernels with
g groups, and
generates point-wise convolution kernels.
This separable formulation decomposes the attention into two sequential convolutions:
The depth-wise convolution captures spatial patterns with query-specific kernels, while the point-wise convolution mixes channel information. This separable design reduces the parameter count and computational cost while maintaining expressive power.
Post-Attention Stage: Standard Multi-Head Self-Attention (MHSA) and Feed-Forward Network (FFN) refine the queries:
The self-attention among queries enables them to exchange information and avoid duplicate detections, while the FFN provides non-linear transformation for feature refinement.
The decoder is applied iteratively for layers, with each layer refining the queries based on the encoder features. The final queries are used to predict bounding boxes and class probabilities through linear projection heads.
3.8. Scale-Aware Focal Loss
Soccer detection faces two types of imbalance: class imbalance (background vs. foreground) and scale imbalance (large players vs. small balls). Standard focal loss addresses class imbalance but treats all positive samples equally, regardless of their scale. We propose Scale-Aware Focal Loss (SAFL) that addresses both challenges simultaneously.
The standard Focal Loss [
24] is defined as:
where
is the predicted probability for the ground-truth class,
is the class balancing factor (which is typically set higher for rare classes), and
is the focusing parameter that down-weights easy examples. The term
reduces the loss contribution from well-classified examples, focusing training on hard examples.
We extend focal loss with scale-aware weighting that up-weights small objects:
where
is the object scale computed as the geometric mean of bounding box width
w and height
h,
is the maximum scale in the dataset (computed from training set statistics),
is a small constant (set to 1.0) for numerical stability, and
controls the scale sensitivity.
The scale weight is inversely proportional to object scale: small objects receive higher weights, encouraging the model to focus on these challenging cases. The exponent controls the strength of this reweighting: recovers standard focal loss, while larger values more aggressively up-weight small objects.
The Scale-Aware Focal Loss is formulated as:
Formal Definition of the SAFL Weighting Function. The scale-aware weighting function
is a monotonically decreasing function of object scale
s, defined as:
where the domain is restricted to positive object scales. The function satisfies the following properties: (i)
is continuous and differentiable for all
; (ii)
, providing a bounded maximum weight for the smallest objects; (iii)
for
, ensuring that the largest objects receive approximately unit weight; (iv)
recovers standard focal loss as
.
Gradient Behavior Analysis. To understand how SAFL modulates the training dynamics, we analyze the gradient of
with respect to the model logit
z (where
and
is the sigmoid function):
where
is the ground-truth label and
accounts for the direction of the gradient. The key observation is that the gradient magnitude is scaled by
, which is inversely proportional to object scale
s. For a small ball with
pixels and a large player with
pixels (assuming
,
,
), the gradient ratio is:
This means that the ball receives approximately 3× larger gradient signals than the player, effectively compensating for the inherent difficulty of detecting small objects. The exponent controls this amplification: larger values produce more aggressive gradient amplification for small objects, while provides uniform gradients regardless of scale.
For bounding box regression, we employ the Complete IoU (CIoU) loss [
48] that considers overlap area, center distance, and aspect ratio:
where
is the Intersection over Union between predicted and ground-truth boxes,
is the Euclidean distance between box centers,
c is the diagonal length of the smallest enclosing box covering both boxes, and
,
v are aspect ratio consistency terms:
The aspect ratio term v penalizes predictions with incorrect aspect ratios, which is important for accurately localizing elongated objects like standing players.
The total loss function combines classification and regression losses:
where
,
,
, and
are balancing coefficients determined through validation.
is the Generalized IoU loss [
49] that provides additional supervision for non-overlapping boxes, and
represents auxiliary losses from intermediate decoder layers that provide deep supervision.
3.9. Training Algorithm
The complete training procedure of SoccerDETR is summarized in Algorithm 1. The algorithm follows the standard end-to-end detection training paradigm with Hungarian matching for label assignment.
The Hungarian matching algorithm finds the optimal bipartite assignment between predictions and ground-truth objects by minimizing a cost matrix that combines classification and localization costs:
where
is the classification cost (negative log probability),
is the L1 distance between predicted and ground-truth box coordinates, and
is the negative GIoU.
Algorithm 1 SoccerDETR training procedure |
Require: Training dataset , learning rate , epochs E, batch size B Ensure: Trained model parameters 1: Initialize backbone with ImageNet weights; initialize other modules with Xavier 2: Set loss weights and SAFL parameters; initialize AdamW optimizer 3: for epoch to E do 4: for each mini-batch do 5: {Mosaic, flip, color jitter} 6: {Backbone} 7: {Multi-scale fusion} 8: {Feature enhancement} 9: 10: 11: {Iterative refinement} 12: 13: 14: 15: end for 16: Adjust learning rate; validate and save checkpoint periodically 17: end for 18: return
|
4. Experiments
4.1. Datasets
We evaluate SoccerDETR on two public soccer detection datasets that represent different aspects of the soccer detection challenge:
Soccana Player Ball Detection Dataset: This dataset contains 11,673 images collected from professional soccer matches across multiple leagues and seasons. The dataset includes three object categories: Player (0), Ball (1), and Referee (2). The images exhibit diverse scenarios including different camera angles (wide shots, medium shots, close-ups), varying lighting conditions (day games, night games under floodlights), different weather conditions, and varying player densities, from sparse counterattacks to crowded set pieces.
The dataset statistics reveal the scale imbalance challenge: players have an average bounding box area of (ranging from 500 to 50,000), while balls average only (ranging from 100 to 2000). Referees fall in between with an average of . This 19:1 scale ratio between players and balls motivates our scale-aware loss design.
We follow the standard 80%/10%/10% split for training (9338 images), validation (1167 images), and testing (1168 images). The splits are stratified to ensure similar class distributions across sets.
SoccerNet Dataset [2]: Originally designed for action spotting in soccer videos, the detection subset contains 19,786 annotated frames extracted from broadcast footage of professional matches. The dataset includes two categories: Ball (0) and Person (1), where Person encompasses both players and referees.
SoccerNet presents additional challenges compared to Soccana: (1) broadcast overlays including scoreboards, team logos, and replay indicators that may occlude objects; (2) varying video quality from different broadcasters; (3) more extreme scale variations due to the inclusion of wide-angle shots covering the entire field; and (4) motion blur from fast camera pans during exciting moments.
The dataset is split into training (15,829 frames), validation (1978 frames), and testing (1979 frames) following the official benchmark protocol. This larger dataset enables evaluation of model scalability and generalization.
4.2. Implementation Details
Our implementation is based on PyTorch 2.0 and the MMDetection framework. Training is conducted on 2 NVIDIA RTX 4090 GPUs with 24 GB memory each, using DistributedDataParallel for multi-GPU training.
AI-Assisted Writing Tools: In the preparation of this manuscript, we utilized several AI-assisted tools for language enhancement: ChatGPT (OpenAI, San Francisco, CA, USA, GPT-3.5/GPT-4) for improving sentence structure and clarity, DeepL (DeepL SE, Cologne, Germany) for translation refinement, and Grammarly Premium (Grammarly Inc., San Francisco, CA, USA) for grammar and style checking. These tools were used solely for language polishing and expression improvement. All AI-generated content was thoroughly reviewed, edited, and validated by the authors to ensure scientific accuracy and appropriateness.
Backbone Configuration: The MobileMamba backbone is initialized with ImageNet-1K pretrained weights. We use the MobileMamba-B2 variant with 4 stages, outputting features with channel dimensions [64, 128, 320, 512] at strides [4, 8, 16, 32]. We extract from the last three stages with strides .
Encoder Configuration: The Efficient Transformer Encoder consists of 1 layer with 8 attention heads and hidden dimension 256. We use pre-normalization and GELU activation in the FFN.
Decoder Configuration: The Separable Dynamic Decoder has 6 layers with 8 attention heads. The dynamic convolution kernel size is set to with 8 groups for depth-wise convolution. The number of object queries is set to 300.
Training Configuration: We use AdamW optimizer [
50] with initial learning rate
, weight decay
, and
. The learning rate follows a cosine annealing schedule with 500 warmup iterations where the learning rate linearly increases from
to
. Training is conducted for 300 epochs with batch size 16 (4 per GPU).
Data Augmentation: Input images are resized to with the following augmentations:
Mosaic augmentation (probability 0.5): combines 4 images into one, increasing object diversity.
Random horizontal flip (probability 0.5).
Color jittering: brightness ±0.4, contrast ±0.4, saturation ±0.4, hue ±0.1.
Random scale: scale factor uniformly sampled from [0.5, 1.5].
Random crop: after scaling, crop to .
Loss Configuration: For SAFL, we set , , and is computed based on inverse class frequency: , , for Soccana; , for SoccerNet. The higher weight for ball reflects its lower frequency and higher detection difficulty.
Inference Configuration: During inference, we use a single scale of without test-time augmentation. The confidence threshold is set to 0.3 for filtering low-confidence predictions. No NMS is required due to the end-to-end design. All speed measurements are conducted on a single NVIDIA RTX 4090 GPU (24 GB VRAM) (NVIDIA, Santa Clara, CA, USA) with an Intel Core i9-13900K CPU (Intel, Santa Clara, CA, USA), running CUDA 12.1 and PyTorch 2.0. We use FP32 precision for all experiments to ensure fair comparison across methods; FP16 mixed precision can further improve speed by approximately 20% but is not used in our reported results. We do not use TensorRT or ONNX optimization in the reported results to maintain consistency with baseline implementations. FPS is measured with a batch size of 1, averaged over 1000 iterations after 100 warmup iterations to ensure stable GPU frequency and eliminate cold-start effects. Latency is measured as end-to-end wall-clock time, including image preprocessing (resizing, normalization) and output postprocessing (confidence thresholding).
Baseline Training Configurations: To ensure fair comparison, all methods are trained with their officially recommended configurations on identical hardware using the same data splits and input resolution (). For CNN-based two-stage detectors (Faster R-CNN, Cascade R-CNN) and one-stage detectors (RetinaNet, FCOS), we train for 36 epochs using SGD optimizer with learning rates of 0.02 and 0.01 respectively, with flip and scale augmentation. For YOLO-series methods (YOLOv7, YOLOv8-L, YOLOv9-E, YOLOv10-L), we train for 300 epochs using SGD optimizer with learning rate 0.01, employing mosaic, flip, and HSV augmentation. For Transformer-based methods, DETR is trained for 500 epochs, Deformable DETR for 50 epochs, and DINO for 36 epochs, all using AdamW optimizer with learning rates of around to . RT-DETR-L and our SoccerDETR are both trained for 300 epochs using AdamW optimizer with learning rate and mosaic, flip, and color augmentation.
4.3. Evaluation Metrics
We adopt standard COCO-style metrics [
51] for comprehensive evaluation:
mAP@50: Mean Average Precision at IoU threshold 0.5, the primary metric for detection accuracy
mAP@50:95: Mean AP averaged over IoU thresholds from 0.5 to 0.95 with step 0.05, measuring localization precision
mAP@75: Mean AP at IoU threshold 0.75, emphasizing precise localization
, , : AP for small (area ), medium ( area ), and large (area ) objects
Per-class AP: AP for each object category (Player, Ball, Referee)
We also report efficiency metrics:
FPS: Frames Per Second measured on a single GPU with batch size 1, averaged over 1000 iterations after 100 warmup iterations.
Params: Number of model parameters in millions.
GFLOPs: Giga Floating Point Operations for a single input.
Latency: End-to-end inference latency in milliseconds.
4.4. Comparison with State-of-the-Art Methods
We compare SoccerDETR with 12 representative detection methods spanning four categories: two-stage detectors, one-stage detectors, YOLO-series detectors, and Transformer-based detectors. All methods are trained on the same datasets with their recommended configurations and evaluated under identical conditions.
As shown in
Table 2, SoccerDETR achieves the best performance on the Soccana dataset with 94.2% mAP@50 and 67.8% mAP@50:95, outperforming the previous best method RT-DETR-L by 1.9% and 3.3%, respectively. The improvements are consistent across all object categories, with particularly notable gains on ball detection (88.4% vs. 85.2%, +3.2%). This demonstrates the effectiveness of our scale-aware design for small object detection.
The inference speed of 78 FPS surpasses RT-DETR-L (74 FPS) by 5.4%, validating the efficiency of our MobileMamba backbone and Separable Dynamic Decoder. Compared to the fastest method SSD (45 FPS), SoccerDETR achieves 17.9% higher mAP@50 while being 73% faster, demonstrating an excellent accuracy-efficiency trade-off.
Compared to YOLO-series detectors, SoccerDETR shows consistent improvements across all metrics while maintaining competitive speed. The 3.0% improvement over YOLOv10-L in mAP@50 highlights the advantage of our end-to-end detection paradigm that eliminates NMS-induced errors. The improvement is more pronounced on ball detection (+5.3%), where NMS often fails due to the small size and potential overlap with player bounding boxes. To further illustrate the advantage of NMS-free detection in crowded scenes, we evaluated detection performance specifically on penalty area scenarios where player density is highest. On these challenging subsets, SoccerDETR achieves 91.8% mAP@50 compared to 87.2% for YOLOv10-L, a 4.6% improvement that is larger than the 3.0% gap on the full dataset. This confirms that end-to-end detection provides particular benefits in dense scenarios where NMS-based methods struggle with overlapping detections.
Two-stage detectors (Faster R-CNN, Cascade R-CNN) achieve reasonable accuracy but suffer from slow inference speed due to the region proposal stage. One-stage detectors (RetinaNet, FCOS) offer better speed but lag behind in accuracy, particularly for small objects.
Table 3 presents results on the more challenging SoccerNet dataset. SoccerDETR achieves 91.8% mAP@50, surpassing RT-DETR-L by 2.3%. The improvement is particularly pronounced for ball detection (84.7% vs. 81.4%, +3.3%), where our Scale-Aware Focal Loss effectively addresses the extreme scale imbalance. The consistent improvements across both datasets demonstrate the strong generalization capability of our approach.
The performance gap between Soccana and SoccerNet (94.2% vs. 91.8% mAP@50) reflects the additional challenges in SoccerNet, including broadcast overlays and more extreme scale variations. Nevertheless, SoccerDETR maintains its advantage over competing methods on both datasets.
Figure 10 provides a visual comparison of mAP@50 across methods on both datasets.
Table 4 provides a detailed breakdown by object scale. SoccerDETR achieves the largest improvement on small objects (
: 72.4% vs. 65.8% for RT-DETR-L, +6.6%), validating the effectiveness of our scale-aware design. The improvements on medium and large objects are also significant (+3.7% and +1.3%, respectively), demonstrating that our approach benefits detection across all scales.
Figure 7 and
Figure 8 present qualitative detection results on both datasets. SoccerDETR demonstrates robust detection across diverse scenarios. In penalty area situations with multiple overlapping players, SoccerDETR accurately detects individual players without false positives from NMS errors. Players at different distances from the camera are detected with consistent accuracy, from close-up shots to wide-angle views. The ball is accurately localized even when appearing as a small blob of 10–20 pixels, demonstrating the effectiveness of our scale-aware design. Partially occluded players and balls are detected with reasonable confidence, showing robustness to visual clutter. On SoccerNet, detections remain accurate despite scoreboards and logos that partially occlude the playing field.
4.5. Ablation Studies
We conduct comprehensive ablation studies on the Soccana dataset to analyze the contribution of each component and validate our design choices.
4.5.1. Component Analysis
Table 5 shows the incremental contribution of each component:
Effect of MobileMamba Backbone: Replacing the ResNet-50 backbone with MobileMamba improves mAP@50 by 1.6% while reducing parameters by 8.3% (42.0 M → 38.5 M) and increasing speed by 20.6% (68 → 82 FPS). This demonstrates that state space models can effectively capture visual features with superior efficiency compared to CNNs. The improvement is particularly notable for ball detection (+2.7%), suggesting that the global receptive field of SS2D helps detect small objects.
Effect of SDFM: Adding the Semantic-aware Dynamic Feature Fusion Module results in a 1.3% improvement in mAP@50 and 2.6% improvement in ball AP. The adaptive fusion mechanism enables more effective multi-scale feature aggregation compared to standard FPN, particularly benefiting small object detection where high-resolution features are crucial.
Effect of SCSA: The Spatial-Channel Synergistic Attention contributes 1.3% improvement in mAP@50 by enhancing feature discrimination through synergistic spatial and channel attention. The improvement is consistent across all object categories, indicating that SCSA provides general feature enhancement rather than scale-specific benefits.
Effect of Separable Dynamic Decoder: The decoder replacement improves mAP@50 by 0.6% while slightly reducing parameters (40.1 M → 39.8 M). The dynamic convolution attention provides effective query-feature interaction with lower computational cost than standard cross-attention.
Effect of Scale-Aware Focal Loss: The proposed SAFL provides 0.8% improvement in mAP@50 and 1.3% improvement in ball AP. This confirms that explicitly addressing scale imbalance through loss reweighting benefits small object detection.
The cumulative improvement from baseline to full model is 5.6% mAP@50 and 9.9% ball AP, demonstrating that all components contribute meaningfully to the final performance.
Figure 9 visualizes the progressive improvement and component contributions across different metrics.
4.5.2. SAFL Parameter Analysis
Table 6 analyzes the sensitivity of SAFL parameters:
Effect of (scale sensitivity): Setting (no scale awareness) degrades ball detection by 2.6% (88.4% → 85.8%), confirming the importance of scale-aware weighting. Increasing to 0.75 or 1.0 further improves ball AP but degrades player and referee detection, as the model over-focuses on small objects. The optimal achieves the best balance.
Effect of (focusing parameter): Lower values (1.0, 1.5) reduce the focusing effect, leading to a slightly weaker overall performance. Higher (3.0) over-focuses on hard examples, potentially ignoring easy but important examples. The standard provides the best trade-off.
4.5.3. Backbone Comparison
Table 7 compares different backbone architectures:
CNN backbones (ResNet, ConvNeXt): ResNet-50 provides a strong baseline but is limited by local receptive fields. Deeper ResNet-101 improves accuracy but significantly increases computation. ConvNeXt-T modernizes CNN design with larger kernels and achieves better accuracy than ResNet-50.
Transformer backbones (Swin, ViT): Swin Transformer achieves good accuracy through hierarchical design and shifted windows, but the quadratic complexity of attention limits its speed. ViT-B/16 has the highest parameter count and slowest speed due to global attention on all patches.
SSM backbones (VMamba, MobileMamba): VMamba-T demonstrates the potential of state space models for vision, achieving competitive accuracy with good efficiency. MobileMamba-B2 further improves both accuracy and efficiency through the MRFFI module and optimized architecture, achieving the best results across all metrics.
The comparison validates our choice of MobileMamba as the backbone: it achieves 2.4% higher mAP@50 than VMamba-T while being 8.3% faster, and 4.1% higher mAP@50 than Swin-T while being 50% faster.
4.5.4. Decoder Architecture Analysis
Table 8 compares different decoder architectures. The Separable Dynamic Decoder achieves the best accuracy while being the most efficient, validating the effectiveness of dynamic convolution attention for query–feature interaction.
4.5.5. Number of Queries Analysis
Table 9 shows the effect of the number of object queries. Performance improves as the number of queries increases from 100 to 300, then saturates. We use 300 queries as the default, balancing accuracy and efficiency.
4.6. Computational Complexity Analysis
We provide a detailed analysis of the computational complexity of SoccerDETR, covering theoretical complexity bounds, actual resource consumption, and per-component latency breakdown.
4.6.1. Hardware Setup
All efficiency measurements are conducted on a single NVIDIA RTX 4090 GPU (24 GB VRAM) with an Intel Core i9-13900K CPU (32 GB RAM), running CUDA 12.1 and PyTorch 2.0. FPS is measured with batch size 1, input resolution , averaged over 1000 iterations after 100 warmup iterations to ensure stable measurements. FLOPs are computed using the fvcore library. Latency is measured as end-to-end wall-clock time, including pre-processing and post-processing.
4.6.2. Theoretical Complexity Comparison
Table 10 compares the computational complexity of different attention mechanisms:
Self-Attention: The quadratic complexity makes it prohibitive for high-resolution inputs. For tokens (640 × 640 image with 16 × 16 patches), this results in 2.56 M attention computations per layer.
Cross-Attention: The complexity depends on both query count N and feature token count M. With 300 queries and 1600 feature tokens, this results in 480 K attention computations.
Deformable Attention: By attending to only K sampling points (typically ), the complexity reduces to , significantly improving efficiency.
SS2D: The linear complexity enables efficient processing of long sequences. The four-directional scanning adds a constant factor of 4 but maintains linear scaling.
DyConvAtten: The separable design reduces complexity from multiplicative to additive , enabling efficient query–feature interaction.
The actual FLOPs measurements confirm the theoretical analysis: SS2D and DyConvAtten are significantly more efficient than standard attention mechanisms, enabling real-time processing of high-resolution soccer videos.
4.6.3. Overall Model Efficiency Comparison
Table 11 provides a comprehensive efficiency comparison. SoccerDETR achieves the lowest parameter count (39.8 M), lowest FLOPs (72.4 G), highest FPS (78), and lowest latency (12.8 ms) among all compared methods, while simultaneously achieving the highest mAP@50 (94.2%). Compared to RT-DETR-L, SoccerDETR reduces parameters by 5.2%, FLOPs by 16.0%, and latency by 5.2%, while improving mAP@50 by 1.9%. The efficiency gains are primarily attributed to the linear-complexity MobileMamba backbone (replacing the quadratic-complexity Transformer backbone) and the separable dynamic decoder (replacing standard cross-attention).
4.6.4. Per-Component Latency Breakdown
Table 12 provides a latency breakdown of SoccerDETR components. The backbone accounts for the largest portion (40.6%), followed by the decoder (21.9%) and encoder (16.4%). The SDFM and SCSA modules add only 14.1% overhead while providing significant accuracy improvements, demonstrating their efficiency.
4.7. Cross-Dataset Generalization
To evaluate the generalization capability of SoccerDETR, we conduct cross-dataset experiments where models are trained on one dataset and evaluated on the other.
Table 13 shows that SoccerDETR exhibits stronger cross-dataset generalization than competing methods. When trained on Soccana and tested on SoccerNet, SoccerDETR achieves 81.2% mAP@50, outperforming RT-DETR-L by 4.4%. This suggests that our model learns more transferable representations, likely due to the global receptive field of SS2D and the adaptive fusion of SDFM.
Figure 10 provides a comprehensive visualization of multi-dimensional performance comparison and cross-dataset results.
Figure 10.
Performance comparison visualization. (a) Multi-dimensional radar chart comparing SoccerDETR with competing methods across mAP@50, mAP@50:95, Precision, Recall, FPS, and parameter efficiency. (b) mAP@50 comparison across Soccana and SoccerNet datasets, demonstrating consistent improvements in SoccerDETR over all baselines.
Figure 10.
Performance comparison visualization. (a) Multi-dimensional radar chart comparing SoccerDETR with competing methods across mAP@50, mAP@50:95, Precision, Recall, FPS, and parameter efficiency. (b) mAP@50 comparison across Soccana and SoccerNet datasets, demonstrating consistent improvements in SoccerDETR over all baselines.
4.8. Statistical Significance Testing
To ensure the reliability of our experimental results, we conduct statistical significance testing using paired t-tests between SoccerDETR and the strongest baseline (RT-DETR-L). Specifically, we train each model five times with different random seeds (seeds 0, 42, 123, 456, 789), and each training run constitutes one sample in the paired t-test. The pairing is based on the random seed, meaning we compare SoccerDETR trained with seed 0 against RT-DETR-L trained with seed 0, and so on. This design controls for random initialization effects and provides a fair comparison of the two architectures. The resulting five paired observations are used to compute the t-statistic and p-value. We report the mean and standard deviation of mAP across the five runs.
As shown in
Table 14, all improvements of SoccerDETR over RT-DETR-L are statistically significant with
, confirming that the observed performance gains are not due to random variation. The low standard deviations (0.22–0.33%) indicate that SoccerDETR training is stable across different random initializations.
4.9. Five-Fold Cross-Validation Analysis
To further validate the robustness of our results and mitigate potential bias from a single train/test split, we perform 5-fold cross-validation on the Soccana dataset. The dataset is divided into five non-overlapping folds at the image level with stratified sampling to maintain class distribution balance. Each fold uses 80% of images for training and 20% for testing, and we report the mean and standard deviation of mAP across the five folds. This provides an estimate of performance variance due to data splitting rather than random initialization.
Table 15 demonstrates that SoccerDETR achieves consistent performance across all five folds, with a mean mAP@50 of 94.1% and a standard deviation of only 0.23%. The per-class results are equally stable:
varies by only 0.15%,
by 0.34%, and
by 0.20%. The cross-validation mean (94.1%) is consistent with the standard split result (94.2%), confirming that our reported results are representative and not an artifact of a favorable data split. The slightly higher variance in ball detection reflects the inherent difficulty and variability of small object detection across different subsets of the data.
4.10. Failure Case Analysis
Despite the strong overall performance, SoccerDETR has limitations in certain challenging scenarios. When the ball occupies fewer than 10 pixels in very wide-angle shots, detection accuracy drops significantly, which is a fundamental limitation of the input resolution rather than the model architecture. During fast camera pans or powerful shots, motion blur can make the ball appear as an elongated streak, confusing the detector, and temporal modeling could potentially address this limitation. Circular objects in the background such as advertising logos and stadium lights occasionally cause false positives, and additional context modeling could help distinguish the actual ball. When the ball is completely occluded by players, detection fails, which is expected behavior since the model cannot detect objects that are not visible. These failure cases suggest directions for future improvement, including higher-resolution processing, temporal modeling, and enhanced context reasoning.
5. Discussion
Our experimental results demonstrate that SoccerDETR achieves state-of-the-art performance on soccer object detection while maintaining real-time inference speed, validating the effectiveness of integrating visual state space models with Transformer-based detection. The statistical significance testing ( for all metrics) and 5-fold cross-validation (standard deviation ) confirm the reliability and reproducibility of these results.
The success of the MobileMamba backbone confirms that state space models can effectively replace transformers for visual feature extraction, with the linear complexity of SS2D enabling efficient processing of high-resolution inputs essential for detecting small objects like balls. The global receptive field provided by multi-directional scanning captures comprehensive spatial dependencies analogous to transformer attention but with linear complexity. As shown in the computational complexity analysis (
Table 11), SoccerDETR achieves the lowest FLOPs (72.4G) and highest FPS (78) among all compared methods, with the MobileMamba backbone contributing to a 16.0% reduction in FLOPs compared to RT-DETR-L.
Our Scale-Aware Focal Loss explicitly addresses the extreme scale variations in soccer detection (20–100× between players and balls). The gradient behavior analysis (Equation (
34)) reveals that SAFL amplifies gradient signals for small objects by approximately 3× compared to large objects, effectively compensating for the inherent difficulty of detecting small balls. Ablation studies confirm that removing scale awareness (
) degrades ball detection by 2.6%.
The combination of SDFM and SCSA provides complementary benefits through adaptive multi-scale fusion and synergistic spatial-channel attention, while adding only 14.1% latency overhead (1.8 ms per image). The systematic comparison in
Table 1 highlights that SoccerDETR is the first method to combine an SSM backbone with adaptive multi-scale fusion and synergistic attention, filling a gap in the existing literature.
Cross-dataset experiments demonstrate stronger generalization than competing methods, attributed to the global receptive field learning transferable representations and the adaptive fusion adjusting to domain shifts.
Beyond soccer video analysis, the proposed SoccerDETR framework has potential applications in other computer vision inspection tasks that share similar challenges of multi-scale object detection with real-time requirements. First, our approach could be adapted for exterior cladding material detection in urban environments [
57]. Similar to detecting players and balls at varying distances, identifying different cladding materials in street view images requires handling significant scale variations as buildings appear at different distances from the camera. The adaptive multi-scale fusion (SDFM) and scale-aware loss (SAFL) could be particularly beneficial for this task, enabling efficient processing of large-scale urban imagery. Second, the framework shows promise for safety inspection tasks such as non-PPE detection on construction sites [
58]. This task involves detecting workers without proper safety equipment from both body-worn cameras and general surveillance footage. The challenges are analogous to soccer detection: workers appear at various scales, may be partially occluded, and real-time detection is crucial for immediate safety alerts. Our efficient backbone and scale-aware training strategy could enhance detection performance in such safety-critical applications. Third, the linear complexity of our MobileMamba backbone makes it suitable for processing high-resolution imagery in industrial quality inspection, where defects may be small relative to the overall image size, similar to ball detection in wide-angle soccer footage.
Several limitations suggest future directions: temporal modeling could improve detection consistency and handle motion blur; higher-resolution processing could address extremely small ball detection; domain adaptation techniques could enable transfer to other sports; and production deployment requires model quantization and hardware optimization.