1. Introduction
Player tracking systems used in sports offer numerous potential benefits. For broadcasters, the use of a player tracking system may provide important contextual data that enhances the overall viewing experience. The system can generate statistics based on player movement and potentially player actions, which can be presented to viewers. Action recognition assists in identifying key moments in a match, enabling the creation of highlight packages after the game. Sports teams can leverage data to manage player load and gain insight into the impact of player positioning on match outcomes, leading to improved tactical decision-making. Integrating human action recognition (HAR) with player tracking has the potential to add additional context to the data that is being collected and can be used for better motion predictions and statistical player analyses.
While radar–camera fusion has been explored in prior work for human activity recognition and object detection, its application to fine-grained soccer-specific action recognition in outdoor environments remains limited.
A mm-wave radar as a sensor has the benefit of being resistant to weather effects, lighting, and occlusion, as well as being able to provide accurate range and radial velocity measurements. Radar HAR methods typically analyze time–frequency micro-Doppler spectrograms, which capture the Doppler shifts caused by moving limbs. Existing studies use hand-crafted features from these spectrograms. Javier and Kim apply linear predictive coding on the envelopes of the radar spectrograms for HAR [
1]. This was achieved in an indoor environment using only radial movements with classes including walking, crawling, and boxing. The reported accuracy is over 85%, and it is noted that a long enough observation window is required to obtain high accuracies. Caglıyan and Gürbüz also used radar spectrogram envelopes for HAR using a BumbleBee Radar in an indoor environment [
2]. The authors used a treadmill to keep the subject at a constant distance from the sensor, and they classified between walking, jogging, crawling, and crawling at a near tangential angle. The precision was reported between 0.837 and 0.94. Zhang et al. applied a Bayesian network on spectral data [
3]. The dataset used was the Mellon University Graphics Lab Motion Capture Database [
4]. The classes used are walk, run, and jump, and the reported classification accuracy ranged from 95.34% to 98.11%. Chae et al. combined range–Doppler and Doppler-time features to monitor head motions but did not provide numerical results [
5]. Lin et al. performed HAR using mm-wave radar in an indoor environment by using feature fusion across time–frequency, range angle domains [
6]. Classification was performed using CNN-BiLSTM and PCANet-based fusion. Combinations of certain actions were classified, and this includes bending, squatting, standing, walking, and falling with a reported accuracy of 99.75%.
Van Eeden et al. used mel-frequency cepstral coefficients (MFCCs) to distinguish humans from animals in the field [
7]. A Gaussian mixture Hidden Markov model (GMM-HMM) approach was applied, and a classification accuracy between 75% and 90% was obtained when classifying between humans and different animals. Even though this paper does not focus on differentiating human actions, it still shows the potential of using MFCC data for motion classification in an operational environment. From the above existing research, it can be seen that radar-based HAR can achieve high accuracies for classifying coarse actions, but there is a lack of research on the more fine-grained actions that relate to sport, specifically soccer. Radar-based HAR can achieve high accuracy for coarse actions, but further research is needed for fine-grained soccer-specific motion.
Using a camera for HAR has the benefit of providing rich spatial and contextual information regarding a player’s pose and environment. Vision-based HAR is a well-documented topic where deep learning is often used to learn spatio-temporal features automatically. Zhang et al. recognized activities by performing multitask learning with the addition of attribute regularization on the KTH [
8], UIUC [
9], and Olympic Sport datasets [
10], with their approach outperforming existing methods at the time [
9]. Le et al. performed continuous action and gesture recognition using a sliding window approach to analyze hand motions over time. They report classification accuracies of 0.95, 0.97, and 0.71 on the IPN [
11], UOW, and InHARD datasets [
12], respectively, for isolated actions [
12]. They did, however, report accuracies of only 0.57, 0.76, and 0.33 for continuous action classification on the same respective datasets.
Jeon et al. proposed a lightweight radar–camera fusion deep learning model for human activity recognition in a recent study [
13]. Their model achieved a 98.74% classification accuracy on actions such as answering a phone, drinking, taking off glasses, grabbing a handle, sitting, standing, pickup, fall, recovery, handshake, walking, running, entering, and exiting. Yi et al. performed HAR using a multimodal fusion model that also utilizes both camera and radar data [
14]. Multiple classes were defined, including hand movements, head movements, leg lifts, hand raises, squats, stooping motions, body twists, and walking. The system was tested in both normal and complex environments. Both of these studies include only a single co-located camera. The method obtained a reported average recognition accuracy of 99.3%. Keyter and de Villiers performed action recognition in an outdoor environment on a small dataset for camera and radar fusion to determine which features are potentially suitable for the task [
15]. It was concluded that MFCC data for the radar and histogram of oriented gradients (HOG) features for a camera are potentially the most suitable features to use. They have classified four classes, namely walk, jog, dribble walk, and dribble jog.
Hu et al. performed pose estimation with radar as opposed to cameras to predict bone lengths and joint rotation angles by integrating deep learning with forward kinematics [
16]. A mean joint error of roughly 3.5 cm was achieved. Using pose estimation from radar can lead to improved accuracy of HAR and could be beneficial in the soccer player action recognition sense. Rivas-Caceido et al. performed HAR using IMU sensors [
17]. The data was collected by placing five IMUs on participants’ hands, knees, and chest. The sensors collected orientation, linear acceleration, and angular velocity information. They were able to classify multiple activities at an average accuracy of 93.5%. Even though this approach has some merit, it is too invasive for soccer player action recognition with the amount of sensors on multiple players, thus making it impractical.
Despite notable studies published on the broader subject of HAR, the studies are typically under ideal conditions and without the intricacies of soccer player action recognition. Measurements are taken in controlled or ideal environments, which include being indoors or the use of treadmills to keep the target within a certain range bin for micro-Doppler processing. The existing research also focuses on actions that are easier to separate, with actions such as kicks and dribbling being excluded. Dribbling with a soccer ball is very similar to normal walking or running, with the detail lying in the micro-Doppler signatures of these actions, which means the features need to be handled in such a way that these fine details can be used to separate between classes. Hence, it would be beneficial to perform research in an outdoor environment with these actions that are more difficult to separate to determine the degree to which it is possible to accurately classify soccer player actions.
The existing literature also tends to focus on radial movement with regard to radar. Range–Doppler processing provides radial velocity, and classification tasks become more challenging when the targets move tangential to the radar, which ultimately minimizes the data present in the Doppler domain. Owing to this challenge, it becomes crucial to focus on the micro-Doppler effects that can be captured using mm-wave radar.
The use of radar and camera sensors in a complementary manner could thus potentially leverage the strengths of each sensor while mitigating their shortcomings and ultimately improve soccer player action recognition. This is a field that has not been studied thoroughly. A synchronized multicamera and radar system will be used to create a dataset that can be used in further studies relating to this topic.
A conference paper by the authors [
15] presented an exploratory study focusing primarily on the analysis of radar and vision feature representations using a single smartphone camera and a loosely synchronized sensing setup. The objective of that work was to investigate the behavior of different feature types for soccer-related actions. In contrast, the work presented in this paper addresses a different research objective, namely the development and evaluation of a synchronized multimodal sensing framework for fine-grained soccer player action recognition. The current study introduces a multiview RGB camera setup synchronized with mm-wave radar, enabling accurate temporal alignment between sensing modalities. Furthermore, the scope of this work has been substantially expanded to include multimodal feature fusion, ablation studies across sensor configurations, direction-dependent performance analysis, which includes tangential motions, and computational cost evaluation. Consequently, while the two studies are related in terms of application domain, the present paper focuses on multimodal action recognition and system-level evaluation rather than feature exploration.
The main contributions of this work are summarized as follows:
Multimodal sensing framework: a synchronized radar–camera system, consisting of a single radar and multiple cameras at different positions, is developed for fine-grained soccer action recognition in an outdoor environment, capturing complementary motion and visual information
Comprehensive feature analysis: a wide range of radar (micro-Doppler, mel-spectrogram, MFCC) and visual (HOG, optical flow, pose, bounding box) features are systematically evaluated.
A block-based radar feature representation using CFAR-based region-of-interest selection and Doppler-centered alignment of the dominant return to improve the consistency of micro-Doppler signatures and enabling better separation of torso and limb motion.
Extensive ablation and modality comparison: the contributions of individual feature groups and sensing modalities are analyzed through controlled ablation studies, including radar-only, camera-only, and fused configurations.
Direction-aware evaluation: the impact of motion direction (e.g., radial, tangential, diagonal, and horizontal) on radar and fusion performance is investigated.
Computational cost analysis: a detailed comparison of classical and deep learning models is provided, highlighting trade-offs between accuracy and efficiency for real-world deployment.
2. Materials and Methods
In this section, we describe the proposed multimodal radar–camera action recognition pipeline consisting of data collection, radar processing and feature extraction, camera processing and feature extraction, temporal windowing and multimodal fusion, and classification and evaluation. An overview of the system can be seen in
Figure 1.
2.1. Data Collection and Experimental Setup
Radar data was collected using the IWR1443Boost mm-wave FMCW radar and DCA1000EVM data capture board (Texas Instruments, Dallas, TX, USA). Each clip contains 500 radar frames, recorded at a slow-time sampling period of 40 ms. Three global-shutter cameras were deployed to provide complementary RGB viewpoints. Each camera used the IMX296 imaging sensor (Sony Corporation, Tokyo, Japan), a global shutter camera from the Raspberry Pi camera suite (Raspberry Pi Ltd., Cambridge, UK). The cameras recorded at 25 fps to match the slow-time capture rate of the radar. The cameras were each placed at chest height with one co-located with the radar on the horizontal axis as seen in
Figure 2. The cameras are synchronized using software, and the radar is hardware synchronized to one of the cameras. Two participants with different body types and fitness levels were then tasked to perform actions related to soccer, as well as some used in the literature, that include walk, jog, dribble-walk, dribble-jog, jump, kick, and crawl. The difference in participants introduces subject variability in motion dynamics. Although the number of subjects is limited, additional variability is introduced through multiple repetitions and varying motion directions, which helps improve the robustness of the learned models. The actions were performed in different directions where the radar–camera pair is the reference point. These include radial, diagonal (45° and 135° from the radar–camera pair), horizontal, and tangential movements in both directions between the start and end points.
The placement of the cameras and radar, as well as the directions of motion, can be seen in
Figure 3, while
Table 1 describes the radar system specifications.
Table 2 and
Table 3 describe the dataset that was created.
2.2. Radar Processing and Feature Extraction
Figure 4 illustrates the processing chain used to obtain log-power range–Doppler frames ready for feature extraction from the raw ADC samples. The pipeline consists of fast-time DC removal, range windowing and FFT, slow-time mean removal (MTI), Doppler windowing and FFT, RX combining, and finally log-power conversion.
The radar front-end employs an FMCW architecture with a direct-conversion quadrature receiver, producing complex baseband signals where the in-phase (I) and quadrature (Q) components encode amplitude as well as phase information. While ideal receivers assume perfect orthogonality between these channels, practical systems exhibit amplitude and phase imbalance due to hardware imperfections. This leads to a distortion of the IQ constellation from a circular to an elliptical trajectory and introduces artefacts such as spectral distortion, ghost targets, and mirrored components in the micro-Doppler signature as described by Cardillo [
18].
In this work, range–Doppler processing is performed directly on the raw IQ data using a standard two-stage FFT pipeline. A fast-time FFT is applied along each chirp to obtain the range information, followed by a slow-time FFT across chirps to extract Doppler information. Before this, the data is conditioned using DC offset removal, static clutter suppression, windowing, and non-coherent integration across receive antennas to improve robustness under real measurement conditions.
No explicit hardware-level IQ calibration is applied. Instead, the classification models are trained on real measured data, allowing residual hardware non-idealities to be handled implicitly within the learning process.
2.2.1. CFAR-Based Block Localization
Given that the target does not remain within a specific range bin, as is often the case in the literature, the extended human target needs to be detected to generate the radar features. A CFAR-like detector is applied to each range–Doppler frame to achieve this [
19]. For every cell
, a local noise estimate is computed using a 2D training window with guard cells removed. A detection is declared when
where
is the local mean and
K is a constant offset. A
non-maximum suppression filter retains only local peaks, producing a sparse detection mask.
A window of size
is slid across the RD frame. Each location is scored using a combination of RD amplitude and detection-mask activity. The window with the maximum score defines the block location
, and the corresponding block
is used for the block-level Doppler processing and for the per-range and block-based radar feature extraction described next. To stabilize the representation across frames, the block is re-centered such that the maximum-power Doppler bin (typically corresponding to torso motion) is aligned to the block center. This improves separability between torso motion and surrounding limb-induced micro-Doppler signatures.
2.2.2. Radar Feature Extraction
Radar features are derived from both the full range–Doppler map and the CFAR-selected block . This results in three complementary representations: global Doppler features, block-level features, and per-range extended-target features. Together, these provide both coarse and fine-grained motion descriptors for the extended human target.
Global features summarize Doppler activity across all range bins and capture overall motion intensity. Block-level features restrict the analysis to a compact Doppler neighborhood around the torso bin, which produces the strongest and most stable radar return, with the torso centered to reduce temporal drift. Limb movements, such as those from arms and legs, generate additional micro-Doppler components that vary more rapidly and are distributed around the torso motion. Without recentering, these components may shift across Doppler bins due to variations in target motion, leading to inconsistent representations over time. By aligning the dominant Doppler component to the block center, the torso motion is stabilized, and the limb-induced micro-Doppler signatures become more symmetrically distributed around it. This improves the separability between the central (torso) and peripheral (limb) motion components and is expected to lead to more consistent and discriminative features for classification. Per-range features preserve the spatial structure of the extended human target by extracting a Doppler signature independently for each range column, enabling the model to separate torso motion from limb-induced micro-Doppler at neighboring ranges.
The complete set of radar features is summarized in
Table 4.
All mel and MFCC features are computed using standard mel-filterbank processing and a DCT-II transform. For each frame t, all radar descriptors (global mel, global MFCCs, global mel energy, block mel, block MFCCs, center Doppler, and per-range mel) are concatenated into a single radar feature vector. These frame-level vectors are aggregated over temporal windows for multimodal fusion.
Figure 5 shows the global mel spectrogram computed from the full Doppler spectrum. Due to the dominance of the torso return and the inclusion of background components, finer micro-Doppler signatures associated with limb motion are less clearly separated. In addition, the effective resolution is influenced by radar processing parameters, such as the number of Doppler bins and FFT configuration, which further limit the ability to resolve closely spaced motion components.
Figure 6 shows the mel spectrogram computed from a localized block centered on the detected region of interest. By focusing on this region and aligning the dominant Doppler component, background interference is reduced, and limb-induced micro-Doppler variations become more pronounced. This representation should therefore be more suitable for capturing fine-grained motion characteristics. To further improve the representation, the detected region is recentered such that the maximum-power Doppler component is aligned to the block center. This stabilizes the dominant torso return over time and provides a consistent reference point. Without this alignment, changes in target velocity shift the Doppler signature, which can blur the micro-Doppler patterns. By fixing the dominant return, limb-induced variations remain centered and become easier to distinguish.
Figure 6 shows the block-based mel representation.
2.3. Camera Processing and Feature Extraction
Camera-based features complement the radar modality by capturing appearance, motion, and articulated pose information. A summary of the camera features used can be seen in
Table 5. All features except full-frame HOG are computed from the player-centered region of interest (ROI) obtained from YOLO-based person detections [
22]. The pose, extracted from the ROI using the OpenCV library [
23], is normalized to a fixed body size to ensure scale- and translation-invariant articulation features.
Optical flow and pose estimation provide complementary motion representations that enhance the multimodal framework. Optical flow captures dense, pixel-level motion between consecutive frames, enabling the modeling of fine-grained movements such as limb dynamics. In contrast, radar captures motion via micro-Doppler signatures, providing robust measurements of radial velocity along the line of sight.
Pose estimation further contributes a structured representation of human body dynamics by encoding joint-level relationships. While radar reflects the underlying motion dynamics, pose features provide additional spatial context that is not directly observable from radar alone. These representations offer complementary perspectives: radar captures velocity-based motion information, optical flow captures detailed motion fields, and pose estimation captures high-level structural information.
Recent advances in optical flow estimation, such as PWC-Net [
24], and transformer-based approaches for pose estimation [
25] highlight the increasing capability of deep models to capture complex motion patterns. These developments further support the role of vision-based representations in modeling human motion within multimodal systems.
Table 5.
Camera feature groups extracted per frame.
Table 5.
Camera feature groups extracted per frame.
| Feature Group | Description |
|---|
| Bounding Box | Player center , width, height, and aspect ratio. |
| HOG (Full) | Histogram of oriented gradients over the full image [26]. |
| HOG (ROI) | HOG descriptor within the player bounding box. |
| Optical Flow | Dense optical-flow magnitude and orientation statistics [27]. |
| Pose Keypoints | Normalized 2D joint coordinates. |
Figure 7 displays the bounding box, pose estimation and optical flow visualization of a player.
2.4. Temporal Windowing and Multimodal Fusion
Radar and camera features are temporally aligned by frame, ensuring that each timestamp contains a corresponding set of radar and camera descriptors. The features are aggregated into temporal windows of 192 frames, which corresponds to 7.68 s at 25 FPS. A 50% overlap is used to preserve continuity between windows.
Three separate fusion strategies are evaluated. Early fusion concatenates radar and camera features along the feature dimension within each window. Pooled statistics serve as input to an L1-regularized logistic regression, while the raw fused sequence is leveraged directly by the gated recurrent network (GRU), long short-term memory (LSTM), 1D-CNN, and temporal convolution network (TCN) models. Late fusion trains separate radar-only and camera-only logistic regression classifiers independently, combining their predicted probabilities via a weighted sum, with weights (radar: 0.2, camera: 0.8) selected via a sweep over the validation set. From doing a sweep of hyperparameters, it was noted that reducing the large per-range features from radar negatively affects the fusion results if not reduced, hence these features are collapsed for fusion, while the full features are used when only radar is available.
2.5. Classification and Evaluation Protocol
A classical and a deep model are used for classification. Features were standardized and projected using principal component analysis (PCA) [
28], retaining 95% of the total variance, and split using the GroupKFold cross-validation implementation provided by scikit-learn (Version 1.3.2) [
29] over 5 folds (roughly 80% training/20% testing per fold) to avoid temporal leakage. Experiments were conducted with a fixed random seed of 42 for cross-validation, PCA, PyTorch (Version 2.4.1, CUDA 12.1), NumPy (Version 1.24.4), and Python (Version 3.8.20) random to ensure reproducibility. The classical model used is logistic regression [
30]. For deep sequence modeling, five architectures are evaluated to span the space of temporal modeling approaches. GRU [
31] and LSTM [
32] are employed as representative recurrent architectures capable of capturing long-range sequential dependencies, with bidirectional processing to exploit both past and future context within each window. A 1D-CNN [
33] is included as a computationally efficient convolutional baseline that captures local temporal patterns. TCN [
34] extends this approach with dilated causal convolutions and residual connections, providing an exponentially larger receptive field without the vanishing gradient challenges associated with recurrent architectures. Finally, a cross-modal transformer (CMT), built on the self-attention mechanism of Vaswani et al. [
35], is included as the sole fusion-aware architecture for radar–camera fusion. Its dual-stream encoder with bidirectional cross-attention is designed to model interactions between the radar and camera streams, making it the only classifier in this study that explicitly learns inter-modality relationships rather than treating the fused feature vector as a single undifferentiated input.
Table 6 shows the architecture specifications for the chosen classifiers, and
Table 7 shows the hyperparameters chosen for the deep classifiers.
Given that the dataset is unbalanced, classification accuracy alone is not a sufficient metric. Macro F1, balanced accuracy (bAcc), precision, recall, and confusion matrices are used to represent the performance of the classifiers.
4. Discussion
Table 8 and
Table 9 indicate that certain feature groups demonstrate a strong standalone performance, suggesting that they encode motion patterns directly relevant to action discrimination without requiring complementary modalities. This is especially true for the camera features, with the HOG features for the entire frame performing the best, followed by the HOG features for the bounding box around the soccer player. Conversely, several feature groups show limited standalone performance, suggesting that they are either weak or require supplemental features for classification. This is the case with several radar features. For the center Doppler of the targets, which relates to the main body sway and the mel energy, the classification accuracy was low. The spectrogram for the detected bounding box in the radar spectrogram achieved the highest standalone accuracy amongst the radar features, indicating that it could be a strong feature. It should also be noted that the strongest standalones also tend to be the features with the higher dimensionalities. In this test, the classical logistic regression achieved lower accuracies for all feature groups except for the HOG of the full frame, where it achieved the highest accuracy overall.
The ablation study in
Table 10 indicates that the omission of the HOG features for the full frame results in the most significant drop in performance in terms of the acc and bAcc metrics, with the omission of the HOG features for the region of interest showing the second most significant drop in performance. This corresponds to the data in
Table 9. It should be noted that the omission of radar features does not lead to a significant decrease in classification performance and in some cases results in a slight improvement. This is due to the strong discriminative power of the visual features in the current dataset, where camera-based information dominates the classification task. In contrast, radar features provide complementary motion information, which becomes more relevant in challenging scenarios such as variations in motion direction or conditions where visual data may be degraded. A general observation is that in all of the cases, accuracy and F1
macro score drops are statistically modest, which suggests that discriminative information may be duplicated in multiple features and the dropping of any single feature has a minimal effect on performance. This points to robustness in the current selection of features.
From
Table 11, it can be seen that the radial direction achieves the highest classification accuracy for the radar modality. This is expected, as FMCW radar measures radial velocity directly, and motions aligned with the radar line of sight produce strong, well-separated Doppler signatures. In contrast, horizontal and tangential motions exhibit much weaker radial components, leading to poorer Doppler separability and consequently lower recognition performance. This especially affects the accuracy of the logistic regression and GRU results. These classifiers also achieve poor accuracy for the horizontal motions, followed by the tangential motions. The diagonal motions produce intermediate results and are slightly poorer than the radial and combined motions. It does, however, outperform the combined case where the 1D-CNN is used. For all cases, with the exception of the combined motions, 1D-CNN achieves the best results, and deep models broadly outperform logistic regression across all directions, confirming that exploring temporal patterns within the 192-frame window is critical for radar-based action recognition. Compared to
Table 8, there is an increase in the overall accuracy when the features are combined over the best case for a single feature group This table shows that the block spectrogram per range performs the best overall for radar and suggests that it is the strongest of the radar features.
Table 9 shows that HOG features for the full frame are the strongest features by themselves, achieving a high accuracy for all classifiers. From
Table 8,
Table 9 and
Table 12, it can be seen that radar by itself, though reasonable, does not classify nearly as accurately as a single camera. A single center camera under TCN achieves 0.993 ± 0.015 F1
macro, compared to a maximum of 0.897 ± 0.033 for radar-only under the same classifier, a gap that persists across all deep architectures.
Table 13 suggests that adding a second camera at a different position yields further improvement, with the right and center camera achieving 0.996 ± 0.008 under both 1D-CNN and TCN. Even though adding radar to a single camera performs worse than adding a second camera, it can still prove beneficial, given that you can use a single co-located radar–camera setup to gain a slight boost in accuracy. Interestingly, when radar is combined with two cameras, classification performance matches or exceeds the three-camera camera-only configuration for all of the deep models, providing direct evidence that radar features carry complementary information not captured by additional viewpoints alone. Adding a third camera does not improve the best performance but does benefit logistic regression for both early and late fusion and also slightly improves for the 1D-CNN.
Prior radar-only approaches typically report classification accuracies in the range of 85–90% under controlled indoor conditions [
1,
2], which aligns with the radar-only performance observed in this work. More recent radar-based methods employing advanced feature fusion and deep learning have reported significantly higher accuracies, exceeding 98% in some cases [
6]. However, these results are generally obtained on constrained indoor datasets with relatively coarse activity classes and limited variability.
In contrast, camera-based methods often achieve near-perfect performance due to the richer spatial information available in visual data, particularly in controlled environments [
9,
12].
More recent multimodal radar–camera approaches report accuracies exceeding 98% by combining micro-Doppler and visual features [
13,
14]. However, these studies typically focus on relatively simple daily activities and are evaluated in controlled indoor environments, which limits the variability of the observed motion patterns.
In this work, the focus is on fine-grained soccer actions, which involve more complex motion patterns and object interactions. A GroupKFold strategy is also used to ensure proper separation between training and testing clips. Despite the increased task complexity, the proposed multimodal system achieves a macro F1-score of up to 0.998. Direct numerical comparison remains difficult due to differences in dataset characteristics and evaluation protocols.
In the confusion matrix in
Figure 8, it is shown that the radar-only approach achieves over 0.85 per-class accuracy for all classes except walk. This class is most frequently misclassified as dribble walk, which is expected given the subtle differences in motion between these two activities and the limited micro-Doppler separation in radar data. Similarly, dribble walk is misclassified as crawl and walk and jog, but to a lesser extent. In
Figure 9, it can be seen that using a single camera already mitigates most of the shortcomings of radar, with only jog being misclassified as dribble walk and kick on a small number of occasions.
Figure 10 shows that the combination of two cameras and a radar classifies soccer player actions extremely accurately with minimal misclassification.
The t-SNE embeddings in
Figure 11 display partially overlapping clusters for the radar, consistent with the misclassifications seen in
Figure 8. Even though kick, jump, and crawl form distinct clusters, there is significant overlap for the walk, dribble walk, and dribble jog classes. In contrast,
Figure 12 indicates that the camera features produce tighter, more distinct clusters, indicating that visual data has higher discriminative power for the activities performed. Even though there is still some overlap, it is to a significantly lesser extent. The fused representation exhibits the most clearly separated clusters. Even though the clusters appear less compact than in the embeddings for only using camera sensors, there is less overall overlap. These observations correspond to the reported classification results.
From the computational results reported in
Table 14 and
Table 15, it can be seen that the cost for radar computation is dominated by the loading of the data. This step is I/O-bound and depends primarily on storage throughput rather than computational complexity. Since the raw radar data files are large and stored on external media, data transfer time exceeds the time required for subsequent signal processing operations. Feature extraction steps incur negligible computational cost in comparison. CFAR detection incurs low computational cost due to the use of GPU processing in this case. Computing the mel filterbank and MFCC extraction incurred the highest cost of the features. Camera processing is detection-limited, with object detection accounting for the majority of the computational cost. Most of the features have relatively low computational times with the exception of pose estimation, which takes significantly longer than the other features that were computed. In the case of both modalities, feature extraction stages are computationally inexpensive and scale favorably. In the case of the radar, if data transfer rates could be improved, inference duration could be significantly reduced. Data processing can also be sped up by utilizing GPU processing for the creation of bounding boxes and pose estimation.
Table 16 shows that classical classification typically operates faster than deep models with the exception of the a single camera being used. The use of the full radar features also significantly slows down training for the radar-only case for the reasons given in
Section 2.4. If the radar features are not collapsed, the computational cost for the radar-only case will be significantly lower than that of the fusion-based approaches. The evaluation times are, however, still similar.
5. Conclusions
This work investigates the effectiveness of combining mm-wave radar with multiview RGB cameras for fine-grained soccer player action recognition in an outdoor environment. By jointly analyzing global Doppler information and CFAR-localized extended-target representations, and integrating these with appearance-, motion-, and pose-based visual features, the proposed framework demonstrates that radar and cameras provide complementary and mutually reinforcing information for human action recognition.
The experimental results indicate that mm-wave radar is effective at capturing velocity-based motion characteristics, while camera data excels at distinguishing actions with subtle kinematic variations, such as dribbling versus normal walking or jogging. Although camera-only models achieve very high recognition performance under favorable conditions, radar–camera fusion consistently produces the most discriminative feature representations at the highest overall classification accuracy. This highlights the value of radar as an auxiliary sensing modality that enhances robustness, particularly in scenarios where visual sensing may be degraded by lighting, occlusion, or environmental factors.
The findings further indicate that even a single co-located radar–camera configuration can yield measurable gains, while combining radar with multiple camera viewpoints offers the strongest overall performance. While this study focused on a controlled set of soccer-specific actions and a limited number of subjects, the proposed framework provides a foundation for future work on more diverse player behaviors, larger-scale deployments, and more challenging environmental conditions.
These findings indicate a gap in current radar feature design for human action recognition, particularly in the development of features that capture articulated motion and temporal dynamics beyond global Doppler statistics.
The ablation results demonstrate that camera-based features consistently outperform radar-only features when evaluated in isolation, reflecting the rich spatial information captured by visual sensors. This performance gap is expected, as human actions are inherently defined by body pose and motion, which are directly observable in video but only indirectly inferred from radar reflections. These findings suggest that current radar feature representations do not yet fully exploit the information contained in range–Doppler measurements, highlighting a gap in radar feature design rather than a fundamental limitation of the sensing modality. Consequently, this work motivates further research into radar features that capture temporal motion patterns. Although radar achieves lower classification accuracy than camera-based modalities when used in isolation and does not yield substantial accuracy gains in fusion under ideal conditions, it provides a critical redundancy mechanism in scenarios where optical sensing is degraded. Camera performance is inherently susceptible to adverse environmental conditions such as poor lighting, motion blur, occlusion, and weather effects, all of which have no effect on FMCW radar operation. In such conditions, the radar stream can maintain classification capability independently, ensuring system robustness beyond what the ablation results under controlled conditions alone suggest. The radar-only classification results presented for radar in
Table 12 consequently represent a minimum performance floor under complete camera failure, with the full multimodal system expected to operate significantly above this bound under normal conditions.
Limitations and Future Work
The dataset currently includes soccer actions performed by only two participants. Although cross-validation was employed and windows from the same recording clip were kept within the same partition to avoid temporal leakage, the limited number of subjects may restrict the generalization of the model to players with different body characteristics, playing styles, and movement patterns. Future work will therefore focus on expanding the dataset to include a larger and more diverse group of participants.
Additionally, the actions considered in this study were performed in a structured and instructed manner to ensure consistent data collection across modalities. In real match scenarios, player movements are more dynamic and unpredictable, and visual occlusions or environmental variations may occur. These factors introduce a domain gap between the controlled experimental setting and real-world deployment. Future work will investigate the application of the proposed multimodal sensing approach in less controlled, real-world match environments.
Additional future work includes an extension to multiple targets and application to real footage from sports matches and analysis of results.