YED-Net: Yoga Exercise Dynamics Monitoring with YOLOv11-ECA-Enhanced Detection and DeepSORT Tracking

Zhou, Youyu; Dong, Shu; Sheng, Hao; Ke, Wei

doi:10.3390/app15137354

Open AccessArticle

YED-Net: Yoga Exercise Dynamics Monitoring with YOLOv11-ECA-Enhanced Detection and DeepSORT Tracking

¹

Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China

²

Department of Data Science, Beijing Institute of Technology, Zhuhai, Zhuhai 519088, China

³

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China

⁴

Department of Computer Science and Technology, Zhongfa Aviation Institute of Beihang University, 166 Shuanghongqiao Street, Pingyao Town, Yuhang District, Hangzhou 311115, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7354; https://doi.org/10.3390/app15137354

Submission received: 13 May 2025 / Revised: 14 June 2025 / Accepted: 17 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)

Download

Browse Figures

Versions Notes

Abstract

Against the backdrop of the deep integration of national fitness and sports science, this study addresses the lack of standardized movement assessment in yoga training by proposing an intelligent analysis system that integrates an improved YOLOv11-ECA detector with the DeepSORT tracking algorithm. A dynamic adaptive anchor mechanism and an Efficient Channel Attention (ECA) module are introduced, while the depthwise separable convolution in the C3k2 module is optimized with a kernel size of 2. Furthermore, a Parallel Spatial Attention (PSA) mechanism is incorporated to enhance multi-target feature discrimination. These enhancements enable the model to achieve a high detection accuracy of 98.6% mAP@0.5 while maintaining low computational complexity (2.35 M parameters, 3.11 GFLOPs). Evaluated on the SND Sun Salutation Yoga Dataset released in 2024, the improved model achieves a real-time processing speed of 85.79 frames per second (FPS) on an RTX 3060 platform, with an 18% reduction in computational cost compared to the baseline. Notably, it achieves a 0.9% improvement in AP@0.5 for small targets (<20 px). By integrating the Mars-smallCNN feature extraction network with a Kalman filtering-based trajectory prediction module, the system attains 58.3% Multiple Object Tracking Accuracy (MOTA) and 62.1% Identity F1 Score (IDF1) in dense multi-object scenarios, representing an improvement of approximately 9.8 percentage points over the conventional YOLO+DeepSORT method. Ablation studies confirm that the ECA module, implemented via lightweight 1D convolution, enhances channel attention modeling efficiency by 23% compared to the original SE module and reduces the false detection rate by 1.2 times under complex backgrounds. This study presents a complete “detection–tracking–assessment” pipeline for intelligent sports training. Future work aims to integrate 3D pose estimation to develop a closed-loop biomechanical analysis system, thereby advancing sports science toward intelligent decision-making paradigms.

Keywords:

YOLOv11-ECA; real-time yoga movement detection; efficient channel attention (ECA); multi-target tracking; sports training intelligence

1. Introduction

In the contemporary era of national fitness, there has been a notable increase in the emphasis and integration of physical activity into lifestyle practices, recognized as a pivotal strategy for enhancing physical fitness and promoting mental well-being [1]. Yoga, a form of exercise combining flexibility, strength training, and balance, has gained worldwide popularity for its unique physical and mental conditioning effects. A notable sequence of yoga, Surya Namaskar, comprises a series of 12 asanas [2], which have been shown to effectively enhance core strength, coordination, and whole-body function. This sequence has demonstrated significant benefits in the fields of athletic training and rehabilitation through the synergistic effect of breath and movement. However, in practice, due to a lack of professional guidance, some practitioners tend to have problems such as irregular movements and uneven load distribution, which not only reduce the training effect but also may lead to muscle injury or joint burden. Consequently, the development of an artificial-intelligence-based movement recognition and real-time feedback system is imperative to optimize the yoga training process.

This paper proposes an improved YOLOv11-ECA detection model, which optimizes the feature extraction architecture through the integration of the C3k2 and C2PSA modules and incorporates the Efficient Channel Attention (ECA) mechanism to enhance channel-wise feature dependency. These enhancements significantly improve detection accuracy for small objects and complex backgrounds while maintaining a lightweight model structure. Furthermore, the DeepSORT multi-object tracking algorithm is integrated into the YOLOv11 framework, forming a unified detection-tracking system capable of continuous dynamic analysis and evaluation of Sun Salutation movements. Compared to existing studies, the innovations of this work lie in two main aspects: First, the introduction of the ECA attention mechanism into YOLOv11 strengthens channel modeling capability. Combined with the use of small convolution kernels and a dynamic anchor strategy, the model demonstrates enhanced robustness in detecting small and occluded objects. Second, the integration of a lightweight DeepSORT tracking module enables efficient and stable motion trajectory analysis and continuity recognition in multi-person scenarios. Experiments conducted on the newly released 2024 Yoga Sun Salutation Dataset (SND2024) show that the system achieves a real-time processing speed of 45.7 FPS on an RTX 3060 platform, providing a reliable technical solution for intelligent feedback in exercise training.

2. Related Research

In recent years, the extensive utilization of AI technology in the domain of sports training has substantially accelerated the advancement of movement analysis and real-time feedback technology, particularly in the domain of yoga movement recognition and tracking, which has witnessed remarkable progress. In 2021, Wei et al. [3] substantiated the technical superiority of AI in simulating training scenarios and optimizing exercise performance through a case study. In 2022, Zhang et al. [4] proposed the KELM-MFF algorithm for action recognition through multi-dimensional spatio-temporal feature fusion, achieving a micro-motion recognition rate of 92.4%.

In order to promote the development of AI algorithms in the field of motion analysis, Verma et al. [5] constructed the Yoga-82 dataset in 2020, thus establishing an important data foundation for the field of motion analysis. The dataset employs a three-tiered hierarchical labeling system for the purpose of fine-grained annotation of 82 types of yoga postures, encompassing 6618 high-resolution images. These images are distributed across five primary postures, including standing, sitting, and inverted postures, among others. The dataset also employs a dual annotation strategy of human skeletal keypoints and contour masks, providing substantial data support for subsequent multimodal modeling and recognition tasks. Subsequent to the analysis of this dataset, Yadav et al. [6] constructed HybridNet, a hybrid model consisting of EfficientNet-B4 and DenseNet-121, in 2023. This enhanced the discriminative ability of the model through the feature channel reweighting mechanism, and its Top-1 classification accuracy reached 93.28%. However, the single-frame processing time of the HybridNet model is as high as 3.2 s, which makes it difficult for it to meet real-time requirements.

In order to address the trade-off between structural lightness and real-time performance, the Mediapipe pose estimation model proposed by Sharma et al. [7] in 2022 combined with MobileNetV3 achieves high frame rate recognition (up to 45 FPS) for 12 phases of the Bay Day style. Despite its proficiency in single-target recognition, the system’s accuracy exhibits a decline of up to 19.7% in multi-target scenarios. This finding suggests that the single-target paradigm may not be readily adaptable to intensive multi-person training scenarios. Sim et al. [8] further proposed a real-time workout tracking system based on MediaPipe in 2024, which achieves push-ups, pull-ups, and pull-ups through a multilayer perceptual machine (MLP) with an angle-constraint rule on the HSiPu2 data set and pull-ups with 87.6% accuracy, providing a new paradigm for low-complexity exercise feedback systems.

Multi-object tracking (MOT) remains a key research focus in the field of computer vision. Although recent studies have increasingly explored the use of large-scale models [9] to enhance detection and tracking performance, the Tracking-by-Detection (TBD) paradigm—particularly frameworks based on the YOLO series [10,11,12,13,14,15,16,17,18,19,20] detectors—continues to dominate due to its superior balance between accuracy and real-time efficiency. The YOLO architecture, known for its unified and lightweight design, has been widely adopted in MOT systems and consistently serves as a strong baseline in real-time perception tasks such as defect detection [21], aerial imagery analysis [22], and intelligent transportation systems [23]. In 2022, Pujara and Bhamare [24] systematically demonstrated the technical advantages of the fusion framework of the YOLO series algorithms and DeepSORT proposed by Wojke et al. [25] in 2017 for multi-target tracking. Their proposed background suppression algorithm on the MOT20 dataset [26] resulted in a reduction in ID switching rate by 22.4%, providing theoretical support for complex scene tracking. In 2024, Jati et al. [27] conducted a comparative analysis of the performance of YOLO-NAS and YOLOv7/v8 in a football robotics scenario. The findings revealed that YOLO-NAS attained 91.7% of mAP in the goal alignment task. However, it exhibited suboptimal performance in the dynamic target tracking task. This observation underscores the pivotal interplay between model architecture and task suitability, highlighting the critical impact of model architecture and task relevance on performance outcomes.

With regard to detector structure optimization, Patil et al. [28] proposed a C3-module-optimized feature pyramid structure based on YOLOv7x [16], which achieves 88.6% mAP on a self-built dataset. However, the computational complexity of 2.4G FLOPs restricts the feasibility of its deployment on mobile. In their seminal work, Khanam and Hussain [29] proposed a feature pyramid structure using the YOLOv11 model. This structure was developed to compress the number of model parameters to 1.8M through the C3k2 module. The authors then combined this with a dynamic anchor frame clustering algorithm. This combination was implemented to improve the AP50 of the COCO dataset by 2.3%.

In order to enhance the detection and tracking of targets by deep learning algorithms, scholars have explored the fusion strategy of lightweighting and feature enhancement in several directions. For instance, the utilization of light field technology has been demonstrated to enhance the feature expression ability and achieve improvement in image processing accuracy [30,31,32]. Zhang et al. [33] combined block labeling with an adaptive saliency mechanism to effectively improve the detection accuracy of the model. Ji et al. [34] enhanced the model’s ability to model fine-grained targets through complex feature pyramid design. Moreover, the incorporation of the attention mechanism, particularly the emergence of lightweight modules, presents a novel opportunity to regulate the computational resource overhead while preserving model performance.

The concept of spatial attention was initially introduced by Google [35] in 2018. This theoretical framework simulates the human capacity to allocate cognitive resources to salient regions through an effective weighting mechanism. Spatial attention has found extensive application in various domains, including image classification, target detection, and semantic segmentation, to name a few. The spatial attention module has been developed to target specific regions, thereby enhancing the precision of feature expression. The spatial attention model is concerned with the delineation of salient locations and has been demonstrated to enhance the efficacy of feature representation by directing the network to prioritize critical regions. Notable contributions in this domain include the CBAM module, as proposed by Woo et al. [36], and the BAM structure, as proposed by Park et al. [37]. Lightweight attention mechanisms, exemplified by SimAM [38] and Shuffle Attention [39], have been shown to prioritize performance optimization in scenarios where computational resources are limited. These mechanisms have emerged as a pivotal research direction for the deployment of edge devices.

In recent years, considerable attention has been directed towards channel attention (CA) mechanisms. The ECA (Efficient Channel Attention) module [40] enhances the network’s feature selection capability without substantial increases in model complexity through local cross-channel interactions and parameter-free gating mechanisms. This module has demonstrated robust adaptability and generalization in structures such as YOLOv5 [14], YOLOv7 [16], and other structures that exhibit comparable adaptability and generalization.

Table 1 summarizes the aforementioned studies.

To summarize, although extant studies have achieved a certain degree of success in the detection and tracking of multiple targets in dynamic scenarios, there are two fundamental limitations that persist. Firstly, the majority of systems are predicated on offline analysis and are devoid of real-time feedback functionality. Secondly, prevailing frameworks demonstrate an inadequate adaptation to yoga-type scenes, which are characterized by occlusion and significant interference from multiple targets. The present study proposes a detection framework based on the YOLOv11-ECA improved model, which is fused with the DeepSORT tracker to construct a multi-target detection and tracking system that is capable of both real-time and accurate operation. The introduction of a dynamic anchor frame optimization strategy and a lightweight channel attention module has been demonstrated to enhance the MOTA in a high-density scene (target spacing <15 px) to 58.3% while maintaining a real-time processing speed of 45FPS. This outcome serves to substantiate the efficacy and practicality of the proposed methodology.

3. Model Introduction

The YOLOv11 [20] model employed in this study is built upon the official open-source implementation provided by Ultralytics, utilizing its Python3.9.0 interface (ultralytics==11.0.0) for model training and inference. The core codebase is derived from its GitHub repository. The model follows a three-stage architecture consisting of a backbone, a Feature Pyramid Network (neck), and a Detection Head. The conventional C2f module is replaced with the C3k2 module, which incorporates small convolution kernels and cross-stage connections to enhance feature extraction accuracy while maintaining lightweight performance. Innovatively, the model integrates a Parallel Spatial Attention mechanism (C2PSA) to improve adaptability in multi-object scenarios. Additionally, depthwise separable convolutions (DWConv) are deployed in the detection head, significantly reducing the number of parameters while preserving the advantages of multi-scale detection. This results in a synergistic improvement in both detection accuracy and inference speed.

The model configuration in this study is based on the original YOLOv11n architecture, with the integration of the Efficient Channel Attention (ECA) mechanism (Wang et al., 2020 [40]). ECA computes channel interaction weights using an adaptive 1D convolution kernel, effectively avoiding the computational redundancy associated with fully connected layers. This reduces the complexity of channel feature recalibration from O(

C^{2}

) to O(C), thereby enhancing the representation of critical information. Figure 1 illustrates the architecture of the improved model. The black box in the C2PSA module highlights the newly added ECA component, which is embedded at the end of the backbone in a plug-and-play manner. This integration enhances the model’s small object detection performance without significantly increasing computational overhead.

As shown in Figure 1, the detection and classification process of the YOLOv11-ECA model begins by feeding input image frames into the backbone network enhanced with the Efficient Channel Attention (ECA) module. This module improves the model’s ability to represent both channel and spatial features, thereby enhancing the perception of key motion regions. The extracted multi-scale features are then fused and passed to the detection head, which outputs detection results, including bounding boxes, class probability distributions, and confidence scores. During classification, the model may predict multiple candidate bounding boxes for each frame, with each box associated with a probability distribution over eight yoga pose categories. The final classification result is determined by filtering out all boxes below a predefined confidence threshold and selecting the box with the highest confidence score as the representative pose for the current frame. The category corresponding to the highest class probability within this box is then assigned as the predicted pose. If no detection box meets the confidence threshold, the system concludes that no pose has been detected. This process ensures that the model accurately determines whether the current image contains a correct yoga pose based on the most reliable prediction.

The C2PSA module of the YOLOv11 model employs the SE (Squeeze-and-Excitation) channel attention mechanism, a sophisticated technique that enhances the model’s capacity to discern significant features through the adaptable adjustment of channel weights. The specific steps involved are as follows: firstly, the spatial information of each channel is compressed into a scalar using global average pooling (Squeeze) to obtain the channel description vector

z_{c}

. The channel attention coefficients

s_{c}

are then generated using a two-layer fully connected network (Excitation) modeling. Finally, the channel-by-channel multiplication of

s_{c}

with the original feature map is performed to achieve feature recalibration.

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U_{c} (i, j)

(1)

s_{c} = σ (W_{2} \cdot δ (W_{1} \cdot z_{c}))

(2)

{\tilde{U}}_{c} = s_{c} \cdot U_{c}

(3)

The SE module (Squeeze-and-Excitation) has been demonstrated to facilitate the enhancement of focus on important channel features at a reduced computational cost [41]. However, it should be noted that this module has dual limitations: firstly, the lack of spatial information modeling capability leads to insufficient capture of local feature correlations; secondly, the dense parameters of the fully connected layer lead to a significant increase in computational overhead when processing high-dimensional features [40]. To address these limitations, this study proposes the ECA (Efficient Channel Attention) module. The ECA module enhances the spatial-channel synergistic perception capability of YOLOv11 while preserving the benefit of channel attention. This is achieved by substituting the fully connected operation with a lightweight 1D convolution.

The flow of the module is as follows: firstly, the channel description vector

z_{c}

is obtained using global average pooling. Subsequently, local interaction is performed using 1D convolution to obtain the channel weight distribution. Finally, the channel weights are multiplied channel-by-channel with the original feature map to recalibrate the features. The formula for the ECA module is as follows:

z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c} (i, j)

(4)

w_{c} = \sum_{i = - ⌊ k / 2 ⌋}^{⌊ k / 2 ⌋} z_{c + i} \cdot ω_{i}

(5)

Y_{c} (i, j) = w_{c} \cdot X_{c} (i, j)

(6)

The ECA module is responsible for the calculation of the channel interaction weights through the utilization of an adaptive 1D convolutional kernel. This process successfully avoids the computational redundancy that is typically caused by the fully connected operation. Consequently, this enhances the efficiency of the attention mechanism and ensures optimal performance in real-time target detection. The specific structure of the ECA module is illustrated in Figure 2. The incorporation of ECA within the C2PSA module of YOLOv11 facilitates the optimization of channel weighting through local interactions, thereby ensuring that the model can allocate more precise attention to the target area, particularly in scenarios characterized by complex backgrounds and the detection of diminutive targets. Concurrently, the lightweight design of ECA maintains the real-time performance of YOLOv11, rendering it suitable for environments with limited resources.

YOLOv11 integrates the ECA algorithm into the object detection framework, which has been proven to enhance its performance in detection tasks [42]. In mining experiments, the architecture demonstrated outstanding results even under low illumination and target occlusion scenarios, achieving a mAP@0.5 of 95.8%, significantly outperforming comparative models. The model maintains a high frame rate of 59.6 FPS while ensuring accuracy, highlighting its robustness and practical value under complex conditions.

4. Testing Experiment

4.1. Configuration

The experiment employs a high-performance device that is equipped with an Intel Core i7-12700H processor, an NVIDIA GeForce RTX 3060 graphics card, 16GB DDR5-4800 RAM, and a 512GB PCIe 4.0 SSD. The graphics card delivers up to 140W of performance, which stably supports large-scale YOLOv11 training and inference, and ensures the stability of long-time high-load operation through the dual-fan + 5 heatpipe cooling design.

4.2. Data Presentation

The present study was trained using the open-source Surya Namaskar dataset [43], provided by Roboflow, which contains 2954 high-quality yoga bhajans images covering eight classical postures with labeled frames. The images originate from a diverse range of sources, encompassing both indoor yoga practice and natural environments.

The YOLO model imposes specific requirements on the data format; the present paper converts and preprocesses the data, chiefly by converting the annotation files to YOLO format. Subsequent to the conversion process, it is the case that each image is now aligned with a single annotation file, with each line comprising the object category number, center coordinates, and width- and height-normalized values in the following format:

[class_id center_x center_y width height]

In this study, class_id denotes the action class number (0–7), corresponding to the eight yoga actions in the dataset. centre_x and centre_y are the normalized coordinates of the center point of the target detection frame with respect to the width and height of the image, and width and height are the normalized dimensions of the target detection frame with respect to the width and height of the image.

Following processing, the dataset is divided into a training set, a validation set and a test set at a ratio of 8:1:1. The training set is used for model parameter optimization and contains diverse yoga action scenes and background data; the validation set is responsible for hyper-parameter tuning and monitoring model performance dynamics; and the test set rigorously tests the model’s actual generalization ability through independent samples. The rigorous process of data preprocessing ensures the standardization of inputs and, by extension, the enhancement of model training accuracy.

4.3. Parameter Selection

The present study is grounded in the YOLOv11 model, which has been utilized for the detection of yoga movements and the optimization of exercise training. The primary hyperparameter configurations employed during the training process are delineated in Table 2.

During the training process, the AdamW optimizer and Automatic Mixed Precision (AMP) were employed to enhance training efficiency and reduce GPU memory consumption. The input image size was set to 640, balancing computational cost and detection accuracy. To mitigate the risk of overfitting in small-sample scenarios, weight decay (weight_decay = 0.0005) and a momentum factor (momentum = 0.937) were introduced as regularization techniques to constrain excessive parameter fluctuations, thereby improving the model’s generalization ability. Additionally, an early stopping mechanism was activated to automatically terminate training when no significant improvement was observed on the validation set over several consecutive epochs, effectively preventing overfitting. The learning rate was decayed according to the following strategy:

lr (t) = lr 0 \times {(1 - \frac{t}{T})}^{γ}

(7)

In this strategy,

lr (t)

denotes the learning rate at the current epoch t, lr0 is the initial learning rate, T represents the total number of epochs, and

γ

is the decay factor. Experimental results demonstrate that the model exhibits consistent and stable performance on both the training and validation sets, ultimately achieving 98.6% mAP@0.5 on the validation set. This indicates that the overall training strategy effectively balances performance and generalization.

4.4. Evaluation Framework

To comprehensively evaluate the performance of the YOLOv11-ECA model in object detection and pose classification tasks, this study adopts a multi-dimensional evaluation framework encompassing detection accuracy, computational complexity, and multi-object tracking stability. In terms of detection performance, key evaluation metrics include the precision–recall (PR) curve, F1 curve, and mAP@0.5. The PR curve illustrates the trade-off between detection precision and recall at varying confidence thresholds. For each class, the model outputs predicted bounding boxes and associated confidence scores. By adjusting the confidence threshold, different pairs of precision and recall can be computed to generate the PR curve. The area under this curve represents the Average Precision (AP) for the corresponding class, which quantitatively reflects the overall detection performance of that class. The F1 curve captures the trend of F1 scores across different thresholds, indicating the model’s balanced performance between precision and recall. By averaging the AP values across all classes, the mean Average Precision at an Intersection over Union (IoU) threshold of 0.5 (mAP@0.5) is computed and used as the principal metric to assess overall detection accuracy.

For evaluating the model’s computational complexity and multi-object tracking performance, this study employs metrics such as FLOPs (Floating Point Operations), MOTA (Multiple Object Tracking Accuracy), and IDF1 (ID F1 Score). FLOPs quantify the number of floating-point operations required during a single forward inference pass. A lower FLOPs value indicates a more lightweight model, making it more suitable for real-time inference scenarios. MOTA provides an overall measure of tracking accuracy by accounting for missed detections, false positives, and identity switches. IDF1 evaluates the consistency of identity preservation by computing the F1 score based on correct target ID matches, thereby reflecting the quality and robustness of the tracking process.

MOTA = 1 - \frac{\sum_{t} ({FN}_{t} + {FP}_{t} + {IDSW}_{t})}{\sum_{t} {GT}_{t}}

(8)

where

{FN}_{t}

is the number of false negatives at frame t,

{FP}_{t}

is the number of false positives,

{IDSW}_{t}

denotes ID switches, and

{GT}_{t}

is the number of ground truth objects.

IDF 1 = \frac{2 \cdot IDTP}{2 \cdot IDTP + IDFP + IDFN}

(9)

where IDTP is the number of correctly identified detections, IDFP is the number of false positive identities, and IDFN is the number of missed ground-truth identities.

Based on the aforementioned evaluation metrics, this study conducts a comprehensive assessment of the proposed model from multiple perspectives, including detection accuracy, recall capability, inference efficiency, and multi-object tracking stability. This systematic evaluation ensures the completeness and reliability of the experimental results.

4.5. Experimental Comparisons

In this section, we perform a comprehensive comparison between the original YOLOv11 model and the improved YOLOv11-ECA model, focusing on the detection performance on the confusion matrix, normalized confusion matrix, PR curve, F1 curve, and validation set. Through detailed analyses, we find that the introduction of the ECA module significantly improves the performance of the model, especially in the detection of several key categories.

Firstly, the comparison between the confusion matrix and the normalized confusion matrix shows that the improved model achieves significant improvements in most categories. For example, in the ‘ashwa namaskara’ and ‘bhujangasana’ categories, the number of correct classifications improves from 17 and 28 to 19 and 29, respectively, and the number of misclassified samples decreases by two and one, respectively. In addition, the normalized confusion matrix for the ‘ashwa namaskara’ category shows an improvement in its correct classification rate from 89% to 100% and a significant reduction in misclassification, indicating that the improved model reduces the number of samples that would otherwise be misclassified into other categories.

In the confusion matrix plot, we show the comparison of the confusion matrix between the original model and the improved model on the validation set. Figure 3 shows the classification results for each category, visualizing the improvement in the correct classification rate.

Comparing the PR curves, the improved model improves by 0.5% from 98.3% of the original model to 98.8% on mAP@0.5. Specifically, the accuracy of the ‘ashwa mukh svanasana’ category improves from 96.4% to 99.3%, and the ‘ashwa namaskara’ category also improves from 97.4% to 98.5%. Although the precision rate for the ‘kumbhakasana’ category slightly decreased, the precision rates for most categories were more consistent in the high-recall region, indicating that the improved model’s ability to capture detailed features has been enhanced. Figure 4 shows the performance of the original model and the improved model on the PR curves, where it is clear that the improved model achieves a better balance for most categories.

When comparing the F1 curves, both the original and improved models achieve a highest F1 score of 0.96, but the confidence threshold of the improved model is increased from 0.385 to 0.416, indicating that the model can maintain the same F1 score at a higher confidence level. The improved model is smoother in the high-confidence interval (confidence > 0.8) with less fluctuation in F1 scores, further demonstrating the ECA module’s effectiveness in optimizing the model for determining positive and negative samples. Figure 5 shows the performance of the original model and the improved model on the F1 curve, with the improved model maintaining a higher F1 score across various confidence levels.

Finally, a comparison of the detection performance of the validation set shows that the overall performance of the improved model is significantly improved. The overall mAP@0.5 of the original model is 97.7%, and the improved model improves to 98.6%. In particular, for the categories ‘Pranamasana’ and ‘Adho Mukh Svanasana’, mAP@0.5 improves from 0.977 to 0.995 and from 0.964 to 0.993, respectively, with high levels of precision and recall. However, the performance of the ’hasta utthanasana’ and ’kumbhakasana’ categories slightly decreases from 0.987 to 0.983 and from 0.991 to 0.968, respectively.

Table 3 shows the comparison of the detection performance between the original model and the improved model on the validation set. From the table, it can be seen that the improved model achieves significant improvement in most of the categories, especially in the categories of ‘adho mukh svanasana’ and ‘pranamasana’ where mAP@0.5 improves from 0.964 and 0.977 to 0.993 and 0.995 respectively, indicating that the improved model significantly improves the detection ability in these categories. Meanwhile, the ‘ashwa sanchalanasana’ and ‘bhujangasana’ categories also show a more stable performance improvement, further validating the effectiveness of the ECA module.

According to the results in Table 4, the performance of each detection model on the validation set varies. From the table, it can be seen that YOLOv11-ECA performs well in the precision and recall mAP@0.5 metrics, especially outperforming other models in recall and mAP@0.5, which shows its strong ability in the detection task. In contrast, the Transformer and YOLO series models also perform well, especially YOLOv6 and YOLOv10, which are close to 0.96 in precision and recall, but their overall performance is slightly inferior to that of YOLOv11-ECA.

It is important to note that the selection of baseline models in this study was primarily based on considerations of model lightweightness and practical application scenarios. Given that the task focuses on action posture classification based on camera input, which demands high detection speed and real-time performance, the YOLO series was selected as the main comparative framework. In addition, a Transformer-based model with relatively higher performance but greater computational complexity was introduced as a reference. Pixel-level segmentation models such as SAM2 were not included, as they are primarily designed for high-precision segmentation tasks, entail substantial computational costs, and are not well-suited for the requirements of this study. Considering detection efficiency, inference speed, and the alignment with the task objectives, the chosen comparative models are representative and appropriate, effectively validating the performance advantages of the proposed YOLOv11-ECA model.

In summary, by introducing the ECA module, the improved model achieves significant performance gains in several categories, especially in the three key metrics of precision, recall and mAP@0.5. Although there is a slight decrease in performance in some categories, the overall performance of the improved model shows higher accuracy and stability in most categories. This makes the improved model more suitable for real-world application scenarios, especially in dynamic environments where highly accurate recognition is required.

Figure 6 shows some of the recognition results of the YOLOv11-ECA model on the validation set. The model is able to accurately identify multiple targets and performs well in distinguishing between different action categories. At the same time, the model maintains high detection confidence in complex backgrounds. For example, most of the detection results shown in the figure have a confidence level higher than 0.8.

4.6. Masking Experiment

Table 5 shows the results of the ablation experiments to analyze the impact of the different improvements on the model performance. In experimental group 1, the original YOLOv11 (including the SE module) is used with a mAP of 0.977, an FPS of 85.99, a parameter count of 2.59 M and a FLOPs of 3.21 G, which shows high detection performance and inference speed, but the higher parameter count and computational volume reflect the redundancy of the SE module in feature extraction, and there is still room for further optimization. In experimental group 2, after replacing the SE module with the more efficient ECA module, the mAP is increased to 0.986, the parameter volume is reduced to 2.35 M, and the FLOPs are reduced to 3.11 G, while the FPS is only slightly reduced to 85.79, which shows that the ECA module is better than the SE module in feature channel enhancement and can significantly improve the detection accuracy while reducing the complexity of the model, showing that it is better than the SE module for yoga pose detection. This shows that the ECA module is better than the SE module in terms of feature channel enhancement and can significantly improve the detection accuracy while reducing the model complexity.

In experimental group 3, the removal of the Parallel Spatial Attention (PSA) module results in a mAP of 0.979, which is 0.002 higher than that of experimental group 1. Meanwhile, the FPS is increased to 89.88, the number of parameters is reduced to 2.34 M, and the FLOPs are the same as those of experimental group 2. The slight improvement in detection accuracy despite the removal of the PSA module suggests that the channel attention mechanism plays a central role in feature extraction after the introduction of the ECA module, while the spatial attention contribution of PSA is relatively limited. In addition, the significant improvement in FPS further confirms that the removal of the PSA module has significantly optimized the model inference efficiency. After the removal of all attention mechanisms in experimental group 4, the mAP recovers to 0.977, but the FPS decreases to 70.41, indicating that channel attention is a key factor in improving detection accuracy, while the removal of spatial attention plays a crucial role in optimizing inference efficiency.

The comprehensive analysis shows that the introduction of the ECA module is the key point to improve the performance of YOLOv11 in the yoga pose detection task. Compared to the SE module, the ECA module achieves higher feature enhancement through a lighter design, resulting in an increase in mAP to 0.986, while reducing the number of parameters and computation, demonstrating an excellent balance between efficiency and performance. In addition, the performance advantage of the ECA module can still be fully demonstrated after removing the PSA module, demonstrating the critical role of channel attention in feature extraction. The improved YOLOv11 achieves higher accuracy and inference efficiency in the yoga motion detection task, providing reliable technical support for subsequent research and application of related tasks.

5. Tracking Experiment

In this paper, a DeepSORT tracker based on the YOLOv11-ECA detector is added to implement a multi-target tracking system. The detector improves the small target detection accuracy by a parameter configuration of 640 × 640 input resolution, 0.4 confidence threshold and 0.5 NMS IoU threshold combined with a dynamic adaptive anchor frame. The tracker uses Kalman filtering with the Mars-smallCNN feature extraction model (128-dimensional features) and sets the maximum track survival time to 30 frames (max_age) and the track confirmation threshold to 5 frames (n_init) to cope with the occlusion problem. The experiments were run on NVIDIA RTX 3060 GPUs, and the system performance was verified using a 1920 × 1080 resolution Yoga Action dataset. The system achieves 58.3% MOTA vs. 62.1% IDF1 in the test, with a bounding box positioning error (MOTP) of 0.213 and a real-time processing speed of 45.7 FPS. As shown in Figure 7, the target ID and bounding box remain stable in the regular scene, but there is a 12% drop in MOTA in the extremely dense region (target spacing < 10 pixels). Experiments show that the dynamic anchor frame design improves the IDF1 of 20px spaced targets by more than 60%, while the frequency of ID switching due to occlusion increases by a factor of 2.3 in dense regions. In the future, it is proposed to optimize the feature matching module through the spatio-temporal attention mechanism to improve the complex occlusion processing capability.

6. Summary

In this paper, a yoga motion analysis system based on the improved YOLOv11-ECA model and DeepSORT tracker is proposed to meet the demand for accurate motion detection and optimization in sports training. By introducing the Efficient Channel Attention (ECA) mechanism instead of the traditional SE module, the computational overhead is reduced by 18% while maintaining 98.6% detection accuracy, and the AP50 of small target detection is improved by 0.9% when combined with the dynamic anchor frame adjustment strategy. Deep separable convolution optimization compresses the number of model parameters to 2.35 M and achieves 85.79 FPS real-time processing on the RTX 3060 platform, laying the foundation for real-time monitoring of complex scenes. The system integrates the DeepSORT multi-target tracking algorithm, which solves the target occlusion problem in continuous action by Kalman filtering with a lightweight feature extraction network (Mars-smallCNN). Experiments show that this method has a MOTA of 58.3% on the self-constructed SND2024 dataset, which significantly improves the accuracy of coherence assessment of the Bayanihan action.

This study provides a lightweight solution for the intelligent sports training system through the synergistic architecture of “detection-tracking”, and in the future, we intend to integrate the 3D posture estimation module to build a whole-link technology system of “real-time monitoring-biomechanical analysis-personalised feedback”, which will promote the development of sports science from data-driven to intelligent decision-making.

Author Contributions

Conceptualization, Y.Z., H.S. and W.K.; methodology, Y.Z., S.D., H.S. and W.K.; software, H.S. and W.K.; validation, S.D.; investigation, Y.Z.; resources, W.K. and H.S.; data curation, S.D.; writing—original draft preparation, Y.Z., S.D., H.S. and W.K.; writing—review and editing, Y.Z., S.D., H.S. and W.K.; visualization, S.D., H.S. and W.K.; supervision, H.S. and W.K.; project administration, H.S. and W.K.; funding acquisition, H.S. and W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Macao Polytechnic University (File No. fca.bc72.a47b.8).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are contained within the article.

Acknowledgments

The authors gratefully acknowledge the useful comments of the reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ross, A.; Thomas, S. The Health Benefits of Yoga and Exercise: A Review of Comparison Studies. J. Altern. Complement. Med. 2010, 16, 3–12. [Google Scholar] [CrossRef] [PubMed]
Prasanna Venkatesh, L.; Vandhana, S. Insights on Surya Namaskar from Its Origin to Application Towards Health. J. Ayurveda Integr. Med. 2022, 13, 100530. [Google Scholar] [CrossRef]
Wei, S.; Huang, P.; Li, R.; Liu, Z.; Zou, Y. Exploring the Application of Artificial Intelligence in Sports Training: A Case Study Approach. Complexity 2021, 2021, 4658937. [Google Scholar] [CrossRef]
Zhang, L. Applying Deep Learning-Based Human Motion Recognition System in Sports Competition. Front. Neurorobot. 2022, 16, 860981. [Google Scholar] [CrossRef]
Verma, M.; Kumawat, S.; Nakashima, Y.; Raman, S. Yoga-82: A New Dataset for Fine-grained Classification of Human Poses. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 4472–4479. Available online: https://api.semanticscholar.org/CorpusID:216056215 (accessed on 13 June 2025).
Yadav, S.K.; Shukla, A.; Tiwari, K.; Pandey, H.M.; Akbar, S.A. An Efficient Deep Convolutional Neural Network Model For Yoga Pose Recognition Using Single Images. arXiv 2023, arXiv:2306.15768. https://arxiv.org/abs/2306.15768. [Google Scholar]
Sharma, A.; Sharma, P.; Pincha, D.; Jain, P. Surya Namaskar: Real-time Advanced Yoga Pose Recognition and Correction for Smart Healthcare. arXiv 2022, arXiv:2209.02492. https://arxiv.org/abs/2209.02492. [Google Scholar]
Sim, K.S.; Wong, S.W.; Low, A.; Yunus, A.P.; Lim, C.P. Real-Time Digital Assistance for Exercise: Exercise Tracking System with MediaPipe Angle Directive Rules. J. Vis. Comput. 2024, 8, 2993–3005. [Google Scholar] [CrossRef]
Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 17 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. https://arxiv.org/abs/1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. https://arxiv.org/abs/2004.10934. [Google Scholar]
Jocher, G. Ultralytics YOLOv5, Version 7.0; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. https://arxiv.org/abs/2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, Version 8.0.0.; GitHub, Inc.: San Francisco, CA, USA, 2023; Available online: https://github.com/ultralytics/ultralytics (accessed on 13 June 2025).
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. https://arxiv.org/abs/2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. https://arxiv.org/abs/2405.14458. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11, Version 11.0.0; GitHub, Inc.: San Francisco, CA, USA, 2024; Available online: https://github.com/ultralytics/ultralytics (accessed on 13 November 2024).
Bao, J.; Yuan, X.; Wu, Q.; Lam, C.; Ke, W.; Li, P. CCA-YOLO: Channel and Coordinate Aware-Based YOLO for Photovoltaic Cell Defect Detection in Electroluminescence Images. IEEE Trans. Instrum. Meas. 2025, 74, 3541805. [Google Scholar] [CrossRef]
Wang, L.; Law, K.E.; Lam, C.T.; Ng, B.; Ke, W.; Im, M. Automatic Lane Discovery and Traffic Congestion Detection in a Real-Time Multi-Vehicle Tracking Systems. IEEE Access 2024, 12, 161468–161479. [Google Scholar] [CrossRef]
Yuan, Q.; Huang, G.; Zhong, G.; Yuan, X.; Tan, Z.; Lu, Z.; Pun, C.M. Triangular Chain Closed-Loop Detection Network for Dense Pedestrian Detection. IEee Trans. Instrum. Meas. 2003, 73, 1–14. [Google Scholar] [CrossRef]
Pujara, A.; Bhamare, M. DeepSORT: Real-Time & Multi-Object Detection and Tracking with YOLO and TensorFlow. In Proceedings of the 2022 International Conference on Augmented Intelligence and Sustainable Systems, Chennai, India, 24–25 November 2022; pp. 456–460. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. MOT20: A Benchmark for Multi Object Tracking in Crowded Scenes. arXiv 2020, arXiv:2003.09003. https://arxiv.org/abs/2003.09003. [Google Scholar]
Jati, H.; Ilyasa, N.A.; Dominic, D.D. Enhancing Humanoid Robot Soccer Ball Tracking, Goal Alignment, and Robot Avoidance Using YOLO-NAS. J. Robot. Control 2024, 5, 21839–21851. [Google Scholar] [CrossRef]
Patil, P.S. Yoga Pose Estimation Using YOLO Model. Int. J. Sci. Res. Eng. Manag. 2023, 7, 1011–1017. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. https://arxiv.org/abs/2410.17725. [Google Scholar]
Cong, R.; Sheng, H.; Yang, D.; Cui, Z.; Wang, S. End-to-End Semantic Segmentation Utilizing Multi-Scale Baseline Light Field. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5790–5804. [Google Scholar] [CrossRef]
Cong, R.; Sheng, H.; Yang, D.; Cui, Z.; Wang, S. Multimodal Perception Integrating Point Cloud and Light Field for Ship Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2024, 25, 12477–12489. [Google Scholar] [CrossRef]
Cong, R.; Sheng, H.; Yang, D.; Cui, Z.; Chen, R.; Yang, Z. Pseudo 5D Hyperspectral Light Field for Image Semantic Segmentation. Inf. Fusion 2025, 121, 103042. [Google Scholar] [CrossRef]
Zhang, Q.; Yuan, X.; Liu, T.; Lam, C.-T.; Huang, G.; Lin, D.; Li, P. Tampering Localization and Self-Recovery Using Block Labeling and Adaptive Significance. Expert Syst. Appl. 2023, 226, 120228. [Google Scholar] [CrossRef]
Ji, S.; Yuan, X.; Bao, J.; Liu, T.; Lian, Y.; Huang, G.; Zhong, G. PFPS: Polymerized Feature Panoptic Segmentation Based on Fully Convolutional Networks. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 9, 2584–2596. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. https://arxiv.org/abs/1706.03762. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. https://arxiv.org/abs/1807.06521. [Google Scholar]
Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.S. BAM: Bottleneck Attention Module. arXiv 2018, arXiv:1807.06514. https://arxiv.org/abs/1807.06514. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. Proc. Int. Conf. Mach. Learn. (ICML) 2021, 139, 11863–11874. Available online: https://proceedings.mlr.press/v139/yang21o.html (accessed on 13 June 2025).
Zhang, Q.-L.; Yang, Y.-B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 14–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network. arXi 2021, arXiv:2105.14447. https://arxiv.org/abs/2105.14447. [Google Scholar]
Li, Y.; Yan, H.; Li, D.; Wang, H. Robust Miner Detection in Challenging Underground Environments: An Improved YOLOv11 Approach. Appl. Sci. 2024, 14, 11700. [Google Scholar] [CrossRef]
Lalitha. Surya Namaskar Dataset (SND2024); Roboflow, Inc.: Des Moines, IA, USA, 2024; Available online: https://universe.roboflow.com/lalitha-uruu5/surya-namaskar (accessed on 15 December 2024).

Figure 1. YOLOv11+ECA model structure.

Figure 2. ECA module structure.

Figure 3. Normalized confusion matrices of original and improved models. (a) Original model normalized confusion matrix. (b) Improved model normalized confusion matrix.

Figure 4. PR curve of original and improved models. (a) Original model PR curve. (b) Imporved model PR curve.

Figure 5. F1 scores of original and improved models. (a) Original model F1 scores. (b) Improved model F1 scores.

Figure 6. Visualization of YOLOv11-ECA detection performance on the validation set. (a) YOLOv11-ECA Validation swet annotation example. (b) YOLOv11-ECA Validation set prediction results.

Figure 7. Tracking effect.

Table 1. Overview of the relevant studies.

Studies	Dataset	Model
Yoga-82 Data Presentation & Classification [5]	Yoga-82	None
Yoga Pose Classification (YPose) [6]	Yoga-82	EN-DNet ¹
Suryanamaskar Pose Detection [7]	Open-source	CNN
Real-time Workout Tracking [8]	HSiPu2	MediaPipe+MLP
Yoga Pose Detection [28]	Self-built	YOLOv7
YOLOv11 Architecture Optimization [29]	Open-source	YOLOv11
Photovoltaic Defect Detection [21]	I-images ²	CCA-YOLO
Soccer Robot Tracking [27]	Self-built	YOLO-NAS
Multi-Target Tracking [24]	MOT20	YOLO+DeepSORT

¹ EfficientNet+DenseNet; ² Industrial images.

Table 2. Hyperparameters for YOLOv11 training.

Parameter	Value	Clarification
Batch	4	Number of samples per training
Imgsz	640	Size of the training image
Epochs	1000	Total training rounds
Lr0	0.01	Learning rate
Optimizer	AdamW	Optimiser selection
Weight_decay	0.0005	Preventing overfitting
Momentum	0.937	Momentum factor
Amp	True	Enable automatic mixing accuracy

Table 3. Comparison of detection performance between the original model and the improved model on the validation set.

Classification	Original Model			Improved Model
Classification	Precision	Recall	mAP@0.5	Precision	Recall	mAP@0.5
adho mukh svanasana	0.919	1.000	0.964	0.918	0.993	0.993
ashtanga namaskara	0.947	0.942	0.956	0.885	1.000	0.985
ashwa sanchalanasana	0.986	0.958	0.979	0.965	0.972	0.983
bhujangasana	0.935	0.967	0.982	0.968	1.000	0.995
hasta utthanasana	0.976	0.976	0.987	0.976	0.972	0.983
kumbhakasana	0.956	0.992	0.991	0.857	0.955	0.968
padahastasana	0.961	0.989	0.980	0.949	1.000	0.980
pranamasana	0.966	1.000	0.977	0.966	1.000	0.995
Overall performances	0.960	0.978	0.977	0.935	0.987	0.986

Table 4. Comparison of overall performance of each detection model on the validation set.

Model	Precision	Recall	mAP@0.5
CNN	0.778	0.689	0.780
Transformer	0.893	0.927	0.934
YOLOv6	0.958	0.963	0.968
YOLOv10	0.955	0.961	0.971
YOLOv11	0.960	0.978	0.977
YOLOv11-ECA	0.935	0.987	0.986

Table 5. Masking experiment results.

Experimental Group	Model Structure	mAP	FPS	Params/FLOPs
Group 1	Original model YOLOv11 (SE module)	0.977	85.99	2.59 M/3.21 G
Group 2	Improved model YOLOv11 (ECA module)	0.986	85.79	2.35 M/3.11 G
Group 3	Removal of the PSA module	0.979	89.88	2.34 M/3.11 G
Group 4	Remove space and channel attention	0.977	70.41	2.59 M/3.11 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Y.; Dong, S.; Sheng, H.; Ke, W. YED-Net: Yoga Exercise Dynamics Monitoring with YOLOv11-ECA-Enhanced Detection and DeepSORT Tracking. Appl. Sci. 2025, 15, 7354. https://doi.org/10.3390/app15137354

AMA Style

Zhou Y, Dong S, Sheng H, Ke W. YED-Net: Yoga Exercise Dynamics Monitoring with YOLOv11-ECA-Enhanced Detection and DeepSORT Tracking. Applied Sciences. 2025; 15(13):7354. https://doi.org/10.3390/app15137354

Chicago/Turabian Style

Zhou, Youyu, Shu Dong, Hao Sheng, and Wei Ke. 2025. "YED-Net: Yoga Exercise Dynamics Monitoring with YOLOv11-ECA-Enhanced Detection and DeepSORT Tracking" Applied Sciences 15, no. 13: 7354. https://doi.org/10.3390/app15137354

APA Style

Zhou, Y., Dong, S., Sheng, H., & Ke, W. (2025). YED-Net: Yoga Exercise Dynamics Monitoring with YOLOv11-ECA-Enhanced Detection and DeepSORT Tracking. Applied Sciences, 15(13), 7354. https://doi.org/10.3390/app15137354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YED-Net: Yoga Exercise Dynamics Monitoring with YOLOv11-ECA-Enhanced Detection and DeepSORT Tracking

Abstract

1. Introduction

2. Related Research

3. Model Introduction

4. Testing Experiment

4.1. Configuration

4.2. Data Presentation

4.3. Parameter Selection

4.4. Evaluation Framework

4.5. Experimental Comparisons

4.6. Masking Experiment

5. Tracking Experiment

6. Summary

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI