1. Introduction
With the continued evolution of digital capture, video storage, and computer vision technologies, video analysis has gradually developed from a traditional post-match replay tool into an important technical foundation for performance analysis in modern competitive sport. The key value of performance analysis lies not in the simple recording of match events but in constructing sport-specific performance indicators that are interpretable, outcome-relevant, and useful for match interpretation and training decision making [
1]. In high-performance sporting environments, full-match video provision, profiling, performance reports, and visual feedback have become routine components of the workflow through which analysts support coaching staff, and both coaches and analysts have a strong demand for timely feedback and contextualized interpretation [
2,
3]. Empirical studies and integrative reviews have also shown that video feedback can facilitate motor learning, technical understanding, and reflective processes, although its effectiveness depends on feedback timing, implementation environment, and coach–athlete interaction [
4,
5]. Therefore, video analysis has become deeply embedded in pre-match preparation, in-game communication, post-match review, and subsequent training adjustment.
In basketball, video analysis is particularly important because tactical execution depends on rapid offensive–defensive transitions, multi-player coordination, and context-dependent decision making. Previous studies have shown that indicators such as assists, defensive rebounds, and successful field goals can distinguish winning and losing performances in professional basketball, while these indicators should be interpreted according to competition stage and game context [
6]. However, recent reviews have pointed out that basketball analysis still relies heavily on box-score or aggregated statistical indicators, with insufficient use of temporal relationships, tactical coordination, and contextual information embedded in raw game video [
7]. In training contexts, the combination of tactical tasks, video recording, and technical–tactical parameter analysis can help reveal differentiated features of athletes’ performance, perceived exertion, and mental load [
8]. Thus, finer-grained and more automated extraction of tactical segments from basketball game videos may provide practical support for more targeted training design and match review.
From the perspective of coaching practice, video feedback is not merely the viewing of match footage but a form of intervention that integrates information presentation, reflection facilitation, and training transfer. In basketball, where game tempo is fast and offensive–defensive transitions occur frequently, manual clipping can retain advantages in tactical interpretation and quality control, but it requires analysts to identify, select, and organize events based on a professional understanding of the game [
3,
9]. Time pressure has, therefore, become a key factor limiting the efficiency of performance analysis and feedback implementation [
2]. Improving the automation of tactical segment extraction is not simply a matter of replacing manual work but a response to the practical need for both rapid and interpretable feedback in high-level coaching environments.
The development of deep learning has provided new technical possibilities for automatic sports video recognition and automated clipping. Compared with general visual scenes, action recognition in adversarial team sports such as football and basketball is more challenging because actions occur at high speed, multi-agent interactions are dense, occlusion is frequent, and models must often process individual actions, multi-person interactions, and game-context information simultaneously [
10]. Reviews of computer vision in sport have shown that automated methods have been applied to player and ball detection, object tracking, action recognition, event classification, highlight extraction, automated annotation, and tactical analysis [
11,
12]. In basketball specifically, multi-person event detection has become an important benchmark scenario. For example, Ramanathan et al. constructed a basketball event dataset containing 11 event classes and approximately 14,000 densely annotated temporal instances, demonstrating that attention-based models can identify both events and key actors in complex multi-person videos [
13]. These findings suggest that basketball tactical clipping is not a conventional video editing task but a composite intelligent analysis task, involving object detection, action recognition, temporal event localization, and clip generation.
In terms of methodological foundations, deep learning has established a relatively clear framework for spatiotemporal modeling. The Long-term Recurrent Convolutional Network proposed by Donahue et al. demonstrated that the integration of convolutional neural networks and long short-term memory networks can process variable-length video inputs and learn temporal dynamics [
14]. Feichtenhofer et al. further proposed the SlowFast network, which models spatial semantics through a Slow pathway and captures fine-grained motion information through a Fast pathway, thereby improving performance in video action recognition and detection [
15]. For basketball tactical recognition, which requires the simultaneous processing of player positions, movement trajectories, interaction sequences, and event boundaries, the paradigm of spatial feature extraction plus temporal sequence modeling provides a feasible technical basis. Accordingly, combining an object detection module with an LSTM-based temporal model may support the identification of tactical events such as dribble hand-offs, pick-and-rolls, and horns offense. However, because such components are already established in the broader computer vision literature, their applied value in basketball tactical clipping should be examined through event-level evaluation and comparison with a simpler technical baseline.
In addition to recognition performance, the practical deployment of artificial intelligence in sports science also requires attention to interpretability, transparency, and ethical use. Recent reviews have emphasized that explainable artificial intelligence can improve the understandability and trustworthiness of sports analytics systems, particularly when automated outputs are used to support expert decision making [
16]. Ethical discussions of artificial intelligence in sport have also highlighted the need to ensure responsible implementation, human oversight, and appropriate interpretation of automated analytical outputs [
17]. These considerations are particularly relevant to the present study, because automatically generated clips are intended to support, rather than replace, expert coaching judgment. Therefore, an automated tactical clipping workflow should be evaluated not only by its processing speed but also by its recognition reliability, error patterns, practical usability, and need for expert review.
Nevertheless, the practical application of automated basketball tactical clipping remains limited by several factors. Basketball tactics involve severe occlusion, frequent changes in player spacing, diverse tactical variants, rapid transitions, and context-dependent event structures, all of which increase the difficulty of robust recognition and clip generation [
12]. Moreover, the evaluation of an automated clipping workflow should not be restricted to processing time alone. It should also consider event-level recognition performance, false-positive and false-negative patterns, baseline comparison, and the practical usability of generated clips for coaches. In coaching-oriented tactical review, extracted clips often need to preserve sufficient contextual information to support interpretation, which differs from the strict localization of minimal isolated action intervals. Existing studies have demonstrated the feasibility of object detection, action recognition, and event localization, but fewer studies have examined whether an automated tactical clipping workflow can reduce manual workload while also providing transparent event-level recognition evidence and a realistic assessment of coaching usability. Therefore, the present study focuses on the applied validation of an automated tactical clip extraction workflow rather than the proposal of a new deep learning architecture.
The purpose of this study was to develop and evaluate an application-oriented framework for automatically extracting three common offensive tactical clips from basketball game videos: dribble hand-off, pick-and-roll, and horns offense. Specifically, this study addressed the following research questions: (1) Can the automated workflow reduce the time required for tactical clip extraction compared with manual clipping? (2) What is the event-level tactical recognition performance of the full YOLOv8–LSTM workflow in terms of true positives, false positives, false negatives, precision, recall, and F1-score? (3) Does the full YOLOv8–LSTM workflow outperform a YOLOv8 plus rule-based baseline without LSTM temporal modeling? (4) How does the coach-rated quality of automatically generated clips compare with manually generated clips? (5) What limitations emerge when applying automated tactical recognition to real-game basketball videos? Based on these objectives, three working expectations were formulated: H1, the automated workflow would substantially reduce tactical clipping time compared with manual procedures; H2, the full YOLOv8–LSTM workflow would show better event-level recognition performance than the rule-based baseline without LSTM; and H3, the automated clips would receive lower expert-rated quality than manually generated clips because of the complexity of tactical event recognition and boundary interpretation in real-game basketball scenarios.
2. Materials and Methods
2.1. Study Design and Video Dataset
This study was designed as an application-oriented validation of an automated tactical clip extraction workflow for basketball game videos. Rather than proposing a new deep learning architecture, we aimed to evaluate whether an integrated workflow combining player detection, temporal modeling, basketball-specific rule constraints, and automatic clip generation could reduce manual clipping time while providing measurable event-level recognition performance and practical usability for coaching review.
Two independent sources of basketball video materials were used. First, a separate pool of 25 Chinese Basketball Association (CBA) professional basketball broadcast games was used during the prior development stage of the automated workflow. From these CBA games, 85 manually clipped tactical events were used as development clips for configuring the LSTM-based classifier, feature representation, rule-constrained workflow, and baseline settings.
Second, ten publicly accessible full-game broadcast videos of men’s basketball matches from the Paris 2024 Olympic Games were obtained from the Migu Video platform and used as the independent application-level evaluation sample. These videos were selected because they represented high-level international basketball competition and contained diverse offensive tactical situations, defensive pressure, player movement patterns, and broadcast camera conditions. The Olympic videos were used only for final application-level validation, processing efficiency comparison, event-level recognition evaluation, coach-rated clip quality assessment, and inter-rater agreement analysis. A summary of the development and evaluation data used in the automated workflow is presented in
Table 1.
To prevent data leakage, the ten Olympic full-game videos were not used during model configuration, hyperparameter selection, threshold adjustment, rule development, or baseline construction. All feature definitions, LSTM settings, tactical rule constraints, baseline settings, and event-matching procedures were fixed before the Olympic validation videos were processed. The distribution of tactical clips across the development and evaluation sets is presented in
Table 2.
The evaluation clips from the ten Olympic games were organized, clipped, annotated, and analyzed according to three predefined offensive tactical categories: dribble hand-off, pick-and-roll, and horns offense. No frame-level random partitioning was performed on the ten Olympic full-game videos, in order to preserve the integrity of continuous game contexts and to ensure that the reported results reflected application-level performance on held-out full-game materials.
2.2. Manual Reference Clipping, Annotation Protocol, and Consensus Procedure
Manual reference clips were generated using Hudl Sportscode version 12.52.0, a sport-specific video analysis tool used for event coding, clip capture, and video library organization in performance analysis workflows [
3,
9]. According to the official support documentation, Hudl Sportscode is compatible with macOS and requires appropriate hardware support for video analysis tasks [
18]. In this study, Hudl Sportscode was used to support manual clip extraction, tactical event tagging, and the construction of reference clips.
The use of manual reference clipping is consistent with previous performance analysis research showing that coaches and athletes value video-based feedback when it is practically interpretable and directly connected to training and competition needs [
19]. At the same time, automated video processing studies have shown that contextual cues and deep action recognition features can be used to support basketball highlight generation and sports video summarization [
20,
21]. Public multi-person sports datasets, such as MultiSports, further demonstrate the importance of fine-grained spatiotemporal annotation for evaluating sports action detection in complex multi-athlete scenarios [
22]. These studies informed the present annotation and validation strategy, while the current work focused specifically on application-level tactical clip extraction from full-game basketball broadcasts.
Three offensive tactical categories were selected for annotation and automated extraction: dribble hand-off, pick-and-roll, and horns offense. These tactical categories were selected because they are frequently used in high-level basketball games and involve clear multi-player coordination patterns. Before formal annotation, the analysts were familiarized with the operational definitions, inclusion criteria, exclusion criteria, and temporal boundary rules for each tactical category.
For the Olympic evaluation videos, each video was independently reviewed by eight trained video analysts before consensus discussion. Analysts first identified candidate tactical events according to the operational definitions and temporal annotation rules summarized in
Table 3. Inter-rater agreement was calculated using the independently generated labels before disagreement resolution. After reliability assessment, disagreements were resolved through consensus discussion. The final consensus labels were used as reference labels for event-level precision, recall, and F1-score calculation. Tactical events that remained ambiguous after consensus discussion were excluded from the event-level reference set.
It should be noted that the manual reference clips in this study were defined as coaching-oriented tactical analysis segments rather than as minimal isolated event intervals. To preserve tactical interpretability and practical usability, analysts were allowed to retain the tactical setup immediately preceding the focal action and, where appropriate, extend the clip until the immediate offensive sequence became interpretable for coaching review. For event-level recognition evaluation, these coaching-oriented clips were converted into event-level reference labels by assigning each clip a tactical category and a central offensive sequence identifier. The extended lead-in and follow-up phases were retained for coach-rated clip quality evaluation but were not used as strict temporal localization criteria.
The operational definitions and temporal annotation rules are summarized in
Table 3.
2.3. Automated Tactical Clip Extraction Workflow
The automated tactical clipping workflow consisted of video preprocessing, player detection, temporal feature extraction, tactical event recognition, and clip generation. The overall workflow is shown in
Figure 1.
First, video preprocessing was performed to prepare full-game video materials for automated recognition. This process included video segmentation, format standardization, frame-rate normalization, image-quality screening, and basic temporal alignment where necessary. These procedures were used to reduce interference caused by differences in video format, frame rate, resolution, and broadcast camera conditions.
Second, player detection was performed using a YOLOv8-based object detection module. YOLOv8 was selected because of its established use in real-time object detection tasks requiring a balance between processing speed and localization accuracy [
23]. The detection module was used to locate players in sampled video frames and generate bounding boxes, confidence scores, and frame-level positional information.
Third, temporal feature extraction was conducted on the basis of consecutive video frames. The system extracted player coordinates, detected-ball location cues, relative distances, movement directions, ball-handler-related interaction cues, and basic spatial relationship information. In the current workflow, the ball handler was inferred primarily from the spatial association between the detected ball and nearby players across consecutive frames. These features were then used for temporal modeling and rule-constrained tactical recognition.
Fourth, tactical recognition was conducted for the three predefined offensive patterns. The recognition process combined LSTM-based temporal modeling with basketball-specific rule constraints. The rule constraints were designed to improve interpretability by incorporating tactical conditions, such as interpersonal distance thresholds, ball–handler interaction, screening relationships, horns alignment, and possession interruption.
Finally, once a potential tactical event was detected, the system generated candidate clips according to predefined tactical extraction rules designed to retain sufficient contextual information for coaching review. Candidate clips were organized by tactical category for subsequent comparison with manually generated reference clips, event-level recognition evaluation, baseline comparison, and coach-rated quality assessment.
The recognition and segment extraction procedures for the three tactical categories are illustrated in
Figure 2,
Figure 3 and
Figure 4.
Because basketball games may be interrupted by fouls, violations, timeouts, and other stoppages, the system incorporated a possession-interruption rule based on stoppage-related visual cues. These cues included referee-action signals and observable player-action or possession discontinuities associated with interruptions in play. When such a visual cue or possession interruption was detected, extraction of the current segment was terminated, and recognition resumed after play restarted. This rule was used to reduce the inclusion of invalid or incomplete clips. The stoppage-related visual cue–based termination procedure is illustrated in
Figure 5.
2.4. Model Implementation, Development Settings, and Technical Baseline
The automated clipping workflow included three computational components: player detection, temporal modeling, and rule-constrained tactical recognition. In addition, a technical baseline without LSTM temporal modeling was implemented to examine the contribution of the temporal modeling component.
2.4.1. Player Detection Module
Player detection was implemented using the Ultralytics YOLOv8 framework. In this study, the YOLOv8-based module was used as a detection component rather than as a newly proposed detection architecture. The detected player coordinates were used to calculate relative distances, movement directions, and interaction patterns among offensive and defensive players. Ball location cues used for ball handler inference were also extracted during the visual detection stage and subsequently combined with player position information for temporal modeling.
2.4.2. Temporal Modeling Module
Temporal modeling was performed using an LSTM-based module. The purpose of this module was to model the temporal evolution of player interactions across consecutive frames rather than to propose a new recurrent neural network architecture. The input features included player coordinates, detected-ball location cues, relative distances between the inferred ball handler and nearby teammates, movement direction, and tactical interaction cues.
The LSTM classifier used manually clipped tactical video clips as input. The input features were derived from YOLOv8-based player detection results and related spatial–temporal descriptors. Each clip was represented as a sequence of 32 time steps, with 39 spatial–temporal features per time step. The classifier was implemented as a three-class LSTM model for dribble hand-off, pick-and-roll, and horns offense recognition. The 85 development clips from the CBA data pool were used to configure the LSTM-based classifier and workflow settings, whereas the 75 Olympic evaluation clips were held out for final event-level evaluation.
The LSTM module was implemented as a two-layer network with 128 hidden units in each layer. The input sequence length was set to 32 time steps. After frame-rate normalization, frames were sampled at five-frame intervals, corresponding to a temporal resolution of approximately 0.20 s per time step at 25 frames per second and an input temporal window of approximately 6.4 s. The model was trained for 150 epochs, with a batch size of 16. Adam was used as the optimizer [
24], with an initial learning rate of 0.0001 that was reduced by half every 50 epochs. A dropout rate of 0.3 was applied to reduce overfitting.
2.4.3. Tactical Rule Constraints
The tactical recognition module integrated temporal model outputs with basketball-specific rule constraints. For dribble hand-off, the system focused on close-range ball transfer between the ball handler and a nearby teammate. For pick-and-roll, the system considered the relationship between the ball handler, screener, and on-ball defender. For horns offense, the system considered the initial two-player high-position structure and the subsequent offensive initiation pattern. Candidate clips were generated only when the temporal model outputs and tactical rule constraints jointly satisfied the predefined recognition criteria.
2.4.4. Rule-Based Technical Baseline Without LSTM
To provide a minimal algorithmic baseline, a YOLOv8 plus rule-based recognition workflow without LSTM temporal modeling was implemented. This baseline used the same YOLOv8-based detection outputs, player position information, ball handler cues, and basketball-specific rule constraints as the full workflow, but it did not use the LSTM temporal classifier. Candidate tactical events were generated using deterministic rule-based conditions related to close-range ball–handler interaction, screening relationships, horns alignment, and possession interruption.
The purpose of this baseline was to examine whether the addition of LSTM-based temporal modeling improved event-level tactical recognition performance beyond a straightforward YOLOv8 plus rule-based implementation. The baseline and the full YOLOv8–LSTM workflow were evaluated using the same Olympic evaluation set and the same event-matching procedure.
2.4.5. Data Augmentation
Data augmentation was applied to improve the robustness of the detection and recognition modules under different camera conditions, player movement patterns, and broadcast video variations. Moderate transformations, including random scaling, rotation, cropping, and brightness adjustment, were used to simulate common variations in broadcast basketball videos. The augmentation settings were fixed before system validation and were selected to avoid unrealistic distortions of player posture, court geometry, and tactical spatial relationships.
2.5. Software and Hardware Environment
The automated workflow was implemented in a Conda environment named yolov8. Python 3.9.25 was used for video data organization, feature processing, automated workflow implementation, baseline comparison, and statistical analysis. The player detection module was implemented using the Ultralytics YOLOv8 framework based on PyTorch. The confirmed software environment included PyTorch 2.7.1+cu128, CUDA Runtime 12.8, torchvision 0.22.1+cu128, Ultralytics YOLO 8.4.38, OpenCV 4.10.0.84, NumPy 1.26.4, pandas 2.3.3, and SciPy 1.13.1, and scikit-learn 1.5.2. LabelMe 5.10.1 was used for manual bounding-box annotation where applicable [
25].
All automated processing experiments were conducted using the same hardware and software environment to ensure the comparability of processing time and system output evaluation. The automated workflow was run on a computing platform equipped with an Intel Core i9-14900HX processor at 2.20 GHz (Intel Corporation, Santa Clara, CA, USA), 16 GB RAM, and an NVIDIA GeForce RTX 4080 Laptop graphics processing unit (GPU) with 12 GB of video memory (NVIDIA Corporation, Santa Clara, CA, USA). The operating system was Microsoft Windows 11.
Manual clipping was conducted using Hudl Sportscode in the corresponding manual analysis environment. Automated clipping time was measured from the start of video processing to the completion of categorized clip output. Manual clipping time was measured as the time required by analysts to identify, mark, extract, and organize the corresponding tactical clips.
2.6. Evaluation Metrics and Event-Matching Procedure
The performance of the automated tactical clipping system was evaluated from five perspectives: processing efficiency, event-level tactical recognition performance, baseline comparison, coach-rated clip quality, and inter-rater agreement among manual video analysts. These indicators were selected to assess both the practical applicability and the quantitative recognition performance of the automated workflow in basketball video analysis.
2.6.1. Processing Efficiency
Processing efficiency was evaluated by comparing the time required for automated and manual clipping. For each tactical category, the average time required to extract clips from a single game was calculated. Because automated and manual clipping was performed on the same ten full-game videos, paired comparisons were used in the statistical analysis.
2.6.2. Event-Level Tactical Recognition Performance
To provide an objective quantitative assessment of tactical recognition accuracy, automatically generated clips were compared with manually annotated reference clips for each tactical category. A true positive (TP) was defined as an automatically generated clip that corresponded to a manually referenced tactical event of the same category under the predefined annotation rules. A false positive (FP) was defined as an automatically generated clip that did not correspond to a manually referenced target event. A false negative (FN) was defined as a manually referenced tactical event that was not detected by the automated workflow.
Event matching was performed according to the following procedure. First, each manually annotated event was assigned a tactical category and a central offensive sequence identifier. Second, each automatically generated clip was assigned a predicted tactical category. Third, an automatically generated clip was counted as a TP when its predicted category matched the manual reference category and both clips referred to the same offensive sequence. Fourth, when multiple automated clips corresponded to the same manually annotated event, only one clip was counted as a TP, whereas the remaining duplicated clips were counted as FP. Fifth, an automatically generated clip without a corresponding manual reference event was counted as FP. Sixth, a manually annotated event without any corresponding automated clip was counted as FN. Extended lead-in and follow-up footage was not used as a strict temporal localization criterion but was considered in coach-rated clip quality evaluation.
Precision, recall, and F1-score were calculated as follows:
2.6.3. Baseline Comparison
To evaluate the contribution of the LSTM-based temporal modeling component, the full YOLOv8–LSTM + rule-constrained workflow was compared with the YOLOv8 + rule-based baseline without LSTM. Both methods were evaluated on the same Olympic evaluation set using the same event-matching procedure and the same precision, recall, and F1-score calculations. This comparison was used to examine whether the integrated workflow provided additional event-level recognition value beyond a simpler rule-based implementation.
2.6.4. Coach-Rated Clip Quality
Clip quality was evaluated by basketball coaches using a 0–10 rating scale. The evaluation considered whether the generated clip correctly captured the target tactical event, whether the extracted segment preserved sufficient tactical context and had practically adequate boundaries for coaching review, whether the clip was usable for tactical review, and whether it could support coaching feedback. Higher scores indicated better clip quality and greater practical usability. The coach-rated quality of automatically generated clips was compared with that of manually generated clips.
2.6.5. Inter-Rater Agreement
To examine the reliability of manual reference judgments, inter-rater agreement among manual video analysts was assessed using Fleiss’ kappa. The agreement results were used to determine whether the manual clipping outputs were sufficiently consistent to serve as reference judgments for evaluating the practical quality and event-level recognition performance of automated clips.
2.6.6. Scope of Evaluation and Boundary Considerations
This study combined system-level application validation with event-level tactical recognition evaluation and baseline comparison. Accordingly, the evaluation included processing efficiency, TP, FP, FN, precision, recall, F1-score, coach-rated clip quality, and inter-rater agreement among manual analysts.
Strict temporal displacement metrics, such as start-point error, end-point error, or temporal intersection over union, were not applied in the present analysis. The manually generated reference clips were defined as coaching-oriented tactical analysis segments rather than minimal isolated event intervals. Their boundaries were designed to preserve tactically relevant pre-action and post-action context for practical review. Therefore, direct calculation of temporal boundary displacement between automated and manual clips could conflate genuine localization error with deliberate contextual extension and may not provide a valid measure of clip quality for the present task. Future studies designed specifically for temporal localization should establish separate minimal-event boundary annotations and then apply metrics such as start-point error, end-point error, or temporal intersection over union.
2.7. Statistical Analysis
Statistical analyses were conducted to compare automated and manual clipping procedures. Because both procedures were applied to the same ten full-game videos, the analyses were conducted within a paired-samples framework. The normality of paired differences was examined using the Shapiro–Wilk test. When the normality assumption was satisfied, paired-samples t-tests were used to compare automated and manual clipping times; when the assumption was violated, the Wilcoxon signed-rank test was applied. Because the comparisons were not based on independent groups, Levene’s test for homogeneity of variance was not applied.
For coach-rated clip quality, descriptive statistics included the mean, standard deviation, median, and interquartile range (IQR). Given that the ratings were recorded on a 0–10 scale and treated as ordinal data, the Wilcoxon signed-rank test was used to compare the quality scores of automated and manual clips. The significance level was set at p < 0.05. All statistical analyses were performed using Python 3.9.25, with SciPy 1.13.1 used for statistical testing and pandas 2.3.3 used for data organization and tabular processing. Event-level tactical recognition performance and baseline comparison results were summarized descriptively using TP, FP, FN, precision, recall, and F1-score for each tactical category.
3. Results
3.1. Processing Efficiency of Automated and Manual Clipping
In this study, ten full-game videos of men’s basketball matches from the Paris 2024 Olympic Games were selected as the application-level validation materials. For the automated clipping condition, the proposed automated tactical clipping workflow was used to extract three types of offensive tactical clips: dribble hand-off, pick-and-roll, and horns offense. Each game was tested eight times using the automated workflow, and the processing time for each trial was recorded and averaged.
For the manual clipping condition, eight senior video analysts with substantial practical experience in professional basketball video analysis were invited to manually clip the same ten game videos using Hudl Sportscode. The manual clipping time of each analyst was recorded, and the mean manual clipping time per game was calculated.
3.1.1. Automated Clipping Efficiency
After applying the automated tactical recognition and clipping workflow, the system was able to generate video clips for the three predefined offensive tactical categories. The results showed that the average time required by the automated system to extract tactical clips from a single game was 3.12 min for dribble hand-off clips, 3.69 min for pick-and-roll clips, and 1.96 min for horns offense clips. The summary results of automated clipping efficiency across the three tactical categories are presented in
Table 4.
3.1.2. Manual Clipping Efficiency
For manual clipping, eight experienced video analysts independently clipped the same ten game videos. The results showed that the average time required for manual clipping from a single game was 54.85 min for dribble hand-off clips, 67.22 min for pick-and-roll clips, and 35.01 min for horns offense clips. The summary results of manual clipping efficiency across the three tactical categories are presented in
Table 5.
3.1.3. Comparison of Automated and Manual Clipping Efficiency
Statistical Comparison of Clipping Time
Figure 6 presents a comparison between automated and manual clipping efficiency in terms of processing time. The horizontal axis represents the three offensive tactical categories, whereas the vertical axis represents clipping time in minutes. For dribble hand-offs, pick-and-rolls, and horns offense, manual clipping required substantially more time than automated clipping. The automated system reduced the average clipping time by 94.31% for dribble hand-offs, 94.51% for pick-and-rolls, and 94.40% for horns offense.
Because automated and manual clipping was performed on the same ten game videos, paired-samples statistical tests were used to compare processing time between the two methods. The normality of paired differences was examined using the Shapiro–Wilk test. The paired differences met the normality assumption for all three tactical categories; therefore, paired-samples t-tests were used. The results showed that automated clipping required significantly less time than manual clipping for all three tactical categories.
Specifically, for dribble hand-offs, the mean processing time was 3.12 ± 0.48 min for automated clipping and 54.85 ± 4.17 min for manual clipping,
p < 0.001. For pick-and-rolls, the mean processing time was 3.69 ± 0.56 min for automated clipping and 67.22 ± 5.06 min for manual clipping,
p < 0.001. For horns offense, the mean processing time was 1.96 ± 0.34 min for automated clipping and 35.01 ± 2.87 min for manual clipping,
p < 0.001. These findings indicate that the automated system substantially improved clipping efficiency compared with manual procedures. The paired comparison results are presented in
Table 6.
3.2. Event-Level Tactical Recognition Performance of the Full Automated Workflow
To provide an objective quantitative assessment of tactical recognition performance, automatically generated clips from the full YOLOv8–LSTM + rule-constrained workflow were compared with manually annotated reference events for each tactical category. Event-level performance was evaluated using true positives (TPs), false positives (FPs), false negatives (FNs), precision, recall, and F1-score. The results are presented in
Table 7.
For dribble hand-off clips, the full workflow identified 10 true-positive events, with 21 false positives and 10 false negatives. The corresponding precision, recall, and F1-score were 0.3226, 0.5000, and 0.3922, respectively. The relatively low precision indicates that over-detection remained a major limitation for this tactical category.
For pick-and-roll clips, the full workflow produced 9 true positives, 5 false positives, and 21 false negatives. The precision was 0.6429, whereas recall was 0.3000, resulting in an F1-score of 0.4091. This pattern indicates that pick-and-roll recognition achieved relatively higher correctness among detected clips but remained strongly constrained by missed tactical events.
For horns offense clips, the full workflow produced 16 true positives, 14 false positives, and 9 false negatives. The precision, recall, and F1-score were 0.5333, 0.6400, and 0.5818, respectively. Among the three tactical categories, horns offense showed the highest F1-score and comparatively better recognition performance.
Overall, the full automated workflow provided measurable but uneven event-level recognition performance across tactical categories. These results help explain why the automatically generated clips, although efficient to produce, still required expert review before practical coaching use.
3.3. Baseline Comparison and Ablation Analysis
To examine whether the LSTM-based temporal modeling component contributed additional recognition value beyond a straightforward rule-based implementation, the full YOLOv8–LSTM + rule-constrained workflow was compared with a YOLOv8 + rule-based baseline without LSTM. Both methods used the same Olympic evaluation set, the same manually annotated reference events, and the same event-matching procedure. The rule-based baseline used YOLOv8-based detection outputs, ball handler cues, interpersonal distance thresholds, screening-related rules, horns alignment rules, and possession interruption rules but did not include LSTM-based temporal sequence modeling.
As shown in
Table 8, the full YOLOv8–LSTM workflow achieved higher F1-scores than the rule-based baseline across all three tactical categories. For dribble hand-off, the F1-score improved from 0.3390 to 0.3922. For pick-and-roll, the F1-score improved from 0.2632 to 0.4091. For horns offense, the F1-score improved from 0.3396 to 0.5818.
These findings suggest that LSTM-based temporal modeling contributed to improved event-level tactical recognition performance, particularly for pick-and-roll and horns offense. However, the absolute F1-scores remained limited, indicating that the full workflow should still be regarded as a preliminary candidate clip generation tool requiring expert human review rather than as a fully automated tactical analysis system.
3.4. Coach-Rated Clip Quality and Usability Evaluation
Although automated clipping substantially reduced processing time, and event-level recognition metrics provided objective evidence of tactical identification performance, these indicators alone are not sufficient to determine whether the generated clips are practically usable for basketball coaching. Therefore, coach-rated clip quality was further assessed to examine whether the automatically generated clips met practical coaching needs.
The evaluation focused on four aspects: whether the generated clip correctly captured the target tactical event, whether the extracted segment preserved sufficient tactical context and had practically adequate boundaries for coaching review, whether the tactical category was correctly identified, and whether the clip was usable for coaching feedback and tactical review. Each game was scored on a 0–10 scale, with higher scores indicating greater clip quality and practical usability.
The results showed that the mean coach-rated quality score for automated clipping was 3.8 ± 1.81, with a median of 3.5 and an interquartile range of 1.75. In contrast, the mean coach-rated quality score for manual clipping was 8.4 ± 1.07, with a median of 8.0 and an interquartile range of 1.00. Because coach ratings were based on a 0–10 ordinal scale, the Wilcoxon signed-rank test was used to compare the two clipping methods. The results indicated that coach-rated clip quality was significantly lower for automated clipping than for manual clipping,
p = 0.002, as shown in
Table 9 and
Figure 7.
These results suggest that, although the automated system markedly improved processing efficiency and outperformed the rule-based baseline in event-level F1-score, the quality and practical usability of automatically generated clips remained substantially lower than those of manually generated clips. This finding is consistent with the event-level recognition and baseline comparison results. Specifically, the automated workflow still showed limited recognition reliability, with dribble hand-off extraction being more affected by over-detection and pick-and-roll extraction being more constrained by missed events. Horns offense, although comparatively better recognized, also remained imperfect. Therefore, the automated system should currently be regarded as a decision-support tool for preliminary candidate clip generation rather than a replacement for expert manual analysis.
3.5. Inter-Rater Agreement Among Manual Video Analysts
To examine the reliability of manual reference judgments, inter-rater agreement among the eight manual video analysts was assessed. The mean game-level Fleiss’ kappa coefficient was 0.81, indicating almost perfect agreement among analysts. At the game level, kappa values ranged from 0.72 to 0.91, suggesting that the manual clipping results had acceptable consistency and could be used as reference judgments for evaluating automated clip quality. The game-level Fleiss’ kappa coefficients are presented in
Table 10.
These findings indicate that the manual clipping results were not based on the judgment of a single analyst, but were supported by relatively stable agreement among multiple experienced analysts. Therefore, the manually generated clips were used as the reference standard for both event-level tactical recognition evaluation and the assessment of the practical quality and usability of the automated clipping outputs.
3.6. Summary of Main Findings
Overall, the automated tactical clipping system demonstrated a substantial advantage in processing efficiency. Compared with manual clipping, the automated workflow reduced clipping time by more than 94% across all three offensive tactical categories, including dribble hand-off, pick-and-roll, and horns offense. This finding indicates that automated clipping has practical value for reducing the workload of video analysts and improving the timeliness of basketball video feedback.
The event-level tactical recognition analysis showed that the full YOLOv8–LSTM workflow achieved measurable but uneven recognition performance across tactical categories. Horns offense exhibited the highest F1-score (0.5818), whereas dribble hand-off recognition was more affected by over-detection and pick-and-roll recognition was more constrained by missed events. The baseline comparison further showed that the full YOLOv8–LSTM workflow outperformed the rule-based baseline without LSTM across all three tactical categories. These results suggest that temporal modeling contributed to improved recognition performance, although the absolute performance remained limited.
The coach-rated clip quality results confirmed that the automated system remained inferior to manual clipping in terms of practical usability. The lower coach ratings suggest that automatically generated clips still had limitations in recognition reliability, contextual completeness, and coaching applicability. Therefore, the current system should be regarded as an auxiliary candidate clip generation tool requiring expert review, correction, and interpretation rather than as a full replacement for expert manual video analysis.
Finally, the almost perfect mean game-level inter-rater agreement among manual video analysts supports the use of manually generated clips as a reference standard for evaluating both event-level recognition performance and practical clip quality.
4. Discussion
4.1. Processing Efficiency: Practical Advantages of the Automated Clipping Workflow
The results of this study show that the automated tactical clipping workflow demonstrated a substantial advantage in processing efficiency compared with manual clipping. Across the three offensive tactical categories, automated clipping reduced processing time by more than 94%. Taking dribble hand-off clips as an example, the mean processing time of the automated workflow was 3.12 min, which accounted for only approximately 5.7% of the time required for manual clipping. Similar efficiency advantages were also observed for pick-and-roll and horns offense clips.
This efficiency advantage can be mainly attributed to the automated extraction of player-related visual features and the rule-constrained temporal analysis of tactical events. In conventional basketball video analysis workflows, analysts need to repeatedly observe full-game footage, identify target events, mark clip boundaries, assign labels, and organize clips. Although sport-specific video analysis tools such as Hudl Sportscode and Dartfish can support tagging, annotation, and video library construction, the identification and selection of tactical clips still depend largely on manual judgment [
3,
9,
18]. By contrast, the automated workflow evaluated in this study generated candidate tactical clips through predefined recognition logic and automatic segment extraction, thereby reducing repetitive manual operations.
From an applied coaching perspective, this finding is important because tactical feedback often needs to be delivered within a limited time window. In pre-match preparation, post-match review, and potential near-real-time match support, reducing the time required for clip extraction may allow coaches and analysts to allocate more attention to tactical interpretation, diagnosis, and feedback design rather than to repetitive video searching and editing. Therefore, the current system has practical value as a preliminary candidate clip generation tool within basketball performance analysis workflows.
4.2. Event-Level Recognition Performance, Baseline Comparison, and Practical Usability
Although the automated workflow was clearly superior to manual clipping in terms of processing efficiency, the event-level tactical recognition results revealed substantial category-specific limitations. For dribble hand-off clips, the full YOLOv8–LSTM + rule-constrained workflow produced 10 true positives, 21 false positives, and 10 false negatives, with a precision of 0.3226, a recall of 0.5000, and an F1-score of 0.3922. This result indicates that over-detection remained a major limitation for this tactical category. For pick-and-roll clips, the full workflow produced 9 true positives, 5 false positives, and 21 false negatives. Although precision reached 0.6429, recall was only 0.3000, resulting in an F1-score of 0.4091. This pattern indicates that pick-and-roll recognition was more strongly constrained by missed tactical events. For horns offense clips, the full workflow produced 16 true positives, 14 false positives, and 9 false negatives, with a precision of 0.5333, a recall of 0.6400, and an F1-score of 0.5818. Among the three tactical categories, horns offense showed the best recognition performance but still remained imperfect.
The baseline comparison further clarifies the contribution of LSTM-based temporal modeling. Compared with the YOLOv8 + rule-based baseline without LSTM, the full YOLOv8–LSTM + rule-constrained workflow achieved higher F1-scores across all three tactical categories. The F1-score improved from 0.3390 to 0.3922 for dribble hand-off, from 0.2632 to 0.4091 for pick-and-roll, and from 0.3396 to 0.5818 for horns offense. These findings suggest that temporal modeling contributed additional recognition value beyond a straightforward rule-based implementation, particularly for tactical patterns that require interpretation of multi-player interaction sequences. However, the absolute F1-scores remained limited, indicating that the system is not yet sufficiently reliable for fully automated tactical analysis.
These quantitative findings help explain the lower coach-rated quality of automatically generated clips. The mean coach-rated quality score for automated clipping was 3.8, whereas the corresponding score for manual clipping was 8.4. This result indicates that the automated system cannot yet replace expert manual video analysis in complex basketball scenarios. Importantly, the lower practical quality was not attributable to a single error source. Instead, the TP/FP/FN results and precision–recall patterns indicate different recognition limitations across tactical categories: dribble hand-off extraction was more affected by over-detection, pick-and-roll extraction was more strongly constrained by missed events, and horns offense, although comparatively better recognized, still showed limited reliability.
The remaining gap between automated and manual clipping is also related to the nature of basketball tactical interpretation. Tactical events are not defined solely by isolated player positions or single-frame actions but by dynamic relationships among the ball handler, teammates, defenders, ball movement, and spatial context. Real-game broadcast video further introduces player occlusion, camera movement, rapid transitions, and incomplete visual information. In addition, coaching-oriented tactical clips often need to preserve sufficient contextual information before and after the focal action so that the offensive sequence can be interpreted meaningfully. Therefore, the practical quality of a generated clip depends not only on whether the target event is detected but also on whether the segment is contextually adequate for coaching review.
Accordingly, the current automated workflow should be regarded as a decision-support tool rather than a replacement for experienced video analysts. Its main value lies in quickly generating preliminary candidate clips that can be further reviewed, corrected, and interpreted by coaches or analysts. In practical use, the system may first provide a batch of candidate clips, after which analysts refine clip selection, remove false positives, recover missed events where necessary, and add tactical interpretation. This human-in-the-loop workflow is more realistic than full automation at the current stage.
4.3. Application Value and Methodological Positioning
The primary contribution of this study lies in the application-oriented validation of an automated tactical clipping workflow for basketball game videos. Rather than proposing a new deep learning architecture, this study evaluated whether an integrated workflow based on player detection, temporal modeling, tactical recognition rules, and automatic clip generation could support basketball video analysis under real-game conditions.
Compared with conventional manual workflows, the proposed system has three practical implications. First, it can reduce the workload of video analysts by automatically generating candidate clips for selected offensive tactical categories. Second, it can improve the timeliness of video feedback by shortening the time required to locate and organize tactical clips. Third, it provides a structured technical pathway for linking player detection, tactical event recognition, clip generation, quantitative recognition evaluation, technical baseline comparison, and coach assessment in a single applied workflow.
Existing computational studies in basketball have explored related tasks such as tactical pattern recognition, player-relation modeling, and highlight generation; however, directly comparable workflows for automatically extracting coaching-oriented tactical clips from full-game broadcast videos remain limited. Therefore, the present study adopted manual clipping as the primary practical baseline for workflow efficiency and coach usability, while also introducing a YOLOv8 + rule-based baseline without LSTM as a minimal algorithmic comparison. This combined evaluation strategy helps position this study between practical sports video analysis and computational tactical recognition research.
The incorporation of interpersonal distance thresholds, possession interruption rules, and LSTM-based temporal modeling reflects an attempt to adapt automated clipping logic to basketball-specific tactical scenarios. The rule-based constraints improve interpretability because coaches and analysts can understand why the system identifies a segment as a potential dribble hand-off, pick-and-roll, or horns offense. The baseline comparison further suggests that temporal modeling can improve recognition performance beyond rule-based recognition alone. This is particularly important in applied sports settings, where users often require not only automated output but also interpretable reasoning and expert review behind the generated clips.
4.4. Limitations and Future Work
Several limitations should be acknowledged. First, although the workflow was developed using 25 CBA professional broadcast games and evaluated on 10 independent Olympic full-game videos, the final application-level validation sample remained limited in scale. The Olympic games represented high-level international competition, but the findings may not fully generalize to other competitions, teams, camera conditions, tactical styles, or levels of play. Future studies should expand both the size and diversity of validation datasets.
Second, the current system focused on only three offensive tactical categories: dribble hand-off, pick-and-roll, and horns offense. Although these actions are common in high-level basketball, they do not cover the full range of offensive and defensive tactical patterns. More complex tactical structures, such as off-ball screens, Spain pick-and-roll, transition offense, zone offense, and defensive rotations, were not included in the present validation.
Third, the automated recognition workflow still relied partly on predefined tactical rules, including interpersonal distance thresholds and possession interruption rules. These constraints improved interpretability, but they may also limit adaptability in non-standardized game scenarios. In complex real-game situations involving player occlusion, rapid transitions, visually cluttered scenes, or incomplete broadcast angles, false detections and missed detections may still occur. The event-level recognition analysis further showed category-specific limitations: dribble hand-off extraction was affected by false positives, whereas pick-and-roll extraction was more strongly constrained by false negatives. Moreover, despite the use of predefined annotation rules and strong inter-rater agreement, residual labeling uncertainty may remain because the interpretation of basketball tactical events can involve a degree of expert judgment.
Fourth, although the present study added event-level tactical recognition metrics, including TP, FP, FN, precision, recall, and F1-score, it did not apply strict temporal displacement metrics such as start-point error, end-point error, or temporal intersection over union. This decision was based on the task definition of the study. The manually generated reference clips were designed as coaching-oriented tactical analysis segments rather than as minimal isolated event intervals. Their boundaries intentionally preserved tactically relevant lead-in and continuation phases so that coaches could interpret the full offensive sequence. Therefore, direct calculation of temporal displacement between automated and manual clips could conflate genuine localization error with deliberate contextual extension and may not provide a valid measure of clip quality for the present task. Future studies designed specifically for temporal localization should establish separate minimal-event boundary annotations and then apply metrics such as start-point error, end-point error, or temporal intersection over union in a dedicated benchmark setting.
Fifth, the present study included a minimal technical baseline, namely YOLOv8 + rule-based recognition without LSTM, to examine the contribution of temporal modeling. However, the baseline comparison remained limited. This study did not compare the workflow with standard action recognition networks, transformer-based video models, temporal event detection methods, or large-scale sports video benchmarks. Therefore, the algorithmic conclusions should be interpreted cautiously. Future research should construct standardized benchmark settings, expand event-level annotations, and compare different recognition architectures under consistent task definitions and evaluation protocols.
Future work should also expand both the scale and diversity of basketball video datasets. Videos from different competitions, camera angles, levels of play, tactical styles, and team contexts should be included to improve the robustness and generalizability of automated tactical recognition. More advanced spatiotemporal modeling methods, such as attention mechanisms, graph neural networks, transformer-based video models, pose estimation [
26], ball tracking, and multi-player interaction modeling, may help the system better capture tactical relationships among the ball handler, teammates, defenders, and spatial context.
Although the present workflow was developed and validated for basketball tactical clip extraction, its general logic—player detection, temporal modeling, rule-constrained event recognition, and candidate clip generation—may be adaptable to other team sports. However, such transferability would require the redefinition of sport-specific tactical events, annotation rules, and recognition criteria according to the characteristics of each sport. For example, the workflow may have potential application in football, handball, or volleyball if appropriate event definitions and tactical rules are established.
Finally, future systems should continue to adopt a human-in-the-loop workflow. Rather than aiming to fully replace video analysts, automated systems should first generate candidate clips, after which analysts and coaches refine clip selection, remove false positives, recover missed events, and add tactical interpretation. This approach may better align with practical coaching workflows and improve user trust. To enhance reproducibility, future studies should also provide source code, annotation protocols, event-matching rules, sample clips where permitted, and non-copyrighted benchmark materials.
5. Conclusions
This study developed and evaluated an application-oriented automated tactical clip extraction workflow for basketball game videos. It focused on validating whether an integrated workflow based on player detection, LSTM-based temporal modeling, basketball-specific rule constraints, and automatic clip generation could support basketball performance analysis under real-game conditions.
The results showed that the automated workflow had a substantial advantage in processing efficiency. Compared with manual clipping, the automated system reduced clipping time by more than 94% across the three offensive tactical categories of dribble hand-off, pick-and-roll, and horns offense. This finding suggests that automated tactical clipping can effectively reduce repetitive manual work and improve the timeliness of video feedback for coaching and match review.
Event-level tactical recognition evaluation provided a more objective assessment of the current system’s performance. For the full YOLOv8–LSTM + rule-constrained workflow, the F1-scores were 0.3922 for dribble hand-off, 0.4091 for pick-and-roll, and 0.5818 for horns offense, indicating measurable but uneven recognition performance across tactical categories. The corresponding TP, FP, and FN results further showed that dribble hand-off extraction was more affected by false positives, whereas pick-and-roll extraction was more strongly constrained by false negatives.
The baseline comparison showed that the full YOLOv8–LSTM workflow outperformed the YOLOv8 + rule-based baseline without LSTM across all three tactical categories. The F1-score increased from 0.3390 to 0.3922 for dribble hand-off, from 0.2632 to 0.4091 for pick-and-roll, and from 0.3396 to 0.5818 for horns offense. These findings suggest that LSTM-based temporal modeling contributed additional recognition value beyond a straightforward rule-based implementation. However, the absolute F1-scores remained limited, indicating that the current workflow still lacks sufficient reliability for fully automated tactical analysis.
Consistent with these quantitative findings, the coach-rated clip quality of the automated system remained markedly lower than that of manual clipping. This indicates that, although the system can rapidly generate candidate tactical clips, it cannot yet replace expert manual video analysis. At the current stage, the system is more suitable as an auxiliary tool for preliminary candidate clip generation, after which coaches or video analysts can review, correct, and interpret the generated clips before practical coaching use.
Overall, this study provides a practical validation of automated tactical clipping in basketball game video analysis. The findings support the feasibility of using automated workflows to improve clipping efficiency while also demonstrating the need to enhance tactical recognition reliability, reduce false detections and missed events, improve coaching-oriented clip usability, and establish more standardized benchmark settings for future evaluation before such systems can be more widely adopted in high-level coaching practice.