4.5.2. Ablation Experiment for BODY-ST-GCN
In this section, ablation experiments are designed to demonstrate the effectiveness of BODY-ST-GCN. Building upon the graph structure variants introduced above, four models are constructed based on different improvement strategies to verify the necessity of each component. The baseline ST-GCN is first considered as a reference model without any modifications. In addition, the TGF-ST-GCN, which has been described in the previous subsection, is included to further evaluate the contribution of graph structure fusion within the overall framework. Furthermore, a TSP-ST-GCN model is constructed by replacing the original TCN with the proposed TSP-TCN, aiming to verify the effectiveness of multi-scale temporal modeling. Finally, the proposed BODY-ST-GCN integrates both the TGF strategy in the GCN component and the TSP-TCN in the temporal component, so as to simultaneously enhance spatial and temporal modeling capabilities and achieve better performance in detecting OOW behaviors.
To clearly assess the learning and discrimination capability of each model, four ablation variants were evaluated on the test set of the 10-Behaviors dataset, and the high-dimensional features of all behavior classes were visualized using t-SNE.
Figure 17 presents the t-SNE feature distributions for each model: (a) baseline ST-GCN, (b) TGF-ST-GCN, (c) TSP-ST-GCN, and (d) BODY-ST-GCN. As shown in
Figure 17a, the baseline ST-GCN forms clusters for most classes but exhibits loose intra-class distributions and a considerable number of outliers, particularly for “drop”, “fallingdown”, and “handwaving”. With enhanced spatial modeling, TGF-ST-GCN improves the compactness of most class clusters. TSP-ST-GCN further strengthens temporal modeling and yields clearer class boundaries and fewer outliers compared with ST-GCN. Combining both spatial and temporal enhancement strategies, BODY-ST-GCN achieves the most compact intra-class clustering and the most distinct inter-class separation, with the fewest outliers across all behavior categories, indicating a more stable and discriminative feature space.
To further analyze the discriminative ability of each model, misclassified samples in the test set were marked with red crosses, as shown in
Figure 18.
Figure 18a–d correspond to ST-GCN, TGF-ST-GCN, TSP-ST-GCN, and BODY-ST-GCN, respectively. The baseline ST-GCN misclassified 81 samples, mainly from the “drop”, “fighting”, “handwaving”, “sitdown”, “staggering”, and “walking” classes. In contrast, all enhanced variants significantly reduced the number of misclassified samples. Benefiting from improved spatial and temporal modeling, TGF-ST-GCN and TSP-ST-GCN reduced the misclassification counts to 66 and 59, respectively. Finally, although BODY-ST-GCN still shows room for improvement in distinguishing walking and dropping, it achieves the lowest number of misclassified samples among all models, further confirming its superior discriminative capability across behavior classes.
Table 7 presents the performance of the ablation models on the test set of the 10-Behaviors dataset. The results show that all models incorporating the proposed enhancements outperform the baseline ST-GCN to varying degrees. Specifically, BODY-ST-GCN, which integrates both the TGF and TSP strategies and strengthens spatial and temporal modeling simultaneously, achieves the highest
,
, and
values. Compared with ST-GCN, these three metrics improve by 6.4%, 6.6%, and 6.8%, respectively. Likewise, TGF-ST-GCN and TSP-ST-GCN, which introduce only spatial or temporal enhancement, also achieve noticeable improvements over the baseline.
From the perspective of computational efficiency, ST-GCN exhibits the lowest , yet its is lower than those of the enhanced models. TGF-ST-GCN attains the highest , primarily because TGF replaces the sparse matrix multiplication used in ST-GCN with adaptive adjacency learning and trainable node embeddings, leading to more regularized computations during inference. In addition, the G-MHSA module in TGF performs averaging in the temporal dimension, further reducing computational cost. The TSP strategy replaces the original large-kernel TCN in ST-GCN with parallel small-kernel branches, which not only lowers the computational burden but also improves GPU utilization, resulting in higher despite slightly increased . Although BODY-ST-GCN has a lower than TGF-ST-GCN, it consistently achieves superior macro-level performance across all evaluation metrics.
Figure 19 presents the confusion matrices on the 10-Behaviors dataset. As shown, the baseline ST-GCN achieves relatively high recognition rates for behaviors such as “sitdown”, “standup”, and “walking”, but exhibits considerable confusion among several classes. For example, 9.66% and 8.52% of “drop” samples are misclassified as “clapping” and “fallingdown”, while 1.13%, 2.26%, and 6.21% of “jumpup” samples are misclassified as “drop”, “handwaving”, and “standup”, respectively. With enhanced spatial modeling, TGF-ST-GCN significantly reduces misclassification across most behavior categories compared with the baseline. The TSP strategy enables the model to better adapt to actions with different temporal scales, leading to marked performance improvements for medium-duration behaviors such as “fallingdown” and “sitdown”, as well as short-duration behaviors such as “fighting”, demonstrating the effectiveness of TSP in strengthening temporal modeling. By jointly enhancing spatial and temporal modeling, BODY-ST-GCN achieves the most accurate and balanced classification performance. For the five key behaviors emphasized in this paper, namely “fallingdown”, “fighting”, “sitdown”, “standup”, and “walking”, BODY-ST-GCN attains classification accuracies exceeding 98%, consistently outperforming the models that incorporate only a single enhancement strategy. Although its accuracy for “sitdown” is slightly lower than that of the baseline, BODY-ST-GCN demonstrates superior and more stable recognition capability across all other behaviors, resulting in overall performance that far surpasses the baseline.
The complete BODY-ST-GCN architecture is a hierarchical model composed of ten ST-GCN blocks. To more precisely determine the optimal insertion position of the TGF module, we construct a set of model variants that differ only in the location where TGF is integrated. The backbone can be conceptually divided into three stages. Blocks 1–4 form the shallow stage with 64 channels and are primarily responsible for extracting low-level action semantics. After the first spatiotemporal downsampling, Blocks 5–7 constitute the intermediate stage with 128 channels, where local semantic relationships are progressively aggregated. Following the second spatiotemporal downsampling, Blocks 8–10 comprise the deep stage with 256 channels, capturing high-level and global action semantics. Based on this hierarchical structure, six representative insertion positions are selected: Blocks 2 and 4 in the shallow stage, Blocks 5 and 7 in the intermediate stage, and Blocks 8 and 10 in the deep stage. These positions span the progression from low-level to high-level semantic learning and cover both spatiotemporal downsampling transitions within the network.
Table 8 summarizes the performance of the models with TGF inserted at different locations. As shown, the insertion position produces negligible changes in computational complexity, and all model variants sustain inference speeds of over 60
, indicating that real-time capability remains largely unaffected. However, the recognition accuracy varies markedly with the insertion position. The best
,
, and
are achieved when TGF is inserted at Block 4, followed by insertion at Block 10. In contrast, inserting TGF too early, for example, at Block 2, or placing it within the intermediate stage, leads to noticeable performance degradation. In summary, Block 4 is selected as the optimal insertion position for TGF. Although this configuration does not yield the highest
, it provides the best overall recognition accuracy, making it the most effective choice for the proposed architecture.
To evaluate the impact of convolutional kernel scales on temporal feature modeling, we take the single-scale convolution (kernel size = 9) used in ST-GCN as the baseline and construct several TCN variants with different temporal scales and kernel sizes within the BODY-ST-GCN framework. First, under the single-scale setting, we set the kernel sizes to one and five to examine the effect of short and medium temporal windows on model performance. Then, we further design dual-scale (1, 5) and tri-scale (1, 3, 5) structures to simultaneously capture temporal dependencies at different scales. Finally, these configurations are compared with the proposed TSP structure (3, 5, 7) to comprehensively assess its effectiveness.
The ablation results are summarized in
Table 9. They show that, although the single-scale models with short and medium temporal windows achieve higher real-time performance, their
,
, and
scores are significantly lower than those of the multi-scale models. The dual-scale model yields a certain performance gain, but still lags behind the baseline model with a long temporal window in terms of both accuracy and real-time performance. In contrast, the proposed TSP structure maintains competitive real-time performance while improving
,
, and
by 2.5%, 2.6%, and 2.6%, respectively, over the baseline single-scale long-window model. Moreover, the TSP-based model exhibits markedly better real-time performance than the alternative tri-scale (1, 3, 5) configuration.
Figure 20a–e illustrate the detection results of various ablation models during the process of a person transitioning from standing to falling. As shown between
Figure 20b and
Figure 20c, the person exhibits motion blur, with their posture resembling the “sitdown” action. The baseline model, ST-GCN, misclassifies this phase as “sitdown,” only correctly identifying the “fallingdown” behavior between
Figure 20d and
Figure 20e, when the individual assumes a complete falling posture. In contrast, TGF-ST-GCN, TSP-ST-GCN, and BODY-ST-GCN avoid this misclassification. Notably, BODY-ST-GCN recognizes the person’s behavior as “staggering” in
Figure 20b, demonstrating its ability to capture subtle movements, and continues to accurately detect “fallingdown” in the subsequent frames. Moreover, the confidence score gradually increases from 0.894 to 0.952 as the falling posture becomes more stable, reaching the highest value among all compared models.
Figure 21a–e present the change detection results of various ablation models during the transition from “standup” to “sitdown”. In
Figure 21a, all models detect the person’s behavior as “standup.” However, as time progresses and the person’s posture changes, the baseline model, ST-GCN, incorrectly classifies the “sitdown” behavior as “jumpup” in
Figure 21c. In contrast, other models, after incorporating different improvement strategies, successfully detect this transitional phase. As the person remains seated, while TGF-ST-GCN and TSP-ST-GCN maintain high detection confidence, they exhibit fluctuations due to minor changes in the seated posture. Meanwhile, BODY-ST-GCN not only avoids misclassification of the subject’s behavior but also achieves consistently high and stable confidence in recognizing the “sitdown” action. As shown in
Figure 21d,e, when the seated posture becomes stable, the confidence scores for “sitdown” reach 0.934 and 0.941, respectively, which are the highest among all ablation models.
Figure 22a–e show the performance of various ablation models in detecting the “fighting” behavior. “Fighting” is a relatively complex interpersonal interaction, with most body movements concentrated in the arm region. As a result, the baseline model, ST-GCN, initially fails to detect this behavior accurately, misclassifying it as “standup.” Only as the person’s posture evolves does ST-GCN begin to correctly detect the “fighting” behavior. In contrast, TGF-ST-GCN accurately detects the “fighting” behavior in most frames, although it makes a misclassification in
Figure 22c. TSP-ST-GCN captures more comprehensive features over longer time scales, but due to its limited spatial feature modeling ability, similar to ST-GCN, it misclassifies the behavior as “walking” in
Figure 22a, before correctly identifying “fighting” in subsequent frames. BODY-ST-GCN, on the other hand, consistently detects the “fighting” behavior accurately across all frames. Although its detection confidence is lower in
Figure 22a,b, it maintains a high confidence level of above 0.85 in subsequent frames. Although the detection performance of all models improves in the later frames, the key differences are mainly reflected in the early-stage recognition capability and confidence levels. Therefore, BODY-ST-GCN demonstrates superior robustness and discriminative ability in handling complex interactive behaviors such as “fighting”.
4.5.3. Ablation Experiments for FACE-ST-GCN
In this section, ablation experiments are designed to demonstrate the effectiveness of FACE-ST-GCN. Similarly to the ablation experiments conducted for BODY-ST-GCN, four deep learning models are constructed based on the different improvement strategies incorporated into FACE-ST-GCN, in order to validate the effectiveness of each strategy. ST-GCN is used as the baseline model without any improvement strategies, serving as a reference for performance comparison with the subsequent improved models. In addition, the TGF-ST-GCN, as described in the previous subsection, is included to evaluate the contribution of the TGF strategy within the facial feature recognition framework. Furthermore, a TAM-ST-GCN model is constructed by incorporating the TAM solely into the TCN component of the baseline model, aiming to assess whether the introduction of TAM can enhance the detection capability for different facial features. Finally, the proposed FACE-ST-GCN integrates both the TGF strategy and the TAM into the baseline model. By combining the advantages of these two strategies, this model is designed to enhance spatiotemporal modeling capabilities and further improve recognition performance for facial features.
Figure 23 shows the t-SNE visualizations of high-dimensional behavior features extracted by the four models on the test set of the Face-Normal dataset. Specifically,
Figure 23a corresponds to the baseline ST-GCN,
Figure 23b to TGF-ST-GCN,
Figure 23c to TAM-ST-GCN, and
Figure 23d to the proposed FACE-ST-GCN. As observed, the feature distributions of the four behavior classes in ST-GCN are the most dispersed, indicating limited discriminative capability. In contrast, both TGF-ST-GCN and TAM-ST-GCN exhibit improved feature compactness, with more evident clustering, particularly for the “closeeyes_yawn” or “closeeyes” classes. For FACE-ST-GCN, although a small number of outliers remain, the intra-class compactness is significantly enhanced compared with the other models, and the clusters corresponding to the “closeeyes_yawn” and “yawn” behaviors are more regular in shape with clearer inter-class boundaries.
Misclassified samples are marked with red crosses, and the misclassification results of the four ablation models are illustrated in
Figure 24.
Figure 24a corresponds to the baseline ST-GCN, which exhibits the largest number of misclassified samples, mainly concentrated in the “yawn,” “normal,” and “closeeyes” behavior classes, indicating that the original ST-GCN has limited capability in capturing high-dimensional facial behavior features and thus suffers from inferior classification performance. In comparison, TGF-ST-GCN and TAM-ST-GCN, shown in
Figure 24b and
Figure 24c, respectively, achieve better results, with 53 and 67 misclassified samples, validating the effectiveness of the TGF strategy and the TAM in improving classification performance. As shown in
Figure 24d, the proposed FACE-ST-GCN achieves the best performance, further reducing the number of misclassified samples to 37, the lowest among all ablation models. These results demonstrate the complementary advantages of integrating the TGF strategy with the TAM, which jointly enhance the accuracy of facial behavior recognition.
Table 10 reports the performance comparison of the four ablation models. As shown in the table, although the models incorporating the proposed improvement strategies incur a slight reduction in real-time performance compared with the baseline ST-GCN, they achieve substantial gains in classification accuracy. Specifically, TGF-ST-GCN and TAM-ST-GCN effectively enhance the original model from the spatial and temporal modeling perspectives, respectively, yielding improvements of approximately 10% in
,
, and
. Building upon these results, FACE-ST-GCN integrates the advantages of both strategies and achieves the best overall performance, with
,
, and
improvements of 14.9%, 14.3%, and 14.7% over ST-GCN, respectively. These results convincingly demonstrate the effectiveness of the proposed method in improving classification performance.
Figure 25 presents the confusion matrices of the compared models. As can be observed, the baseline ST-GCN shows limited capability in distinguishing the four behavior classes, with particularly high misclassification rates for the “normal” and “yawn” behaviors. Specifically, 9.63% and 8.02% of the “normal” samples are misclassified as “closeeyes” and “yawn,” respectively, while 1.61% and 24.19% of the “yawn” samples are misclassified as “closeeyes” and “closeeyes_yawn.” These results indicate insufficient representation of key facial behavior features by the baseline model. After incorporating the TGF strategy, TGF-ST-GCN achieves a clear improvement in overall recognition accuracy; however, it still exhibits limited discriminative capability for long-duration behaviors such as “yawn.” By contrast, TAM-ST-GCN enhances temporal modeling through the TAM, enabling 82.80% of the “yawn” samples to be correctly recognized, although 16.67% are still misclassified as “closeeyes_yawn.” Overall, FACE-ST-GCN, which integrates the advantages of both TGF and TAM, delivers the best performance, reducing the misclassification rate of the challenging “yawn” behavior to only 5.38%.
Figure 26a–e illustrate the detection results of four ablation models for facial behavior recognition. As shown in the figure, among the four ablation models, the baseline ST-GCN exhibits limited capability in continuous detection of facial behavior variations. Although it achieves a relatively high confidence score for the “closeeyes_yawn” behavior in
Figure 26c, its detection confidence across the remaining frames is significantly lower than that of the other models. Moreover, it incorrectly classifies the “normal” state as “closeeyes” in
Figure 26d, indicating insufficient temporal modeling ability. In contrast, TGF-ST-GCN, TAM-ST-GCN, and FACE-ST-GCN consistently achieve accurate recognition of all four facial behaviors. Notably, after enhancing temporal modeling through the introduction of the TAM, TAM-ST-GCN demonstrates a substantially higher confidence in detecting the “yawn” behavior compared with TGF-ST-GCN, thereby validating the effectiveness of the TAM in capturing long-term temporal dynamics. FACE-ST-GCN achieves the best detection performance among all compared models. Throughout the process of facial state transitions, it produces no false detections while consistently maintaining high confidence scores, which are the highest among all models. These results further demonstrate that the proposed method effectively integrates the advantages of the TGF strategy and TAM, thereby enabling accurate and robust facial behavior recognition.