1. Introduction
Imitation is a high-level cognitive expression, especially in performing arts, and its significance goes far beyond action replication [
1]. In the animal exercise course, students are required to use their bodies to convey the characteristics, rhythm, energy, and emotional state of animals [
2]. This kind of training is not just a physical exercise but is also a construction of the ability to express the life state of others through action, which is an expressive imitation.
Expressive imitation is not just about the restoration of the action path; it is also about whether the imitator can grasp the animal’s action style, sense of rhythm, center of gravity shift, muscle tension, and even emotional atmosphere. This is a fusion of behavior, performance, and intention. Teachers can perceive whether this complex level of imitation is successful during the viewing process. However, the teacher’s evaluation is sometimes highly subjective, the scoring criteria are not structured, and there are problems such as poor consistency and difficulty in quantification [
3]. This makes it difficult for traditional methods to achieve stable and objective evaluation.
In recent years, computer vision and machine learning (ML) have made significant progress in the fields of action recognition, pose estimation, and 3D behavior modeling [
4,
5]. Through 3D pose estimation tools such as MediaPipe, the pose motion trajectory in imitation behavior can be accurately captured [
6]. However, existing research focuses more on what actions have taken place and less on how to use structured pose features to quantitatively describe and evaluate the expressive quality of imitation actions.
This study aims to address the aforementioned issues by developing a machine learning and structured pose data-based auxiliary evaluation framework for imitation (ML-PSDAEF). The framework first attempts to extract core features from pose data and model the teacher’s scoring decision process. It further explores whether expressive imitation behaviors exhibit quantifiable structural patterns and ultimately evaluates the effectiveness and generalizability of the AI-assisted scoring system. The main contributions of this study include the following:
- 1.
For the first time, animal imitation is used as an expressive modeling task, and a machine learning framework is introduced to achieve objective prediction of the quality of expressive imitation.
- 2.
Multi-model comparison and leave-one-subject-out cross-validation (LOSO-CV) are used to systematically verify the score prediction ability and generalization stability of machine learning models.
- 3.
Through comparative analysis of recursive feature elimination with cross-validation (RFECV) and feature importance ranking, stable and interpretable core structural features are identified to achieve efficient, objective, and interpretable improvements in expressive evaluation.
2. Related Work
2.1. Cognitive and Expressive Mechanisms of Imitation
Imitation occupies an important place in human cognitive development [
7]. It is not only a way to learn skills but also a process of internalizing and re-expressing the external world. In performing arts, imitation requires not just accurate action reproduction but also the perception, deconstruction, and style reconstruction of another’s actions. As such, imitation behavior is viewed as a highly abstract bodily expression mechanism involving situational awareness, emotional transmission, and style construction [
8]. Although imitation has been studied extensively in psychology, education, and neuroscience, most of the literature remains at the cognitive or behavioral level, and work on structured modeling of action expressiveness and quality assessment is limited. From an artificial intelligence perspective, there is a clear lack of systematic methods for structured data modeling and prediction of expressive imitation behavior.
2.2. Action Modeling Techniques in Artificial Intelligence
In recent years, action modeling and recognition techniques in AI have made significant advances. Algorithms such as OpenPose, MediaPipe, AlphaPose, and DeepLabCut can efficiently extract 2D or 3D pose key points from video [
9,
10,
11,
12,
13]. This structured coordinate data provides strong support for tasks such as action recognition, behavior analysis, and emotion recognition. By treating key point sequences as data, researchers can describe, classify, and compare actions in a data-driven way [
14]. In domains such as sports, rehabilitation, and dance analysis, these techniques have been widely applied, yielding good results in action accuracy, efficiency, and temporal alignment [
15]. Action recognition research mainly addresses which action was performed, for example, classifying jumps, waves, or sitting [
16]. Some studies have progressed toward action quality assessment (AQA), aiming to evaluate technical performance in sports or rehabilitation training [
17]. Such systems typically combine pose trajectories, temporal modeling, and deep feature extraction with expert scores for supervised learning. However, these frameworks focus on execution correctness or degree of completion and do not address the semantic dimension of action expressiveness. Moreover, Liu et al. [
18] proposed a weight-aware multi-source unsupervised domain adaptation method that, by adaptively estimating the marginal distribution discrepancies between different source subjects and the target subject, effectively enhances the accuracy of cross-subject human motion intention recognition and action imitation. At the same time, Afzal et al. [
19] reviewed cutting-edge research on the integration of AI and biomimetic robotics in the context of ecological conservation, exploring strategies for using animal-inspired robots to simulate, monitor, and intervene in wildlife behavior, thereby offering new technical pathways for animal imitation research and ecological monitoring.
2.3. Current Status and Limitations of Expressive Imitation Modeling
Expressive modeling has recently become a hot topic in AI research on art, music and performance. For example, in dance generation and speech emotion recognition, studies have introduced variables such as rhythm, tension and style. Yet, in the field of pose-based action imitation, especially open-ended imitation, research on expressive modeling remains sparse. Existing systems typically assume a standard action template and measure similarity via dynamic time warping (DTW), angular differences or Euclidean distance [
20]. These methods suit tasks with clear rules, such as gymnastics scoring, or rehabilitation detection, but they lack effective modeling capability for highly subjective expressive imitation [
21]. Meanwhile, modeling for teachers’ assessments remains immature. Although sports scoring studies predict scores from video or pose data, they rely on strict rule-based criteria rather than expressive or aesthetic dimensions. Teachers’ assessments of expressive imitation depend heavily on experience and perception, and there is no systematic research on their consistency, predictability, or ability to learn structured patterns. Moreover, the expressiveness of animal action imitation is closely tied to realism, stylistic features, and state quality dimensions that resist static rule definition; thus, no mature system can automatically judge expressive quality yet.
Given that existing research is still insufficient in the structuring and quantification of expressive imitation scores, expressive modeling is further extended to open imitation tasks, and a score prediction framework that incorporates pose features and machine learning as the core is constructed. Combining expert score consistency analysis using the Intraclass Correlation Coefficient (ICC) and RFECV methods, the system verifies the structuring, predictability, and interpretability of expressive imitation scores [
22].
3. Methodology
This study proposes an expressive imitation prediction process based on posture structure features, focusing on whether the AI model can accurately predict the expressive imitation score given by the teacher from the posture features of the action sequence and thus realize score regression prediction.
3.1. Machine Learning and Structured Pose Data-Based Auxiliary Evaluation Framework
The overall research process is illustrated in
Figure 1 and consists of four stages: data collection and preprocessing, verification of teacher score consistency, motion feature quantification, and AI modeling and validation.
3.2. Data Collection and Evaluation Consistency Validation
In this experiment, we recruited 10 undergraduate students majoring in acting, with diverse genders, ages, and heights, to participate in the study. The indicators used in this experiment, such as angle error, time rhythm characteristics, and overall posture differences, were determined based on the actual indicators currently employed in the animal training courses. They exhibited varying levels of acting experience and practice duration. In terms of hardware implementation, this experiment employed a camera capable of capturing 1920 × 1080 color resolution, 1280 × 720 depth resolution, and 60 fps; the recording took place in an indoor performance studio of approximately 8 × 10 m, equipped with adjustable overhead lights to ensure uniform illumination. A full blackout curtain was hung in the background to minimize environmental interference and enhance pose extraction accuracy. In the imitation task, two animal video segments were employed as control benchmarks for data collection and evaluation consistency validation.
Figure 2 presents typical video screenshots of students imitating animal behaviors, including performance clips and corresponding reference images of animal movements. In addition, five teachers with experience in stage performance instruction were invited to evaluate each video using a 10-point scale. The scoring criteria focused on the extent to which students successfully conveyed the style, rhythm, and expressive state of the target animal. Each video’s final score was calculated as the average of the independent scores from the five teachers. The distribution of scores across samples was relatively balanced and approximately normal.
We used the ICC under the two-factor random effects model to conduct the consistency analysis. The results are shown in
Table 1. The ICC(2, k) value was 0.889 with
, indicating a high level of agreement among raters. This suggests that the scoring system is both learnable and suitable for modeling. Accordingly, we define the average teacher rating for each video as the “true score” and use it as the ground truth label for machine learning.
3.3. Pose Feature Extraction and Scoring Model Construction
This study uses MediaPipe as a posture data extraction tool to extract the three-dimensional skeletal key point trajectories in students’ imitation behaviors. It is a cross-platform real-time posture estimation framework provided by Google. It is efficient, stable, and scalable. It can accurately identify the spatial position information of 33 skeletal key points of the human body and is suitable for performance tasks with rich action details.
In order to eliminate the differences in individual body shape, position, and orientation, all key point coordinates are normalized based on the body core after extraction. Specifically, in each frame, the center points of the left and right hip joints are first calculated, and all 33 joint points are translated with this as the origin to eliminate the position difference; then, the distance between the left and right shoulder joints is calculated, and all coordinates are scaled proportionally using this as the standard scale to correct the deviation caused by the individual body size.
After completing data standardization, a prediction model for expressive imitation scores was constructed. To evaluate the modeling capabilities of different algorithms, several representative regression methods were selected for comparative experiments. First, Ridge Regression was employed as the linear baseline model. It introduces an L2 regularization term to mitigate multicollinearity among features and is suitable for scenarios with high feature dimensionality or redundancy [
23].
Considering the possibility of complex nonlinear relationships between the scores and pose features, Support Vector Regression (SVR) with a Radial Basis Function (RBF) kernel were adopted to enhance nonlinear fitting capacity, making it suitable for small sample sizes and high-dimensional data [
24,
25]. In addition, Random Forest regression was used to model nonlinear interactions between features [
26]. This ensemble-based method constructs multiple decision trees and demonstrates strong robustness and feature selection capability.
Finally, gradient boosting was introduced to improve overall prediction performance by iteratively optimizing residuals, which is particularly well suited for high-precision modeling tasks under limited-data conditions [
27].
3.4. Pose Feature Engineering
3.4.1. Baseline Modeling and Feature Ablation Design
In all formulas, we use the following unified mathematical notation:
: | Student’s skeleton sequence. |
: | Template’s skeleton sequence. |
N: | Number of frames in the student’s imitation sequence. |
M: | Number of frames in the template sequence. |
: | Three-dimensional spatial coordinate vector of the j-th body joint at frame i. |
: | The L2 norm of vector v, i.e., the Euclidean distance. |
: | The k-th aligned frame index pair as determined by the DTW algorithm. |
In the base feature set, we define eight local joint error angle features to quantify how accurately the imitator’s joint flexions at key limbs (left/right elbows, knees, shoulders, and hips) reproduce the target animal’s posture. These handcrafted features are variables manually defined based on domain knowledge (e.g., performance theory, biomechanics) to capture meaningful aspects of action execution. Specifically, each joint angle is computed from three points forming an angle, as shown in Equation (
1), where
denotes the angle’s vertex:
To further evaluate the dynamic reproduction between the imitation and the template actions, we compute, based on the temporal alignment results, the mean joint angle difference for each pair of aligned frames. We define this as the
error angle, as shown in Equation (
2). This metric reflects the precision of local dynamic imitation; smaller values indicate closer alignment with the target.
Additionally, we introduce three temporal–rhythmic features: the student’s imitation frame count N, the frame count difference from the reference sequence length M, and their ratio . These features quantify how well the imitation’s rhythm matches the temporal structure of the original movement and are sensitive to deviations such as moving too quickly or too slowly.
To measure the overall spatial–structural alignment between the imitation and the target action, we define two global pose-difference metrics: the dynamic time warping cost (DTW cost) and the aligned pose error (
). DTW aligns pose sequences of different lengths, with its cumulative cost defined by the recurrence relation in Equation (
3), which represents the global matching cost of the entire imitation in the temporal dimension.
Based on temporal alignment, we compute the Euclidean distance between corresponding keypoints in each pair of aligned frames and then average these distances to obtain the aligned pose error (Equation (
4)), which reflects how closely the overall contour matches the target animal’s morphology.
The Basic Feature Set captures three structural dimensions of imitation behavior, namely local dynamic precision, temporal rhythm control, and global morphological reconstruction. These features provide foundational expressive information for subsequent modeling. To assess the actual contribution of rhythm features in score prediction, an Ablation Feature Set was constructed by removing the temporal rhythm features from the original Basic Feature Set, while retaining only the local joint angle error and global pose difference metrics. This configuration helps reveal the significance of rhythm features in expressive understanding and predictive modeling.
3.4.2. Enhanced Experiments Based on Baseline
To further capture the expressive dimension of action, we developed an Augmented Feature Set by incorporating five descriptive features into the Basic Feature Set. Specifically, these features include:
Mean velocity, which measures the average movement speed of keypoints across the entire motion sequence, reflecting the overall activity level;
Mean acceleration, which captures the rate of change of velocity over time, indirectly indicating the smoothness of action rhythm;
Impact intensity, which describes the magnitude of acceleration variation, used to evaluate action explosiveness and energy intensity.
Pose symmetry error, which evaluates execution symmetry and body balance by comparing horizontal coordinate differences of paired keypoints on both sides of the body.
Velocity symmetry error, which quantifies the velocity consistency between the left and right sides, reflecting movement coordination.
To validate the efficacy of each formula and parameter,
Table 2 summarizes the corresponding technical implementations and literature references for the baseline parameters in existing studies.
3.4.3. Model Optimization Strategy Based on Feature Selection
To reduce the interference of redundant features and potential noise, RFECV was applied to identify the optimal feature subset. To further evaluate the relative contribution of each of the 18 features to the prediction task, an independent feature importance analysis was also conducted. Specifically, a Random Forest regressor was trained on the entire dataset, and its built-in feature importance scores were extracted using the impurity decrease mechanism. These methods jointly enhanced both the predictive performance and the interpretability of the model.
4. Experimental Results and Analysis
To systematically validate the modeling value and adaptability of pose motion features in expressive score prediction, in this chapter, we conduct experiments on three different feature combinations.
4.1. Experimental Setup and Evaluation Strategy
In our experimental design, considering the significant variability among students in body shape, performance style, and imitation strategy, all models were trained and tested using LOSO-CV. In each fold, all imitation videos from one student were used as the test set, while the remaining data were used for training. This strategy effectively evaluates model stability when applied to a new individual and is a widely adopted method for robustness validation in behavioral modeling. The evaluation metrics include mean absolute error (MAE) and Spearman’s rank correlation coefficient (
). The parameter configurations for all models are presented in
Table 3.
4.2. Baseline Model Evaluation and Rhythm Feature Ablation Analysis
This experiment uses the complete Basic Feature Set to train four commonly used supervised regression models: Ridge Regression, Support Vector Machine Regression (SVR with RBF kernel), Random Forest, and Gradient Boosting. The performance of the Ridge Regression now serves as the baseline model’s result.
Table 4 shows the prediction performance of each model under the LOSO-CV framework.
The results indicate that the Ridge Regression model achieves the best performance across all three evaluation metrics. Its Spearman correlation coefficient reaches 0.619, demonstrating that the model most closely approximates the ranking trend of the teacher’s scores and exhibits strong trend modeling capabilities. The Random Forest model also performs comparably well, suggesting that nonlinear tree-based methods can capture structural patterns between features to a certain extent. In contrast, the performance of SVR and Gradient Boosting is relatively poor, possibly due to a mismatch between the RBF kernel and the underlying feature distribution.
4.3. Ablation Analysis of Duration Rhythm Features
To analyze the contribution of the duration rhythm feature set to model performance, we trained an ablation version of the model using only the local joint angle error and global posture difference features. The same four regression models were trained, and their performance changes were compared. The experimental results are shown in
Table 5.
The results demonstrate that the duration rhythm features play a critical role in score prediction. After removing these features, the Spearman rank correlation coefficients of all models decreased significantly, in some cases approaching zero or even becoming negative. This indicates that the models can no longer effectively capture the score ranking trends, rendering the prediction performance nearly invalid. This effect is particularly evident in Ridge Regression and Support Vector Regression, suggesting that linear and kernel-based models are highly sensitive to rhythmic structure.
Moreover, an ablation study using only the kinematic feature subset shows that all models yield Spearman correlation coefficients near zero, demonstrating that pose noise significantly degrades prediction performance. This finding underscores the inherent limitations of pose estimation algorithms based solely on RGB video, as their accuracy can be compromised by lighting variations, complex backgrounds, and partial occlusions.
4.4. Enhanced Feature Modeling Effect Evaluation
In
Table 6, using the interaction feature set (rhythm × symmetry, 19 dimensions), the average Spearman’s correlation coefficient and mean MAE of six regression models are reported. The results indicate that Ridge Regression achieves the best performance in both correlation and error.
Based on the above experiments, this study further explored the impact of expressive features on the model’s predictive ability by constructing an Enhanced Feature Set containing 18 features. On top of the original Basic Feature Set, five additional indicators were introduced to reflect the tension, energy distribution, and spatial coordination of imitation movements. These additions were designed to better capture the deeper expressive dimension of students’ imitation behavior.
In addition, this experiment incorporated two mainstream ensemble learning models, XGBoost and LightGBM, to evaluate the adaptability and predictive stability of different model structures under high-dimensional feature conditions. All experiments were conducted using the LOSO-CV strategy to ensure result comparability and robustness. The experimental results are presented in
Table 7.
Despite the introduction of more expressive features, the overall model performance did not significantly improve compared with the Basic Feature Set results. Among the tested models, Ridge Regression continued to perform best, with a mean absolute error (MAE) of 0.668 and a Spearman correlation coefficient of 0.606. These values remained high but slightly lower than those in the baseline setting. The newly introduced ensemble models, XGBoost and LightGBM, did not outperform traditional methods. This suggests that increasing the number of features does not necessarily enhance model performance and may introduce redundancy and noise. Under small-sample conditions, a high-dimensional feature space is more likely to cause overfitting, especially for complex nonlinear models.
4.5. Comparison of Modeling Performance Using Feature Selection Optimization
In the enhanced feature experiment, we introduced 18 posture-related variables to characterize the rhythm, morphology, and expressive aspects of the imitation action. However, the results show that some models, such as XGBoost and LightGBM, experienced performance degradation in the high-dimensional feature space. This suggests that the models may be affected by redundant information or collinearity among features.
Moreover, because the expressive dimension is inherently subjective and varies significantly across individuals, certain expressive features may introduce substantial noise, thereby reducing the stability of model predictions. To address these issues, this section introduces a feature selection mechanism aimed at identifying the core variables with the highest explanatory power and predictive contribution. This strategy is intended to improve both model performance and interpretability.
We use the RFECV method, with Ridge Regression as the base estimator, to iteratively remove feature variables that contribute the least to model prediction. The model adopts negative mean squared error (negative MSE) as the evaluation metric to identify the optimal subset of features.
Figure 3 illustrates the cross-validation score trend as the number of selected features increases from 1 to 18. The results show that the model reaches its performance peak when only two features are used. Although the overall change in performance remains relatively stable beyond this point, no further significant improvement is observed. This suggests that the model’s expressive structure can be effectively represented in a low-dimensional feature space.
In addition, to intuitively evaluate the relative contribution of the 18 features in the prediction task, this study further adopts an independent feature importance analysis method. Specifically, a Random Forest regressor is trained using the entire dataset, and its built-in feature importance scores, which are based on the impurity decrease mechanism, are extracted. These scores quantify the relative contribution of each feature to the reduction of prediction error during the model’s decision-making process. A higher score indicates that the feature plays a more critical role in prediction.
Finally, all features are ranked and visualized according to their importance scores in order to identify the core variables that influence score prediction.
Figure 4 presents the importance ranking of all 18 features. It can be observed that the top three features are structural variables belonging to the duration rhythm feature category. In contrast, the expressive features appear near the bottom of the ranking, which suggests either high noise interference or low model sensitivity in this task. The global posture difference feature ranks in the middle and provides some auxiliary value for modeling.
Based on the ranking results, we selected the automatically screened features, such as
num_frames and
studythe_frame_ratio, along with the top three features ranked by feature importance, to build alternative score prediction models. The results are summarized in
Table 8, which compares the Spearman rank correlation of different models across two types of feature subsets.
It can be observed that fine-tuning these feature subsets leads to optimal results, with Ridge Regression achieving the highest Spearman correlation of 0.7297. This performance significantly surpasses that of the full feature set models. The duration rhythm features and postural structure variables demonstrate a high degree of score predictability, making them valuable indicators for expressive imitation assessment. These findings highlight the importance of effective feature selection and low-dimensional modeling in enhancing the stability and interpretability of AI-based scoring systems.
To validate the statistical significance of these performance differences, we conducted a Wilcoxon signed-rank test on the prediction errors of the optimal 2-feature model versus the 13-feature baseline model.
Table 9 shows that the performance improvement of the 2-feature model over the baseline model is statistically significant (
p < 0.01).
5. Discussion
The effective prediction performance of machine learning indicates that, despite the inherent subjectivity in evaluations by multiple teachers, expressive imitation ratings are not purely based on artistic intuition. Instead, they exhibit clear structural patterns and learnable features. This finding provides both theoretical support and a practical foundation for the development of AI-assisted objective scoring tools in the field of arts education. A central innovation of this work is the incorporation of a semantic expressiveness dimension into action evaluation. By extracting high-level pose and temporal descriptors that quantify intent and emotion, we enable quantitative analysis of how effectively participants convey the underlying meaning of each movement.
To address the issues of excessive subjectivity and the lack of structured criteria in expressive imitation assessment, this study proposes an auxiliary scoring framework based on machine learning and structured pose data. Experimental results show that teacher ratings can be effectively predicted across multiple models. In particular, the Ridge Regression model, trained with the Basic Feature Set, achieved a Spearman rank correlation of up to 0.619. This suggests that the subjective perception of “likeness” in imitation is closely associated with certain structural patterns embedded in the pose data. Although expressive imitation involves emotion, style, and internal state, it also follows quantifiable rules at the level of physical movement.
The results from the feature ablation experiments underscore the critical role of duration rhythm features in modeling expressive scores. When these features were removed, prediction performance dropped significantly, with some models showing near-zero or even negative correlations. This implies that, in the absence of temporal rhythm information, the models lose their ability to perceive the ranking trend of scores. Furthermore, both the feature importance ranking and the RFECV results consistently identified rhythm length and structural alignment as top predictive features. These findings suggest that rhythm is not merely auxiliary timing information, but a core structural dimension in determining imitation quality.
Although this study incorporated additional expressive features such as tension, motion symmetry, and impact intensity, the overall performance did not exceed that of the Basic Feature Set. In particular, ensemble models like XGBoost and LightGBM demonstrated signs of overfitting. The feature importance analysis revealed that these expressive indicators were ranked relatively low, indicating that their contribution may be hindered by subjectivity and individual differences. This implies that adding high-dimensional features in small-sample settings does not necessarily enhance generalization and may instead introduce redundancy and reduce model robustness and interpretability.
Further analysis using RFECV and feature importance scores demonstrate that even when only the top two or three features are retained, several models are still able to maintain or even improve their predictive accuracy. This supports the feasibility of using compact and interpretable feature subsets for expressive score modeling. Such configurations not only reduce computational complexity but also enhance the potential for real-time feedback and classroom deployment. Moreover, the structural composition of the selected features provides a foundation for future work on constructing more semantically rich and generalizable expressive labels.
As the model is closely related to the normal scoring method used in class, the teachers can use our prediction results as a reference for their evaluation. Moreover, students can also use the model we propose to self-assess their imitation behaviors, thereby helping them to optimize key expression dimensions such as rhythm control and body tension.
6. Conclusions
This study proposed the ML-PSDAEF to assess expressive imitation quality based solely on temporal pose features. The results preliminarily validate the learning potential of artificial intelligence in imitation scoring tasks. The framework introduces a pose data modeling approach with strong interpretability and structural clarity, offering both theoretical foundations and practical support for the development of AI-assisted evaluation systems in arts education. Specifically, through comparative analysis of three types of feature sets (Basic, Ablation, and Enhancement) as well as evaluation of multiple ML models, the results show that structured posture data does contain key information that can be used for expressive rating prediction. Specifically, the duration rhythm features show a significant predictive effect in rating modeling. Further feature selection experiments show that high-precision modeling of teacher ratings can be achieved with only a small number of key feature values. Since the actions in this study are relatively simple and performed frontally, the impact of rotations or non-frontal poses is minimal. To further enhance robustness, future experiments will incorporate improved normalization and alignment techniques. The present study represents a preliminary effort to validate the exploratory framework. Future research can further expand the generalization ability and semantic understanding depth of this method using larger sample sizes, more complex imitation task categories, and multimodal information fusion. However, challenges remain in ensuring robustness across diverse real-world conditions and reducing dependence on subjective labeling, which will be crucial for deploying AI-assisted evaluation systems in practical educational settings.