Analysis, Evaluation, and Prediction of Machine Learning-Based Animal Behavior Imitation

Qi, Yu; Xiong, Siyu; Wu, Bo

doi:10.3390/electronics14142816

Open AccessArticle

Analysis, Evaluation, and Prediction of Machine Learning-Based Animal Behavior Imitation

by

Yu Qi

^1,†,

Siyu Xiong

^1,† and

Bo Wu

^2,*

¹

Graduate School of Bionics, Computer and Media Sciences, Tokyo University of Technology, 1404-1 Katakuramachi, Hachioji, Tokyo 192-0982, Japan

²

School of Computer Science, Tokyo University of Technology, 1404-1 Katakuramachi, Hachioji, Tokyo 192-0982, Japan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(14), 2816; https://doi.org/10.3390/electronics14142816

Submission received: 15 June 2025 / Revised: 8 July 2025 / Accepted: 10 July 2025 / Published: 13 July 2025

(This article belongs to the Special Issue Machine Learning and Deep Learning in Image Analysis for Biological Systems)

Download

Browse Figures

Versions Notes

Abstract

Expressive imitation in the performing arts is typically trained through animal behavior imitation, aiming not only to reproduce action trajectories but also to recreate rhythm, style and emotional states. However, evaluation of such animal imitation behaviors relies heavily on teachers’ subjective judgments, lacking structured criteria, exhibiting low inter-rater consistency and being difficult to quantify. To enhance the objectivity and interpretability of the scoring process, this study develops a machine learning and structured pose data-based auxiliary evaluation framework for imitation quality. The proposed framework innovatively constructs three types of feature sets, namely baseline, ablation, and enhanced, and integrates recursive feature elimination with feature importance ranking to identify a stable and interpretable set of core structural features. This enables the training of machine learning models with strong capabilities in structured modeling and sensitivity to informative features. The analysis of the modeling results indicates that temporal–rhythm features play a significant role in score prediction and that only a small number of key feature values are required to model teachers’ ratings with high precision. The proposed framework not only lays a methodological foundation for standardized and AI-assisted evaluation in performing arts education but also expands the application boundaries of computer vision and machine learning in this field.

Keywords:

machine learning; pose estimation; human motion analysis; AI-assisted evaluation; computer vision in education

1. Introduction

Imitation is a high-level cognitive expression, especially in performing arts, and its significance goes far beyond action replication [1]. In the animal exercise course, students are required to use their bodies to convey the characteristics, rhythm, energy, and emotional state of animals [2]. This kind of training is not just a physical exercise but is also a construction of the ability to express the life state of others through action, which is an expressive imitation.

Expressive imitation is not just about the restoration of the action path; it is also about whether the imitator can grasp the animal’s action style, sense of rhythm, center of gravity shift, muscle tension, and even emotional atmosphere. This is a fusion of behavior, performance, and intention. Teachers can perceive whether this complex level of imitation is successful during the viewing process. However, the teacher’s evaluation is sometimes highly subjective, the scoring criteria are not structured, and there are problems such as poor consistency and difficulty in quantification [3]. This makes it difficult for traditional methods to achieve stable and objective evaluation.

In recent years, computer vision and machine learning (ML) have made significant progress in the fields of action recognition, pose estimation, and 3D behavior modeling [4,5]. Through 3D pose estimation tools such as MediaPipe, the pose motion trajectory in imitation behavior can be accurately captured [6]. However, existing research focuses more on what actions have taken place and less on how to use structured pose features to quantitatively describe and evaluate the expressive quality of imitation actions.

This study aims to address the aforementioned issues by developing a machine learning and structured pose data-based auxiliary evaluation framework for imitation (ML-PSDAEF). The framework first attempts to extract core features from pose data and model the teacher’s scoring decision process. It further explores whether expressive imitation behaviors exhibit quantifiable structural patterns and ultimately evaluates the effectiveness and generalizability of the AI-assisted scoring system. The main contributions of this study include the following:

1.: For the first time, animal imitation is used as an expressive modeling task, and a machine learning framework is introduced to achieve objective prediction of the quality of expressive imitation.
2.: Multi-model comparison and leave-one-subject-out cross-validation (LOSO-CV) are used to systematically verify the score prediction ability and generalization stability of machine learning models.
3.: Through comparative analysis of recursive feature elimination with cross-validation (RFECV) and feature importance ranking, stable and interpretable core structural features are identified to achieve efficient, objective, and interpretable improvements in expressive evaluation.

2. Related Work

2.1. Cognitive and Expressive Mechanisms of Imitation

Imitation occupies an important place in human cognitive development [7]. It is not only a way to learn skills but also a process of internalizing and re-expressing the external world. In performing arts, imitation requires not just accurate action reproduction but also the perception, deconstruction, and style reconstruction of another’s actions. As such, imitation behavior is viewed as a highly abstract bodily expression mechanism involving situational awareness, emotional transmission, and style construction [8]. Although imitation has been studied extensively in psychology, education, and neuroscience, most of the literature remains at the cognitive or behavioral level, and work on structured modeling of action expressiveness and quality assessment is limited. From an artificial intelligence perspective, there is a clear lack of systematic methods for structured data modeling and prediction of expressive imitation behavior.

2.2. Action Modeling Techniques in Artificial Intelligence

In recent years, action modeling and recognition techniques in AI have made significant advances. Algorithms such as OpenPose, MediaPipe, AlphaPose, and DeepLabCut can efficiently extract 2D or 3D pose key points from video [9,10,11,12,13]. This structured coordinate data provides strong support for tasks such as action recognition, behavior analysis, and emotion recognition. By treating key point sequences as data, researchers can describe, classify, and compare actions in a data-driven way [14]. In domains such as sports, rehabilitation, and dance analysis, these techniques have been widely applied, yielding good results in action accuracy, efficiency, and temporal alignment [15]. Action recognition research mainly addresses which action was performed, for example, classifying jumps, waves, or sitting [16]. Some studies have progressed toward action quality assessment (AQA), aiming to evaluate technical performance in sports or rehabilitation training [17]. Such systems typically combine pose trajectories, temporal modeling, and deep feature extraction with expert scores for supervised learning. However, these frameworks focus on execution correctness or degree of completion and do not address the semantic dimension of action expressiveness. Moreover, Liu et al. [18] proposed a weight-aware multi-source unsupervised domain adaptation method that, by adaptively estimating the marginal distribution discrepancies between different source subjects and the target subject, effectively enhances the accuracy of cross-subject human motion intention recognition and action imitation. At the same time, Afzal et al. [19] reviewed cutting-edge research on the integration of AI and biomimetic robotics in the context of ecological conservation, exploring strategies for using animal-inspired robots to simulate, monitor, and intervene in wildlife behavior, thereby offering new technical pathways for animal imitation research and ecological monitoring.

2.3. Current Status and Limitations of Expressive Imitation Modeling

Expressive modeling has recently become a hot topic in AI research on art, music and performance. For example, in dance generation and speech emotion recognition, studies have introduced variables such as rhythm, tension and style. Yet, in the field of pose-based action imitation, especially open-ended imitation, research on expressive modeling remains sparse. Existing systems typically assume a standard action template and measure similarity via dynamic time warping (DTW), angular differences or Euclidean distance [20]. These methods suit tasks with clear rules, such as gymnastics scoring, or rehabilitation detection, but they lack effective modeling capability for highly subjective expressive imitation [21]. Meanwhile, modeling for teachers’ assessments remains immature. Although sports scoring studies predict scores from video or pose data, they rely on strict rule-based criteria rather than expressive or aesthetic dimensions. Teachers’ assessments of expressive imitation depend heavily on experience and perception, and there is no systematic research on their consistency, predictability, or ability to learn structured patterns. Moreover, the expressiveness of animal action imitation is closely tied to realism, stylistic features, and state quality dimensions that resist static rule definition; thus, no mature system can automatically judge expressive quality yet.

Given that existing research is still insufficient in the structuring and quantification of expressive imitation scores, expressive modeling is further extended to open imitation tasks, and a score prediction framework that incorporates pose features and machine learning as the core is constructed. Combining expert score consistency analysis using the Intraclass Correlation Coefficient (ICC) and RFECV methods, the system verifies the structuring, predictability, and interpretability of expressive imitation scores [22].

3. Methodology

This study proposes an expressive imitation prediction process based on posture structure features, focusing on whether the AI model can accurately predict the expressive imitation score given by the teacher from the posture features of the action sequence and thus realize score regression prediction.

3.1. Machine Learning and Structured Pose Data-Based Auxiliary Evaluation Framework

The overall research process is illustrated in Figure 1 and consists of four stages: data collection and preprocessing, verification of teacher score consistency, motion feature quantification, and AI modeling and validation.

3.2. Data Collection and Evaluation Consistency Validation

In this experiment, we recruited 10 undergraduate students majoring in acting, with diverse genders, ages, and heights, to participate in the study. The indicators used in this experiment, such as angle error, time rhythm characteristics, and overall posture differences, were determined based on the actual indicators currently employed in the animal training courses. They exhibited varying levels of acting experience and practice duration. In terms of hardware implementation, this experiment employed a camera capable of capturing 1920 × 1080 color resolution, 1280 × 720 depth resolution, and 60 fps; the recording took place in an indoor performance studio of approximately 8 × 10 m, equipped with adjustable overhead lights to ensure uniform illumination. A full blackout curtain was hung in the background to minimize environmental interference and enhance pose extraction accuracy. In the imitation task, two animal video segments were employed as control benchmarks for data collection and evaluation consistency validation. Figure 2 presents typical video screenshots of students imitating animal behaviors, including performance clips and corresponding reference images of animal movements. In addition, five teachers with experience in stage performance instruction were invited to evaluate each video using a 10-point scale. The scoring criteria focused on the extent to which students successfully conveyed the style, rhythm, and expressive state of the target animal. Each video’s final score was calculated as the average of the independent scores from the five teachers. The distribution of scores across samples was relatively balanced and approximately normal.

We used the ICC under the two-factor random effects model to conduct the consistency analysis. The results are shown in Table 1. The ICC(2, k) value was 0.889 with

p < 0.001

, indicating a high level of agreement among raters. This suggests that the scoring system is both learnable and suitable for modeling. Accordingly, we define the average teacher rating for each video as the “true score” and use it as the ground truth label for machine learning.

3.3. Pose Feature Extraction and Scoring Model Construction

This study uses MediaPipe as a posture data extraction tool to extract the three-dimensional skeletal key point trajectories in students’ imitation behaviors. It is a cross-platform real-time posture estimation framework provided by Google. It is efficient, stable, and scalable. It can accurately identify the spatial position information of 33 skeletal key points of the human body and is suitable for performance tasks with rich action details.

In order to eliminate the differences in individual body shape, position, and orientation, all key point coordinates are normalized based on the body core after extraction. Specifically, in each frame, the center points of the left and right hip joints are first calculated, and all 33 joint points are translated with this as the origin to eliminate the position difference; then, the distance between the left and right shoulder joints is calculated, and all coordinates are scaled proportionally using this as the standard scale to correct the deviation caused by the individual body size.

After completing data standardization, a prediction model for expressive imitation scores was constructed. To evaluate the modeling capabilities of different algorithms, several representative regression methods were selected for comparative experiments. First, Ridge Regression was employed as the linear baseline model. It introduces an L2 regularization term to mitigate multicollinearity among features and is suitable for scenarios with high feature dimensionality or redundancy [23].

Considering the possibility of complex nonlinear relationships between the scores and pose features, Support Vector Regression (SVR) with a Radial Basis Function (RBF) kernel were adopted to enhance nonlinear fitting capacity, making it suitable for small sample sizes and high-dimensional data [24,25]. In addition, Random Forest regression was used to model nonlinear interactions between features [26]. This ensemble-based method constructs multiple decision trees and demonstrates strong robustness and feature selection capability.

Finally, gradient boosting was introduced to improve overall prediction performance by iteratively optimizing residuals, which is particularly well suited for high-precision modeling tasks under limited-data conditions [27].

3.4. Pose Feature Engineering

3.4.1. Baseline Modeling and Feature Ablation Design

In all formulas, we use the following unified mathematical notation:

$S_{s t u}$ :	Student’s skeleton sequence.
$S_{s t d}$ :	Template’s skeleton sequence.
N:	Number of frames in the student’s imitation sequence.
M:	Number of frames in the template sequence.
$P (i, j)$ :	Three-dimensional spatial coordinate vector of the j-th body joint at frame i.
${∥ v ∥}_{2}$ :	The L₂ norm of vector v, i.e., the Euclidean distance.
$(i_{k}, j_{k})$ :	The k-th aligned frame index pair as determined by the DTW algorithm.

In the base feature set, we define eight local joint error angle features to quantify how accurately the imitator’s joint flexions at key limbs (left/right elbows, knees, shoulders, and hips) reproduce the target animal’s posture. These handcrafted features are variables manually defined based on domain knowledge (e.g., performance theory, biomechanics) to capture meaningful aspects of action execution. Specifically, each joint angle is computed from three points forming an angle, as shown in Equation (1), where

P_{2}

denotes the angle’s vertex:

θ = arccos (\frac{(P_{1} - P_{2}) \cdot (P_{3} - P_{2})}{{∥ P_{1} - P_{2} ∥}_{2} {∥ P_{3} - P_{2} ∥}_{2}})

(1)

To further evaluate the dynamic reproduction between the imitation and the template actions, we compute, based on the temporal alignment results, the mean joint angle difference for each pair of aligned frames. We define this as the error angle, as shown in Equation (2). This metric reflects the precision of local dynamic imitation; smaller values indicate closer alignment with the target.

{Error}_{angle} = \frac{1}{K_{valid}} \sum_{k = 1}^{K_{valid}} |θ_{stu} (i_{k}) - θ_{std} (j_{k})|

(2)

Additionally, we introduce three temporal–rhythmic features: the student’s imitation frame count N, the frame count difference

N - M

from the reference sequence length M, and their ratio

N / M

. These features quantify how well the imitation’s rhythm matches the temporal structure of the original movement and are sensitive to deviations such as moving too quickly or too slowly.

To measure the overall spatial–structural alignment between the imitation and the target action, we define two global pose-difference metrics: the dynamic time warping cost (DTW cost) and the aligned pose error (

{Error}_{pose}

). DTW aligns pose sequences of different lengths, with its cumulative cost defined by the recurrence relation in Equation (3), which represents the global matching cost of the entire imitation in the temporal dimension.

D (i, j) = d (i, j) + min (D (i - 1, j), D (i, j - 1), D (i - 1, j - 1))

(3)

Based on temporal alignment, we compute the Euclidean distance between corresponding keypoints in each pair of aligned frames and then average these distances to obtain the aligned pose error (Equation (4)), which reflects how closely the overall contour matches the target animal’s morphology.

{Error}_{pose} = \frac{1}{33} \sum_{j = 1}^{33} {∥P_{stu} (i_{k}, j) - P_{std} (j_{k}, j)∥}_{2}

(4)

The Basic Feature Set captures three structural dimensions of imitation behavior, namely local dynamic precision, temporal rhythm control, and global morphological reconstruction. These features provide foundational expressive information for subsequent modeling. To assess the actual contribution of rhythm features in score prediction, an Ablation Feature Set was constructed by removing the temporal rhythm features from the original Basic Feature Set, while retaining only the local joint angle error and global pose difference metrics. This configuration helps reveal the significance of rhythm features in expressive understanding and predictive modeling.

3.4.2. Enhanced Experiments Based on Baseline

To further capture the expressive dimension of action, we developed an Augmented Feature Set by incorporating five descriptive features into the Basic Feature Set. Specifically, these features include:

Mean velocity, which measures the average movement speed of keypoints across the entire motion sequence, reflecting the overall activity level;

$\bar{v} = \frac{1}{(N - 1) \cdot 33} \sum_{i = 2}^{N} \sum_{j = 1}^{33} {∥v_{stu} (i, j)∥}_{2}$

(5)
Mean acceleration, which captures the rate of change of velocity over time, indirectly indicating the smoothness of action rhythm;

$\bar{a} = \frac{1}{(N - 2) \cdot 33} \sum_{i = 2}^{N - 1} \sum_{j = 1}^{33} {∥a_{stu} (i, j)∥}_{2}$

(6)
Impact intensity, which describes the magnitude of acceleration variation, used to evaluate action explosiveness and energy intensity.

$\bar{J} = \frac{1}{(N - 3) \cdot 33} \sum_{i = 2}^{N - 2} \sum_{j = 1}^{33} {∥J_{stu} (i, j)∥}_{2}$

(7)
Pose symmetry error, which evaluates execution symmetry and body balance by comparing horizontal coordinate differences of paired keypoints on both sides of the body.

${Error}_{pose_sym} = \frac{1}{N} \sum_{i = 1}^{N} (\frac{1}{| Pairs |} \sum_{(j_{L}, j_{R})} |x_{stu} (i, j_{L}) + x_{stu} (i, j_{R})|)$

(8)
Velocity symmetry error, which quantifies the velocity consistency between the left and right sides, reflecting movement coordination.

${Error}_{vel_sym} = \frac{1}{N - 1} \sum_{i = 2}^{N} (\frac{1}{| Pairs |} \sum_{(j_{L}, j_{R}) \in Pairs} |{∥v (i, j_{L})∥}_{2} - {∥v_{stu} (i, j_{R})∥}_{2}|)$

(9)

To validate the efficacy of each formula and parameter, Table 2 summarizes the corresponding technical implementations and literature references for the baseline parameters in existing studies.

3.4.3. Model Optimization Strategy Based on Feature Selection

To reduce the interference of redundant features and potential noise, RFECV was applied to identify the optimal feature subset. To further evaluate the relative contribution of each of the 18 features to the prediction task, an independent feature importance analysis was also conducted. Specifically, a Random Forest regressor was trained on the entire dataset, and its built-in feature importance scores were extracted using the impurity decrease mechanism. These methods jointly enhanced both the predictive performance and the interpretability of the model.

4. Experimental Results and Analysis

To systematically validate the modeling value and adaptability of pose motion features in expressive score prediction, in this chapter, we conduct experiments on three different feature combinations.

4.1. Experimental Setup and Evaluation Strategy

In our experimental design, considering the significant variability among students in body shape, performance style, and imitation strategy, all models were trained and tested using LOSO-CV. In each fold, all imitation videos from one student were used as the test set, while the remaining data were used for training. This strategy effectively evaluates model stability when applied to a new individual and is a widely adopted method for robustness validation in behavioral modeling. The evaluation metrics include mean absolute error (MAE) and Spearman’s rank correlation coefficient (

ρ

). The parameter configurations for all models are presented in Table 3.

4.2. Baseline Model Evaluation and Rhythm Feature Ablation Analysis

This experiment uses the complete Basic Feature Set to train four commonly used supervised regression models: Ridge Regression, Support Vector Machine Regression (SVR with RBF kernel), Random Forest, and Gradient Boosting. The performance of the Ridge Regression now serves as the baseline model’s result. Table 4 shows the prediction performance of each model under the LOSO-CV framework.

The results indicate that the Ridge Regression model achieves the best performance across all three evaluation metrics. Its Spearman correlation coefficient reaches 0.619, demonstrating that the model most closely approximates the ranking trend of the teacher’s scores and exhibits strong trend modeling capabilities. The Random Forest model also performs comparably well, suggesting that nonlinear tree-based methods can capture structural patterns between features to a certain extent. In contrast, the performance of SVR and Gradient Boosting is relatively poor, possibly due to a mismatch between the RBF kernel and the underlying feature distribution.

4.3. Ablation Analysis of Duration Rhythm Features

To analyze the contribution of the duration rhythm feature set to model performance, we trained an ablation version of the model using only the local joint angle error and global posture difference features. The same four regression models were trained, and their performance changes were compared. The experimental results are shown in Table 5.

The results demonstrate that the duration rhythm features play a critical role in score prediction. After removing these features, the Spearman rank correlation coefficients of all models decreased significantly, in some cases approaching zero or even becoming negative. This indicates that the models can no longer effectively capture the score ranking trends, rendering the prediction performance nearly invalid. This effect is particularly evident in Ridge Regression and Support Vector Regression, suggesting that linear and kernel-based models are highly sensitive to rhythmic structure.

Moreover, an ablation study using only the kinematic feature subset shows that all models yield Spearman correlation coefficients near zero, demonstrating that pose noise significantly degrades prediction performance. This finding underscores the inherent limitations of pose estimation algorithms based solely on RGB video, as their accuracy can be compromised by lighting variations, complex backgrounds, and partial occlusions.

4.4. Enhanced Feature Modeling Effect Evaluation

In Table 6, using the interaction feature set (rhythm × symmetry, 19 dimensions), the average Spearman’s correlation coefficient and mean MAE of six regression models are reported. The results indicate that Ridge Regression achieves the best performance in both correlation and error.

Based on the above experiments, this study further explored the impact of expressive features on the model’s predictive ability by constructing an Enhanced Feature Set containing 18 features. On top of the original Basic Feature Set, five additional indicators were introduced to reflect the tension, energy distribution, and spatial coordination of imitation movements. These additions were designed to better capture the deeper expressive dimension of students’ imitation behavior.

In addition, this experiment incorporated two mainstream ensemble learning models, XGBoost and LightGBM, to evaluate the adaptability and predictive stability of different model structures under high-dimensional feature conditions. All experiments were conducted using the LOSO-CV strategy to ensure result comparability and robustness. The experimental results are presented in Table 7.

Despite the introduction of more expressive features, the overall model performance did not significantly improve compared with the Basic Feature Set results. Among the tested models, Ridge Regression continued to perform best, with a mean absolute error (MAE) of 0.668 and a Spearman correlation coefficient of 0.606. These values remained high but slightly lower than those in the baseline setting. The newly introduced ensemble models, XGBoost and LightGBM, did not outperform traditional methods. This suggests that increasing the number of features does not necessarily enhance model performance and may introduce redundancy and noise. Under small-sample conditions, a high-dimensional feature space is more likely to cause overfitting, especially for complex nonlinear models.

4.5. Comparison of Modeling Performance Using Feature Selection Optimization

In the enhanced feature experiment, we introduced 18 posture-related variables to characterize the rhythm, morphology, and expressive aspects of the imitation action. However, the results show that some models, such as XGBoost and LightGBM, experienced performance degradation in the high-dimensional feature space. This suggests that the models may be affected by redundant information or collinearity among features.

Moreover, because the expressive dimension is inherently subjective and varies significantly across individuals, certain expressive features may introduce substantial noise, thereby reducing the stability of model predictions. To address these issues, this section introduces a feature selection mechanism aimed at identifying the core variables with the highest explanatory power and predictive contribution. This strategy is intended to improve both model performance and interpretability.

We use the RFECV method, with Ridge Regression as the base estimator, to iteratively remove feature variables that contribute the least to model prediction. The model adopts negative mean squared error (negative MSE) as the evaluation metric to identify the optimal subset of features.

Figure 3 illustrates the cross-validation score trend as the number of selected features increases from 1 to 18. The results show that the model reaches its performance peak when only two features are used. Although the overall change in performance remains relatively stable beyond this point, no further significant improvement is observed. This suggests that the model’s expressive structure can be effectively represented in a low-dimensional feature space.

In addition, to intuitively evaluate the relative contribution of the 18 features in the prediction task, this study further adopts an independent feature importance analysis method. Specifically, a Random Forest regressor is trained using the entire dataset, and its built-in feature importance scores, which are based on the impurity decrease mechanism, are extracted. These scores quantify the relative contribution of each feature to the reduction of prediction error during the model’s decision-making process. A higher score indicates that the feature plays a more critical role in prediction.

Finally, all features are ranked and visualized according to their importance scores in order to identify the core variables that influence score prediction. Figure 4 presents the importance ranking of all 18 features. It can be observed that the top three features are structural variables belonging to the duration rhythm feature category. In contrast, the expressive features appear near the bottom of the ranking, which suggests either high noise interference or low model sensitivity in this task. The global posture difference feature ranks in the middle and provides some auxiliary value for modeling.

Based on the ranking results, we selected the automatically screened features, such as num_frames and studythe_frame_ratio, along with the top three features ranked by feature importance, to build alternative score prediction models. The results are summarized in Table 8, which compares the Spearman rank correlation of different models across two types of feature subsets.

It can be observed that fine-tuning these feature subsets leads to optimal results, with Ridge Regression achieving the highest Spearman correlation of 0.7297. This performance significantly surpasses that of the full feature set models. The duration rhythm features and postural structure variables demonstrate a high degree of score predictability, making them valuable indicators for expressive imitation assessment. These findings highlight the importance of effective feature selection and low-dimensional modeling in enhancing the stability and interpretability of AI-based scoring systems.

To validate the statistical significance of these performance differences, we conducted a Wilcoxon signed-rank test on the prediction errors of the optimal 2-feature model versus the 13-feature baseline model. Table 9 shows that the performance improvement of the 2-feature model over the baseline model is statistically significant (p < 0.01).

5. Discussion

The effective prediction performance of machine learning indicates that, despite the inherent subjectivity in evaluations by multiple teachers, expressive imitation ratings are not purely based on artistic intuition. Instead, they exhibit clear structural patterns and learnable features. This finding provides both theoretical support and a practical foundation for the development of AI-assisted objective scoring tools in the field of arts education. A central innovation of this work is the incorporation of a semantic expressiveness dimension into action evaluation. By extracting high-level pose and temporal descriptors that quantify intent and emotion, we enable quantitative analysis of how effectively participants convey the underlying meaning of each movement.

To address the issues of excessive subjectivity and the lack of structured criteria in expressive imitation assessment, this study proposes an auxiliary scoring framework based on machine learning and structured pose data. Experimental results show that teacher ratings can be effectively predicted across multiple models. In particular, the Ridge Regression model, trained with the Basic Feature Set, achieved a Spearman rank correlation of up to 0.619. This suggests that the subjective perception of “likeness” in imitation is closely associated with certain structural patterns embedded in the pose data. Although expressive imitation involves emotion, style, and internal state, it also follows quantifiable rules at the level of physical movement.

The results from the feature ablation experiments underscore the critical role of duration rhythm features in modeling expressive scores. When these features were removed, prediction performance dropped significantly, with some models showing near-zero or even negative correlations. This implies that, in the absence of temporal rhythm information, the models lose their ability to perceive the ranking trend of scores. Furthermore, both the feature importance ranking and the RFECV results consistently identified rhythm length and structural alignment as top predictive features. These findings suggest that rhythm is not merely auxiliary timing information, but a core structural dimension in determining imitation quality.

Although this study incorporated additional expressive features such as tension, motion symmetry, and impact intensity, the overall performance did not exceed that of the Basic Feature Set. In particular, ensemble models like XGBoost and LightGBM demonstrated signs of overfitting. The feature importance analysis revealed that these expressive indicators were ranked relatively low, indicating that their contribution may be hindered by subjectivity and individual differences. This implies that adding high-dimensional features in small-sample settings does not necessarily enhance generalization and may instead introduce redundancy and reduce model robustness and interpretability.

Further analysis using RFECV and feature importance scores demonstrate that even when only the top two or three features are retained, several models are still able to maintain or even improve their predictive accuracy. This supports the feasibility of using compact and interpretable feature subsets for expressive score modeling. Such configurations not only reduce computational complexity but also enhance the potential for real-time feedback and classroom deployment. Moreover, the structural composition of the selected features provides a foundation for future work on constructing more semantically rich and generalizable expressive labels.

As the model is closely related to the normal scoring method used in class, the teachers can use our prediction results as a reference for their evaluation. Moreover, students can also use the model we propose to self-assess their imitation behaviors, thereby helping them to optimize key expression dimensions such as rhythm control and body tension.

6. Conclusions

This study proposed the ML-PSDAEF to assess expressive imitation quality based solely on temporal pose features. The results preliminarily validate the learning potential of artificial intelligence in imitation scoring tasks. The framework introduces a pose data modeling approach with strong interpretability and structural clarity, offering both theoretical foundations and practical support for the development of AI-assisted evaluation systems in arts education. Specifically, through comparative analysis of three types of feature sets (Basic, Ablation, and Enhancement) as well as evaluation of multiple ML models, the results show that structured posture data does contain key information that can be used for expressive rating prediction. Specifically, the duration rhythm features show a significant predictive effect in rating modeling. Further feature selection experiments show that high-precision modeling of teacher ratings can be achieved with only a small number of key feature values. Since the actions in this study are relatively simple and performed frontally, the impact of rotations or non-frontal poses is minimal. To further enhance robustness, future experiments will incorporate improved normalization and alignment techniques. The present study represents a preliminary effort to validate the exploratory framework. Future research can further expand the generalization ability and semantic understanding depth of this method using larger sample sizes, more complex imitation task categories, and multimodal information fusion. However, challenges remain in ensuring robustness across diverse real-world conditions and reducing dependence on subjective labeling, which will be crucial for deploying AI-assisted evaluation systems in practical educational settings.

Author Contributions

Writing—original draft preparation, Y.Q.; investigation and methodology, S.X.; writing—review and editing, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data are not publicly available due to institutional or ethical restrictions but may be made available upon reasonable request and with appropriate approvals.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Freedberg, D.; Gallese, V. Motion, emotion and empathy in esthetic experience. Trends Cogn. Sci. 2007, 11, 197–203. [Google Scholar] [CrossRef] [PubMed]
Schyberg, F. The Art of Acting. Tulane Drama Rev. 1961, 5, 56–76. [Google Scholar] [CrossRef]
Qi, Y.; Zhang, C.; Kameda, H. Historical Summary and Future Development Analysis of Animal Exercise. In Proceedings of the ICERI2021 Proceedings, 14th Annual International Conference of Education, Research and Innovation, IATED, Online, 8–9 November 2021; pp. 8529–8538. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
Gao, W.; Wu, C.; Huang, W.; Lin, B.; Su, X. A data structure for studying 3D modeling design behavior based on event logs. Autom. Constr. 2021, 132, 103967. [Google Scholar] [CrossRef]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. Mediapipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
Meltzoff, A.N.; Decety, J. What imitation tells us about social cognition: A rapprochement between developmental psychology and cognitive neuroscience. Philos. Trans. R. Soc. London. Ser. B Biol. Sci. 2003, 358, 491–500. [Google Scholar] [CrossRef]
Gallagher, S. How the Body Shapes the Mind; Clarendon Press: Oxford, UK, 2006. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.; Lee, J.; et al. MediaPipe: A Framework for Perceiving and Processing Reality. In Proceedings of the Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019, Long Beach, CA, USA, 17 June 2019. [Google Scholar]
Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7157–7173. [Google Scholar] [CrossRef]
Nath, T.; Mathis, A.; Chen, A.C.; Patel, A.; Bethge, M.; Mathis, M.W. Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nat. Protoc. 2019, 14, 2152–2176. [Google Scholar] [CrossRef]
Kim, J.W.; Choi, J.Y.; Ha, E.J.; Choi, J.H. Human Pose Estimation Using MediaPipe Pose and Optimization Method Based on a Humanoid Model. Appl. Sci. 2023, 13, 2700. [Google Scholar] [CrossRef]
Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition From Skeleton Data. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Pirsiavash, H.; Vondrick, C.; Torralba, A. Assessing the quality of actions. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Proceedings, Part VI 13, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 556–571. [Google Scholar]
Liu, X.Y.; Li, G.; Zhou, X.H.; Liang, X.; Hou, Z.G. A Weight-Aware-Based Multisource Unsupervised Domain Adaptation Method for Human Motion Intention Recognition. IEEE Trans. Cybern. 2025, 55, 3131–3143. [Google Scholar] [CrossRef]
Afzal, N.; ur Rehman, M.; Seneviratne, L.; Hussain, I. The Convergence of AI and animal-inspired robots for ecological conservation. Ecol. Inform. 2025, 85, 102950. [Google Scholar] [CrossRef]
Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar]
De Mantaras, R.L.; Arcos, J.L. AI and music: From composition to expressive performance. AI Mag. 2002, 23, 43. [Google Scholar]
Koo, T.K.; Li, M.Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 2016, 15, 155–163. [Google Scholar] [CrossRef]
Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Brereton, R.G.; Lloyd, G.R. Support vector machines for classification and regression. Analyst 2010, 135, 230–267. [Google Scholar] [CrossRef] [PubMed]
Kumar, A. Introduction to Radial Basis Function Networks; University of Edinburgh: Edinburgh, UK, 1996; Available online: https://www.academia.edu/2046643/Introduction_to_radial_basis_function_networks (accessed on 9 July 2025).
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Wang, S.; Zhang, X.; Ma, F.; Li, J.; Huang, Y. Single-Stage Pose Estimation and Joint Angle Extraction Method for Moving Human Body. Electronics 2023, 12, 4644. [Google Scholar] [CrossRef]
Dharmayanti; Iqbal, M.; Suhendra, A.; Benny Mutiara, A. Velocity and acceleration analysis from kinematics linear punch using optical motion capture. In Proceedings of the 2017 Second International Conference on Informatics and Computing (ICIC), Jayapura, Indonesia, 1–3 November 2017; pp. 1–6. [Google Scholar] [CrossRef]
Switonski, A.; Josinski, H.; Wojciechowski, K. Dynamic time warping in classification and selection of motion capture data. Multidimens. Syst. Signal Process. 2019, 30, 1437–1468. [Google Scholar] [CrossRef]
Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar] [CrossRef]
Michaels, R.; Barreira, T.V.; Robinovitch, S.N.; Sosnoff, J.J.; Moon, Y. Estimating hip impact velocity and acceleration from video-captured falls using a pose estimation algorithm. Sci. Rep. 2025, 15, 1558. [Google Scholar] [CrossRef] [PubMed]
Patterson, K.K.; Nadkarni, N.K.; Black, S.E.; McIlroy, W.E. Gait symmetry and velocity differ in their relationship to age. Gait Posture 2012, 35, 590–594. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Machine learning and structured pose data-based auxiliary evaluation framework (ML-PSDAEF).

Figure 2. Representative video frames from the expressive imitation task.

Figure 3. Cross-validation score curve from RFECV based on Ridge Regression.

Figure 4. Feature importance ranking based on a Random Forest regressor.

Table 1. Intraclass correlation coefficients (ICCs) under different models and configurations.

ICC Type	Model Description	ICC Value	p-Value	95% CI
ICC1	Single, absolute agreement	0.615	$4.86 \times 10^{- 58}$	[0.53, 0.70]
ICC2	Single, random-effects	0.616	$1.66 \times 10^{- 60}$	[0.53, 0.70]
ICC3	Single, fixed-effects	0.628	$1.66 \times 10^{- 60}$	[0.55, 0.71]
ICC1k	Average, absolute agreement	0.889	$4.86 \times 10^{- 58}$	[0.85, 0.92]
ICC2k	Average, random-effects	0.889	$1.66 \times 10^{- 60}$	[0.85, 0.92]
ICC3k	Average, fixed-effects	0.894	$1.66 \times 10^{- 60}$	[0.86, 0.92]

Table 2. Overview of existing techniques corresponding to baseline parameters.

Formula	Feature	Reference
Equation (1)	Joint angle computation (vector cosine law)	Single-stage pose estimation and joint angle extraction method [28].
Equations (2) and (5)	Mean angular error; Mean velocity	Velocity and acceleration analysis from kinematics linear punch [29].
Equation (3)	DTW cumulative pairing cost	Dynamic time warping in classification and selection of motion capture data [30].
Equation (4)	Mean post-alignment keypoint distance	Deep 3D human pose estimation [31].
Equations (6) and (7)	Mean acceleration; Impact intensity	Estimating hip impact velocity and acceleration from video-captured falls [32].
Equations (8) and (9)	Pose symmetry error; Velocity symmetry error	Temporal gait symmetry and velocity differ in their relationship to age [33].

Table 3. Hyperparameter search spaces for all evaluated models.

Model	Hyperparameter	Search Values
Ridge Regression	`alpha`	[10.0, 50.0, 100.0, 200.0]
SVR	C	[1, 10]
	`gamma`	[`‘scale’`]
Random Forest	`n_estimators`	[100, 200]
	`max_depth`	[3, 5]
Gradient Boosting	`n_estimators`	[100, 200]
	`learning_rate`	[0.05, 0.1]
XGBoost	`n_estimators`	[100, 200]
	`learning_rate`	[0.05, 0.1]
LightGBM	`n_estimators`	[100, 200]
	`learning_rate`	[0.05, 0.1]

Table 4. Prediction performance of each model on the basic feature set.

Model	MAE (avg)	Spearman $ρ$ (avg)
Ridge Regression	0.6579	0.6187
SVR	0.8254	0.4326
Random Forest	0.7162	0.5706
Gradient Boosting	0.7661	0.5094

Table 5. Prediction performance after removing duration rhythm features.

Model	MAE (avg)	Spearman $ρ$ (avg)
Ridge Regression	0.9413	0.0076
SVR	1.0900	$- 0.0773$
Random Forest	0.9624	0.0766
Gradient Boosting	0.9990	0.1023

Table 6. Performance comparison of models based on the 19-dimensional interaction feature set.

Model	MAE (avg)	Spearman $ρ$ (avg)
Ridge Regression	0.6672	0.6064
SVR	0.7940	0.4331
Random Forest	0.7359	0.5648
Gradient Boosting	0.7705	0.4980
XGBoost	0.8379	0.4293
LightGBM	0.7678	0.5058

Table 7. Prediction performance of each model using the Enhanced Feature Set.

Model	MAE (avg)	Spearman $ρ$ (avg)
Ridge Regression	0.6679	0.6064
SVR	0.7922	0.4323
Random Forest	0.7390	0.5565
Gradient Boosting	0.7677	0.5001
XGBoost	0.8468	0.4183
LightGBM	0.7682	0.5128

Table 8. Model prediction performance using selected feature subsets.

Experiment	Model	MAE (avg)	Spearman $ρ$ (avg)
4_Optimal_2_Features_RFECV	Ridge Regression	0.5895	0.7244
	SVR	0.6058	0.6961
	Random Forest	0.6721	0.6132
	Gradient Boosting	0.6929	0.6109
	XGBoost	0.7223	0.5683
	LightGBM	0.6815	0.6170
5_Manual_Top_3_Features	Ridge Regression	0.5875	0.7297
	SVR	0.6194	0.6803
	Random Forest	0.6551	0.6243
	Gradient Boosting	0.6981	0.5936
	XGBoost	0.7279	0.5467
	LightGBM	0.6888	0.6104

Table 9. Pairwise comparison of model predictive performance using the Wilcoxon signed-rank test (p < 0.05 indicates significance).

Comparison	p-Value	Interpretation
Ridge (2 feat. RFECV)
vs. Ridge (13 feat. base)	0.0008	The 2-feature model outperforms the baseline model.
Ridge (2 feat. RFECV)
vs. SVR (2 feat.)	0.3374	No significant difference in performance.
Ridge (2 feat. RFECV)
vs. Random Forest (2 feat.)	0.0005	Ridge model outperforms Random Forest.
Ridge (2 feat. RFECV)
vs. XGBoost (2 feat.)	0.0034	Ridge model outperforms XGBoost.
Ridge (13 feat. Base)
vs. Ridge (18 feat. Enh)	0.0837	No significant difference in performance.
Ridge (2 feat. RFECV)
vs. Ridge (19 feat. In)	0.0003	The 2-feature model outperforms the interaction model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, Y.; Xiong, S.; Wu, B. Analysis, Evaluation, and Prediction of Machine Learning-Based Animal Behavior Imitation. Electronics 2025, 14, 2816. https://doi.org/10.3390/electronics14142816

AMA Style

Qi Y, Xiong S, Wu B. Analysis, Evaluation, and Prediction of Machine Learning-Based Animal Behavior Imitation. Electronics. 2025; 14(14):2816. https://doi.org/10.3390/electronics14142816

Chicago/Turabian Style

Qi, Yu, Siyu Xiong, and Bo Wu. 2025. "Analysis, Evaluation, and Prediction of Machine Learning-Based Animal Behavior Imitation" Electronics 14, no. 14: 2816. https://doi.org/10.3390/electronics14142816

APA Style

Qi, Y., Xiong, S., & Wu, B. (2025). Analysis, Evaluation, and Prediction of Machine Learning-Based Animal Behavior Imitation. Electronics, 14(14), 2816. https://doi.org/10.3390/electronics14142816

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis, Evaluation, and Prediction of Machine Learning-Based Animal Behavior Imitation

Abstract

1. Introduction

2. Related Work

2.1. Cognitive and Expressive Mechanisms of Imitation

2.2. Action Modeling Techniques in Artificial Intelligence

2.3. Current Status and Limitations of Expressive Imitation Modeling

3. Methodology

3.1. Machine Learning and Structured Pose Data-Based Auxiliary Evaluation Framework

3.2. Data Collection and Evaluation Consistency Validation

3.3. Pose Feature Extraction and Scoring Model Construction

3.4. Pose Feature Engineering

3.4.1. Baseline Modeling and Feature Ablation Design

3.4.2. Enhanced Experiments Based on Baseline

3.4.3. Model Optimization Strategy Based on Feature Selection

4. Experimental Results and Analysis

4.1. Experimental Setup and Evaluation Strategy

4.2. Baseline Model Evaluation and Rhythm Feature Ablation Analysis

4.3. Ablation Analysis of Duration Rhythm Features

4.4. Enhanced Feature Modeling Effect Evaluation

4.5. Comparison of Modeling Performance Using Feature Selection Optimization

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI