Figure 1.
Geographic distribution of data collection sites across four locations in Northern Europe. (a) Map showing Norway and three Swedish sites including an Icelandic horse facility, Saxtorp, and TorHall. (b) Environmental diversity across sites showing variation in lighting (overcast to bright sunlight), terrain (outdoor dry paddocks to hilly pastures and woodlands), and background (open pasture to indoor arenas). Multi-site data collection introduces environmental variability supporting model generalization. Base map: ©OpenStreetMap contributors.
Figure 1.
Geographic distribution of data collection sites across four locations in Northern Europe. (a) Map showing Norway and three Swedish sites including an Icelandic horse facility, Saxtorp, and TorHall. (b) Environmental diversity across sites showing variation in lighting (overcast to bright sunlight), terrain (outdoor dry paddocks to hilly pastures and woodlands), and background (open pasture to indoor arenas). Multi-site data collection introduces environmental variability supporting model generalization. Base map: ©OpenStreetMap contributors.
Figure 2.
Behavioral taxonomy consolidation workflow. The original ethogram comprised 13 functional behavior groups (top row), which were consolidated into 6 functionally coherent categories (middle row) following three principles: (1) maintaining the three-level parent taxonomy (affiliative, neutral, avoidant), (2) distinguishing initiating party (horse vs. human), and (3) differentiating active versus passive engagement within affiliative interactions. These 6 categories were then organized into 3 parent categories (bottom row) for hierarchical classification. Sample counts (n) and percentages shown at each level demonstrate the severe class imbalance: affiliative–active comprised 51.7% of samples while avoidant–human represented only 0.3%, creating a 166.6:1 imbalance ratio. Arrows indicate consolidation relationships between levels. The hierarchical decomposition enables specialized Stage 2 classifiers for each parent category, achieving 82.2%, 84.3%, and 98.9% balanced accuracy for affiliative, neutral, and avoidant sub-behaviors respectively, culminating in 88.5% overall balanced accuracy for six-behavior classification.
Figure 2.
Behavioral taxonomy consolidation workflow. The original ethogram comprised 13 functional behavior groups (top row), which were consolidated into 6 functionally coherent categories (middle row) following three principles: (1) maintaining the three-level parent taxonomy (affiliative, neutral, avoidant), (2) distinguishing initiating party (horse vs. human), and (3) differentiating active versus passive engagement within affiliative interactions. These 6 categories were then organized into 3 parent categories (bottom row) for hierarchical classification. Sample counts (n) and percentages shown at each level demonstrate the severe class imbalance: affiliative–active comprised 51.7% of samples while avoidant–human represented only 0.3%, creating a 166.6:1 imbalance ratio. Arrows indicate consolidation relationships between levels. The hierarchical decomposition enables specialized Stage 2 classifiers for each parent category, achieving 82.2%, 84.3%, and 98.9% balanced accuracy for affiliative, neutral, and avoidant sub-behaviors respectively, culminating in 88.5% overall balanced accuracy for six-behavior classification.
![Sensors 26 02202 g002 Sensors 26 02202 g002]()
Figure 3.
Distribution of behavioral categories in the dataset. (a) Three parent categories showing dominance of affiliative interactions (69.4%) and rarity of avoidant behaviors (3.8%). (b) Six fine-grained behaviors revealing extreme imbalance, with affiliative–active comprising over half of all samples while avoidant–human represents only 0.3%, creating a 166.6:1 imbalance ratio that necessitates specialized handling techniques.
Figure 3.
Distribution of behavioral categories in the dataset. (a) Three parent categories showing dominance of affiliative interactions (69.4%) and rarity of avoidant behaviors (3.8%). (b) Six fine-grained behaviors revealing extreme imbalance, with affiliative–active comprising over half of all samples while avoidant–human represents only 0.3%, creating a 166.6:1 imbalance ratio that necessitates specialized handling techniques.
Figure 4.
Overview of the multi-modal hierarchical classification framework. Video frames undergo parallel processing through three feature extraction pipelines: YOLOv8 for spatial relationships (12 features), MediaPipe for human pose (15 features), and AP-10K for equine pose (8 features). Extracted features are concatenated into 35-dimensional vectors and processed through a two-stage classifier: Stage 1 discriminates parent categories (affiliative, neutral, avoidant), while Stage 2 classifies fine-grained sub-behaviors within each parent. The system achieves 73.2% balanced accuracy at Stage 1 and 88.5% overall for six-behavior classification.
Figure 4.
Overview of the multi-modal hierarchical classification framework. Video frames undergo parallel processing through three feature extraction pipelines: YOLOv8 for spatial relationships (12 features), MediaPipe for human pose (15 features), and AP-10K for equine pose (8 features). Extracted features are concatenated into 35-dimensional vectors and processed through a two-stage classifier: Stage 1 discriminates parent categories (affiliative, neutral, avoidant), while Stage 2 classifies fine-grained sub-behaviors within each parent. The system achieves 73.2% balanced accuracy at Stage 1 and 88.5% overall for six-behavior classification.
Figure 5.
Representative video frames illustrating the three parent behavioral categories with multi-modal feature extraction overlays. (a) Affiliative: Horse actively approaches crouching human with head lowered toward subject; overlapping YOLOv8 bounding boxes (blue/green) indicate minimal inter-subject distance, MediaPipe skeleton (cyan) captures engaged crouching posture, AP-10K keypoints (purple) show relaxed equine body orientation with forward-directed attention. (b) Neutral: Human standing passively while horse grazes independently with body oriented away; moderate spatial separation visible in bounding box positions, horse head lowered in grazing posture with neutral ear position. (c) Avoidant: Horse moving away from crouching human with increasing spatial separation; horse body oriented in opposite direction from human, demonstrating the withdrawal pattern characteristic of avoidant interactions. Overlaid annotations demonstrate the complementary behavioral information captured by each feature extraction modality.
Figure 5.
Representative video frames illustrating the three parent behavioral categories with multi-modal feature extraction overlays. (a) Affiliative: Horse actively approaches crouching human with head lowered toward subject; overlapping YOLOv8 bounding boxes (blue/green) indicate minimal inter-subject distance, MediaPipe skeleton (cyan) captures engaged crouching posture, AP-10K keypoints (purple) show relaxed equine body orientation with forward-directed attention. (b) Neutral: Human standing passively while horse grazes independently with body oriented away; moderate spatial separation visible in bounding box positions, horse head lowered in grazing posture with neutral ear position. (c) Avoidant: Horse moving away from crouching human with increasing spatial separation; horse body oriented in opposite direction from human, demonstrating the withdrawal pattern characteristic of avoidant interactions. Overlaid annotations demonstrate the complementary behavioral information captured by each feature extraction modality.
![Sensors 26 02202 g005 Sensors 26 02202 g005]()
Figure 6.
Pose validation mosaic demonstrating multi-modal feature extraction quality across 18 representative frames. Top three rows (green borders): Successful detection examples from three different interaction videos across the three parent behavioral categories (columns: affiliative, neutral, avoidant). All cells display YOLOv8 bounding boxes (blue/green), MediaPipe human skeleton (cyan), and AP-10K equine keypoints (purple) generated through real model inference. Normalized inter-subject distance d and the annotated behavior label are shown at the bottom of each cell, illustrating the characteristic distance progression from close proximity during affiliative interactions to increased separation during avoidant episodes. Bottom three rows (red borders): Challenging and failure cases illustrating nine distinct detection failure modes encountered across the dataset, including sparse equine keypoint estimation (intentional failure case examples), missed YOLOv8 detections, low detection confidence, MediaPipe skeleton failure, distant subjects occupying a small fraction of the frame, and multi-horse confusion. Each failure cell is annotated with a description of the specific issue. Together, the 18 panels validate that the feature extraction pipelines produce reliable detections under favorable conditions while transparently documenting failure modes and their visual characteristics.
Figure 6.
Pose validation mosaic demonstrating multi-modal feature extraction quality across 18 representative frames. Top three rows (green borders): Successful detection examples from three different interaction videos across the three parent behavioral categories (columns: affiliative, neutral, avoidant). All cells display YOLOv8 bounding boxes (blue/green), MediaPipe human skeleton (cyan), and AP-10K equine keypoints (purple) generated through real model inference. Normalized inter-subject distance d and the annotated behavior label are shown at the bottom of each cell, illustrating the characteristic distance progression from close proximity during affiliative interactions to increased separation during avoidant episodes. Bottom three rows (red borders): Challenging and failure cases illustrating nine distinct detection failure modes encountered across the dataset, including sparse equine keypoint estimation (intentional failure case examples), missed YOLOv8 detections, low detection confidence, MediaPipe skeleton failure, distant subjects occupying a small fraction of the frame, and multi-horse confusion. Each failure cell is annotated with a description of the specific issue. Together, the 18 panels validate that the feature extraction pipelines produce reliable detections under favorable conditions while transparently documenting failure modes and their visual characteristics.
![Sensors 26 02202 g006 Sensors 26 02202 g006]()
Figure 7.
Confusion matrix for Stage 1 parent category classification showing sample counts and proportions. The model achieves strong avoidant recall (85%) essential for safety monitoring while maintaining reasonable performance on majority classes. Primary confusion occurs between affiliative and neutral categories, reflecting the subtle distinction between passive positive presence and truly neutral observation. Red numbers indicate diagonal values (correct classification counts).
Figure 7.
Confusion matrix for Stage 1 parent category classification showing sample counts and proportions. The model achieves strong avoidant recall (85%) essential for safety monitoring while maintaining reasonable performance on majority classes. Primary confusion occurs between affiliative and neutral categories, reflecting the subtle distinction between passive positive presence and truly neutral observation. Red numbers indicate diagonal values (correct classification counts).
Figure 8.
Confusion matrix for six-behavior hierarchical classification under oracle routing, showing sample counts. Under oracle routing, Stage 2 classifiers receive ground-truth parent assignments, producing a block-diagonal structure where all misclassifications occur within parent categories (affiliative–active vs. affiliative–subtle, neutral–horse vs. neutral–human). Near-perfect avoidant recognition (bottom two rows) confirms successful handling of severe class imbalance. Cross-parent cells are zero by construction; in cascaded deployment, Stage 1 errors would introduce off-diagonal entries.
Figure 8.
Confusion matrix for six-behavior hierarchical classification under oracle routing, showing sample counts. Under oracle routing, Stage 2 classifiers receive ground-truth parent assignments, producing a block-diagonal structure where all misclassifications occur within parent categories (affiliative–active vs. affiliative–subtle, neutral–horse vs. neutral–human). Near-perfect avoidant recognition (bottom two rows) confirms successful handling of severe class imbalance. Cross-parent cells are zero by construction; in cascaded deployment, Stage 1 errors would introduce off-diagonal entries.
Figure 9.
Top 15 features by importance for three-category parent classification, color-coded by source. YOLO spatial features dominate (8/15), with normalized distance, box positions, and areas proving most discriminative. MediaPipe contributes shoulder width and body height (reflecting human posture), while AP-10K’s horse ear angle ranks critically for avoidant detection. Importance measured by total gain reduction across all CatBoost tree splits.
Figure 9.
Top 15 features by importance for three-category parent classification, color-coded by source. YOLO spatial features dominate (8/15), with normalized distance, box positions, and areas proving most discriminative. MediaPipe contributes shoulder width and body height (reflecting human posture), while AP-10K’s horse ear angle ranks critically for avoidant detection. Importance measured by total gain reduction across all CatBoost tree splits.
Figure 10.
Ablation study visualizing modality contributions. (a) Single modality performance: no individual source exceeds 59% balanced accuracy, with all falling substantially below the full model but demonstrating discriminative power. YOLO performs best alone (58.9%), reflecting the primacy of spatial relationships. (b) Ablation analysis: removing AP-10K causes the largest degradation (−13.1%), confirming that equine body language provides the strongest behavioral signal. Removing YOLO produces intermediate degradation (−8.0%), while removing MediaPipe has the smallest impact (−4.8%). Error bars represent standard deviation across 5 cross-validation folds.
Figure 10.
Ablation study visualizing modality contributions. (a) Single modality performance: no individual source exceeds 59% balanced accuracy, with all falling substantially below the full model but demonstrating discriminative power. YOLO performs best alone (58.9%), reflecting the primacy of spatial relationships. (b) Ablation analysis: removing AP-10K causes the largest degradation (−13.1%), confirming that equine body language provides the strongest behavioral signal. Removing YOLO produces intermediate degradation (−8.0%), while removing MediaPipe has the smallest impact (−4.8%). Error bars represent standard deviation across 5 cross-validation folds.
Figure 11.
Note: X-axis represents boosting iterations, not neural network epochs. Training convergence analysis for the two-stage hierarchical pipeline. (a) Stage 1 loss decreases smoothly without overfitting; training and validation losses converge by boosting iteration 300. (b) Stage 1’s balanced accuracy plateaus at 73.2% validation accuracy with minimal training–validation gap (<2%), indicating good generalization. (c) Stage 2 losses for all three sub-classifiers (2A: affiliative, 2B: neutral, 2C: avoidant) show faster convergence due to simpler binary tasks and more balanced class distributions. (d) Stage 2 accuracies reach 82.2%, 84.3%, and 98.9% respectively, with Stage 2C (avoidant) achieving near-perfect performance due to distinct behavioral signatures. In subfigure (d), dashed lines represent validation curves and solid lines represent training curves for each sub-classifier.
Figure 11.
Note: X-axis represents boosting iterations, not neural network epochs. Training convergence analysis for the two-stage hierarchical pipeline. (a) Stage 1 loss decreases smoothly without overfitting; training and validation losses converge by boosting iteration 300. (b) Stage 1’s balanced accuracy plateaus at 73.2% validation accuracy with minimal training–validation gap (<2%), indicating good generalization. (c) Stage 2 losses for all three sub-classifiers (2A: affiliative, 2B: neutral, 2C: avoidant) show faster convergence due to simpler binary tasks and more balanced class distributions. (d) Stage 2 accuracies reach 82.2%, 84.3%, and 98.9% respectively, with Stage 2C (avoidant) achieving near-perfect performance due to distinct behavioral signatures. In subfigure (d), dashed lines represent validation curves and solid lines represent training curves for each sub-classifier.
Figure 12.
Cross-validation stability analysis showing performance consistency across 5 stratified folds. (a) Three-category classification achieves balanced accuracy of 73.2 ± 0.4%, with all folds within 2 percentage points of the mean, indicating robust generalization. (b) Six-behavior hierarchical classification achieves 88.5 ± 0.3% balanced accuracy with similarly low variance, confirming that performance is not driven by fortunate data splits but reflects genuine learning. Standard deviations represent variation across folds, not standard error. Low standard deviations (<1%) demonstrate model stability across diverse interaction contexts present in different folds.
Figure 12.
Cross-validation stability analysis showing performance consistency across 5 stratified folds. (a) Three-category classification achieves balanced accuracy of 73.2 ± 0.4%, with all folds within 2 percentage points of the mean, indicating robust generalization. (b) Six-behavior hierarchical classification achieves 88.5 ± 0.3% balanced accuracy with similarly low variance, confirming that performance is not driven by fortunate data splits but reflects genuine learning. Standard deviations represent variation across folds, not standard error. Low standard deviations (<1%) demonstrate model stability across diverse interaction contexts present in different folds.
Figure 13.
ROC curves for three-category parent classification using One-vs-Rest strategy. Each curve plots true positive rate against false positive rate across classification thresholds. The avoidant category achieves the highest AUC (0.936), indicating excellent discrimination despite severe underrepresentation. Affiliative (AUC = 0.891) and neutral (AUC = 0.869) show strong but slightly lower discrimination. Macro-average AUC of 0.899 substantially exceeds random classification baseline (0.500).
Figure 13.
ROC curves for three-category parent classification using One-vs-Rest strategy. Each curve plots true positive rate against false positive rate across classification thresholds. The avoidant category achieves the highest AUC (0.936), indicating excellent discrimination despite severe underrepresentation. Affiliative (AUC = 0.891) and neutral (AUC = 0.869) show strong but slightly lower discrimination. Macro-average AUC of 0.899 substantially exceeds random classification baseline (0.500).
Figure 14.
Precision–recall curves for three-category classification. Each curve shows precision–recall tradeoff across thresholds, with horizontal dashed line indicating no-skill baseline (class prevalence). (a) Affiliative achieves high average precision (AP = 0.81) and maintains >80% precision across most recall levels, benefiting from majority class status. (b) Neutral shows moderate performance (AP = 0.65) with precision dropping more steeply at high recall, reflecting confusion with affiliative. (c) Avoidant achieves excellent AP (0.94) maintaining >90% precision even at 100% recall, validating effective minority class learning through cost-sensitive training.
Figure 14.
Precision–recall curves for three-category classification. Each curve shows precision–recall tradeoff across thresholds, with horizontal dashed line indicating no-skill baseline (class prevalence). (a) Affiliative achieves high average precision (AP = 0.81) and maintains >80% precision across most recall levels, benefiting from majority class status. (b) Neutral shows moderate performance (AP = 0.65) with precision dropping more steeply at high recall, reflecting confusion with affiliative. (c) Avoidant achieves excellent AP (0.94) maintaining >90% precision even at 100% recall, validating effective minority class learning through cost-sensitive training.
Table 1.
Original 13-behavior ethogram with operational definitions, parent category assignments, and consolidation mapping to the final 6-category taxonomy.
Table 1.
Original 13-behavior ethogram with operational definitions, parent category assignments, and consolidation mapping to the final 6-category taxonomy.
| Behavior | Parent | Operational Definition | Consolidated To |
|---|
| Approach | Affiliative | Deliberate movement toward partner, reducing distance | Affiliative–active |
| Touch | Affiliative | Physical contact initiated by either species (nuzzle, stroke) | Affiliative–active |
| Mutual Grooming | Affiliative | Reciprocal grooming between horse and human | Affiliative–active |
| Follow | Affiliative | Sustained movement maintaining proximity with partner | Affiliative–active |
| Stand Together | Affiliative | Relaxed co-presence without active engagement | Affiliative–subtle |
| Explore Near | Affiliative | Environmental investigation while maintaining proximity | Affiliative–subtle |
| Graze | Neutral | Horse feeding behavior without human-directed attention | Neutral–horse |
| Self-Groom | Neutral | Horse self-maintenance (scratching, shaking, rolling) | Neutral–horse |
| Stand Still | Neutral | Horse stationary, aware of but not engaging human | Neutral–horse |
| Wait | Neutral | Human stationary, observing without initiating interaction | Neutral–human |
| Move Away | Avoidant | Deliberate increase in distance from partner | Avoidant–horse/human |
| Ear Pin | Avoidant | Ears flattened against head indicating stress/aggression | Avoidant–horse |
| Back Away | Avoidant | Backward movement maintaining visual contact | Avoidant–human |
Table 2.
Summary statistics of the annotated dataset.
Table 2.
Summary statistics of the annotated dataset.
| Characteristic | Value |
|---|
| Total temporal samples | 50,270 |
| Videos included | 28 |
| Recording locations | 4 (Norway, Sweden × 3) |
| Horse breeds represented | Multiple (including Icelandic ponies) |
| Total video duration | 85 min |
| BORIS annotation events | 624 |
| Annotation hours | 220 |
| Original behavioral categories | 13 |
| Consolidated categories | 6 |
| Inter-rater reliability () | 0.78 (95% CI: 0.74–0.82) |
| Videos double-coded | 5 (18% of dataset) |
| Sampling rate | 3 fps |
| Features per sample | 35 |
| Parent categories | 3 |
| Maximum class imbalance | 166.6:1 |
| Cross-validation folds | 5 (stratified) |
Table 3.
Summary of extracted features by source modality and semantic category.
Table 3.
Summary of extracted features by source modality and semantic category.
| Source | Category | Count | Examples |
|---|
| YOLOv8 | Distance | 3 | Euclidean distance, , |
| | Position | 4 | Human/horse box centers |
| | Size | 2 | Bounding box areas |
| | Confidence | 3 | Detection scores, both-detected flag |
| MediaPipe | Head/Gaze | 2 | Nose position |
| | Upper Body | 5 | Shoulder positions, shoulder width |
| | Arms | 4 | Wrist positions |
| | Derived | 4 | Body center, height, speed |
| AP-10K | Head | 2 | Head position, elevation angle |
| | Body | 1 | Body orientation angle |
| | Ears | 2 | Ear angle, ear spread |
| | Composite | 3 | Alertness score, confidence |
| Total | | 35 | |
Table 4.
Computational infrastructure specifications.
Table 4.
Computational infrastructure specifications.
| Component | Specification |
|---|
| Training GPU | NVIDIA RTX 3070 Ti (8 GB GDDR6X) |
| Inference GPU | NVIDIA RTX 3070 (8 GB GDDR6) |
| CPU | Intel Core i7-12800HX (16 cores) |
| RAM | 32 GB DDR5-4800 |
| Operating System | Ubuntu 22.04 LTS |
| Framework | PyTorch 1.12.1 + CUDA 11.6 |
| Feature Extraction Duration | ∼24 h (all 28 videos) |
| Classifier Training Duration | <2 min (all stages combined) |
| Peak Memory Usage | 6.2 GB (training), 4.8 GB (inference) |
| Pipeline Latency | 83 ms/frame (end-to-end at 12 fps) |
Table 5.
Classification algorithm comparison on three-category parent task using stratified 5-fold cross-validation. Balanced accuracy computed as mean of per-class recalls. All models trained with class weighting where supported. Bold indicates best performance.
Table 5.
Classification algorithm comparison on three-category parent task using stratified 5-fold cross-validation. Balanced accuracy computed as mean of per-class recalls. All models trained with class weighting where supported. Bold indicates best performance.
| Algorithm | Bal. Acc. | Accuracy | Weighted F1 | Training Time |
|---|
| CatBoost | 73.2% | 63.8% | 65.8% | 35.6 s |
| HistGradientBoosting | 72.8% | 62.9% | 64.7% | 9.2 s |
| ExtraTrees | 69.2% | 61.4% | 62.8% | 9.1 s |
| Random Forest | 59.2% | 58.3% | 59.1% | 42.9 s |
| MLP (3 layers) | 55.0% | 56.7% | 57.2% | 28.4 s |
| K-Nearest Neighbors | 50.1% | 53.8% | 54.2% | 12.3 s |
| Logistic Regression | 47.2% | 51.9% | 52.4% | 64.8 s |
| LightGBM | 46.3% | 50.2% | 50.8% | 35.3 s |
| XGBoost | 46.2% | 50.1% | 50.6% | 19.2 s |
| Naive Bayes | 41.8% | 47.3% | 48.1% | 1.2 s |
| Decision Tree | 39.5% | 45.2% | 46.0% | 3.4 s |
| Baseline (majority) | 33.3% | 69.4% | — | — |
Table 6.
Overall classification performance comparing hierarchical vs. flat architectures across multiple validation strategies. Frame-level CV uses stratified 5-fold on frames; video-level CV uses grouped 5-fold on videos; LOVO uses leave-one-video-out. Metrics computed via stratified 5-fold cross-validation using cross_val_predict. Bold indicates best performance per validation strategy.
Table 6.
Overall classification performance comparing hierarchical vs. flat architectures across multiple validation strategies. Frame-level CV uses stratified 5-fold on frames; video-level CV uses grouped 5-fold on videos; LOVO uses leave-one-video-out. Metrics computed via stratified 5-fold cross-validation using cross_val_predict. Bold indicates best performance per validation strategy.
| Approach | Validation | Bal. Acc. | Accuracy | Weighted F1 |
|---|
| Three-Category Classification |
| Hierarchical Stage 1 | Frame-level CV | 73.2% | 63.8% | 65.8% |
| Hierarchical Stage 1 | Video-level CV | 70.8 ± 1.2% | 61.2 ± 1.4% | 63.1 ± 1.3% |
| Hierarchical Stage 1 | LOVO | 69.5 ± 2.8% | 59.8 ± 3.1% | 61.7 ± 2.9% |
| Flat CatBoost | Frame-level CV | 73.2% | 63.1% | 65.1% |
| Naive (majority class) | — | 33.3% | 69.4% | — |
| Six-Behavior Classification |
| Hierarchical (proposed) | Frame-level CV | 88.5% | 80.9% | 81.8% |
| Hierarchical (proposed) | Video-level CV | 85.2 ± 1.1% | 77.3 ± 1.3% | 78.4 ± 1.2% |
| Hierarchical (proposed) | LOVO | 83.7 ± 2.4% | 75.8 ± 2.7% | 76.9 ± 2.5% |
| Flat CatBoost | Frame-level CV | 63.4% | 58.7% | 60.1% |
| Flat CatBoost | Video-level CV | 61.2 ± 1.8% | 56.3 ± 2.0% | 57.8 ± 1.9% |
| Flat CatBoost | LOVO | 59.7 ± 2.6% | 54.1 ± 2.8% | 55.6 ± 2.7% |
Table 7.
Per-class performance metrics for six-behavior hierarchical classification. Support indicates number of test set samples per category.
Table 7.
Per-class performance metrics for six-behavior hierarchical classification. Support indicates number of test set samples per category.
| Behavior | Precision | Recall | F1 | Support |
|---|
| Affiliative–Active | 95.1% | 76.0% | 84.5% | 25,993 |
| Affiliative–Subtle | 56.4% | 88.4% | 68.4% | 8904 |
| Neutral–Horse | 94.0% | 78.3% | 85.3% | 8717 |
| Neutral–Human | 69.4% | 90.4% | 78.5% | 4746 |
| Avoidant–Horse | 100% | 99.7% | 99.8% | 1754 |
| Avoidant–Human | 97.0% | 98.1% | 97.5% | 156 |
| Macro Average | 85% | 88% | 86% | — |
| Weighted Average | 86% | 81% | 82% | 50,270 |
Table 8.
Hierarchical classification performance summary across all taxonomic levels. Stage 1 discriminates between three parent categories, while Stage 2 sub-classifiers perform binary classification within each parent. The nested structure shows accuracy and F1 scores at each level, demonstrating the system’s ability to handle both coarse-grained and fine-grained behavioral distinctions under severe class imbalance.
Table 8.
Hierarchical classification performance summary across all taxonomic levels. Stage 1 discriminates between three parent categories, while Stage 2 sub-classifiers perform binary classification within each parent. The nested structure shows accuracy and F1 scores at each level, demonstrating the system’s ability to handle both coarse-grained and fine-grained behavioral distinctions under severe class imbalance.
| Category | Accuracy | F1 Score | Samples | Bal. Acc. |
|---|
| Overall Performance |
| OVERALL (Hierarchical) | 80.9% | 81.8% | 50,270 | 88.5% |
| Stage 1: Parent Categories |
| 3-way classification | 63.8% | 65.8% | 50,270 | 73.2% |
| Affiliative Behaviors |
| Combined | 79.2% | 80.4% | 34,897 | 82.2% |
| Affiliative–Active | 76.0% | 84.5% | 25,993 | — |
| Affiliative–Subtle | 88.4% | 68.4% | 8904 | — |
| Neutral Behaviors |
| Combined | 82.6% | 82.9% | 13,463 | 84.3% |
| Neutral–Horse | 78.3% | 85.3% | 8717 | — |
| Neutral–Human | 90.4% | 78.5% | 4746 | — |
| Avoidant Behaviors |
| Combined | 99.6% | 99.6% | 1910 | 98.9% |
| Avoidant–Horse | 99.7% | 99.8% | 1754 | — |
| Avoidant–Human | 98.1% | 97.5% | 156 | — |
Table 9.
Oracle vs. cascaded end-to-end performance on six-behavior classification. Oracle routing uses ground-truth parent labels for Stage 2 input; cascaded routing uses Stage 1 predictions. The substantial performance gap identifies Stage 1 parent classification as the primary bottleneck for end-to-end deployment.
Table 9.
Oracle vs. cascaded end-to-end performance on six-behavior classification. Oracle routing uses ground-truth parent labels for Stage 2 input; cascaded routing uses Stage 1 predictions. The substantial performance gap identifies Stage 1 parent classification as the primary bottleneck for end-to-end deployment.
| Routing | Bal. Acc. | Accuracy | Weighted F1 |
|---|
| Oracle (ground-truth parents) | 88.5% | 80.9% | 81.8% |
| Cascaded (Stage 1 predictions) | 62.9% | 52.9% | 55.0% |
| Performance drop | −25.6 pp | −28.0 pp | −26.8 pp |
| Cross-parent errors | 36.2% of samples misrouted |
Table 10.
Ablation study quantifying performance impact of removing each feature modality. Performance degradation measured relative to full 35-feature model (73.2% balanced accuracy).
Table 10.
Ablation study quantifying performance impact of removing each feature modality. Performance degradation measured relative to full 35-feature model (73.2% balanced accuracy).
| Feature Configuration | Bal. Acc. | Absolute | Relative |
|---|
| All features (35) | 73.2% | — | — |
| Ablation (remove one modality) |
| Without AP-10K (27 features) | 60.1% | −13.1% | −17.9% |
| Without YOLO (23 features) | 65.2% | −8.0% | −10.9% |
| Without MediaPipe (20 features) | 68.4% | −4.8% | −6.6% |
| Single modality (use one only) |
| YOLO only (12 features) | 58.9% | −14.3% | −19.5% |
| AP-10K only (8 features) | 55.7% | −17.5% | −23.9% |
| MediaPipe only (15 features) | 52.4% | −20.8% | −28.4% |
Table 11.
Comparison with existing automated animal behavior recognition systems. Direct quantitative comparison is precluded by differences in tasks, datasets, and evaluation protocols. Our approach uniquely addresses cross-species interaction under severe class imbalance.
Table 11.
Comparison with existing automated animal behavior recognition systems. Direct quantitative comparison is precluded by differences in tasks, datasets, and evaluation protocols. Our approach uniquely addresses cross-species interaction under severe class imbalance.
| Study | Task | Classes | Acc. | Method | Key Limitation |
|---|
| Bhave et al. [19] | Horse emotion (unsup.) | 8 | 55.5% | MoCo contrast. | Single-species only |
| Feighelstein et al. [49] | Cat pain (facial) | 2 | 72% | CNN + LDM (images) | Facial-only, no pose |
| Feighelstein et al. [18] | Horse emotion (facial) | 4 | 76% | CNN (images) | Facial-only, no pose |
| Corujo et al. [50] | Horse emotion (images) | 4 | 65% | CNN | Static images only |
| Broomé et al. [17] | Animal pain (survey) | — | — | Various CV | Review paper |
| Mathis et al. [20] | Animal pose track. | — | — | DeepLabCut | Tracking only |
| Pereira et al. [21] | Multi-animal track. | — | — | SLEAP | Tracking only |
| Temporal Baseline | Sliding window | 6 | 90.2% | Majority vote (5-frame) | Limited temporal modeling |
| This work | Human–horse interact. | 6 | 88.5% | Hierarchical | Frame-by-frame |
| | (parent categories) | 3 | 73.2% | multi-modal | Single subject |