AviaTAD-LGH: A Multi-Task Spatio-Temporal Action Detector with Lightweight Gradient Harmonization for Real-Time Avian Behavior Monitoring
Highlights
- We introduce a spatio-temporal action detection benchmark for wintering cranes, featuring dense annotations for six fine-grained behaviors in long, continuous ground-camera videos under cluttered, real-world conditions.
- We propose AviaTAD-LGH, a real-time multi-task spatio-temporal action detector trained with Lightweight Gradient Harmonization (LGH) to stabilize joint optimization and improve fine-grained behavior detection. On the wintering-crane benchmark, AviaTAD-LGH reaches 68.60% mAP, surpassing strong public baselines. Compared with a single-task variant, it improves mAP by 2.80 percentage points and boosts AP by more than 12 percentage points on challenging classes.
- AviaTAD-LGH enables efficient, individual-level behavior localization and recognition from continuous monitoring streams, supporting scalable video-based ecological monitoring and restoration assessment.
- LGH is a plug-and-play training strategy for multi-task video understanding, improving optimization without adding inference-time complexity and remaining suitable for edge-oriented real-time sensing applications.
Abstract
1. Introduction
2. Related Work
2.1. Datasets and Ecological Monitoring
2.2. Video Action Recognition
2.3. Multi-Task Learning and Gradient Conflicts
3. Materials and Methods
3.1. Dataset Collection and Annotation
3.2. AviaTAD-LGH Model Architecture
4. Results
4.1. Model Training and Validation
4.2. Main Results and Analysis
4.3. Practical Monitoring Evaluation
4.4. Optimization Analysis of LGH
4.5. Daily Activity Rhythms of the Hooded Crane
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AviaTAD-LGH | Avian Temporal Action Detector with Lightweight Gradient Harmonization |
| LGH | Lightweight Gradient Harmonization |
| AP | Average Precision |
| mAP | Mean Average Precision |
| ROI | Region of Interest |
| SGD | Stochastic Gradient Descent |
References
- IPBES. Intergovernmental science-policy platform on biodiversity and ecosystem services. In Summary for Policy Makers of the Global Assessment Report on Biodiversity and Ecosystem Services of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services; IPBES Secretariat: Bonn, Germany, 2019. [Google Scholar]
- Almond, R.E.A.; Grooten, M.; Peterson, T. Living Planet Report 2020-Bending the Curve of Biodiversity Loss; World Wildlife Fund: Washington, DC, USA, 2020. [Google Scholar]
- Yang, F.; Shao, M.; Wang, J. Distributional and behavioral responses of the wintering Oriental Storks to drought in China’s largest freshwater lake. Avian Res. 2024, 15, 100176. [Google Scholar] [CrossRef]
- Liu, D.; Sun, Q.; Hou, J.; Liu, D.; Sun, Q.; Hou, J.; Zheng, B.; Zhang, J.; Li, D.; Norton, T.; et al. Wild ActionFormer: Enhancing wildlife action recognition for 11 endangered species in Wolong. Ecol. Inform. 2025, 89, 103148. [Google Scholar] [CrossRef]
- Brivio, F.; Apollonio, M.; Anderwald, P.; Filli, F.; Bassano, B.; Bertolucci, C.; Grignolio, S. Seeking temporal refugia to heat stress: Increasing nocturnal activity despite predation risk. Proc. R. Soc. B Biol. Sci. 2024, 291, 20231587. [Google Scholar] [CrossRef]
- Koger, B.; Deshpande, A.; Costelloe, B.R.; Graving, J.M.; Kerby, J.T.; Couzin, I.D. Quantifying the movement, behaviour and environmental context of group-living animals using drones and computer vision. J. Anim. Ecol. 2023, 92, 1357–1371. [Google Scholar] [CrossRef]
- Pedrazzi, L.; Naik, H.; Sandbrook, C.; Lurgi, M.; Fürtbauer, I.; King, A.J. Advancing animal behaviour research using drone technology. Anim. Behav. 2025, 222, 123147. [Google Scholar] [CrossRef]
- Fazzari, E.; Romano, D.; Falchi, F.; Stefanini, C. Selective state models are what you need for animal action recognition. Ecol. Inform. 2025, 85, 102955. [Google Scholar] [CrossRef]
- Rodriguez-Juan, J.; Ortiz-Perez, D.; Benavent-Lledo, M.; Mulero-Perez, D.; Ruiz-Ponce, P.; Orihuela-Torres, A.; Gar-cia-Rodriguez, J.; Sebastián-González, E. Visual wetlandbirds dataset: Bird species identification and behavior recognition in videos. Sci. Data 2025, 12, 1200. [Google Scholar] [CrossRef]
- Mao, D.; Wang, M.; Wang, Y.; Jiang, M.; Yuan, W.; Luo, L.; Feng, K.; Wang, D.; Xiang, H.; Ren, Y.; et al. The trajectory of wetland change in China between 1980 and 2020: Hidden losses and restoration effects. Sci. Bull. 2025, 70, 587–596. [Google Scholar] [CrossRef] [PubMed]
- Qiu, J.; Zhang, Y.; Ma, J. Wetland habitats supporting waterbird diversity: Conservation perspective on biodiversity-ecosystem functioning relationship. J. Environ. Manag. 2024, 357, 120663. [Google Scholar] [CrossRef]
- Djerboua, S.; Djerdali, S.; Guerrero-Casado, J.; Bachir, A.S.; Guendouz, A. Water birds as indicators of ecological conditions in a Ramsar Wetland (Sebkhet Bazer, east of Algeria). Agric. Sci. Dig. 2022, 42, 341–350. [Google Scholar] [CrossRef]
- Su, H.L.; Lin, Y.H.; Li, D.Q.; Qian, F.W. Status of Chinese cranes and their conservation strategies. Biodivers. Sci. 2000, 8, 180. [Google Scholar] [CrossRef]
- Harris, J.; Mirande, C. A global overview of cranes: Status, threats and conservation priorities. Avian Res. 2013, 4, 189–209. [Google Scholar] [CrossRef]
- Ruiz, C.; Alaíz, C.M.; Dorronsoro, J.R. A survey on kernel-based multi-task learning. Neurocomputing 2024, 577, 127255. [Google Scholar] [CrossRef]
- Yu, T.; Kumar, S.; Gupta, A.; Hausman, K.; Levine, S. Gradient surgery for multi-task learning. Adv. Neural Inf. Process. Syst. 2020, 33, 5824–5836. [Google Scholar]
- Ng, X.L.; Ong, K.E.; Zheng, Q.; Ni, Y.; Yeo, S.Y.; Liu, J. Animal kingdom: A large and diverse dataset for animal behavior understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 19023–19034. [Google Scholar]
- Wu, X.; Dong, J.; Bao, W.; Zou, B.; Wang, L.; Wang, H. Augmented intelligence of things for emergency vehicle secure trajectory prediction and task offloading. IEEE Internet Things J. 2024, 11, 36030–36043. [Google Scholar] [CrossRef]
- Chen, J.; Hu, M.; Coker, D.J.; Berumen, M.L.; Costelloe, B.; Beery, S.; Rohrbach, A.; Elhoseiny, M. Mammalnet: A large-scale video benchmark for mammal recognition and behavior understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 13052–13061. [Google Scholar]
- Beery, S.; Agarwal, A.; Cole, E.; Birodkar, V. The iwildcam 2021 competition dataset. arXiv 2021, arXiv:2105.03494. [Google Scholar]
- Swanson, A.; Kosmala, M.; Lintott, C.; Simpson, R.; Smith, A.; Packer, C. Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci. Data 2015, 2, 150026. [Google Scholar] [CrossRef]
- Norouzzadeh, M.S.; Nguyen, A.; Kosmala, M.; Swanson, A.; Palmer, M.S.; Packer, C.; Clune, J. Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. USA 2018, 115, E5716–E5725. [Google Scholar] [CrossRef] [PubMed]
- Mathis, A.; Mamidanna, P.; Cury, K.M.; Abe, T.; Murthy, V.N.; Mathis, M.W.; Bethge, M. DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 2018, 21, 1281–1289. [Google Scholar] [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar] [CrossRef]
- Gu, C.; Li, Y.; Sun, C.; Ross, D.A.; Vijayanarasimhan, S.; Sukthankar, R.; Vondrick, C.; Toderici, G.; Pantofaru, C.; Ric-co, S.; et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 6047–6056. [Google Scholar]
- Feng, M.; Meunier, J. Skeleton graph-neural-network-based human action recognition: A survey. Sensors 2022, 22, 2091. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2015; pp. 4489–4497. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 6299–6308. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 6202–6211. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; Volume 2, p. 4. [Google Scholar]
- Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 6824–6835. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 3202–3211. [Google Scholar]
- Li, Y.; Jian, H.; Wang, H.; Zhang, L.; Liu, Y.; Liu, W. Bitscaling: Streamlining neural network compression via predictive multi-scale growth of mixed-precision networks. Neural Netw. 2025, 196, 108327. [Google Scholar] [CrossRef]
- Ye, X.; Sakurai, K.; Nair, N.K.C.; Wang, K.I.K. Machine learning techniques for sensor-based human activity recognition with data heterogeneity—A review. Sensors 2024, 24, 7975. [Google Scholar] [CrossRef]
- Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar] [CrossRef]
- Wang, H.; Li, J.; Dong, H. A Review of Vision-Based Multi-Task Perception Research Methods for Autonomous Vehicles. Sensors 2025, 25, 2611. [Google Scholar] [CrossRef] [PubMed]
- Chen, Z.; Badrinarayanan, V.; Lee, C.-Y.; Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the 35th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2018; pp. 794–803. [Google Scholar]
- Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 7482–7491. [Google Scholar]
- Qin, X.; Wang, X.; Yan, J. Towards Consistent Multi-Task Learning: Unlocking the Potential of Task-Specific Parameters. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: New York, NY, USA, 2025; pp. 10067–10076. [Google Scholar]
- Chen, J.; Er, M.J. Mitigating gradient conflicts via expert squads in multi-task learning. Neurocomputing 2025, 614, 128832. [Google Scholar] [CrossRef]
- Lv, C.; Lv, X.-L.; Wang, Z.; Zhao, T.; Tian, W.; Zhou, Q.; Zeng, L.; Wan, M.; Liu, C. A focal quotient gradient system method for deep neural network training. Appl. Soft Comput. 2025, 184, 113704. [Google Scholar] [CrossRef]
- Navon, A.; Shamsian, A.; Achituve, I.; Maron, H.; Kawaguchi, K.; Chechik, G.; Fetaya, E. Multi-task learning as a bargaining game. arXiv 2022, arXiv:2202.01017. [Google Scholar] [CrossRef]
- Tong, Z.; Song, Y.; Wang, J.; Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
- Meng, F.; Xiao, Z.; Zhang, Y.; Li, J.; Wang, H. RI-PCGrad: Optimizing multi-task learning with rescaling and impartial projecting conflict gradients. Appl. Intell. 2024, 54, 12009–12019. [Google Scholar] [CrossRef]










| Behavior Type | Behavior Description |
|---|---|
| Feeding | Behaviors such as lowering the head to search, probing, pecking, digging with the bill, swallowing and other food-handling movements, drinking, adults feeding chicks, and chicks begging food from adults. |
| Vigilance | behaviors such as stretching the neck, scanning the surroundings and gazing into the distance. |
| Social behavior | behaviors such as chasing among conspecifics, threatening or responsive calls, flights immediately before or after obvious provocative acts, and courtship behavior. |
| Maintenance | behaviors such as preening, shaking the plumage, wing flapping, resting and bathing. |
| Walk | walking movements not associated with feeding and not accompanied by provocative or aggressive behavior. |
| Flight | flying. |
| Model Name | Overall mAP | Gain over Baseline | Gain over Original | Feeding | Vigilance | Social | Maintenance | Walk | Flight |
|---|---|---|---|---|---|---|---|---|---|
| Original models (without LGH) | |||||||||
| VideoMAE-Base (baseline) | 0.6116 | 0% | - | 0.6315 | 0.4638 | 0.9502 | 0.6792 | 0.1071 | 0.8377 |
| ACRNN-SlowFast | 0.6143 | +0.27% | - | 0.8117 | 0.5099 | 0.8233 | 0.7127 | 0.1673 | 0.6607 |
| SlowOnly | 0.6255 | +1.39% | - | 0.8488 | 0.5115 | 0.9345 | 0.7155 | 0.1337 | 0.6088 |
| VideoMAE-Large | 0.6518 | +4.02% | - | 0.7510 | 0.4980 | 0.9161 | 0.6680 | 0.1614 | 0.9166 |
| SlowFast | 0.6580 | +4.64% | - | 0.8148 | 0.4778 | 0.9617 | 0.6934 | 0.2199 | 0.7806 |
| Multi-task learning models (+LGH) | |||||||||
| ACRNN-SlowFast +LGH | 0.6229 | +1.13% | +0.86% | 0.7020 | 0.4386 | 0.8535 | 0.6690 | 0.1983 | 0.8763 |
| VideoMAE-Base +LGH | 0.6333 | +2.17% | +2.17% | 0.7235 | 0.4594 | 0.9353 | 0.6530 | 0.1405 | 0.8881 |
| SlowOnly +LGH | 0.6717 | +6.01% | +4.62% | 0.8236 | 0.4940 | 0.9344 | 0.7076 | 0.2015 | 0.8690 |
| VideoMAE-Large +LGH | 0.6760 | +6.44% | +2.42% | 0.8534 | 0.5618 | 0.9241 | 0.6399 | 0.1622 | 0.9147 |
| AviaTAD-LGH (Ours) | 0.6860 | +7.44% | +2.80% | 0.8318 | 0.5477 | 0.9403 | 0.7046 | 0.2320 | 0.8595 |
| Configuration | mAP | Compared with Baseline |
|---|---|---|
| Single-task baseline | 65.80 | - |
| +GradNorm | 66.21 | +0.41% |
| +PCGrad | 66.92 | +1.12% |
| +RI-PCGrad | 67.25 | +1.45% |
| +LGH | 68.60 | +2.80% |
| Configuration | Aux. Motion Head | Weighting Strategy | mAP | Over Baseline (pp) |
|---|---|---|---|---|
| Single-task baseline | ✗ | — | 65.80 | - |
| Motion head | ✓ | equal sum (λ = 1.0) | 66.01 | +0.21 |
| Motion head + static weighting | ✓ | fixed λ = 0.5 | 67.20 | +1.40 |
| Motion head + uncertainty weighting | ✓ | uncertainty-based | 67.76 | +1.96 |
| Motion head + LGH (Ours) | ✓ | conflict-aware reweighting | 68.60 | +2.80 |
| λ_base = 0.02 | λ_base = 0.05 | λ_base = 0.10 | λ_base = 0.20 | |
|---|---|---|---|---|
| α = 0.1 | 67.23 | 67.81 | 68.14 | 67.52 |
| α = 0.2 | 67.41 | 67.95 | 68.60 | 67.73 |
| α = 0.3 | 67.18 | 67.86 | 68.26 | 67.44 |
| α = 0.5 | 67.01 | 67.53 | 67.89 | 67.12 |
| Model | Params (M) | FLOPs (G) | Inference Speed (fps) |
|---|---|---|---|
| VideoMAE-Base | 86.6 | 180.1 | 38.5 |
| VideoMAE-Large | 304.4 | 597.2 | 14.2 |
| ACRNN-SlowFast | 35.8 | 71.3 | 45.6 |
| SlowOnly | 31.7 | 54.5 | 52.3 |
| SlowFast | 33.6 | 65.7 | 48.1 |
| AviaTAD-LGH (Ours) | 33.6 | 65.7 | 48.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Xie, Z.; Jian, H.; Yang, W.; Fu, M.; Peng, W.; Eichhorn, M.P.; Crego, R.D.; Ning, X.; Du, J.; Wang, H. AviaTAD-LGH: A Multi-Task Spatio-Temporal Action Detector with Lightweight Gradient Harmonization for Real-Time Avian Behavior Monitoring. Sensors 2026, 26, 2088. https://doi.org/10.3390/s26072088
Xie Z, Jian H, Yang W, Fu M, Peng W, Eichhorn MP, Crego RD, Ning X, Du J, Wang H. AviaTAD-LGH: A Multi-Task Spatio-Temporal Action Detector with Lightweight Gradient Harmonization for Real-Time Avian Behavior Monitoring. Sensors. 2026; 26(7):2088. https://doi.org/10.3390/s26072088
Chicago/Turabian StyleXie, Zihui, Haifang Jian, Wenhui Yang, Mengdi Fu, Wanting Peng, Markus Peter Eichhorn, Ramiro Daniel Crego, Xin Ning, Jun Du, and Hongchang Wang. 2026. "AviaTAD-LGH: A Multi-Task Spatio-Temporal Action Detector with Lightweight Gradient Harmonization for Real-Time Avian Behavior Monitoring" Sensors 26, no. 7: 2088. https://doi.org/10.3390/s26072088
APA StyleXie, Z., Jian, H., Yang, W., Fu, M., Peng, W., Eichhorn, M. P., Crego, R. D., Ning, X., Du, J., & Wang, H. (2026). AviaTAD-LGH: A Multi-Task Spatio-Temporal Action Detector with Lightweight Gradient Harmonization for Real-Time Avian Behavior Monitoring. Sensors, 26(7), 2088. https://doi.org/10.3390/s26072088

