Towards Sustainable Wildlife Conservation: Automatic Recognition of Endangered Animal Behavior Using a Multimodal Contrastive Learning Framework
Abstract
1. Introduction
- In this paper, the behaviors of endangered protected animals are monitored, including some unique behaviors of giant pandas, which helps researchers to formulate scientific conservation and management measures to effectively protect endangered species.
- To address the data scarcity issue in this domain, we explore and adapt the prompt learning paradigm, which, to our knowledge, has been under-explored in the context of endangered animal behavior recognition from videos.
- We propose the ABCLIP model for multimodal architectures, which achieves significant performance on action recognition tasks by modeling the action recognition task as a video–text multimodal learning problem using supervised signals in natural language.
2. Materials and Methods
2.1. Animal Recording Environment
2.2. Dataset Statistics

2.3. Method
2.3.1. Textual Prompt Module
- (1)
- For regular training/evaluation
- (2)
- For zero-shot evaluation
- (3)
- For species-specific prompt ablation (Please refer to the Section 4 for details).
2.3.2. Visual Prompt Module
- Spatiotemporal Prompt
- 2.
- Multi-Frame Integration Prompt
2.3.3. Encoder Module
2.3.4. Feature Similarity Calculation Module
3. Results
3.1. Experimental Procedures
3.2. Comparison with Existing Models
3.3. Zero-Shot Behavior Recognition
- Baselines: We compared against two baselines, (1) Random Guess (chance level: 2/11 ≈ 18.18%) and (2) CLIP Zero-Shot, using the frozen pre-trained CLIP image and text encoders (without our task-specific fine-tuning) with the same prompt templates for nearest-neighbor matching.
- Results: As shown in Table 5, ABCLIP achieves a zero-shot recognition accuracy of 58.00%, significantly outperforming both baselines. This demonstrates that our multimodal fine-tuning enables effective transfer of learned visual–textual associations to novel behavior concepts.
- Analysis: The per-class breakdown indicates better performance on ‘playing’ (~65%) than on ‘defecating’ (~51%), likely due to the more distinctive and dynamic visual features of playful actions.
3.4. Ablation Study
3.5. Ablation on Loss Functions
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Black, S.A. Assessing presence, decline, and extinction for the conservation of difficult-to-observe species. In Problematic Wildlife II: New Conservation and Management Challenges in the Human-Wildlife Interactions; Springer: Cham, Switzerland, 2020; pp. 359–392. [Google Scholar]
- Songer, M.; Delion, M.; Biggs, A.; Huang, Q. Modeling impacts of climate change on giant panda habitat. Int. J. Ecol. 2012, 2012, 108752. [Google Scholar] [CrossRef]
- Smith, J.A.; Gaynor, K.M.; Suraci, J.P. Mismatch between risk and response may amplify lethal and non-lethal effects of humans on wild animal populations. Front. Ecol. Evol. 2021, 9, 604973. [Google Scholar] [CrossRef]
- Walker, K.A.; Trites, A.W.; Haulena, M.; Weary, D.M. A review of the effects of different marking and tagging techniques on marine mammals. Wildl. Res. 2011, 39, 15–30. [Google Scholar] [CrossRef]
- Jewell, Z. Effect of monitoring technique on quality of conservation science. Conserv. Biol. 2013, 27, 501–508. [Google Scholar] [CrossRef]
- Pimm, S.L.; Alibhai, S.; Bergl, R.; Dehgan, A.; Giri, C.; Jewell, Z.; Joppa, L.; Kays, R.; Loarie, S. Emerging technologies to conserve biodiversity. Trends Ecol. Evol. 2015, 30, 685–696. [Google Scholar] [CrossRef] [PubMed]
- Di Cerbo, A.R.; Biancardi, C.M. Monitoring small and arboreal mammals by camera traps: Effectiveness and applications. Acta Theriol. 2013, 58, 279–283. [Google Scholar] [CrossRef]
- Steenweg, R.; Hebblewhite, M.; Kays, R.; Ahumada, J.; Fisher, J.T.; Burton, C.; Townsend, S.E.; Carbone, C.; Rowcliffe, J.M.; Whittington, J.; et al. Scaling-up camera traps: Monitoring the planet’s biodiversity with networks of remote sensors. Front. Ecol. Evol. 2017, 15, 26–34. [Google Scholar] [CrossRef]
- Tabak, M.A.; Norouzzadeh, M.S.; Wolfson, D.W.; Sweeney, S.J.; Vercauteren, K.C.; Snow, N.P.; Halseth, J.M.; Di Salvo, P.A.; Lewis, J.S.; White, M.D.; et al. Machine learning to classify animal species in camera trap images: Applications in ecology. Methods Ecol. Evol. 2019, 10, 585–590. [Google Scholar] [CrossRef]
- Ng, X.L.; Ong, K.E.; Zheng, Q.; Ni, Y.; Yeo, S.Y.; Liu, J. Animal kingdom: A large and diverse dataset for animal behavior understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 19023–19034. [Google Scholar]
- Von Ziegler, L.; Sturman, O.; Bohacek, J. Big behavior: Challenges and opportunities in a new era of deep behavior profiling. Neuropsychopharmacology 2021, 46, 33–44. [Google Scholar] [CrossRef] [PubMed]
- Astill, J.; Dara, R.A.; Fraser, E.D.; Roberts, B.; Sharif, S. Smart poultry management: Smart sensors, big data, and the internet of things. Comput. Electron. Agric. 2020, 170, 105291. [Google Scholar] [CrossRef]
- Hou, J.; He, Y.; Yang, H.; Connor, T.; Gao, J.; Wang, Y.; Zeng, Y.; Zhang, J.; Huang, J.; Zheng, B.; et al. Identification of animal individuals using deep learning: A case study of giant panda. Biol. Conserv. 2020, 242, 108414. [Google Scholar] [CrossRef]
- Li, Q.; Chu, M.; Kang, X.; Liu, G. Temporal aggregation network using micromotion features for early lameness recognition in dairy cows. Comput. Electron. Agric. 2023, 204, 107562. [Google Scholar] [CrossRef]
- Liu, M.S.; Gao, J.Q.; Hu, G.Y.; Hao, G.F.; Jiang, T.Z.; Zhang, C.; Yu, S. MonkeyTrail: A scalable video-based method for tracking macaque movement trajectory in daily living cages. Zool. Res. 2022, 43, 343. [Google Scholar] [CrossRef]
- Zhang, Y.J.; Luo, Z.; Sun, Y.; Liu, J.; Chen, Z. From beasts to bytes: Revolutionizing zoological research with artificial intelligence. Zool. Res. 2023, 44, 1115. [Google Scholar] [CrossRef]
- Qiao, Y.; Guo, Y.; Yu, K.; He, D. C3D-ConvLSTM based cow behaviour classification using video data for precision livestock farming. Comput. Electron. Agric. 2022, 193, 106650. [Google Scholar] [CrossRef]
- Sun, G.; Liu, T.; Zhang, H.; Tan, B.; Li, Y. Basic behavior recognition of yaks based on improved SlowFast network. Ecol. Inform. 2023, 78, 102313. [Google Scholar] [CrossRef]
- Wang, Y.; Li, R.; Wang, Z.; Hua, Z.; Jiao, Y.; Duan, Y.; Song, H. E3D: An efficient 3D CNN for the recognition of dairy cow’s basic motion behavior. Comput. Electron. Agric. 2023, 205, 107607. [Google Scholar] [CrossRef]
- Cheng, M.; Yuan, H.; Wang, Q.; Cai, Z.; Liu, Y.; Zhang, Y. Application of deep learning in sheep behaviors recognition and influence analysis of training data characteristics on the recognition effect. Comput. Electron. Agric. 2022, 198, 107010. [Google Scholar] [CrossRef]
- Fuentes, A.; Yoon, S.; Park, J.; Park, D.S. Deep learning-based hierarchical cattle behavior recognition with spatio-temporal information. Comput. Electron. Agric. 2020, 177, 105627. [Google Scholar] [CrossRef]
- Li, C.; Xiao, Z.; Li, Y.; Chen, Z.; Ji, X.; Liu, Y.; Feng, S.; Zhang, Z.; Zhang, K.; Feng, J.; et al. Deep learning-based activity recognition and fine motor identification using 2D skeletons of cynomolgus monkeys. Zool. Res. 2023, 44, 967. [Google Scholar] [CrossRef]
- Wang, M.; Xing, J.; Liu, Y. Actionclip: A new paradigm for video action recognition. arXiv 2021, arXiv:2109.08472. [Google Scholar] [CrossRef]
- Wang, M.; Xing, J.; Mei, J.; Liu, Y.; Jiang, Y. ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 625–637. [Google Scholar] [CrossRef]
- Ma, Y.; Xu, G.; Sun, X.; Yan, M.; Zhang, J.; Ji, R. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval (Version 2). arXiv 2022, arXiv:2207.07285. [Google Scholar]
- Schiappa, M.C.; Rawat, Y.S.; Shah, M. Self-Supervised Learning for Videos: A Survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
- Fernandes, A.F.A.; Dórea, J.R.R.; Rosa, G.J.D.M. Image analysis and computer vision applications in animal sciences: An overview. Front. Vet. Sci. 2020, 7, 551269. [Google Scholar]
- Sharma, A.; Jain, A.; Gupta, P.; Chowdary, V. Machine learning applications for precision agriculture: A comprehensive review. IEEE Access 2020, 9, 4843–4873. [Google Scholar] [CrossRef]
- Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. arXiv 2020, arXiv:2012.15723. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Liu, D.; Hou, J.; Huang, S.; Liu, J.; He, Y.; Zheng, B.; Ning, J.; Zhang, J. LoTE-Animal: A long time-span dataset for endangered animal behavior understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 20064–20075. [Google Scholar]
- Gherardi, F.C. Operative versus conceptual classification of animal behaviour. Hist. Philos. Life Sci. 1983, 5, 87–99. [Google Scholar]
- Packard, J.M.; Ribic, C.A. Classification of the behavior of sea otters (Enhydra lutris). Can. J. Zool. 1982, 60, 1362–1373. [Google Scholar] [CrossRef]
- Wang, S. Development of an automated transformer-based text analysis framework for monitoring fire door defects in buildings. Sci. Rep. 2025, 15, 43910. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Moon, S.; Eum, I.; Hwang, D.; Kim, J. A text dataset of fire door defects for pre-delivery inspections of apartments during the construction stage. Data Brief 2025, 60, 111536. [Google Scholar] [CrossRef] [PubMed]





| Protection Level | Species | Species | Genus | Family | Order |
|---|---|---|---|---|---|
| I | Giant panda | Ailuropoda melanoleuca | Ailuropoda | Ailuropodidae | Carnivora |
| II | Red panda | Ailurus fulgens | Ailurus | Ailuridae | Carnivora |
| II | Yellow-throated marten | Martes flavigula | Martes | Mustelidae | Carnivora |
| Behavior Category | Description |
|---|---|
| Locomotor Behavior | The animal moves through the limbs in a specific sequence of alternating movements, generally staggered from side to side and back to front. |
| Feeding Behavior | The behavior of an animal in acquiring and ingesting food through actions such as chewing, gnawing, or swallowing. For example, giant pandas feed on bamboo leaves, drink water, and other activities. |
| Communication Behavior | Behavior in which an animal communicates information by sight, sound, smell, or physical contact. For example, giant pandas engage in activities such as smelling and urine signing. |
| Resting Behavior | Behavior of animals at rest for recovery and energy replenishment, usually involving lying down or inactivity. |
| Eliminative Behavior | The behavior of animals excreting feces or urine. |
| Miscellaneous Behavior | Other behaviors that do not fall into specific classifications include activities such as exploring, playing, etc. |
| Species | Behavior | Videos (Train) | Videos (Test) | Avg. Duration (s) | Med. Duration (s) | Resolution | Frame Rate (fps) |
|---|---|---|---|---|---|---|---|
| Giant panda | Walking | 85 | 22 | 4.2 | 3.8 | 1920 × 1080 | 30 |
| Giant panda | Smelling | 92 | 24 | 5.1 | 4.5 | 1920 × 1080 | 30 |
| Giant panda | Drink water | 18 | 5 | 6.3 | 5.9 | 1920 × 1080 | 30 |
| Giant panda | Resting | 95 | 24 | 8.5 | 7.2 | 1920 × 1080 | 30 |
| Giant panda | Circumanal gland signing | 45 | 12 | 3.8 | 3.5 | 1920 × 1080 | 30 |
| Giant panda | Exploratory | 70 | 18 | 7.2 | 6.8 | 1920 × 1080 | 30 |
| Giant panda | Feeding | 105 | 27 | 9.1 | 8.5 | 1920 × 1080 | 30 |
| Giant panda | Trotting | 60 | 15 | 2.5 | 2.3 | 1920 × 1080 | 30 |
| Giant panda | Urine signing | 50 | 13 | 4.0 | 3.7 | 1920 × 1080 | 30 |
| Giant panda | Defecating | 25 | 7 | 5.5 | 5.0 | 1920 × 1080 | 30 |
| Giant panda | Playing | 30 | 8 | 6.8 | 6.2 | 1920 × 1080 | 30 |
| Other species (red panda; yellow-throated marten) | All behaviors | ≈782 | ≈196 | ≈5.0 | ≈4.5 | 1920 × 1080 | 30 |
| Total/average | - | ≈1595 | ≈399 | ≈5.5 | ≈5.0 | 1920 × 1080 | 30 |
| Model | Source | Top-1 Acc (%) | Top-5 Acc (%) |
|---|---|---|---|
| I3D | CVPR2018 | 77.15 | 98.34 |
| SlowOnly | ICCV2019 | 74.13 | 97.03 |
| SlowFast | ICCV2019 | 78.54 | 97.55 |
| TPN | CVPR2020 | 73.34 | 94.32 |
| TimeSformer (spaceOnly) | ICML2021 | 71.70 | 96.67 |
| TimeSformer (jointST) | ICML2021 | 75.64 | 97.86 |
| TimeSformer (divST) | ICML2021 | 77.24 | 97.99 |
| ABCLIP (ours) | Ours | 82.50 ± 0.35 | 99.25 ± 0.15 |
| Method | Zero-Shot Accuracy (%) |
|---|---|
| Random Guess | 18.18 |
| CLIP zero-shot (pre-trained) | 42.15 ± 1.20 |
| ABCLIP (ours) | 58.00 ± 1.05 |
| Video Prompt | Top-1 Acc (%) | Top-5 Acc (%) |
|---|---|---|
| Spatiotemporal Prompt | 77.37 ± 0.52 | 97.02 ± 0.30 |
| Multi-Frame Integration (Mean Pooling) | 80.88 ± 0.41 | 97.80 ± 0.25 |
| Multi-Frame Integration (Transf) | 82.20 ± 0.38 | 98.89 ± 0.19 |
| Multi-Frame Integration (Similar-Transf) | 82.50 ± 0.35 | 99.25 ± 0.15 |
| Accuracy | Unimodality | Multimodality |
|---|---|---|
| Top-1 | 79.40 ± 0.45 | 82.50 ± 0.35 |
| Top-5 | 97.20 ± 0.28 | 99.25 ± 0.15 |
| Loss Function | Top-1 Accuracy (%) | Top-5 Accuracy (%) |
|---|---|---|
| Cross-entropy (CE) | 81.23 ± 0.41 | 98.05 ± 0.21 |
| InfoNCE | 81.89 ± 0.38 | 98.67 ± 0.18 |
| KL divergence (ours) | 82.50 ± 0.35 | 99.25 ± 0.15 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, S.; Xu, A.; Hou, Z. Towards Sustainable Wildlife Conservation: Automatic Recognition of Endangered Animal Behavior Using a Multimodal Contrastive Learning Framework. Sustainability 2026, 18, 1612. https://doi.org/10.3390/su18031612
Liu S, Xu A, Hou Z. Towards Sustainable Wildlife Conservation: Automatic Recognition of Endangered Animal Behavior Using a Multimodal Contrastive Learning Framework. Sustainability. 2026; 18(3):1612. https://doi.org/10.3390/su18031612
Chicago/Turabian StyleLiu, Shuyi, Ao Xu, and Zhenjie Hou. 2026. "Towards Sustainable Wildlife Conservation: Automatic Recognition of Endangered Animal Behavior Using a Multimodal Contrastive Learning Framework" Sustainability 18, no. 3: 1612. https://doi.org/10.3390/su18031612
APA StyleLiu, S., Xu, A., & Hou, Z. (2026). Towards Sustainable Wildlife Conservation: Automatic Recognition of Endangered Animal Behavior Using a Multimodal Contrastive Learning Framework. Sustainability, 18(3), 1612. https://doi.org/10.3390/su18031612

