Cross-Modal and Contrastive Optimization for Explainable Multimodal Recognition of Predatory and Parasitic Insects

Mingyu Liu; Liuxin Wang; Ruihao Jia; Shiyu Ji; Yalin Wu; Yuxin Wu; Luozehan Xie; Min Dong

doi:10.3390/insects16121187

,

and

¹

China Agricultural University, Beijing 100083, China

²

National School of Development, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Insects2025, 16(12), 1187;https://doi.org/10.3390/insects16121187
(registering DOI)

This article belongs to the Special Issue Artificial Intelligence (AI) and Insect Pests Management: Securing Food Security, Human Health, and Natural Resources

Version Notes

Order Reprints

Simple Summary

Accurate identification of natural enemies is essential for ecological pest management, yet traditional methods relying on visual inspection often fail under complex field conditions. This study proposes a multimodal recognition framework, MAVC-XAI, which integrates visual appearance and acoustic signals to improve the detection of key natural enemy species in agricultural ecosystems. The model not only achieves high recognition accuracy but also provides interpretable visualizations to reveal ecologically meaningful diagnostic features. By supporting real-time monitoring and decision-making, this approach offers a practical and intelligent tool for reducing pesticide use, enhancing biological control, and promoting sustainable agricultural production.

Abstract

Natural enemies play a vital role in pest suppression and ecological balance within agricultural ecosystems. However, conventional vision-based recognition methods are highly susceptible to illumination variation, occlusion, and background noise in complex field environments, making it difficult to accurately distinguish morphologically similar species. To address these challenges, a multimodal natural enemy recognition and ecological interpretation framework, termed MAVC-XAI, is proposed to enhance recognition accuracy and ecological interpretability in real-world agricultural scenarios. The framework employs a dual-branch spatiotemporal feature extraction network for deep modeling of both visual and acoustic signals, introduces a cross-modal sampling attention mechanism for dynamic inter-modality alignment, and incorporates cross-species contrastive learning to optimize inter-class feature boundaries. Additionally, an explainable generation module is designed to provide ecological visualizations of the model’s decision-making process in both visual and acoustic domains. Experiments conducted on multimodal datasets collected across multiple agricultural regions confirm the effectiveness of the proposed approach. The MAVC-XAI framework achieves an accuracy of 0.938, a precision of 0.932, a recall of 0.927, an F1-score of 0.929, an mAP@50 of 0.872, and a Top-5 recognition rate of 97.8%, all significantly surpassing unimodal models such as ResNet, Swin-T, and VGGish, as well as multimodal baselines including MMBT and ViLT. Ablation experiments further validate the critical contributions of the cross-modal sampling attention and contrastive learning modules to performance enhancement. The proposed framework not only enables high-precision natural enemy identification under complex ecological conditions but also provides an interpretable and intelligent foundation for AI-driven ecological pest management and food security monitoring.

Keywords:

insect pest management; natural enemy recognition; multimodal deep learning; cross-species contrastive learning; ecological monitoring and interpretation

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.