Abstract
Accurate and timely pest monitoring is essential for sustainable agriculture and effective crop protection. While recent deep learning-based pest recognition systems have significantly improved accuracy, they are typically trained for fixed label sets and narrowly defined tasks. In this paper, we present RefPestSeg, a universal, language-promptable segmentation model specifically designed for pest monitoring. RefPestSeg can segment targets at any semantic level, such as species, genus, life stage, or damage type, conditioned on flexible natural language instructions. The model adopts a symmetric architecture with self-attention and cross-attention mechanisms to tightly align visual features with language embeddings in a unified feature space. To further enhance performance in challenging field conditions, we integrate an optimized super-resolution module to improve image quality and employ diverse data augmentation strategies to enrich the training distribution. A lightweight postprocessing step refines segmentation masks by suppressing highly overlapping regions and removing noise blobs introduced by cluttered backgrounds. Extensive experiments on a challenging pest dataset show that RefPestSeg achieves an Intersection over Union (IoU) of 69.08 while maintaining robustness in real-world scenarios. By enabling language-guided pest segmentation, RefPestSeg advances toward more intelligent, adaptable monitoring systems that can respond to real-time agricultural demands without costly model retraining.