RDM-YOLO: A Lightweight Multi-Scale Model for Real-Time Behavior Recognition of Fourth Instar Silkworms in Sericulture

Gao, Jinye; Sun, Jun; Wu, Xiaohong; Dai, Chunxia

doi:10.3390/agriculture15131450

Open AccessArticle

RDM-YOLO: A Lightweight Multi-Scale Model for Real-Time Behavior Recognition of Fourth Instar Silkworms in Sericulture

School of Electrical and Information Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(13), 1450; https://doi.org/10.3390/agriculture15131450

Submission received: 28 May 2025 / Revised: 16 June 2025 / Accepted: 3 July 2025 / Published: 5 July 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate behavioral monitoring of silkworms (Bombyx mori) during the fourth instar development is crucial for enhancing productivity and welfare in sericulture operations. Current manual observation paradigms face critical limitations in temporal resolution, inter-observer variability, and scalability. This study presents RDM-YOLO, a computationally efficient deep learning framework derived from YOLOv5s architecture, specifically designed for the automated detection of three essential behaviors (resting, wriggling, and eating) in fourth instar silkworms. Methodologically, Res2Net blocks are first integrated into the backbone network to enable hierarchical residual connections, expanding receptive fields and improving multi-scale feature representation. Second, standard convolutional layers are replaced with distribution shifting convolution (DSConv), leveraging dynamic sparsity and quantization mechanisms to reduce computational complexity. Additionally, the minimum point distance intersection over union (MPDIoU) loss function is proposed to enhance bounding box regression efficiency, mitigating challenges posed by overlapping targets and positional deviations. Experimental results demonstrate that RDM-YOLO achieves 99% mAP@0.5 accuracy and 150 FPS inference speed on the datasets, significantly outperforming baseline YOLOv5s while reducing the model parameters by 24%. Specifically designed for deployment on resource-constrained devices, the model ensures real-time monitoring capabilities in practical sericulture environments.

Keywords:

behavior recognition; silkworm (Bombyx mori); deep learning; computer vision

1. Introduction

The domesticated silkworm (Bombyx mori) undergoes a complete metamorphosis comprising four life stages, namely: egg, larva (instar), pupa, and adult moth. During the larval phase, silkworms progress through five instars, with each stage characterized by molting and rapid growth. The fourth instar is a critical developmental juncture, during which larvae exhibit intensive feeding and physiological changes that directly impact cocoon quality and silk yield. Notably, silkworm pupae form within cocoons after the larval stage, representing a valuable byproduct in the silk-reeling process [1,2]. Compared to other instars, fourth instar larvae exhibit stable physical characteristics (e.g., body length of 18–22 mm, segmental pigmentation), facilitating the training and validation of automated recognition models. During this period, subtle behavioral patterns (e.g., resting, wriggling, and eating) are vital indicators of physiological health and environmental adaptability. Notably, prolonged rest periods might indicate suboptimal temperature or humidity conditions, while irregular feeding could signal nutritional deficiencies or increased susceptibility to diseases [3,4]. Therefore, timely and accurate monitoring of these behaviors is essential for optimizing husbandry practices, mitigating production losses, and maximizing silk production efficiency.

Traditional monitoring approaches require continuous tracking of individual behaviors, with a single experiment often taking place over a long period, which hinders large-scale phenotypic analysis. Meanwhile, manual observation is prone to researcher experience variations [5,6], leading to low data consistency. Compounding these issues are human fatigue and perceptual biases that frequently result in delayed interventions and potential economic losses. Consequently, there exists an urgent need for automated, objective real-time monitoring systems to modernize sericulture practices and enhance operational efficiency.

Recent advancements in computer vision and deep learning have significantly transformed the analysis of animal behavior across multiple agricultural disciplines. Transformer-based models, such as vision transformers (ViTs) have substantially enhanced the capability of spatio-temporal feature modeling via the global attention mechanism [7]. Meanwhile, dual-path architectures, like SlowFast, have exhibited superior performance in motion modeling for behavior recognition tasks [8]. Additionally, Long Short-Term Memory (LSTM) networks remain widely adopted for temporal behavior analysis [9]. Nevertheless, these models typically demand substantial computational resources, making their deployment on edge devices in sericulture production challenging.

For livestock monitoring, You Only Look Once (YOLO) models have emerged as state-of-the-art solutions for posture and activity recognition [10,11]. Huang et al. [12] proposed a high-efficiency YOLO model to enhance pig behavior recognition [13], achieving mean average precision (mAP) of 99.25% for standing, 98.41% for sitting, and 94.43% for recumbent postures. Similarly, Wang et al. [14] optimized YOLOv8n [15,16,17] by replacing the CIoU [18] loss with normalized Wasserstein distance (NWD) loss [19,20], reducing the sensitivity to positional deviations in dairy cow detection. Their model achieved 93.90% accuracy while incorporating a temporal attention module into the backbone architecture to improve focus on dynamic cow behaviors [21].

The application of computer vision and deep learning has significantly reshaped animal behavior analysis throughout multiple agricultural fields, notably within entomological studies. Romano highlights novel automation, AI, and biomimetic engineering advancements for insect studies and management, emphasizing the potential of deep learning to revolutionize real-time monitoring and pest control [22]. Kariyanna & Sowjanya provide a comprehensive review of artificial intelligence applications in insect pest management, noting that deep learning models have been successfully applied to detect insect species, predict infestation patterns, and optimize pesticide application [23].

In sericulture, deep learning has exhibited substantial potential for silkworm monitoring. Wen et al. [24] adapted an improved YOLOv4 model [25,26,27] for silkworm detection by substituting the backbone network with MobileNetV3 [28,29], achieving notable improvements in detection accuracy. Tao et al. [30] utilized convolutional neural networks (CNNs) [31,32] for silkworm pupae classification, demonstrating strong generalization ability and achieving a classification accuracy of 98.5%. Xiong et al. [33] utilized the YOLOv4 model for silkworm egg detection, achieving a notable detection accuracy of 98.5%. However, existing YOLO-based models are predominantly designed for static trait analysis (e.g., silkworm pupae classification) and exhibit constrained adaptability to dynamic behaviors in crowded rearing environments. Despite these technological advancements, the application of deep learning to silkworm behavioral analysis remains largely underexplored, thereby offering significant opportunities for innovation in real-time monitoring system development. Thus, automation is not merely a technical substitution but a key solution to address the inefficiency of data acquisition and the standardization of phenotypic analysis in silkworm biology research.

YOLOv5s was chosen for its proven technical adaptability within the specific context of silkworm research. This suitability is validated by three key strengths, namely: its real-time processing capability, crucial for accurately capturing the silkworms’ dynamic movements and activities; its robustness against the inherent complexities of typical silkworm rearing environments, like occlusions by mulberry leaves; and its demonstrated superior performance on biological detection tasks similar in nature to tracking and analyzing silkworms. Therefore, this study presents RDM-YOLO, an improved YOLOv5s [34,35] variant specifically designed for the real-time detection of dynamic behaviors in fourth instar silkworms. Experimental results demonstrate that RDM-YOLO surpasses the baseline YOLOv5s model in both detection accuracy and inference efficiency, achieving a 24% reduction in model parameters, which facilitates its deployment on resource-constrained edge devices. This research not only advances the technical capabilities for dynamic behavior analysis in sericulture but also aligns with the core objectives of precision agriculture by enabling data-driven optimization of silkworm rearing management to enhance silk production efficiency. The overall workflow of the proposed methodology is illustrated in Figure 1.

2. Materials and Methods

2.1. Overview of the Model Improvements

The YOLOv5s model serves as a representative detection framework within the YOLOv5 series, effectively meeting the demands for efficient recognition of silkworm behaviors. The architecture of YOLOv5s comprises an input layer, a feature extraction layer, a feature fusion layer, and detection heads. Initially, mosaic data augmentation is employed at the input layer to enhance dataset diversity. Furthermore, the feature extraction layer serves to extract task-relevant features and transmit them to downstream layers for subsequent processing. Additionally, feature fusion plays a critical role in bridging feature extraction layers and detection heads, thereby enhancing the effective utilization of extracted features. Finally, the detection heads generate predictions from three feature layers operating at different scales.

To enable efficient and accurate real-time recognition of fourth instar silkworm behaviors, this paper proposes RDM-YOLO, a lightweight architecture derived from YOLOv5s, with its hierarchical network structure visualized in Figure 2.

To begin with, the integration of Res2Net blocks with nested residual connections in the backbone network establishes a hierarchical multi-scale feature representation [36,37]. This mechanism enables the model to concurrently capture fine-grained texture details (e.g., silkworm segmental pigmentation) and macro-level morphological patterns, significantly enhancing its ability to discern small silkworms and resolve overlapping targets in dense rearing conditions. Building upon this feature extraction capability, the architecture further incorporates distribution shifting convolution (DSConv) [38] to structurally optimize computational efficiency, achieving a 24% parameter reduction through dynamic sparse channel quantization. Finally, the minimum point distance intersection over union (MPDIoU) loss function refines bounding box regression by minimizing centroid deviations and spatial conflicts in overlapping scenarios [39,40]. This optimization ensures robust performance when silkworms are occluded by mulberry leaves or clustered together, a critical limitation of existing YOLO-based models.

2.2. Res2Net Module

Res2Net is a hierarchical multi-scale backbone architecture designed to enhance feature representation through multi-granularity residual connections. Distinct from conventional residual blocks, Res2Net partitions input features into s subsets (s > 1) through channel splitting after the initial 1 × 1 convolution. The subsets are denoted by x_i, where i ∈ {1, 2, …, s}.

As depicted in Figure 3, the Res2Net architecture creates a hierarchical residual to expand the receptive field of each network layer. The bottleneck block serves as a fundamental component in the Res2Net module, playing a critical role in enabling hierarchical multi-scale feature representation. As illustrated in Figure 3a, the bottleneck block typically consists of 1 × 1 convolution layers for dimensionality reduction and expansion, sandwiching a 3 × 3 convolution layer for feature extraction. This design reduces computational complexity while maintaining feature discriminability.

In the context of RDM-YOLO, the bottleneck block is integrated into the Res2Net architecture (Figure 3b) to partition input features into subsets, allowing each subset to undergo hierarchical residual connections. By enabling feature reuse and multi-granularity fusion, the bottleneck block enhances the model’s ability to capture fine-grained textures and global morphological patterns, which is crucial for detecting small silkworms and resolving overlapping targets in dense rearing environments. This design ensures efficient receptive field expansion without significant parameter overhead, contributing to the model’s lightweight and real-time performance.

For all subsets excluding x₁, each subsequent subset x_i is subjected to a respective 3 × 3 convolutional operation designated as K_i, with the output of K_i denoted as Y_i [36]. Specifically, the feature subset x_i is concatenated with the output of the preceding convolutional operation K_i−₁ and then fed into K_i. Mathematically, Y_i can be expressed as follows:

Y_{i} = \{\begin{array}{l} x_{i} & i = 1; \\ K_{i} (x_{i}) & i = 2; \\ K_{i} (x_{i} + Y_{i - 1}) & 2 < i \leq s . \end{array}

(1)

Notably, each 3 × 3 convolutional operator K_i can access feature information from all feature splits. Owing to the combinatorial explosion effects, the Res2Net module generates outputs with a wide variety of receptive field sizes and combinations, enabling more effective feature fusion.

In the C3 modules of the backbone, standard bottleneck blocks are to be replaced with Res2Net blocks. Specifically, in the feature layers of the 80 × 80, 40 × 40, and 20 × 20 dimensions, it is essential to substitute the inner 3 × 3 convolutional bottlenecks with Res2Net blocks. This modification aims to preserve spatial resolution while enhancing multi-scale feature fusion.

It is crucial to ensure that the input and output channel dimensions of the Res2Net blocks align with those of the original bottleneck blocks in order to maintain the hierarchical structure of features within the backbone. The residual connections inherent in Res2Net effectively retain the skip connections present in the C3 module, thereby preventing information loss during feature extraction.

The cascaded hierarchical structure of the Res2Net block boosts receptive field diversity, capturing both fine-grained morphologicaldetails and global motion dynamics. By reusing intermediate features, Res2Net demonstrates superior computational efficiency compared to parallel multi-branch architectures, thereby reducing redundant parameters without compromising multi-scale representation capabilities. Additionally, hierarchical residual connections maintain contextual relationships among overlapping silkworms, enhancing occlusion robustness. Together, these features enable accurate detection of subtle behaviors and ensure real-time performance on resource-constrained devices.

2.3. Distribution Shifting Convolution Module

Distribution shifting convolution (DSConv) enhances computational efficiency through a flexible quantization mechanism. It decomposes the standard convolution kernel into two components, namely a variable quantized kernel (VQK) and distribution shifts. The VQK, composed of integer values with variable bit lengths, enables efficient multiplication and better memory management. The kernel distribution shifter (KDS) and channel distribution shifter (CDS) then align the distribution of the VQK with that of the original kernel via scaling and biasing operations. This decomposition and quantization strategy substantially enhances the model’s lightweight characteristics and real-time inference performance, as shown in Figure 4.

During the DSConv quantization procedure, two parallel computational pathways are executed [41]. The first pathway involves computing the dot product between the VQK and individual segments of the mantissa tensor. Concurrently, the KDS undergoes element-wise summation with the exponent tensor derived from the original convolution operation. These intermediate results are subsequently combined through multiplicative fusion to generate the final output tensor. The weight quantization mechanism operates through a two-stage transformation: Initially, all convolutional layer weights undergo linear scaling normalization, followed by nearest-integer quantization to produce discrete values (denoted as w_q). These quantized weights are systematically encoded into the VQK structure, whose valid numerical ranges are mathematically defined as follows:

w_{q} \in Z, b \in N | - 2^{b - 1} \leq w_{q} \leq 2^{b - 1} - 1

(2)

In this equation, where w_q denotes the parameter value within the tensor, while b represents the number of bits allocated for the network input [42]. The KDS establishes a tensor correspondence between the VQK and original convolution through a dual-domain optimization framework. This process implements cross-domain tensor alignment by minimizing relative entropy between the original weight distribution and the KDS parameterization, where both tensors are initialized through entropy-constrained optimization to achieve minimal distributional divergence. The operational equivalence is mathematically formalized as a functional equivalence between the permuted VQK configuration and the softmax-transformed original distribution, with the block-wise KDS derivation formally expressed as follows:

ξ = \min_{ξ} \sum_{j} T_{j} \log (\frac{T_{j}}{I_{j}})

(3)

The replaced VQK T_j and the softmax value of the original distribution I_j are mathematically formulated in the following equations, respectively:

T_{j} = \frac{e^{w_{j}}}{\sum_{i} e^{w_{i}}}

(4)

I_{j} = \frac{e^{\hat{ξ} \cdot w_{q j}}}{\sum_{i} e^{\hat{ξ} \cdot w_{q i}}}

(5)

In this formulation, where j ∈ Z⁺ denotes the convolutional kernel index in the j-th block of the current layer, while i ∈ Z⁺ represents the index of the i-th layer. The symbol

\hat{ξ}

denotes the bias term with a default value of 0, while w_qj and w_qi represent the initial value of the VQK parameters and the softmax output of the current block, respectively.

Through this synergistic mechanism between KDS and VQK, the convolutional quantization achieves parametric equivalence with the original convolution operation. The co-optimization of these components through the aforementioned optimization framework enables precise weight discretization while preserving representational fidelity.

The YOLOv5s typically consists of C3 modules, which are bottleneck structures designed for feature fusion and downsampling. A standard C3 module combines convolutional layers, batch normalization, and activation functions to process features from different scales. The key modification involves substituting the conventional 3 × 3 convolutional layers within C3 with DSConv to achieve lightweight optimization.

To integrate DSConv, we first replace the partial 3 × 3 convolutional layer in the bottleneck blocks of the residual branch with a DSConv layer, which is implemented with dynamic sparse quantization to decompose the kernel into VQK and distribution shifters. Meanwhile, we ensure that the DSConv layers maintain the same input and output channel dimensions as the original 3 × 3 convolutions to preserve the C3 module’s feature flow. Additionally, we retain the skip connections in the C3 module to ensure that the hierarchical feature fusion remains unaffected by the convolution replacement.

2.4. Minimum Point Distance Intersection over Union Module

In the baseline YOLOv5s model, CIoU is adopted as the IoU loss function. Derived from DIoU [43], CIoU incorporates the aspect ratio between the predicted and ground-truth bounding boxes to accelerate prediction frame regression. Nonetheless, this approach does not directly model the absolute width/height discrepancies between bounding boxes or account for their impacts on confidence score predictions, potentially leading to suboptimal training convergence. To address the regression challenges in both overlapping and non-overlapping bounding box scenarios, MPDIoU is introduced to optimize the similarity measurement between bounding boxes. By systematically integrating geometric discrepancies into the regression process, this approach enhances training efficiency and model stability. The top-left and bottom-right points distance between the predicted bounding box A and the ground-truth bounding box B are calculated, respectively, as follows:

d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2}

(6)

d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2}

(7)

where

(x_{1}^{A}, y_{1}^{A})

and

(x_{2}^{A}, y_{2}^{A})

, respectively, represent the coordinates of the upper left and upper right points of the prediction box A;

(x_{1}^{B}, y_{1}^{B})

and

(x_{2}^{B}, y_{2}^{B})

respectively represent the coordinates of the upper left and lower right points of the grounding truth box B; w and h are the length and width of the boundary box [44].

Based on Equation (8) is as follows:

I o U = \frac{A \cap B}{A \cup B}

(8)

The Based on Equation (8), the MPDIoU can be formulated as:

M P D I o U = I o U - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(9)

The following is a formal explanation of how MPDIoU improves bounding box regression for overlapping silkworms. Suppose the body parts of two silkworms overlap (analogous to two individuals overlapping their hands)—traditional intersection over union (IoU) may treat them as a single entity due to the extensive overlapping region. In contrast, MPDIoU evaluates the edge positions (corner points) of the overlapping areas. For instance, if the predicted upper-left corner of the bounding box deviates from its true position (e.g., marking a point that should correspond to the silkworm’s head on its body instead), MPDIoU will penalize the model accordingly. This process compels the model to adjust the bounding box until the corner points align with the actual boundaries of the silkworm bodies. Consequently, this mechanism enables the model to discern the distinct boundaries of each silkworm body even in overlapping regions, akin to human visual perception.

To integrate the MPDIoU loss function into the YOLOv5 head, it is necessary to replace the original CIoU/DIoU loss utilized in the bounding box regression stage. Initially, one must calculate the minimum point distances between the predicted and ground-truth bounding boxes, which include both top-left and bottom-right point distances. Subsequently, these distances should be incorporated into the IoU calculation to formulate MPDIoU. It is essential that this loss function is applied during the training phase to optimize bounding box regression effectively. The integration of the MPDIoU loss function introduces three fundamental improvements for silkworm behavior detection. First, overlap-related robustness is enhanced by directly penalizing corner misalignments using minimum point distances, which effectively discriminates adjacent silkworms in high-density rearing scenes. Second, experimental results show that the number of training epochs is reduced by 15% relative to CIoU, with model optimization significantly accelerated. Finally, scale invariance is ensured via normalization with the diagonal length of the smallest enclosing box, maintaining consistent detection accuracy across varying silkworm sizes. As presented in Table 1, MPDIoU outperforms baseline loss functions in crowded environments, achieving a superior mAP@0.5 of 98.5%.

3. Experimental Setup

3.1. Dataset

The experimental dataset was systematically compiled in the Agricultural Information Intelligent Detection Laboratory at Jiangsu University. Silkworms were reared under strictly controlled environmental parameters, as specified in Table 2.

Video recordings of fourth instar silkworms were acquired using a Nikon D5600 DSLR camera (Nikon Corporation, Tokyo, Japan) positioned 20 cm above the rearing trays. The camera operated at a resolution of 1920 × 1080 pixels and a frame rate of 30 frames per second, as shown in Figure 5. A total of 1200 min of video were collected, capturing various silkworm behaviors, such as resting, wriggling, and feeding.

Following rigorous selection, the raw video data were sampled at 40-frame intervals to extract 400 static images, balancing temporal diversity and minimizing redundant sampling. Each image in the dataset was then labeled using the LabelMe tool (v1.8.6) to define bounding boxes and class labels (0: resting; 1: wriggling; 2: eating), with annotations saved in the YOLO format (txt files specifying class indices and bounding box coordinates). Through preliminary statistics, the original class distribution in the dataset is obviously imbalanced. Therefore, data augmentation techniques, including random flipping, rotation, and brightness adjustment, were applied to enhance model robustness: rotation improves angle invariance, brightness adjustment introduces lighting variability, and random flipping promotes scale-invariant feature learning. Specifically, class 0 reached 14,000 labels after data augmentation, while class 1 had 4500 samples and class 2 had 3000 samples, making the number of samples in each class more balanced. This helped the model to learn the features of each class more comprehensively during the training process, improved the generalization ability and detection accuracy of the model on different classes, and, thus, provided a more reliable basis for the subsequent experimental results. The final dataset of 2400 images was randomly split into a training set (1680 images, 70%) validation set (480 images, 20%), and a test set (240 images, 10%), as depicted in Figure 6.

3.2. Implementation Details

The model was implemented in Python 3.9 with PyTorch 1.11.0 and CUDA 11.4. Key dependencies included OpenCV 4.7.0 for image processing and NumPy 1.23.5 for array operations. Hyperparameter settings were as follows: a batch size of 16, 150 training epochs, and optimization via the adaptive moment estimation (Adam) algorithm with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005. All ablation and comparative experiments were conducted on a workstation equipped with an Intel Core i7-9750H, 64 GB RAM, and an NVIDIA GeForce RTX 3080 Ti GPU. Detailed implementation parameters are provided in Table 3.

3.3. Evaluation Metrics

In this study, the performance of the RDM-YOLO model was evaluated using standard metrics, including precision (P), recall (R), mean average precision (mAP), and frames per second (FPS) [45].

Precision (P) measures the proportion of true positive detections among all positive predictions, while recall (R) quantifies the ratio of true positive detections to all actual positive instances. Their formulations are provided in Equation (10) and Equation (11), respectively, where T_P denotes true positives, F_P denotes false positives, and F_N denotes false negatives.

Precision = \frac{T_{P}}{T_{P} + F_{P}}

(10)

Recall = \frac{T_{P}}{T_{P} + F_{N}}

(11)

Average precision (AP) integrates precision and recall to comprehensively assess model performance across varying confidence thresholds. The mean average precision (mAP) is derived as the average of APs across all categories, serving as an overall metric for multi-class detection performance. Specifically, mAP@0.5 and mAP@0.5:0.95, respectively, represent the average precision evaluated at an intersection over union (IoU) threshold of 0.5 and 0.95. The calculations for AP and mAP are presented in Equation (12) and Equation (13), where n denotes the number of categories.

AP = \int_{0}^{1} P (R) d R

(12)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(13)

In addition to these metrics, FPS was used to evaluate inference speed, defined as the number of image frames processed per second. Its calculation is given by the following Equation (14):

F P S = \frac{1000}{P_{r e T i m e} + I n f e r T i m e + N_{M S T i m e}}

(14)

where P_reTime is the image preprocessing duration, InferTime is the network inference time, and N_MSTime is the non-maximum suppression optimization time for prediction frames.

4. Results

4.1. RDM-YOLO Ablation Study

In this section, a series of ablation experiments were conducted using YOLOv5s as the baseline model to systematically identify optimal architectural improvements. Training parameters were rigorously controlled across all experiments to ensure comparability. Specifically, the baseline YOLOv5s was incrementally augmented with Res2Net, DSConv, and MPDIoU to evaluate their respective impacts and cumulative impacts on detection performance, model efficiency, and real-time capability.

The loss curves for the training and validation phases of the baseline model YOLOv5s are depicted in Figure 7. During the initial training epochs, both the training (train/box_loss, train/cls_loss, train/dfl_loss) and validation (val/box_loss, val/cls_loss, val/dfl_loss) losses start at relatively high values. For instance, at epoch 1, the train/box_loss is 4.6704 and the val/box_loss is 6.0327. As the training progresses, the training losses steadily decline. By epoch 5, the train/box_loss drops to 1.468. The validation losses also follow a decreasing trend but may exhibit some fluctuations. The decrease in both training and validation losses indicates that the model is learning effectively and generalizing well to the unseen validation data. This behavior of the Loss curves provides evidence of the model’s stable training process and its potential for good performance in tasks of this study.

Demonstrating comprehensive superiority in detection performance (mAP@0.5), model efficiency (parameters), and real-time capability (FPS) over both the baseline YOLOv5s and other improved models, RDM-YOLO’s results from the ablation studies are detailed in Table 4, which also analyzes the contribution of each module to overall performance enhancement. The baseline YOLOv5s achieves a mAP@0.5 of 98% with 6.7 M parameters and 135 FPS, while incremental improvements are observed in modified models: YOLOv5s+Res2Net slightly increases mAP@0.5 to 98.1%, accompanied by reduced parameters and a marginal increase in FPS to 138; YOLOv5s+DSConv attains a notable mAP@0.5 of 98.6%, reflecting a 0.6% improvement alongside significant parameter reduction and faster inference; while YOLOv5s+MPDIoU reaches 98.5% mAP@0.5, albeit with a slight compromise on speed.

In contrast, RDM-YOLO outperforms all models with an impressive mAP@0.5 of 99%, utilizing only 5.4 M parameters while achieving a frame rate of 150 FPS, establishing itself as the most balanced and advanced model among those tested. Figure 8 visualizes the trade-offs between model size, detection accuracy, and inference speed across architectures. These results underscore that RDM-YOLO achieves state-of-the-art performance by leveraging multi-scale feature extraction, lightweight convolution, and optimized bounding box regression.

4.2. Comparison Between Baseline YOLOv5s and RDM-YOLO

In this section, the detection results of YOLOv5s and RDM-YOLO for several complex silkworm-rearing scenarios are compared to evaluate the detection performance of the models.

Figure 9 presents the detection results of the two models for silkworm instances. In scenarios with relatively dense silkworm distributions, YOLOv5s has error identification issues. For example, there are cases where the correct label category should be ‘1’ or ‘0’, but YOLOv5s gives wrong incorrect detections. In contrast, RDM-YOLO shows better performance. It can more accurately identify silkworm targets, reducing such error situations. By leveraging techniques, like hierarchical multi-scale feature fusion (Res2Net blocks) and appropriate loss functions (MPDIoU loss), RDM-YOLO enhances the precision of category detection and bounding box regression, effectively dealing with the challenges brought by dense distributions and potential ambiguities in silkworm image detection, and outperforming YOLOv5s in the accuracy of silkworm target detection and localization. In handling motion blur resulting from rapid silkworm movements, DSConv’s dynamic sparsity mechanism in RDM-YOLO selectively prioritized salient features and suppressed noise, elevating precision from 96.3% to 97.7% in motion-blurred frames. Collectively, these improvements demonstrate that RDM-YOLO overcomes YOLOv5s’ limitations in high-density, occluded, and dynamic scenarios, establishing a new benchmark for accuracy–efficiency trade-offs in agricultural vision systems and positioning it as a reliable tool for automated sericulture monitoring under real-world constraints.

4.3. Comparative Experiment

To validate the enhanced performance of the proposed improvement algorithm in this study, comparative experiments with other latest algorithms were conducted. To ensure a fair performance comparison, this study employs precision, mAP@0.5, mAP@0.5:0.95, parameters, and FPS as key indicators for evaluation. The comparative results in Table 5 reveal RDM-YOLO’s superiority over other YOLO versions, excelling in key performance metrics. These advancements, including spanning detection accuracy, computational efficiency, and real-time applicability, demonstrate our model’s technical improvements. In terms of mAP@0.5 and mAP@0.5:0.95, RDM-YOLO achieves a state-of-the-art value of 99% and 92.1%, respectively, surpassing classical models, such as YOLOv3-tiny [46,47], YOLOv5s, YOLOv6s [48,49], YOLOv7-tiny [50,51], and even the latest versions, like YOLOv8n [52,53], YOLOv9t [54], and YOLOv11s [55], which highlights its superior localization accuracy in object detection tasks.

Furthermore, RDM-YOLO excels in precision, achieving 97.7%, the highest among all listed models. It outperforms YOLOv8n, YOLOv11s, and YOLOv5s, demonstrating exceptional detection reliability. Notably, despite having a moderate parameter size of 5.4 M, lower than YOLOv3-tiny and YOLOv11s, RDM-YOLO maintains a remarkable balance between model complexity and performance.

Figure 10 presents a 3D performance matrix comparing precision and mAP@0.5 across YOLO detection algorithms, integrating parameter sizes, multi-dimensional metrics, and FPS rankings sorted in descending order. As shown in it, RDM-YOLO achieves the highest FPS of 150, significantly exceeding YOLOv6s, YOLOv8n, and YOLOv8-ghost [56]. This makes it the fastest model in terms of real-time inference while retaining top-tier accuracy. Compared to the high-speed YOLOv9t and YOLOv11s, RDM-YOLO still holds a slight edge, further solidifying its suitability for the deployment of application in the silkworm breeding environment.

In addition to comparisons with YOLO series models, meanwhile, a new experiment comparing our approach RDM-YOLO with a state-of-the-art transformer architectures have been was added, specifically taking MobileViT as an example [57]. Transformer architectures have shown remarkable capabilities in various computer vision tasks due to their self-attention mechanisms that can effectively capture long-range dependencies. MobileViT, in particular, is designed to strike a good balance between computational efficiency and performance, making it a strong competitor in the domain of lightweight and high-performing vision models.

In the experiment, the results, as illustrated in Figure 11, revealed that RDM-YOLO achieved an mAP@0.5 of 99%, while MobileViT reached 96.8%. When considering the more stringent mAP@0.5:0.95, RDM-YOLO still outperformed MobileViT with a score of 92.1%, compared to MobileViT’s 89.8%. This indicates that RDM-YOLO has a superior ability to accurately detect and localize objects across a wide range of IoU values. In terms of FPS, on a standard GPU, RDM-YOLO was able to achieve a high FPS of 150, whereas MobileViT achieved an FPS of 135. This shows that RDM-YOLO has an edge in providing faster real-time object detection capabilities. Regarding the parameter count, MobileViT is relatively lightweight with 5.6 M parameters. However, RDM—YOLO, despite its better performance in accuracy and speed, only has 5.4 M parameters, suggesting that it offers a highly efficient trade-off between complexity and performance.

By combining the highest mAP@0.5, mAP@0.5:0.95, precision, and FPS metrics, RDM-YOLO establishes itself as a robust and efficient solution for real-time detection models, particularly benefitting precision agriculture applications requiring instantaneous decision-making. Compared to existing architectures, our proposed model makes it ideal for deployment on edge-computing devices in smart farming systems.

5. Conclusions

To address the challenges of deploying fourth instar silkworm behavior detection models on mobile devices, this study presents RDM-YOLO, a lightweight architecture. Experimental results demonstrate that RDM-YOLO achieves a 24% reduction in parameter count compared to YOLOv5s, while enhancing mAP@0.5 by 1%. These findings validate the model’s improved detection accuracy and lightweight design.

The proposed RDM-YOLO paves the way for precision sericulture by enabling real-time monitoring of silkworm health, allowing farmers to dynamically adjust environmental parameters (e.g., temperature and humidity) to optimize cocoon quality and silk yield. Furthermore, the model’s modular architecture enhances its scalability: its lightweight design and adaptive feature extraction mechanisms can be extended to monitor diverse insect behaviors, such as honeybee pollination efficiency or locust swarm dynamics, thereby expanding its applicability to entomological research and agricultural pest management.

While this work demonstrates the technical feasibility of automated behavior analysis, it does not yet fully characterize the biological significance of observed behavioral patterns. This limitation arises from the study’s focus on algorithm development and validation within controlled rearing environments. Subsequent work will employ transcriptomic analysis (e.g., RNA-seq) to link specific behavioral anomalies to gene expression profiles, identifying physiological pathways underlying observed behaviors. Moreover, longitudinal studies across multiple rearing seasons will be conducted to correlate behavioral metrics with disease incidence, guided by sericulturists’ domain knowledge.

Future studies will incorporate infrared and multispectral imaging. Moreover, static image analysis limits the understanding of behavioral sequences. The current model does not incorporate temporal dimension information. In future work, LSTM or 3D convolutional layers will be integrated to develop a video-level behavior sequence analysis model. Last but not least, integrating transformer modules could enable activity prediction. In conclusion, RDM-YOLO bridges the gap between academic research and industrial applications, offering a scalable solution for modernizing sericulture through AI-driven automation.

Author Contributions

J.G.: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing—original draft, Writing—review and editing, visualization. J.S.: Supervision. X.W.: Data curation. C.D.: Resources. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the study involving non-invasive behavioral recognition of silkworms, which are not subject to institutional ethical review requirements for animal research.

Data Availability Statement

Our data are unavailable due to privacy or ethical restrictions.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

YOLO	You Only Look Once
DSConv	Distribution shifting convolution
MPDIoU	Minimum point distance intersection over union
AP	Average precision
mAP	Mean average precision
CNNs	Convolutional neural networks
FPS	Frames per second
IoU	Intersection over union
T_P	True positive
F_P	False positive
F_N	False negative
Adam	Adaptive moment estimation

References

Xu, H.; Pan, J.; Ma, C.; Mintah, B.; Dabbour, M.; Huang, L.; Dai, C.; Ma, H.; He, R. Stereo-hindrance effect and oxidation cross-linking induced by ultrasound-assisted sodium alginate-glycation inhibit lysinoalanine formation in silkworm pupa protein. Food Chem. 2025, 463, 141284. [Google Scholar] [CrossRef]
Xu, H.; Pan, J.; Dabbour, M.; Mintah, B.; Chen, W.; Yang, F.; Zhang, Z.; Cheng, Y.; Dai, C.; He, R. Synergistic effects of pH shift and heat treatment on solubility, physicochemical and structural properties, and lysinoalanine formation in silkworm pupa protein isolates. Food Res. Int. 2023, 165, 112554. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Liu, X.; Liu, C.; Pan, Q. Development of the precision feeding system for sows via a rule-based expert system. Int. J. Agric. Biol. Eng. 2023, 16, 187–198. [Google Scholar] [CrossRef]
Yuan, H.; Cai, Y.; Liang, S.; Ku, J.; Qin, Y. Numerical Simulation and Analysis of Feeding Uniformity of Viscous Miscellaneous Fish Bait Based on EDEM Software. Agriculture 2023, 13, 356. [Google Scholar] [CrossRef]
Zhao, Z.; Jin, M.; Tian, C.; Yang, S. Prediction of seed distribution in rectangular vibrating tray using grey model and artificial neural network. Biosyst. Eng. 2018, 175, 194–205. [Google Scholar] [CrossRef]
Yang, N.; Yuan, M.; Wang, P.; Zhang, R.; Sun, J.; Mao, H. Tea diseases detection based on fast infrared thermal image processing technology. J. Sci. Food Agric. 2019, 99, 3459–3466. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Xu, C.; Hou, W.; McElligott, A.; Liu, K.; Xue, Y. Transformer-based audio-visual multimodal fusion for fine-grained recognition of individual sow nursing behaviour. Artif. Intell. Agric. 2025, 15, 363–376. [Google Scholar] [CrossRef]
Sun, G.; Liu, T.; Zhang, H.; Tan, B.; Li, T. Basic behavior recognition of yaks based on improved SlowFast network. Ecol. Inform. 2023, 78, 102313. [Google Scholar] [CrossRef]
Kirsch, K.; Strutzke, S.; Klitzing, L.; Pilger, F.; Thöne-Reineke, C.; Hoffmann, G. Validation of a Time-Distributed residual LSTM–CNN and BiLSTM for equine behavior recognition using collar-worn sensors. Comput. Electron. Agric. 2025, 231, 109999. [Google Scholar] [CrossRef]
Zhu, C.; Hao, S.; Liu, C.; Wang, Y.; Jia, X.; Xu, J.; Guo, S.; Huo, J.; Wang, W. An Efficient Computer Vision-Based Dual-Face Target Precision Variable Spraying Robotic System for Foliar Fertilisers. Agronomy 2024, 14, 2770. [Google Scholar] [CrossRef]
Niu, Z.; Huang, T.; Xu, C.; Sun, X.; Taha, M.; He, Y.; Qiu, Z. A Novel Approach to Optimize Key Limitations of Azure Kinect DK for Efficient and Precise Leaf Area Measurement. Agriculture 2025, 15, 173. [Google Scholar] [CrossRef]
Huang, L.; Xu, L.; Wang, Y.; Peng, Y.; Zou, Z.; Huang, P.; Ahmad, M. Efficient Detection Method of Pig-Posture Behavior Based on Multiple Attention Mechanism. Comput. Intell. Neurosci. 2022, 2022, 1759542. [Google Scholar] [CrossRef] [PubMed]
Huang, W.; Zhu, W.; Ma, C.; Guo, Y.; Chen, C. Identification of group-housed pigs based on Gabor and Local Binary Pattern features. Biosyst. Eng. 2018, 166, 90–100. [Google Scholar] [CrossRef]
Wang, Z.; Hua, Z.; Wen, Y.; Zhang, S.; Xu, X.; Song, H. E-YOLO: Recognition of estrus cow based on improved YOLOv8n model. Expert Syst. Appl. 2024, 238, 122212. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, X.; Sun, J.; Yu, T.; Cai, Z.; Zhang, Z.; Mao, H. Low-Cost Lettuce Height Measurement Based on Depth Vision and Lightweight Instance Segmentation Model. Agriculture 2024, 14, 1596. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, J.; Liu, W.; Yue, R.; Shi, J.; Zhou, C.; Hu, J. SN-CNN: A Lightweight and Accurate Line Extraction Algorithm for Seedling Navigation in Ridge-Planted Vegetables. Agriculture 2024, 14, 1446. [Google Scholar] [CrossRef]
Jiang, L.; Wang, Y.; Wu, C.; Wu, H. Fruit Distribution Density Estimation in YOLO-Detected Strawberry Images: A Kernel Density and Nearest Neighbor Analysis Approach. Agriculture 2024, 14, 1848. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2022, 52, 8574–8586. [Google Scholar] [CrossRef]
Romano, D. Novel automation, artificial intelligence, and biomimetic engineering advancements for insect studies and management. Curr. Opin. Insect Sci. 2025, 68, 101337. [Google Scholar] [CrossRef]
Kariyanna, B.; Sowjanya, M. Unravelling the use of artificial intelligence in management of insect pests. Smart Agric. Technol. 2024, 8, 100517. [Google Scholar] [CrossRef]
Wang, J.; Wei, T.; Song, Z.; Chen, R.; He, Q. Determination of the equivalent length for evaluating local head losses in drip irrigation laterals. Appl. Eng. Agric. 2022, 38, 49–59. [Google Scholar] [CrossRef]
Jin, M.; Zhao, Z.; Che, S.; Chen, J. Improved piezoelectric grain cleaning loss sensor based on adaptive neuro-fuzzy inference system. Precis. Agric. 2022, 23, 1174–1188. [Google Scholar] [CrossRef]
Fan, X.; Ding, W.; Qin, W.; Xiao, D.; Min, L.; Yuan, H. Fusing Self-Attention and CoordConv to Improve the YOLOv5s Algorithm for Infrared Weak Target Detection. Sensors 2023, 23, 6755. [Google Scholar] [CrossRef] [PubMed]
Wen, C.; Wen, J.; Li, J.; Luo, Y.; Chen, M.; Xiao, Z.; Xu, Q.; Liang, X.; An, H. Lightweight silkworm recognition based on Multi-scale feature fusion. Comput. Electron. Agric. 2022, 200, 107234. [Google Scholar] [CrossRef]
Pei, H.; Sun, Y.; Huang, H.; Zhang, W.; Sheng, J.; Zhang, Z. Weed Detection in Maize Fields by UAV Images Based on Crop Row Preprocessing and Improved YOLOv4. Agriculture 2024, 12, 975. [Google Scholar] [CrossRef]
Zhang, F.; Chen, Z.; Ali, S.; Yang, N.; Fu, S.; Zhang, Y. Multi-class detection of cherry tomatoes using improved YOLOv4-Tiny. Int. J. Agric. Biol. Eng. 2023, 16, 225–231. [Google Scholar] [CrossRef]
Ji, W.; Gao, X.; Xu, B.; Pan, Y.; Zhang, Z.; Zhao, D. Apple target recognition method in complex environment based on improved YOLOv4. J. Food Process Eng. 2021, 44, e13866. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, Y.; Zhao, Y.; Pan, Q.; Jin, K.; Xu, G.; Hu, Y. TS-YOLO: An All-Day and Lightweight Tea Canopy Shoots Detection Model. Agronomy 2023, 13, 1411. [Google Scholar] [CrossRef]
Wang, Q.; Qin, W.; Liu, M.; Zhao, J.; Zhu, Q.; Yin, Y. Semantic Segmentation Model-Based Boundary Line Recognition Method for Wheat Harvesting. Agriculture 2024, 14, 1846. [Google Scholar] [CrossRef]
Tao, D.; Qiu, G.; Li, G. A novel model for sex discrimination of silkworm pupae from different species. IEEE Access 2019, 7, 165328–165335. [Google Scholar] [CrossRef]
Liu, J.; Abbas, I.; Noor, R. Development of Deep Learning-Based Variable Rate Agrochemical Spraying System for Targeted Weeds Control in Strawberry Crop. Agronomy 2021, 11, 1480. [Google Scholar] [CrossRef]
Peng, Y.; Zhao, S.; Liu, J. Fused Deep Features-Based Grape Varieties Identification Using Support Vector Machine. Agriculture 2021, 11, 869. [Google Scholar] [CrossRef]
Xiong, H.; Cai, J.; Zhang, W.; Hu, J.; Deng, Y.; Miao, J.; Tan, Z.; Li, H.; Cao, J.; Wu, X. Deep learning enhanced terahertz imaging of silkworm eggs development. iScience 2021, 24, 103316. [Google Scholar] [CrossRef]
Xu, B.; Cui, X.; Ji, W.; Yuan, H.; Wang, J. Apple Grading Method Design and Implementation for Automatic Grader Based on Improved YOLOv5. Agriculture 2023, 13, 124. [Google Scholar] [CrossRef]
Tao, T.; Wei, H. STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field. Agriculture 2025, 15, 22. [Google Scholar] [CrossRef]
Gao, S.; Cheng, M.; Zhao, K.; Zhang, X.; Yang, M.; Torr, P. Res2Net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Zhang, Z.; Mamat, H.; Xu, X.; Aysa, A.; Ubul, K. FAS-Res2net: An Improved Res2net-Based Script Identification Method for Natural Scenes. Appl. Sci. 2023, 13, 4434. [Google Scholar] [CrossRef]
Ju, H.; Fang, Y.; Yang, H.; Si, F.; Kang, K. Improved Lightweight YOLOv8 With DSConv and Reparameterization for Continuous Casting Slab Detection on Embedded Device. IEEE Trans. Instrum. Meas. 2024, 74, 5003712. [Google Scholar] [CrossRef]
Ou, J.; Shen, Y. Underwater Target Detection Based on Improved YOLOv7 Algorithm With BiFusion Neck Structure and MPDIoU Loss Function. IEEE Access 2024, 12, 105165–105177. [Google Scholar] [CrossRef]
Duan, Y.; Han, W.; Guo, P.; Wei, X. YOLOv8-GDCI: Research on the Phytophthora Blight Detection Method of Different Parts of Chili Based on Improved YOLOv8 Model. Agronomy 2024, 14, 2734. [Google Scholar] [CrossRef]
Wan, J.; Xue, F.; Shen, Y.; Song, H.; Shi, P.; Qin, Y.; Yang, T.; Wang, Q. Automatic segmentation of urban flood extent in video image with DSS-YOLOv8n. J. Hydrol. 2025, 655, 132974. [Google Scholar] [CrossRef]
do Nascimento, M.; Fawcett, R.; Prisacariu, V. DSConv: Efficient Convolution Operator. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5147–5156. [Google Scholar] [CrossRef]
Mohammad, S.; Razak, M.; Rahman, A. 3D-DIoU: 3D Distance Intersection over Union for Multi-Object Tracking in Point Cloud. Sensors 2023, 23, 3390. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Cho, Y. Weighted Intersection over Union (wIoU) for evaluating image segmentation. Pattern Recognit. Lett. 2024, 185, 101–107. [Google Scholar] [CrossRef]
Yuan, H.; Liu, G.; Wang, Y. Defect detection of small targets on fabric surface based on improved YOLOv3-tiny. Manuf. Autom. 2022, 44, 172–176. [Google Scholar]
Fu, L.; Feng, Y.; Wu, J.; Liu, Z.; Gao, F.; Majeed, Y.; Al-Mallahi, A.; Zhang, Q.; Li, R.; Cui, Y. Fast and accurate detection of kiwifruit in orchard using improved YOLOv3-tiny model. Precis. Agric. 2021, 22, 754–773. [Google Scholar] [CrossRef]
Chowdhury, A.; Said, W.; Saruchi, S. Oil Palm Fresh Fruit Branch Ripeness Detection Using YOLOV6 Algorithm. In Intelligent Manufacturing and Mechatronics; Springer Nature: Singapore, 2024; pp. 187–202. [Google Scholar] [CrossRef]
Wu, E.; Ma, R.; Dong, D.; Zhao, X. D-YOLO: A Lightweight Model for Strawberry Health Detection. Agriculture 2025, 15, 570. [Google Scholar] [CrossRef]
Al Amoud, I.; Ramli, D. YOLOv7-Tiny and YOLOv8n Evaluation for Face Detection. In Proceedings of the 12th International Conference on Robotics, Vision, Signal Processing and Power Applications; Springer Nature: Singapore, 2024; pp. 477–483. [Google Scholar] [CrossRef]
Kumar, A.; Kumar, A.; Jayakody, D. Ambiguous facial expression detection for Autism Screening using enhanced YOLOv7-tiny model. Sci. Rep. 2024, 14, 12241. [Google Scholar] [CrossRef]
Yang, H.; Jiang, H.; Zheng, H.; Cheng, X.; Hu, J.; Du, Y.; Jiang, Z. HE-Yolov8n: An innovative and efficient method for detecting defects in Lithium battery shells based on Yolov8n. Nondestruct. Test. Eval. 2024. [Google Scholar] [CrossRef]
Song, Y.; Wu, Z.; Zhang, S.; Quan, W.; Shi, Y.; Xiong, X.; Li, P. Estimation of Artificial Reef Pose Based on Deep Learning. J. Mar. Sci. Eng. 2024, 12, 812. [Google Scholar] [CrossRef]
Chen, X.; Lin, C. EVMNet: Eagle visual mechanism-inspired lightweight network for small object detection in UAV aerial images. Digit. Signal Process. 2024, 158, 104957. [Google Scholar] [CrossRef]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus Detection Algorithm Based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Lan, Y.; Lv, Y.; Xu, J.; Zhang, Y.; Zhang, Y. Breast mass lesion area detection method based on an improved YOLOv8 model. Electron. Res. Arch. 2024, 32, 5846–5867. [Google Scholar] [CrossRef]
Cao, Y.; Yin, Z.; Duan, Y.; Cao, R.; Hu, G.; Liu, Z. Research on improved sound recognition model for oestrus detection in sows. Comput. Electron. Agric. 2025, 231, 109975. [Google Scholar] [CrossRef]

Figure 1. Overall workflow of the proposed methodology.

Figure 2. Detailed architecture of the proposed RDM-YOLO.

Figure 3. Architecture of the bottleneck block and Res2Net module. (a) is the architecture of the bottleneck block; (b) is the architecture of the Res2Net module.

Figure 4. General idea of DSConv. The symbol ⊙ represents the Hadamard operator.

Figure 5. Video data acquisition of fourth instar silkworm behavior.

Figure 6. Visualization of dataset annotations and class distributions.

Figure 7. Loss curves of YOLOv5s.

Figure 8. Model efficiency and inference speed trade-offs.

Figure 9. Comparison of detection results between the baseline and proposed models.

Figure 10. Multi-model performance comparison of YOLO detection algorithms.

Figure 11. Comparative analysis of RDM-YOLO and MobileViT in detection accuracy, inference speed, and model complexity metrics.

Table 1. Performance comparison of bounding box regression losses.

Loss Function	Model	mAP@0.5	Convergence Epochs	Overlap Robustness
CIoU	YOLOv5s-CIoU	96.8%	150	Moderate
DIoU	YOLOv5s-DIoU	97.1%	145	Moderate
MPDIoU	YOLOv5s-MPDIoU	98.5%	128	High

Table 2. Environmental conditions and dietary configuration for rearing silkworms.

Environmental Conditions	Value
Temperature	25 °C ± 1 °C
Relative humidity	80% ± 5%
Light–dark #cycle	16:8
Diet	Fresh mulberry leaves derived from young shoots (10–15 cm in length) with tender leaves

Table 3. Detailed implementation parameters.

Parameter Category	Specific Parameter	Value
Software environment	Python version	3.9
	PyTorch version	1.11.0
	CUDA version	11.4
Key dependencies	OpenCV version	4.7.0
	NumPy version	1.23.5
Training hyperparameters	Batch size	16
	Training epochs	150
	Optimizer	Adam
	Initial learning rate	0.01
	Momentum	0.937
	Weight decay	0.0005
Hardware configuration	CPU	Intel Core i7-9750H
	RAM	64 GB
	GPU	NVIDIA GeForce RTX 3080Ti

Table 4. Results of all options replacing modules in YOLOv5s with Res2Ne, DSConv, and MPDIoU, ↑ and ↓, respectively, represent the changes in metrics compared with the baseline model YOLOv5s.

Experiment	Model	mAP@0.5	Parameters	FPS
1	YOLOv5s	98.0%	6.7 M	135 f·s⁻¹
2	YOLOv5s+Res2Net	98.1% ↑	6.4 M ↓	138 f·s⁻¹ ↑
3	YOLOv5s+DSConv	98.6% ↑	5.7 M ↓	144 f·s⁻¹ ↑
4	YOLOv5s+MPDIoU	98.5% ↑	6.7 M	132 f·s⁻¹ ↓
5	RDM-YOLO	99.0% ↑	5.4 M ↓	150 f·s⁻¹ ↑

Table 5. Comparison experiment.

Model	Parameters	Precision	mAP@0.5	mAP@0.5:0.95	FPS
YOLOv3-tiny	8.7 M	85.9%	85.8%	77.8%	140 f·s⁻¹
YOLOv5s	6.7 M	96.3%	98.0%	90.1%	135 f·s⁻¹
YOLOv6s	4.1 M	95.3%	98.9%	91.2%	105 f·s⁻¹
YOLOv7-tiny	6.1 M	92.0%	97.2%	89.5%	118 f·s⁻¹
YOLOv8n	2.9 M	97.6%	98.9%	90.9%	114 f·s⁻¹
YOLOv8-ghost	1.6 M	95.7%	97.9%	88.9%	120 f·s⁻¹
YOLOv9t	2.6 M	96.6%	98.8%	91.1%	147 f·s⁻¹
YOLOv11s	9.4 M	97.5%	98.3%	90.6%	148 f·s⁻¹
RDM-YOLO	5.4 M	97.7%	99.0%	92.1%	150 f·s⁻¹

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, J.; Sun, J.; Wu, X.; Dai, C. RDM-YOLO: A Lightweight Multi-Scale Model for Real-Time Behavior Recognition of Fourth Instar Silkworms in Sericulture. Agriculture 2025, 15, 1450. https://doi.org/10.3390/agriculture15131450

AMA Style

Gao J, Sun J, Wu X, Dai C. RDM-YOLO: A Lightweight Multi-Scale Model for Real-Time Behavior Recognition of Fourth Instar Silkworms in Sericulture. Agriculture. 2025; 15(13):1450. https://doi.org/10.3390/agriculture15131450

Chicago/Turabian Style

Gao, Jinye, Jun Sun, Xiaohong Wu, and Chunxia Dai. 2025. "RDM-YOLO: A Lightweight Multi-Scale Model for Real-Time Behavior Recognition of Fourth Instar Silkworms in Sericulture" Agriculture 15, no. 13: 1450. https://doi.org/10.3390/agriculture15131450

APA Style

Gao, J., Sun, J., Wu, X., & Dai, C. (2025). RDM-YOLO: A Lightweight Multi-Scale Model for Real-Time Behavior Recognition of Fourth Instar Silkworms in Sericulture. Agriculture, 15(13), 1450. https://doi.org/10.3390/agriculture15131450

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RDM-YOLO: A Lightweight Multi-Scale Model for Real-Time Behavior Recognition of Fourth Instar Silkworms in Sericulture

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of the Model Improvements

2.2. Res2Net Module

2.3. Distribution Shifting Convolution Module

2.4. Minimum Point Distance Intersection over Union Module

3. Experimental Setup

3.1. Dataset

3.2. Implementation Details

3.3. Evaluation Metrics

4. Results

4.1. RDM-YOLO Ablation Study

4.2. Comparison Between Baseline YOLOv5s and RDM-YOLO

4.3. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI