3.1. Training Performance and Ablation Analysis of YOLO-PEDI
We next examined the training dynamics of YOLO-PEDI to assess how the proposed architecture learned to localize and identify pigeon eggs during optimization. To this end, Training loss and mAP@50 were monitored throughout training, as these two metrics capture complementary aspects of model behavior, including bounding-box regression, feature discrimination and convergence stability. Together, they provide a direct readout of convergence speed, parameter efficiency and prediction consistency.
As shown in
Figure 4A, the Training-loss curve showed a rapid decline at the beginning of training, indicating that YOLO-PEDI quickly captured the basic geometric characteristics of pigeon eggs and established an effective localization representation. This early convergence suggests that replacing conventional convolution with Ghost modules preserved the structural cues required for bounding-box regression while reducing redundant computation. As training proceeded, the rate of decline became more gradual and the loss curve entered a stable optimization phase, consistent with progressive refinement of target localization under complex visual conditions. This transition likely reflects the contribution of CBAM, which enhanced the model’s ability to suppress interference from metal cage meshes and feather backgrounds while preserving informative edge features. In the later stage of training, Training loss converged to a low level and remained stable without obvious rebound, indicating that the lightweight architecture retained strong regression capability despite substantial model compression. A similarly favorable pattern was observed for mAP@50. Detection accuracy increased sharply during the early stage of training, reaching 85.0% within the first few training cycles, which points to efficient feature learning from the outset. With continued optimization, mAP@50 increased steadily, exceeded 95.0% as training progressed, and finally converged at 98.1%. The stability of the proposed YOLO-PEDI model is further supported by the smooth convergence of the mAP@50 curve shown in
Figure 4B, which reached a steady plateau after approximately 30 epochs, with only minor oscillations of less than 0.5%. This stable convergence behavior indicates that the reported performance was obtained from a well-converged training process on the curated dataset. Nevertheless, repeated training with multiple random seeds and formal confidence-interval estimation would be required in future work to more rigorously quantify the statistical significance of small performance gains.
Notably, the curve remained smooth during the middle and late stages of training, with little evidence of the oscillatory behavior often observed in more heavily parameterized models. This stable high-level convergence indicates that the combination of Ghost-based lightweight feature generation and CBAM-guided feature refinement was sufficient to preserve the diversity and discriminative power of the learned representations, even under complex farm backgrounds. Taken together, these results show that YOLO-PEDI achieved rapid early convergence, stable subsequent optimization and sustained high detection accuracy. The training behavior further suggests that the proposed architectural modifications improved computational efficiency without compromising localization precision or feature discrimination. This combination of efficiency and stability supports the suitability of YOLO-PEDI for real-time pigeon egg inspection in practical farming environments.
To further validate this efficiency-accuracy balance in a broader context, our model was compared against several state-of-the-art architectures. As illustrated in
Figure 4C, the positioning of YOLO-PEDI relative to other benchmark models highlights its optimized accuracy-complexity frontier. While the plot appears focused on key representative architectures, these specific data points were selected to bracket the performance limits of both conventional lightweight CNNs and high-capacity transformer-based detectors. This visualization provides the necessary context to demonstrate that YOLO-PEDI achieves a competitive mAP@50 while occupying an ultra-low computational region (4.0 GFLOPs), effectively bridging the gap between the high-precision requirements of egg inspection and the strict latency constraints of mobile robotic platforms.
We next performed ablation experiments to quantify the contribution of each architectural modification and to clarify how lightweight design and attention refinement jointly affected detection performance. Starting from the original YOLOv8n as the baseline, the Ghost module and CBAM were introduced sequentially under a controlled setting and evaluated on the same pigeon egg dataset. This design allowed the effects of parameter reduction and feature enhancement to be assessed independently and in combination, as summarized in
Table 1.
The architectural evolution from YOLOv8n to YOLO-PEDI reflects a strategic balance between computational efficiency and feature representation capacity. The intro-duction of Ghost modules is theoretically motivated by the need to reduce redundant feature computations, resulting in a 50.8% decrease in parameter count compared with standard convolutional structures. However, such aggressive lightweight compression may inevitably weaken the representation of fine-grained spatial details, which are critical for detecting small pigeon eggs under cage-mesh interference. To alleviate this limitation, the CBAM block was introduced as an attention-guided semantic refinement module to re-weight feature importance across both channel and spatial dimensions. In this design, Ghost modules mainly provide structural compression, whereas CBAM contributes to feature restoration and target-focused representation. In addition, the loss-design strategy further improves optimization balance by enhancing the model’s robustness to small targets, partial occlusion, and complex backgrounds. This “compression–restoration–balancing” mechanism enables YOLO-PEDI to maintain dis-criminative power while substantially reducing computational cost, thereby achieving a favorable balance between real-time latency and detection accuracy for rail-mounted agricultural inspection robots.
The results reveal a clear functional complementarity between the two modules. Introducing the Ghost module markedly reduced both parameter count and computational cost, with decreases of 50.8% and 53.1%, respectively, relative to the baseline configurations. These reductions confirm the effectiveness of Ghost in replacing redundant convolutional operations with lightweight linear feature generation. This gain in efficiency, however, was accompanied by a modest decrease in mAP@50, indicating that lightweight compression alone led to some loss of discriminative capacity. When CBAM was further incorporated, detection accuracy recovered to 98.1% with only a marginal increase of 0.05 M parameters. This result suggests that the attention mechanism effectively compensated for the representational loss introduced by compression by strengthening informative edge textures and suppressing background interference through joint channel and spatial weighting. The trend shown in
Figure 4D further supports this interpretation. Ghost primarily acted as a lightweight structural backbone that removed computational redundancy, whereas CBAM functioned as a feature refinement module that restored fine-grained discrimination. Their combination enabled YOLO-PEDI to maintain high detection accuracy while preserving a compact architecture and fast inference. Together, these findings indicate that the proposed design achieved a favorable balance between accuracy and efficiency, reaching an inference time of 0.8 ms without sacrificing detection reliability.
3.2. The Performance of Different Base Detection Algorithms
To rigorously evaluate the competitiveness of YOLO-PEDI for large-scale pigeon farm inspection, we performed a cross-architecture comparison using representative detectors from both conventional and emerging paradigms. The comparison included Faster R-CNN with a ResNet-50 backbone as a two-stage benchmark, YOLOv5s and YOLOv8n as widely used one-stage industrial baselines, and DETR and RT-DETR-L as transformer-based detectors with global modeling capability. All models were trained and tested under the same dataset distribution and hardware conditions (NVIDIA RTX 4060 GPU with 8 GB VRAM), and the quantitative results are summarized in
Table 2.
The comparative results reveal clear differences in the trade-off between accuracy and efficiency across architectures. Faster R-CNN retained the classical advantage of two-stage detectors in feature refinement but showed a pronounced efficiency bottleneck in dense pigeon egg detection. Its region-proposal-based design resulted in a computational cost of 180.2 GFLOPs and an inference time of 85 ms, which is difficult to accommodate in dynamic inspection scenarios. Under mobile robot operation, such latency can lead to delayed response, repeated counting, and increased risk of missed detections. By contrast, YOLO-PEDI adopted a one-stage end-to-end detection framework and substantially reduced inference time to 0.8 ms through lightweight architectural redesign, demonstrating a clear advantage for edge-side deployment in resource-constrained inspection systems.
Transformer-based detectors, particularly RT-DETR-L, achieved the highest accuracy among the compared models, reaching a mAP@50 of 99.2%. This result reflects the strength of global self-attention in capturing targets under dense and cluttered backgrounds. However, this gain in accuracy came at the expense of substantially higher model complexity and memory demand, with a parameter size of 32.0 M. Rather than pursuing the highest single-metric performance, YOLO-PEDI was designed to achieve a more balanced solution for practical deployment. By introducing CBAM, the model compensated for the limited global modeling ability of conventional convolutional networks through channel and spatial attention. As a result, YOLO-PEDI maintained a mAP@50 of 98.1% while using only a small fraction of the parameters required by RT-DETR-L, indicating that attention-guided lightweight CNNs remain highly competitive for this task.
A key strength of YOLO-PEDI lies in its efficient handling of feature redundancy. Conventional convolutional layers often generate highly similar feature responses, leading to unnecessary computational overhead. By replacing part of this redundant computation with the Ghost module, YOLO-PEDI generated representative intrinsic features using a reduced number of standard convolutions and then produced additional feature maps through inexpensive linear operations. This design reduced the computational load to 4.0 GFLOPs, which was only 49.3% of that of YOLOv8n. Although a slight decrease in mAP@50-95 was observed under this aggressive compression, the reduction was limited relative to the substantial savings in computational resources. Given that pigeon egg inspection primarily requires reliable counting and status recognition rather than extremely strict localization precision, this trade-off is acceptable and practically meaningful for smart farming applications.
The overall comparison further confirms this conclusion. As shown in
Figure 4C, YOLO-PEDI remained comparable to mainstream benchmark models in detection accuracy, while showing clear advantages in model lightweighting and inference efficiency. This asymmetric performance profile is particularly valuable in real deployment scenarios, where hardware cost, power consumption and real-time response are all critical constraints. The compact design of YOLO-PEDI makes it suitable for deployment on low-cost embedded platforms, while its high processing speed can increase inspection coverage within a fixed patrol cycle. Taken together, these results show that YOLO-PEDI achieves a favorable balance between accuracy and speed in automatic pigeon egg detection and provides an effective technical solution for edge-side vision systems in future large-scale smart farms.
Furthermore, to address the limitation of benchmarking on a desktop GPU, the YOLO-PEDI model was exported to ONNX format for cross-platform profiling. The exported model has a compact footprint of 3.1 MB and a computational complexity of 4.0 GFLOPs, corresponding to only 49.3% of the computational cost of the baseline YOLOv8n. These hardware-agnostic metrics indicate that the proposed architecture is substantially lightweight and structurally suitable for resource-constrained deployment scenarios. However, the reported 0.8 ms inference time should be interpreted as the latency measured on the RTX 4060 desktop GPU used in this study, rather than as the actual latency on embedded edge devices. Direct profiling on target platforms such as Jetson Nano, Jetson Orin Nano, Raspberry Pi, or the final robot controller remains necessary to fully quantify real-world edge-device performance. Future work will therefore include embedded-platform benchmarking under practical robot operating conditions to more rigorously evaluate deployment efficiency.
3.3. Grad-CAM-Based Visualization of Pigeon Egg Localization
We employed Grad-CAM to visualize the feature activation patterns of YOLO-PEDI and to assess its ability to focus on target regions under complex breeding conditions. Representative activation maps are shown in
Figure 5. In panoramic pigeon-house images, eggs occupy only a small portion of the field of view and are frequently affected by cage-mesh interference, background clutter, and illumination variation. Under these challenging conditions, YOLO-PEDI generated compact and spatially concentrated activation maps. The strongest responses were mainly aligned with the central regions and elliptical contours of the eggs, indicating that the model captured local features that are relevant for precise detection.
To further examine the interpretability gained from the architectural refinements, we conducted a comparative Grad-CAM analysis between the baseline model and YOLO-PEDI. As shown in
Figure 5, the baseline model produced relatively diffused activation patterns, with noticeable responses on non-target structural elements such as repetitive metal-mesh lines, cage edges, and background padding. This suggests that the baseline network was more susceptible to feature distraction caused by high-frequency farm artifacts.
After the integration of Ghost modules and CBAM, the activation pattern showed a clear transition from diffused background response to more localized target-focused attention. The high-response regions became more concentrated around the biological contours of pigeon eggs, while responses from shaded areas, metal cage edges, and other non-target regions were reduced. This comparison provides qualitative evidence that the attention-refinement mechanism helps suppress spurious background activations and improves the model’s focus on target-relevant features. Nevertheless, the present Grad-CAM analysis remains qualitative, and quantitative evaluation of activation overlap with ground-truth masks was not conducted in this study. Future work will introduce quantitative interpretability metrics, such as activation–mask overlap ratios or pointing-game analysis, to further evaluate the localization consistency of model attention.
3.4. ByteTrack-Based Continuous Counting of Pigeon Eggs
Although ByteTrack effectively handles short-term occlusion during continuous inspection, its trajectory memory can also introduce systematic errors in high-density cage environments. During long patrols, motion-prediction inertia may cause target IDs from one cage to drift into adjacent cages, leading to cumulative counting errors across cages. To overcome this limitation, we introduced a QR-code-forced threshold mechanism as a spatial anchor for trajectory reset and cage-level isolation.
The mechanism uses a front-end visual unit mounted on the inspection robot to identify cage QR codes in real time. Once a new cage code enters the predefined recognition region, all cached trajectories in the current ByteTrack tracker are immediately cleared and the counting results of the current cage are archived. In this way, each cage is treated as an independent spatiotemporal unit, thereby preventing cross-cage error accumulation. Quantitative evaluation showed that, before this mechanism was introduced, target misassociation produced an average cumulative counting error of approximately 15.0% per 50 cages. After QR-code-based forced thresholding was applied, cage attribution accuracy increased from 85.0% to 99.2%. These results indicate that the proposed strategy effectively eliminated counting bias caused by long-range trajectory drift and provided a reliable basis for cage-level digital management under a “one cage, one record” framework.
To further evaluate system performance under practical farming conditions, a 30 min inspection video was collected from a commercial pigeon house. During the test, the inspection robot moved at a constant speed while the onboard camera continuously captured cage images, and the video stream was processed online using the improved YOLO-PEDI model. Representative detection results are shown in
Figure 6. The system maintained stable performance under complex backgrounds and partial occlusion. As shown in
Figure 6, during robot movement, the optical axis frequently intersected the metal cage mesh at different angles, resulting in strong visual interference. Even when pigeon eggs were partially blocked by thick galvanized wires or occluded by the pigeon body, YOLO-PEDI remained able to localize the target accurately. This robustness is consistent with the enhanced feature selection enabled by the attention mechanism, which allowed the model to preserve informative texture and contour cues under challenging visual conditions. The resulting stable detections also provided reliable input for subsequent ByteTrack-based counting and association. In addition to counting, the system simultaneously performed egg-quality recognition. As shown in
Figure 6, the model successfully distinguished normal pigeon eggs from broken eggs in dynamic inspection scenes. For the key production indicator of broken eggs, the system achieved a recall of 98.0%, indicating high sensitivity to fine defect features. This result suggests that the proposed model retained sufficient discriminative capacity to identify subtle structural abnormalities, such as cracks and surface damage, despite its lightweight design.
The overall results of dynamic inspection are summarized in
Table 3. The system showed particularly strong performance in detecting high-value production indicators, especially broken eggs. Although dense occlusion occasionally caused trajectory interruption and affected cumulative counting accuracy, the overall counting accuracy still reached 80.9%. By introducing the QR-code-forced threshold mechanism, the system established a closed-loop association among video frames, cage numbers and detected targets. In field tests, with the support of supplemental illumination, the QR-code recognition rate remained above 99.5%, ensuring that each detected broken egg could be accurately traced back to its corresponding physical cage. This cage-level traceability mechanism effectively resolves a common limitation of traditional inspection, in which broken eggs can be detected but cannot be reliably assigned to a specific cage.
To complement the overall inspection results, we further quantified system performance in terms of tracking stability and processing speed using 15 inspection videos collected under different time periods and illumination conditions, comprising approximately 36,000 frames in total. The multi-object tracking accuracy (MOTA) remained stable at 88.6% when the patrol robot moved at a constant speed of 0.5–0.8 m/s. The variance of MOTA fluctuation remained below 0.04, even under frequent image jitter caused by rail joints, indicating strong tracking robustness during continuous motion. This stability is consistent with the secondary association mechanism in ByteTrack, which links low-confidence detections to historical trajectories and thereby reduces track fragmentation under sudden changes in illumination and partial occlusion.
The system also maintained favorable ID consistency and real-time performance. During a complete inspection trip, the average number of identity switches was only 1.2 per 1000 frames, indicating strong trajectory continuity and reliable counting uniqueness. In terms of computational efficiency, the average processing speed reached 68.4 FPS, with a peak speed of 82 FPS. This throughput exceeds the 30 FPS acquisition rate of the industrial camera and provides sufficient computational margin for additional edge-side functions, such as multi-sensor fusion and real-time warning. Together, these results demonstrate that the proposed system can deliver stable, traceable and real-time inspection performance in practical smart pigeon farming environments.
3.4.1. Analysis of Typical Failure Modes
To provide a more explicit understanding of the gap between the model-level mAP and the reported 80.9% cumulative counting accuracy, we conducted a systematic review of the recognition logs and representative video frames obtained during the field experiment. As illustrated in
Figure 7, the three major failure cases were identified: biological occlusion, motion blur, and environmental interference.
Biological occlusion was the dominant factor, accounting for approximately 70% of missed detections. In these scenarios, breeding pigeons physically covered the eggs to provide warmth or protection, causing the eggs to become completely invisible to the overhead camera. Therefore, even a highly accurate detection model cannot identify these targets when no visual evidence is available in the input image. Motion blur was another important contributor to counting errors. When the robot operated at speeds approaching 0.8 m/s, rail vibration and platform motion occasionally caused image jitter or blurred egg boundaries. This could destabilize the Kalman-filter-based prediction and data association process in ByteTrack, resulting in identity fragmentation, temporary target loss, or duplicate counts. Environmental interference also affected counting stability under certain viewing angles. Structural elements such as feeding troughs, thick galvanized wires, and nest padding created localized blind spots and partially obscured the elliptical contours of the eggs. These factors increased the difficulty of feature extraction in highly cluttered cage environments. The integration of the QR-code-based “hard reset” mechanism mitigated the impact of these failure events at the system level. By periodically reinitializing the spatial association at each cage unit, tracking interruptions were localized within individual cages, preventing error accumulation and propagation across adjacent cage units. This mechanism helped preserve the overall integrity of the production database under practical farm conditions.
3.4.2. System Robustness Under Challenging Field Conditions
The operational reliability of the YOLO-PEDI-based inspection framework was further analyzed under several challenging field conditions commonly encountered in commercial pigeon houses, including illumination heterogeneity, increased robot speed, and spatial-anchor degradation.
In multi-tier cage structures, illumination can vary considerably across different cage levels, with lower-tier cages often suffering from insufficient ambient light. To address this issue, the patrol robot was equipped with integrated LED arrays to provide supplementary illumination and maintain an adequate signal-to-noise ratio in poorly lit areas. In addition, the CBAM-enhanced backbone can strengthen the model’s attention to stable structural features, such as egg contours, while suppressing interference from low-contrast background textures. This design is expected to improve detection stability under heterogeneous illumination conditions.
Robot speed is another important factor affecting image quality and tracking continuity. During routine inspection, the robot operates at approximately 0.2 m/s, at which no obvious motion blur is typically observed. When the speed approaches 0.8 m/s, however, rail-induced vibration and platform motion may introduce motion blur or image jitter into the video stream. Under such conditions, ByteTrack’s Kalman-filter-based motion prediction can help maintain short-term trajectory continuity by estimating target locations from historical motion states, thereby reducing temporary target loss and identity switches. Nevertheless, excessively high speeds may still degrade image sharpness and counting stability, indicating that operational speed should be controlled within a reasonable range for reliable deployment.
The robustness of the spatial-anchor chain was also considered for partially occluded or degraded QR codes. In practical farm environments, QR labels may become partially unreadable because of dust, feathers, manure contamination, physical wear, or viewing-angle changes. To handle this situation, the system adopts a “graceful degradation” strategy. If a cage-level QR code cannot be decoded, the system temporarily relies on short-term visual tracking and maintains the current spatial association until the next valid QR code is detected. Once a clean QR code is recognized, the spatial mapping is reinitialized through the QR-code-based “hard reset” mechanism. This strategy localizes tracking interruptions within individual cage units and prevents identity drift or counting errors from propagating across the entire inspection row.