1. Introduction
With the continuous advancement of protected horticulture technologies, greenhouse-based fruit and vegetable production has become a key component of high-efficiency global agriculture [
1]. Compared with open-field cultivation, greenhouses enable higher yields, extended supply periods, and more stable product quality through precise environmental regulation [
2]. Nevertheless, diseases and insect pests remain critical limiting factors for greenhouse productivity. Under typical greenhouse ecological conditions—such as high humidity, low wind flow, and dense planting—pests and pathogens spread rapidly, exhibit strong concealment, and can cause severe damage within short time intervals [
3]. Consequently, establishing real-time, accurate, and intelligent recognition systems for greenhouse diseases and pests is of great significance for improving management efficiency, reducing pesticide usage, and ensuring high-quality production.
Traditional disease and pest recognition in greenhouses has mainly relied on manual inspection and expert evaluation [
3]. Although such approaches offer a certain degree of reliability, they exhibit several limitations. Manual scouting is labor-intensive and time-consuming, making it difficult to meet the high-frequency monitoring requirements of large-scale modern greenhouses [
4]. Moreover, symptoms of diseases and pest infestations in greenhouses often appear similar, atypical in early stages, and concealed by complex plant structures. Variations in individual plant morphology or illumination conditions may lead to misjudgments [
5]. The typically complex planting structures of modern greenhouses—such as multilayer leaf occlusions, mixed-cropping systems, reflective films, and supplemental lighting—further reduce the accuracy of manual inspection [
6]. Meanwhile, rising labor costs and a decreasing number of agricultural protection specialists have made manual scouting insufficient for intelligent agricultural scenarios that require rapid response and high-precision identification [
7].
To address these limitations, researchers have explored classical image-processing techniques, including color threshold segmentation, texture analysis, shape-based descriptors, edge operators, and handcrafted features [
8]. Although these methods can identify certain typical symptoms under controlled conditions, their adaptability to complex greenhouse environments is limited [
9]. For instance, color-thresholding approaches struggle with illumination-induced color shifts [
10]; texture features are highly sensitive to leaf posture variations [
11]; and tiny pests such as whiteflies and thrips exhibit small sizes and high morphological similarity, making them difficult to distinguish using handcrafted representations [
12]. In addition, rule-based models are highly sensitive to dataset distribution and exhibit weak generalization, often requiring frequent parameter adjustments across crops, greenhouses, or different disease and pest development stages [
13]. Thus, traditional image-processing methods are not suitable for long-term, stable deployment in practical greenhouse environments. In recent years, the rapid development of deep learning has significantly advanced intelligent recognition technologies in agriculture. Convolution-based and Transformer-based models have achieved near-expert performance on benchmark datasets such as PlantVillage, and many studies have successfully captured lesion textures, color variations, and pest morphological details using convolutional features and attention mechanisms [
14]. Concurrently, object detection frameworks such as YOLO, Faster R-CNN, DETR, and their variants have demonstrated outstanding performance in agricultural disease and pest detection tasks, enabling multi-object recognition in complex greenhouse scenes [
15,
16]. However, several challenges persist in applying deep learning to greenhouse disease and pest recognition. The greenhouse environment is characterized by strong specular reflections, complex backgrounds, and frequent occlusions, making it difficult for models to capture small-scale and low-contrast details [
17]. Furthermore, mainstream deep-learning models typically rely on high-performance computing platforms, posing difficulties for deployment on mobile devices, edge-computing units, greenhouse robots, or smart cameras commonly used in greenhouses [
18]. Large networks require high computational resources, making them unsuitable for prolonged operation on low-power devices [
19]. Although lightweight models achieve faster inference, their representational capacity is limited and often results in reduced accuracy—particularly in tasks involving small-object detection [
20].
Several recent studies have attempted to mitigate these issues. Gao et al. [
21] proposed ACLW-YOLO, a lightweight tomato-fruit detection method based on an improved YOLOv11n architecture. The model was compressed to only 3.3 MB while maintaining an mAP of 95.2%, significantly improving deployment efficiency. Zhu et al. [
22] developed a fruit disease and pest detection system based on knowledge graphs and deep learning, achieving a pest recognition accuracy of 94.9% on Raspberry Pi devices. Xu et al. [
23] introduced CNNA, a lightweight tomato disease and pest classification network based on compressed ConvNeXt-Nano with multi-scale feature fusion and global channel attention, achieving 98.96% accuracy with substantially reduced model size and computational cost. Kong et al. [
24] presented LCA-Net, which integrates cross-layer feature aggregation, channel–spatial attention, and Cut-Max cropping, achieving 83.8% accuracy for fine-grained recognition of 28 disease and pest categories. Zhang et al. [
25] proposed an automatic greenhouse pest recognition system based on an improved YOLOv5 and machine vision framework, achieving an average accuracy of 96% and significantly enhancing tiny-pest detection capabilities, thereby providing real-time monitoring and decision support for greenhouse pest management.
To address the aforementioned challenges, a lightweight fruit and vegetable disease–pest recognition network named Light-HortiNet is proposed for resource-constrained greenhouse environments. The network is built upon a lightweight Mobile-Transformer architecture and incorporates cross-scale feature interaction and efficient attention mechanisms to maintain high recognition accuracy on low-power platforms.
A cross-scale lite attention module (CSLA) is designed, which performs cross-scale information fusion through low-rank decomposition and feature compression, thereby enhancing the model’s ability to capture fine-grained lesion and pest details.
A block-level substitution distillation mechanism (BLSD) is introduced, in which intermediate teacher features are used to improve the representational capability of lightweight models without increasing inference cost.
A small-object enhancement branch (SOEB) is constructed to strengthen the detection performance on targets of 5–20 pixels while preserving the lightweight structure. This mechanism is particularly effective for tiny pests such as whiteflies and aphids.
4. Results and Discussion
4.1. Experimental Configuration
4.1.1. Hardware and Software Platform
The experimental hardware platform in this study was designed to cover a wide spectrum ranging from high-performance training servers to resource-constrained edge computing devices, ensuring deployability and stability across diverse application scenarios. During the high-performance training stage, NVIDIA A100 and RTX 4090 GPUs were utilized, providing powerful tensor-core computation capabilities and high-bandwidth memory resources, which enabled large-batch training and parallel multi-model experimentation. For edge inference evaluation, Jetson Nano and RK3588 were selected as representative low-power embedded platforms to validate real-time response capability in practical farmland and greenhouse environments, while a mobile-side NPU was further employed to assess inference latency and energy consumption on mobile devices. Through the cross-platform experimental design, systematic evaluation of performance consistency and deployment adaptability under heterogeneous computational conditions was achieved. The software environment was established based on the PyTorch 1.10.0 deep learning framework for model training and optimization, where the dynamic computation graph facilitated rapid prototyping and customized module development. During the inference acceleration stage, TensorRT was adopted for operator optimization and quantization to further reduce inference latency on edge devices, while ONNX Runtime was employed to enable cross-platform model inference, maintaining structural consistency between training and deployment pipelines and thereby reducing compatibility issues during model migration. The overall software stack covered training, exporting, optimization, and deployment stages, ensuring controllability and portability across the full model lifecycle.
4.1.2. Hyperparameter Settings
With respect to hyperparameter configuration, the dataset was divided into training, validation, and test sets with a ratio of 7:2:1. To ensure representational fairness across all categories, a stratified sampling strategy was employed during the partitioning process. This mechanism guaranteed that the distribution of disease and pest classes in each subset remained consistent with the overall dataset, effectively mitigating evaluation bias caused by potential class imbalance and ensuring that minority classes were sufficiently represented in the validation and test phases. During training, a batch size of = 32 was adopted, the initial learning rate was set to , and a cosine annealing strategy was applied for dynamic learning rate adjustment, while the AdamW optimizer was employed to achieve more stable gradient update behavior. To enhance model generalization capability, five-fold cross-validation was introduced, where the dataset was partitioned into five subsets and one subset was selected as the validation set in each iteration while the remaining four subsets were used for training, and a total of cycles were conducted to obtain more robust and unbiased performance estimation. The above combination of hyperparameters enabled stable convergence behavior and strong generalization performance under varying conditions, providing a reliable foundation for subsequent comparative experiments and deployment verification.
4.1.3. Baseline Models and Evaluation Metrics
The baseline models selected in this study covered representative lightweight detection and visual backbone networks commonly adopted in agricultural small-object scenarios, including YOLOv8n [
49], YOLOv11n [
50], YOLOX-Tiny [
51], MobileViT [
52], MobileNetV3 [
53], EfficientDet-D0 [
54], and Tiny-DETR [
55]. To ensure the fairness and validity of the performance comparison, all baseline models were trained and evaluated under identical experimental conditions, strictly adhering to the same dataset partitioning strategies, input resolutions, and optimization hyperparameter settings.
Model performance evaluation was conducted using mAP@50 and
for detection tasks, accuracy and f1-score for classification tasks,
for small-object detection performance, inference speed metrics including fps and model size (mb) as well as
, and real deployment latency and power consumption to assess comprehensive performance on edge devices. In mathematical definitions, the core evaluation formulas for detection and classification were uniformly expressed as follows:
The variables in the above formulas were defined as follows: , , , and denote the numbers of true positives, false positives, true negatives, and false negatives, respectively; represents the precision curve at different recall rates R; N denotes the number of categories; denotes the inference time per image; L denotes the number of network layers; , , and denote the channel number and feature map dimensions of the l-th layer; and denotes the convolution kernel size. In a comprehensive sense, detection metrics mainly measured the combined performance of localization and classification, classification metrics reflected the accuracy of disease symptom recognition, small-object metrics emphasized the detection capability of fine-grained pest targets, speed-related metrics reflected real-time capability under resource-constrained environments, and deployment performance further demonstrated practical value in real agricultural edge scenarios, ensuring that the model was required not only to be accurate, but also to be fast and efficient.
4.2. Overall Performance Comparison
The objective of this experiment was to systematically evaluate the comprehensive perception capability of different lightweight visual models under complex greenhouse fruit and vegetable disease and pest scenarios. Through unified data partitioning and inference settings, the balance between detection accuracy, classification performance, and inference efficiency was compared, thereby validating the practicality and stability of the proposed model in resource-constrained environments.
From the overall results, as shown in
Table 2 and
Figure 7, the highest values of mAP@50, mAP@50:95, accuracy, and f1-score were achieved by light-hortinet, while a relatively high inference frame rate was maintained under a controlled parameter scale, indicating that a more favorable balance between accuracy and efficiency was realized. YOLOv8n and YOLOv11n, as representative one-stage detection frameworks, exhibited stable accuracy due to strong local feature modeling capability; however, fixed-scale feature fusion mechanisms still resulted in information loss when facing small targets and complex backgrounds in greenhouse environments. YOLOX-tiny and mobilenetv3 adopted extremely lightweight designs, and computational complexity was reduced through channel pruning and depthwise separable convolutions, resulting in higher fps performance, but feature representation dimensionality was constrained, leading to a pronounced decline in accuracy for fine-grained lesion and pest recognition. Mobilevit enhanced global modeling capacity by introducing a lightweight transformer structure, yet scale coupling issues remained in attention computation, which limited performance in high-density small-object scenarios. Efficientdet-d0 relied on bidirectional feature pyramid networks to enhance multi-scale fusion, but under extreme lightweight configurations, channel capacity was restricted, leading to insufficient feature reuse efficiency. Tiny-detr was designed based on a set-matching detection paradigm, but its global attention structure was unable to sufficiently capture local high-frequency textures under low-resolution embedding conditions, resulting in weaker performance in both accuracy and speed.
As shown in
Figure 8, the significant advantage of light-hortinet was attributed to the introduction of multi-level low-rank constraints and cross-scale sparse modeling mechanisms in the feature mapping space, which caused the feature distributions to become more concentrated and the decision boundaries to become clearer. Traditional YOLO series models relied on convolutional kernels with fixed receptive fields, and their feature responses were mainly formed by the superposition of local linear transformations, which ensured translational invariance but led to feature aliasing under complex illumination and occlusion conditions. Mobilenet series models reduced parameter scale by depthwise separable convolution, which can be interpreted as an implicit rank-constrained approximation of convolution operators, improving speed but weakening the expressive freedom of the high-dimensional feature space. Transformer-based models relied on global correlation matrix modeling to capture long-range dependencies, but under lightweight constraints, the compression of attention matrix dimensions resulted in reduced discriminative capability. In contrast, light-hortinet formed a more stable low-dimensional manifold structure in the feature space through cross-scale lightweight attention and block-level distillation mechanisms, while the small-object enhancement branch strengthened local high-frequency response. Consequently, effective compression of feature information entropy and maximal preservation of discriminative information were achieved at the mathematical level, which constituted the fundamental reason why superior accuracy and efficiency were simultaneously obtained under a compact model scale.
4.3. Comparison of Small-Object Detection Capability
The objective of this experiment was focused on evaluating the perception and localization capability of different lightweight visual models under extremely small-scale target scenarios, particularly in real greenhouse environments where pest body sizes are extremely small, initial lesion areas are limited, and target-to-background contrast is low. Through unified small-object sample partitioning and consistent inference settings, model stability and recall capability in fine-grained recognition tasks were systematically measured.
As shown in
Table 3, Light-HortiNet demonstrated a superior trade-off between detection precision and deployment efficiency. It achieved the highest values in mAP-small and recall-small, while maintaining competitive inference speeds with a latency of 43.8 ms and a controlled memory footprint of 14.5 MB. In terms of energy efficiency, the model operated at approximately 6.2 W, striking a favorable balance suitable for battery-powered edge devices. YOLOv11n exhibited relatively superior performance among the YOLO family, primarily benefiting from its updated feature fusion strategy which enabled stronger contextual modeling; however, this came at the cost of slightly elevated latency and memory usage compared to our method. YOLOv8n maintained stable performance, although its feature activation capability for extremely small targets remained constrained. YOLOX-Tiny and MobileNetV3 were oriented toward extreme lightweight design, achieving the lowest power consumption and latency by reducing channel numbers and network depth, but this optimization strategy resulted in a noticeable decline in detection accuracy. MobileViT enhanced global modeling capability by introducing a lightweight transformer structure, but significant memory overhead and loss of local high-frequency texture preservation were observed. EfficientDet-D0 incorporated bidirectional feature fusion mechanisms, yet semantic transmission capability was limited under extremely small target scales. Tiny-DETR was constrained by the representational bottleneck of the set-matching detection mechanism in low-resolution embedding spaces, leading to the highest latency (56.8 ms) and memory usage (24.1 MB) among the evaluated models, resulting in weaker overall performance in both speed and accuracy.
From the perspective of mathematical characteristics, small-object detection tasks inherently require the preservation of more local high-frequency information in high-dimensional feature spaces and the avoidance of excessive smoothing of feature energy during multi-scale mapping, as shown in
Figure 9. Traditional convolution-stacking-based models exhibit exponentially expanding receptive fields as network depth increases, which facilitates semantic modeling but progressively attenuates micro-structural response intensity during downsampling. Attention-based models under lightweight constraints tend to compress the attention computation space, resulting in sparse feature correlation matrices that are insufficient to establish stable global dependency relationships for micro-scale targets. The significant advantage of light-hortinet was attributed to the introduction of a dedicated small-object enhancement mechanism in the feature mapping process, enabling the formation of high-resolution feature preservation pathways at shallow stages, while cross-scale attention modulation suppressed the interference of background noise on discriminative boundaries. Consequently, steeper response gradients and clearer inter-class margins were formed in the feature distribution space. This structural advantage at the mathematical level enabled more compact feature clustering for small-object samples, resulting in superior recall and precision performance compared with conventional lightweight networks.
4.4. Cross-Domain Generalization Analysis
To thoroughly evaluate the robustness of the proposed model under varying environmental conditions, we conducted a domain-specific performance analysis. The test set consists of two distinct data sources: the In-situ Field Set, characterized by complex backgrounds, variable illumination, and occlusion typical of real greenhouse production; and the Public Source Set, which generally contains cleaner backgrounds and simpler compositions. Quantifying the performance gap between these two domains is critical for assessing real-world deployment feasibility.
Table 4 presents the detection performance across the two domains. It is observed that all models achieve higher accuracy on the Public Source Set, reflecting the lower difficulty of these samples. However, significant performance degradation occurs when models are applied to the In-situ Field Set. For instance, MobileViT and EfficientDet-D0 experience a sharp decline in mAP@50 (Domain Gap of
and
, respectively), indicating their limited ability to handle environmental noise such as leaf occlusion and water vapor interference. In contrast, Light-HortiNet demonstrates superior generalization capability. It not only achieves the highest absolute performance on the challenging field data (
) but also maintains the smallest performance gap (
) between domains. This robustness can be attributed to the synergistic effect of the CSLA module, which effectively filters background redundancy, and the specific data augmentation strategies (e.g., Mixup and Copy-paste) that simulate complex occlusions during training, thereby forcing the model to learn invariant pest features rather than relying on simple background correlations.
4.5. Ablation Studies
4.5.1. Impact of Proposed Network Modules
The objective of this ablation experiment was to systematically verify the independent contributions and synergistic gain effects of the three proposed core modules on overall network performance. By progressively introducing the CSLA, BLSD, and SOEB modules, the variation trends of detection accuracy, small-object recognition capability, and inference efficiency under different combination strategies were compared, thereby revealing the actual impact of each module on the model representation capability and computational overhead.
As shown in
Table 5, when only the backbone network was employed, values of
and
were achieved for mAP@50 and mAP-small, respectively, indicating that basic target recognition capability was obtained, while perception of small-scale targets remained limited. After the CSLA module was introduced, the overall detection accuracy increased to
, and small-object performance was also noticeably improved, demonstrating that the cross-scale attention mechanism effectively enhanced information interaction among multi-scale features. With the introduction of the BLSD module, further improvements were observed in both mAP@50 and small-object metrics, reflecting the positive role of block-level distillation in enhancing semantic abstraction capability. When the SOEB module was introduced independently, the most significant improvement was observed in small-object performance, and the increase in mAP-small exceeded that of the overall mAP, confirming that the high-resolution small-object enhancement branch exerted a stronger amplification effect on micro-structural information. Pairwise combinations of different modules consistently exhibited stable performance superposition characteristics, while the simultaneous activation of all three modules achieved the optimal result, with mAP@50 increasing to
and mAP-small reaching
, while acceptable inference frame rates were maintained, thereby validating the synergistic design of the overall architecture.
4.5.2. Impact of Data Augmentation Strategies
To further evaluate the necessity of the proposed data processing pipeline, a stepwise ablation study was conducted on the data augmentation strategies. The experiment started with the raw dataset and sequentially incorporated Basic Augmentation (color jitter, random occlusion), Advanced Geometric Augmentation (Mosaic, Mixup), and the Small-object Copy-paste strategy. The complete Light-HortiNet architecture was used for all tests to isolate the contribution of data quality to the final performance.
As presented in
Table 6, the model trained on raw data achieved a relatively low mAP-small of
, indicating severe overfitting to simple background patterns and insufficient learning of pest features. The introduction of Basic Augmentation improved mAP@50 by
, confirming that color jitter and random occlusion effectively simulated the complex illumination and leaf occlusion variance typical of greenhouse environments, thereby forcing the network to learn more robust features. Subsequently, the inclusion of Mosaic and Mixup strategies resulted in a significant performance leap, with mAP@50 rising to
. This suggests that enriching scene combinations and smoothing label distributions greatly enhanced the model’s spatial generalization capability. Finally, the application of the Small-object Copy-paste strategy yielded the most critical gain for fine-grained detection, boosting mAP-small from
to
. This substantial increase demonstrates that increasing the occurrence frequency and spatial distribution of micro-scale targets effectively alleviates the class imbalance problem, ensuring that the lightweight model maintains high sensitivity to tiny pests like whiteflies and thrips even in complex scenes.
4.5.3. Parameter Sensitivity Analysis
To determine the optimal hyperparameter settings for the proposed architecture, we conducted sensitivity analyses on two critical configuration sets: the low-rank compression ratio in the CSLA module and the multi-scale kernel combinations in the SOEB module.
Impact of Low-rank Compression Ratio. The rank parameter
r in the Cross-Scale Lightweight Attention (CSLA) module controls the dimensionality of the feature subspace approximation (defined as
). A smaller compression ratio (larger
r) preserves more semantic information but increases computational cost, while a larger ratio reduces redundancy but may lead to feature collapse. The theoretical suitability of this low-rank constraint for greenhouse scenarios stems from the intrinsic sparsity of pest features relative to the complex background. In typical greenhouse imagery, visual information is dominated by repetitive background patterns (e.g., leaf textures and soil), while salient targets (pests and lesions) occupy a low-dimensional manifold within the feature space. By enforcing a low-rank approximation, the CSLA module effectively acts as a semantic filter that suppresses high-frequency background noise and forces the attention mechanism to focus on the principal structural components of the targets, thereby enhancing feature purity without requiring dense full-rank computation. As shown in
Table 7, we evaluated reduction ratios of
and 16. The results indicate that a ratio of 4 achieves the best trade-off, yielding a high mAP@50 of
with a moderate FLOPs increase. Although a ratio of 2 slightly improves accuracy (
), it disproportionately increases the model size and FLOPs, reducing FPS to
. Conversely, excessive compression (ratio 16) causes a significant drop in accuracy (
), failing to capture sufficient fine-grained details.
Impact of Multi-scale Kernel Combinations. The Small-object Enhancement Branch (SOEB) relies on depthwise convolutions with varying kernel sizes to capture features at different receptive fields. We compared the default combination
against single-scale and other multi-scale variants. As presented in
Table 8, using a single
kernel limits the receptive field, resulting in poor detection of larger pests or clustered targets (mAP@50:
). The combination of
provides the most robust performance (mAP@50:
), effectively covering the 5–20 pixel scale range of typical greenhouse pests. Interestingly, increasing kernel sizes further to
did not yield significant accuracy gains (
) but noticeably increased latency, validating that the
combination is the most efficient design for this specific task.
4.6. Discussion
In multi-platform deployment experiments, Light-HortiNet exhibited stable execution efficiency and favorable energy-performance characteristics across heterogeneous embedded computing environments. On the Raspberry Pi 4B platform (ARM Cortex-A72, 4 GB RAM), after INT8 quantization and acceleration using the ONNX Runtime inference engine, the model achieved a stable throughput of – FPS at an input resolution of , with an average per-frame latency of approximately 106–128 ms. Under a typical operating power consumption of – W, the power-normalized throughput reached approximately – FPS/W, satisfying the continuous operation requirements of low-power static monitoring nodes in greenhouse environments.
On the Jetson Nano platform (128-core Maxwell GPU, 4 GB RAM), after TensorRT INT8 acceleration, the throughput increased to – FPS, with per-frame latency reduced to approximately 45–54 ms. The measured power consumption was maintained at – W, yielding an energy efficiency of – FPS/W, which was well suited for deployment in greenhouse mobile inspection robots and rail-based automatic inspection systems. On the Jetson Xavier NX platform (384-core Volta GPU), the throughput was further elevated to – FPS, and the latency was reduced to 27–32 ms, while achieving an energy efficiency of – FPS/W within a power envelope of 9–12 W. Overall, the experimental results demonstrated that the proposed method supported sustained low-power operation on ARM-based platforms and provided reliable real-time video stream processing capabilities on embedded GPU platforms, confirming its practical engineering value for resource-constrained agricultural Internet-of-Things applications.
To provide a comprehensive assessment of practical application costs, it is necessary to discuss implementation complexity, training stability, and potential failure modes. Regarding implementation complexity, while the inference architecture is streamlined, the training pipeline introduces significant engineering overhead due to the Block-level Substitution Distillation (BLSD) framework. This requires maintaining a synchronous teacher–student computational graph and precise feature alignment logic, which increases GPU memory usage during training compared to standard single-model training. In terms of training stability, the proposed dynamic substitution mechanism mitigates gradient vanishing, yet the model exhibits sensitivity to the decay schedule of the substitution probability. An overly aggressive decay rate can lead to feature collapse before the student network fully establishes its representational capacity, necessitating careful hyperparameter tuning. Finally, distinct failure modes were observed during stress testing. The lightweight backbone, constrained by its limited channel capacity, tends to produce merged bounding boxes in scenarios with high-density pest occlusion, resulting in undercounting. Additionally, the model shows reduced robustness against significant spectral shifts, such as those caused by specific narrow-band LED supplemental lighting, indicating a reliance on color consistency within the training domain.
4.7. Limitations and Future Work
Although the proposed Light-HortiNet has achieved satisfactory performance in greenhouse fruit and vegetable pest and disease recognition tasks, certain limitations remain at the practical application level. First, the model has been primarily trained and inferred based on two-dimensional visible-light images, and for some diseases with highly concealed or extremely weak early symptoms, the visual features in the visible spectrum remain insufficiently informative. Under extreme environmental conditions such as strong specular reflection, heavy fog, or severe occlusion, detection stability still has room for improvement. Second, we applied the model to continuous video sequences captured in a greenhouse environment. The model maintained high accuracy on stable frames, but exhibited instability under conditions such as rapid insect movement or violent leaf swaying caused by ventilation, exhibiting phenomena like bounding box jitter and intermittent missed detections (flickering). This indicates that frame-by-frame independent inference lacks the necessary temporal consistency required for dynamic monitoring. In addition, although the model has been optimized for edge devices, trade-offs between memory bandwidth and real-time performance may still arise on embedded platforms with lower power budgets. Future work will focus on multimodal information fusion and cross-scenario generalization. On the one hand, infrared, thermal, and hyperspectral sensing data will be introduced and jointly modeled with visible-light images to enhance robustness against concealed diseases and environmental interference. On the other hand, addressing the instability observed in our preliminary video tests, temporal modeling and video-level detection frameworks will be explored. Specifically, spatiotemporal evolution information of targets will be incorporated into a unified modeling system to mitigate detection jitter and strengthen dynamic perception capabilities for pest spreading trends and disease lesion expansion processes.
5. Conclusions
This study was oriented toward the challenges of early recognition of fruit and vegetable diseases and pests and real-time deployment under resource-constrained greenhouse environments, and a lightweight and high-precision light-hortinet recognition framework was constructed, providing a technically feasible pathway for intelligent and fine-grained greenhouse management. By introducing a cross-scale lightweight attention mechanism, a block-level substitution distillation strategy, and a small-object enhancement branch, key bottlenecks in conventional lightweight models, including insufficient perception capability for small-scale targets and constrained feature representation capacity, were effectively addressed. In comprehensive experiments, superior overall performance was achieved in disease and pest detection tasks compared with mainstream lightweight models, with mAP@50 reaching , mAP@50:95 reaching , classification accuracy reaching , and f1-score reaching , while in small-object detection tasks, mAP-small of and recall-small of were obtained. In addition, stable real-time inference capability was maintained on edge device platforms, enabling an effective balance between accuracy and efficiency. The experimental findings demonstrate that the proposed method exhibits innovation at the algorithmic level and significant promotion value at the practical application level, and reliable technical support can be provided for precise greenhouse disease and pest early warning, intelligent pesticide regulation, and digital production management, thereby establishing an important foundation for the intelligent upgrading of facility agriculture.