YOLO-ST-OD: An Enhanced YOLO-Based Architecture for UAV Detection of Sunburned Kiwifruit Under Complex Orchard Conditions

Niu, Zhen; Su, Yunwang; Jin, Ning; Xu, Suguang; Peng, Jiayi; Sigrimis, Nick; Han, Dong; Zhang, Dongyan

doi:10.3390/horticulturae12050630

Open AccessArticle

YOLO-ST-OD: An Enhanced YOLO-Based Architecture for UAV Detection of Sunburned Kiwifruit Under Complex Orchard Conditions

by

Zhen Niu

^1,2,

Yunwang Su

^1,2,

Ning Jin

³,

Suguang Xu

^1,2,

Jiayi Peng

¹,

Nick Sigrimis

⁴,

Dong Han

^1,2,* and

Dongyan Zhang

^1,2,*

¹

College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling 712100, China

²

Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Northwest A&F University, Yangling 712100, China

³

Department of Resources and Environmental Engineering, Shanxi Institute of Energy, Jinzhong 030600, China

⁴

Laboratory of Machine Systems, Dept of Natural Resources and Agricultural Engineering, Agricultural University of Athens, Iera Odos 75, 118 55 Athens, Greece

^*

Authors to whom correspondence should be addressed.

Horticulturae 2026, 12(5), 630; https://doi.org/10.3390/horticulturae12050630

Submission received: 13 April 2026 / Revised: 16 May 2026 / Accepted: 17 May 2026 / Published: 19 May 2026

Download

Browse Figures

Versions Notes

Abstract

Accurate detection and efficient loss assessment serve as critical technical foundations for disaster evaluation by agricultural insurance providers. However, existing detection methods often face limitations such as low detection accuracy and high missed detection rates when dealing with small-scale sunburn lesions in the trellis cultivation environment of kiwifruit, due to unclear texture features and severe canopy obstruction. This study proposes the YOLO-ST-OD model, an improved version of YOLOv11s, for detecting sunburned kiwifruit fruits with small targets in different complex environments. By dynamically adjusting the receptive field via the LSKNet module to achieve the precise detection of key features and suppression of background noise, and by coordinating with the multi-branch spatial and channel enhancement mechanism of the MCSEAM module to effectively compensate for feature loss caused by overlapping leaves or fruits, the system utilizes the RFAMPS module to fuse sub-pixel convolution with dynamic attention for high-fidelity spatial reconstruction. The experimental results show that the precision rate (P) of the YOLO-ST-OD model reaches 0.862, the mean average precision (mAP) reaches 0.837, and the recall rate reaches 0.818. Compared with mainstream models such as YOLOv5s, YOLOv7, YOLOv8s, YOLOv9, YOLOv10s and Faster CNN, it has better comprehensive performance in terms of precision, mAP and floating-point computation. Compared with the baseline model YOLOv11s, which achieved accuracy of 0.813, mAP of 0.752 and recall of 0.792, the YOLO-ST-OD model saw improvements of 6.03%, 8.78% and 5.68% in average accuracy, recall and mAP, respectively. The experimental results also demonstrated the robust performance of YOLO-ST-OD across varying levels of occlusion, fruit densities and imaging altitudes. This research can provide technical support for the rapid assessment of sunscald damage to kiwifruit, enabling faster post-disaster assessments and reducing costs for insurers.

Keywords:

kiwifruit; UAV remote sensing; occlusion-aware detection; agricultural insurance; sunburn hazard assessment

1. Introduction

Against the backdrop of intensified climate change, sunburn has become one of the most destructive meteorological disasters in the field of fruit tree cultivation [1]. This disaster has posed a serious threat to the kiwifruit industry and affected the economic feasibility of kiwifruit production and the sustainability of cultivation [2]. Against this backdrop, the global agricultural insurance market size has continued to expand. In 2019, the global agricultural insurance premium scale reached approximately USD 30 billion. Each year, more than USD 20 billion of effective insurance policies stabilize agricultural incomes [3]. However, while the agricultural insurance business has been growing rapidly on a large scale, there exists the problem of the constantly rising expense ratio. The main reason is the high cost of underwriting and claims settlement. Traditional underwriting relies primarily on manual counting methods. Especially for small-scale farmers, who are scattered, it is difficult to conduct investigation and loss assessment, which require large amounts of human and material resources and increase the costs [4]. Therefore, it is extremely important to develop an efficient, precise and automated method for detecting sunburn damage in kiwifruits.

Target detection, as one of the fundamental challenges in the field of computer vision, has been widely used in the field of agricultural intelligent perception driven by deep learning technology [5,6,7], including pest and disease identification, fruit quality grading, crop growth monitoring and weed identification and location. However, these models still face challenges when dealing with the occlusion and small-target scenarios characteristic of sunburn lesions. The existing algorithm system is divided into two categories based on the feature extraction mechanism: two-stage target detection algorithms and single-stage target detection algorithms. Two-stage algorithms generate candidate frames first and then perform operations such as classification and identification on the target region [8]. Although some studies have improved the computational accuracy using the two-stage detection model represented by RCNN, it also increases the computational complexity [9,10]. In contrast, single-stage detection algorithms, such as the YOLO series, require only one step to obtain the target’s category and location information and directly perform a convolution operation on the entire input image, which drastically reduces the computational time [11]. For example, Meng et al. proposed an innovative spatiotemporal convolutional neural network model that significantly improved the target detection performance for pineapple fruits by fusing the shift window Transformer with a regional convolutional neural network. Experimental results showed that the model achieved 92.54% accuracy in the pineapple fruit detection task and an average processing time of 0.163 s in terms of inference efficiency [12]. In addition, the YOLO network has been applied in other related fields, such as the YOLO-CG-HS model for detecting the occurrence of Fusarium head blight in wheat [13], the YOLOv5s-ShuffleNetv2-Ghost model for detecting the flower rate of apple blossoms in natural environments [14], the YOLOv5s-SBCSM model to detect the number of dragon fruits for yield estimation [15] and the GTCBS-YOLOv5s model to identify weed species in rice fields [16]. However, the specificity of agricultural scenarios imposes stricter requirements on target detection algorithms. The performance optimization of target detection algorithms by scholars at home and abroad is mainly based on the existing algorithm foundation, achieving good performance by adopting deeper network structures, optimizing the backbone network, adding feature pyramids and introducing attention mechanisms and other means to optimize and improve the robustness of the algorithm in complex scenes and achieve better detection effects.

At this stage of research, attempts have been made to apply target detection techniques to various segments of the kiwifruit industry chain, but they still face significant challenges in terms of real-world scenario adaptation. Research at the industrial end has focused on fruit tree grading, flower and bud thinning, pollination and pests and diseases. Among them, a kiwifruit maturity grading system based on an improved Mask R-CNN achieved over 80% accuracy by simulating complex light conditions in the orchard [17]. For kiwifruit pollination, a whole-process detection algorithm based on frequency-domain feature fusion, KIWI-YOLO, has been proposed, which makes full use of frequency- and spatial-domain information to improve the recognition of contour detail features and achieves detection accuracy of 91.6% [18]. For pest and disease detection, the multimodal fusion network exceeded 89.4% accuracy in the early identification of ulcer disease through the synergistic analysis of visible and near-infrared (NIR) data, but it relies on stationary high-precision spectral equipment, which makes it difficult to adapt to the needs of mobile inspection [19]. However, given the episodic nature and uncertainty of kiwifruit sunburn, there is a relative lack of systematic assessment studies on this problem. Kiwifruit orchard detection for sunscald disease is faced with insufficient sensitivity in peel discoloration perception and feature confusion effects caused by complex orchard backgrounds (shading, light). These bottlenecks have limited the practical value of existing detection methods in field scenarios, and there is an urgent need to construct a specific detection framework for sunburned fruit in small-target and complex scenarios.

In recent years, with the deep penetration of UAV technology in the agricultural field, precision agriculture and smart agriculture have shown rapid development [20]. The technology has been successfully applied to various aspects, such as plant protection spraying [21], drone seeding [22], crop phenotype monitoring [23] and agricultural disaster monitoring [24]. Low-altitude UAV imagery provides the spatial resolution required for detecting small sunburn lesions but poses challenges under occlusion. In particular, in kiwifruit sunburn disaster monitoring, UAV remote sensing technology can significantly improve the efficiency of agricultural insurance loss determination and reduce manpower costs through non-contact canopy detection, the core of which lies in the accurate identification of the fruit status in complex orchard environments. Driven by both the intensification of global climate change and the surge in demand for the emergency assessment of agricultural disasters, for kiwifruit sunburn, a typical heat damage disease, a small-target detection technology system based on UAV ultra-low-altitude remote sensing can be constructed by integrating the high-precision image acquisition system of a multi-rotor UAV with an improved YOLO deep learning algorithm. For the “Hongyang” kiwifruit variety, which has relatively low resistance to heat stress, this study proposes a small-target detection network framework for detecting sunburned kiwifruits, assuming that integrating multiple modules can enhance the detection capabilities for small-area sunburn in cases of occlusion. Aiming to break through the bottlenecks in the accurate detection of small-target sunburned fruits in complex orchard environments, it facilitates the assessment of kiwifruit sunburn and could improve the efficiency of agricultural insurance loss determination.

2. Materials and Methods

2.1. Study Area

Qionglai City, Sichuan Province (103°27′ E–103°44′ E, 30°26′ N–30°50′ N), as one of the core production areas of Chinese kiwifruit, exhibits remarkable regional specificity in its ecological conditions. The altitude gradient of the production area ranges from 500 to 1000 m, forming a typical mountainous vertical climate zone. Its microclimate features a daily temperature difference of 8.5–12.3 °C, which significantly promotes the accumulation of soluble solids in the fruits. From 20 August 2024 to 22 August 2024, Qionglai City experienced a yellow warning for high temperatures, which were accompanied by strong direct sunlight, with maximum hourly radiation of 398 W/m², causing kiwifruit sunburn disease to occur (Figure 1).

2.2. Data Acquisition

In this study, “Hongyang” kiwifruit sunburn images were collected using a DJI Mavic3m (DJI., Shenzhen, China) UAV with an integrated 20-megapixel visible camera. The kiwifruit trellis structure was at a height of about 2.0 m. Considering that sunburned kiwifruits are small targets, the UAV was selected to fly at a low altitude, with relative distances from the top of the kiwifruit of 1.0 m, 2.0 m and 3.0 m. The data collection period was set to 10:00–12:00, when the sun was directly overhead, thereby minimizing the impact of tree canopy shade, and it took place between 21 August 2024 and 23 August 2024. Sunlight was more abundant during this period, making the color difference between the diseased area and the normal area more noticeable, being easier to capture by the camera, and the texture clearer. In cases where the propeller vortex caused the kiwifruit leaves to flap, the flight posture was adjusted. After screening, 803 images at a 3 m height, 917 images at a 4 m height and 767 images at a 5 m height were retained, totaling 2487 images. The original data were used for the subsequent expansion of the dataset and model training (Figure 2).

2.3. Dataset Production and Data Enhancement

The data obtained in this study were all collected through on-site photography and summarization in the Qionglai kiwifruit orchard of Sichuan Province, aiming at reflecting the real situation of kiwifruit in the event of a sunburn disaster. The original RGB dataset consisted of 2487 images, including kiwifruit sunburn images with various complex backgrounds, different densities, different occlusion scenarios and different heights. The data were expanded from 2487 to 14,922 images by simulating brightness changes, adding noise points and random cropping. Subsequently, the data underwent 4x super-resolution reconstruction using ESRGAN Arch. To minimize artifacts that might be introduced by the GAN model, the RRDB module without batch normalization was selected (Figure 3). The dataset was divided into training, validation and test sets in a 7:2:1 ratio, comprising 10,445, 2984 and 1493 images, respectively. The sunburned fruits (N) and healthy fruits (Y) in the dataset were labeled using the target detection dataset annotation software Labelme (V3.16.2), and .txt files in YOLO format were generated for subsequent network training.

3. Methodology

3.1. YOLOv11s Network

YOLOv11 is a newer evolution of the You Only Look Once (YOLO) series of target detection models that focuses on achieving a better balance between real-time detection performance and lightweight design. Its core architecture employs a deeply separable convolutional reconfiguration of the backbone network, which reduces the number of parameters and computation to improve the inference efficiency, and it embeds a dynamic cross-stage feature fusion mechanism and dynamic cross-stage local network structure to enhance the level of multi-scale feature representation. Compared with the earlier versions (YOLOv5/v8), YOLOv11 optimizes the hierarchical interaction method based on a feature pyramid (FPN + PAN) and adopts adaptive weight assignment to improve the ability to extract semantic information from small targets, which can effectively alleviate the problems of occlusion and background interference in complex scenes.

3.2. YOLO-ST-OD Network

In this study, YOLOv11s was used as the baseline model, and, in order to solve the problem of small-target occlusion in sunburned kiwifruit detection, YOLOv11s was improved to develop a model specifically for sunburned fruit detection, called YOLO-ST-OD (Figure 4). The model consists of three main components: the backbone network, the neck network and the head network. The backbone network utilizes cascaded C3k2_LSK modules to achieve rapid contextual aggregation, constructing hierarchical features through a grouped residual structure and a channel recalibration strategy. Its shallow modules capture speckle microtextures, while the deep modules employ channel attention to suppress leaf reflections and noise interference. The neck network employs a dual-path interaction mechanism, fusing deep pathological semantics with mid-level geometric features via upsampling and utilizing a Concat operation to concatenate the channel weights and spatial encodings of the MCSEAM module. Through multi-branch spatial and channel enhancement mechanisms, it effectively compensates for feature loss caused by occlusion, thereby achieving the dynamic calibration of the feature maps. The detection head incorporates improved RFAMPS sub-pixel convolution, fusing dynamic attention for high-fidelity spatial reconstruction. Through adaptive receptive field optimization, it employs convolution kernels of different sizes to model sub-centimeter spots and diffuse lesions. Ultimately, by combining multi-granularity inputs, it forms a highly robust, scale-aware detection system.

3.2.1. C3k2_LSK Module

Regarding the technical challenges in detecting small targets of sunburn on kiwifruit in the complex scenarios of kiwi orchards, this study proposes an enhanced detection framework optimized using the large selective kernel network (LSKNet) architecture (Figure 5). Unlike the traditional fixed-kernel attention mechanism of YOLOv10, the framework fully integrates the dynamic perception mechanism of LSKNet with the multi-scale feature extraction capabilities of the C3k2 module and combines cross-layer heterogeneous feature coupling, multi-granularity feature decoupling and fusion and dynamic spatial-adaptive regulation strategies to construct a detection system with spatial-channel dual visual attention and adaptive perception capabilities. First, in terms of the feature interaction mechanism, spatial-channel dual visual attention is constructed by integrating the dynamic attention mechanism of LSKNet (Figure 6) with the C3k2 module through the cross-layer heterogeneous feature coupling method [25]. The Softmax normalized attention weights were used to achieve dynamic focusing and contextual feature mapping for sunburnt fruit and leaf texture features. Second, at the level of multi-granularity feature representation, a gating-guided feature decoupling fusion architecture was designed to construct a parallel feature extraction branch containing regular convolution, depth-separable convolution and cavity convolution. The dynamic selection and non-linear fusion of multi-scale features are realized through differentiable gating units, and a channel attention-guided kernel selection mechanism is used to establish adaptive mapping relations between multi-scale features. Finally, in the optimization of the dynamic perception mechanism, local details (e.g., edge textures of sunburn spots) and global semantics (e.g., distribution of lesion areas) are extracted by a two-branch convolution kernel (k2), which is combined with the dynamic weighting mechanism of LSK to adaptively enhance the key scale features and suppress irrelevant background noise, so as to realize intelligent switching between global contextual perception and local texture focusing.

3.2.2. Multi-Convolutional Spatial and Channel Augmentation Module

The spatial and channel augmentation module (SEAM) aims to enhance the response of the non-occluded region to compensate for the feature loss in the occluded region and improve the overall detection performance [26]. However, the SEAM module is deficient in dealing with feature masking and scale sensitivity issues. To address these issues, this study constructs the MCSEAM module based on multi-branch concurrent convolutional enhancement (Figure 7). This module constructs a four-branch heterogeneous convolutional topology based on the classical SEAM. The outputs of each branch are connected to the dynamic dimension calibration unit after channel splicing, and the feature channels are compressed and mapped from 256 dimensions to 64 dimensions using a learnable 1 × 1 convolution kernel, which ensures the sparsity of feature selection and removes redundant information through L1 regularization constraints to improve the feature quality and effectiveness. A two-layer fully connected network is then used to fuse the information between the channels to obtain the weights Y. In this way, the model is able to learn the relationships between occluded and non-occluded targets and compensate for the above loss in occluded scenarios. The dynamic dimensional calibration unit and the cross-scale feature reorganization mechanism further optimize the feature processing and fusion process and improve the detection performance of the model. This process can be expressed using the following formula:

Y = e x p (ε (L i n e a r (R e L U (L i n e a r (X)))))

(1)

where: ε(·) is the Sigmoid activation function. The output learnt through the fully connected layer is processed by an exponential function (exp) that extends the range of values from [0, 1] to [1, e]. This operation provides a monotonic mapping relation that makes the model more tolerant to occlusion-induced errors in the position information. Finally, the output of the SEAM module is extended to the same shape as the input feature X and multiplied with the original feature to obtain the attention weight W. The formula is expressed as

W = X \cdot Y . e x p a n d_a s (X)

(2)

where the expand_as operation indicates the expansion of Y to the same shape as X. The MCSEAM module improves detection in occluded scenarios by emphasizing critical features and suppressing irrelevant background noise.

3.2.3. Sub-Pixel Dynamic Receptive Field Modules

Receptive field attention (RFA) not only focuses on the spatial features of the receptive field but also provides effective attentional weights for large-size convolution kernels [27]. The receptive field attentional convolution operation (RFAConv) developed by RFA represents a new approach to replace the standard convolution operation. Traditional interpolation methods, like bilinear interpolation or nearest-neighbor interpolation, are simple and straightforward but tend to cause the loss of image information when performing upsampling, resulting in reconstructed images with fuzzy details and poorly defined edges, and they are computationally inefficient when processing high-resolution images.

In this study, we propose a feature reconstruction method that integrates sub-pixel convolution with dynamic sensory field attention (Figure 8), which rewrites the upsampling process of traditional convolutional neural networks by establishing a differentiable geometric adaptive mechanism. The core innovation of the method is to construct a dual-path feature interaction architecture in which a parameterized sub-pixel convolutional layer is used to replace the traditional bilinear interpolation in the feature resolution enhancement path, and detail-preserving upsampling is achieved through the synergistic effect of channel expansion and spatial rearrangement. Given an input feature map X ∈ R^{(H × W × C)}, its mathematical representation is

ϕ_{P S} (X) = P ({C o n v}_{1 \times 1}^{C \to r^{2} C} (X))

(3)

R_{F A} = S o f t m a x (g^{1 \times 1} (A v g P o o l (X))) \times R e L U (N o r m (g^{k \times k} (X)))

(4)

where

P

(·) denotes a periodic rearrangement operation based on a checkerboard pattern,

r

is a spatial scaling factor,

g^{i \times i}

denotes a grouped convolution of size

i \times i

, k denotes the size of the convolution kernel and Norm denotes normalization. The process learns the non-linear mapping relations between channels via a trainable 1 × 1 convolution kernel and subsequently reorganizes the expanded

r^{2} C

channels into the high-resolution spatial domain in a phase-interleaved geometric pattern to form the output feature

X \in R^{r H \times r W \times C}

.

3.3. Model Training and Testing

In this study, the hardware includes 16 GB image memory and an NVIDA GeForce RTX3070Ti (GPU), the central processing unit (CPU) consists of i712700H and 2.3 GHz, and the system is Windows 10. PyCharm(V2024.2.4) is used to build the network environment using the deep learning framework PyTorch2.0.0 and Python version 3.8. For training, all images were uniformly resized to 640 × 640. The batch size of input images during training was set to 16, and the number of training rounds was set to 300. The AdamW optimizer was used, with an initial learning rate of lr0 = 0.01 and an intersection over union (IoU) threshold of 0.5. To ensure the objectivity of the comparative experiments, all comparison models, including YOLOv5s, YOLOv7, YOLOv8s and Faster R-CNN, were trained and tested under identical experimental conditions, using the same number of training epochs, input image dimensions and batch size. Each model was initialized using its official pretrained weights.

In order to evaluate the model objectively, the mean average precision (mAP), precision rate (P), recall rate (R) and floating-point computation (GFLOPs) are used as the evaluation metrics:

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(5)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(6)

A P = \int_{0}^{1} P (R) d R

(7)

m A P = \frac{1}{M} \sum_{m = 1}^{M} A P_{m}

(8)

where P represents the proportion of correctly detected parts of the model to the overall detection results, i.e., the number of kiwifruit correctly detected by the model as a percentage of the total number of kiwifruit detected. R is the proportion of correctly detected positive samples to the actual total positive samples, i.e., the number of kiwifruit correctly detected by the model as a percentage of the total number of kiwifruit in the dataset. TP is the number of actual sample objects in the dataset that are correctly detected, FP is the number of sample objects that are incorrectly detected by the model, and FN is the number of samples that are missed by the model.

4. Results and Analyses

4.1. Comparison of Different Backbone Networks

The edges of sunburn lesions on kiwifruit are blurred and lack contrast against sunburned leaves; under complex lighting conditions, this can easily lead to confusion between features, causing the model to misclassify sunburned leaves as damaged fruit and generate false positives. Furthermore, as lesions formed in the early stages of sunburn are typically small targets measuring less than a centimeter, signal attenuation is severe during feature extraction, which leads to false negatives in the model. This makes accurate inspection strongly dependent on the model’s ability to jointly understand large-scale contextual relationships and local details. In this study, the LSKNet module is encapsulated with C3k2 in the backbone of YOLOv11s into a single module, C3k2_LSK. The spatial selection mechanism in the feature extraction module effectively addresses the differences in background information required for different targets.

In this study, the proposed C3k2_LSK backbone network is compared with three current mainstream backbone networks, Swin Transformer [28], EfficientNet V2 [29] and ConvNeXt V2 [30], and the performance of the different backbone networks is shown in Table 1, from which it can be seen that the Swin Transformer network has the highest accuracy but also the largest number of floating-point operations. C3k2_LSK strikes a balance between accuracy and floating-point calculation, with a calculation parameter volume of 14.181 GFLOPs. There is little difference in the performance of the other two backbone networks. Overall, the C3k2_LSK module achieves higher accuracy between metrics other than mAP, further validating the advantages of its dynamic scaling capabilities and context aggregation techniques, while blocking the propagation of background noise to deeper layers through its spatially selective mechanism. The results of the different backbone network tests are plotted in Figure 9, where red indicates missed detection and yellow indicates incorrect detection, and the C3k2_LSK backbone network has the lowest rates of errors and misses.

4.2. Comparison of Different Attention Mechanism Modules

In this study, YOLOv11s is used as the baseline network to compare the attention mechanism of MCSEAM with the CBAM [31], SE [32] and SEAM [33] attention mechanisms. Table 2 gives the results of the performance comparison of the various attention mechanism models added to the neck layer of YOLOv11s. The results show that all metrics of the MCSEAM module are significantly improved compared to the CBAM, SE and SEAM attention mechanisms. The mAP metric of the MCSEAM attention mechanism is improved by 3.55%, 1.62% and 0.87%, respectively. However, it also increases the number of model computational parameters. In terms of parameters, the improvement of four-branch convolution on the basis of the SEAM module increases the value by 4.9 GFLOPs. The floating-point counts of the CBAM module and SE module are 20.823 GFLOPs and 20.816 GFLOPs, respectively. In terms of recall, MCSEAM scores the highest value of 0.794, with a difference of 6.15%, 4.06% and 2.85% regarding CBAM, SE and SEAM, respectively. A graphical comparison of the detection results of the different attention mechanism modules is shown in Figure 10. Red indicates missed detection and yellow indicates misdetection. As can be seen from Figure 10, the MCSEAM model has the lowest misdetection and missed detection rates.

4.3. Comparison of Different Convolutional Mechanism Modules

To address the small targets and complex backgrounds in kiwifruit sunburn detection, this study uses YOLOv11s as a baseline network to compare the RAFMPS small-target detection module with GSConv [34], SPDConv [15] and RAFConv [35]. Table 3 gives the results of the performance comparison of the various small-target feature fusion mechanism modules added to the neck layer of YOLOv11. From Table 3, it can be seen that the RFAMPS feature fusion mechanism outperforms GSConv, SPDConv and RAFConv in terms of model accuracy, recall and mAP. Specifically, the accuracy, recall and mAP are improved by 2.16%, 2.56% and 1.48%, respectively, compared to the RAFConv fusion mechanism, which suggests that the use of sub-pixel convolution can enhance the feature expression ability of the RFA module. Compared with GSConv and SPDConv, the improvement in mAP is 3.78% and 2.36%, respectively. A graphical comparison of the detection results of the different convolutional mechanism modules is shown in Figure 11, where the red color indicates missed detection and the yellow color indicates misdetection; it shows the lowest missed detection and misdetection rates for the RAFMPS model.

In conclusion, these feature fusion mechanisms are effective in small-target detection; however, they do not take into account the fine features of sunburned kiwifruit. Although the introduction of sub-pixel convolution increases the computational load on the RAFMPS module, when processing images containing complex textures and fine structures, the method adaptively adjusts the distribution of convolution kernels according to the irregularity of lesion edges. By utilizing geometric phase correlations between channels to reconstruct feature maps, it accurately reconstructs the topological details of lesions in the high-resolution spatial domain, thereby enabling subsequent analysis and processing based on these feature maps to yield more accurate results.

4.4. Ablation Experiment

In order to verify the effectiveness of the proposed improvement scheme for YOLOv11s, as well as the actual improvement effect of the combined scheme in detecting sunburn in kiwifruit, an ablation experiment was conducted on the YOLO-ST-OD model. The experiment sequentially added different improvement schemes to the model to quantify the actual contribution of each module in terms of detection accuracy and model computation. The results of the ablation experiment are shown in Table 4. As shown in Table 4, after replacing the original C3k2 module with C3k2_LSK, the mAP is reduced from 0.792 to 0.790; the precision and recall are improved by 0.98% and 0.73%, respectively; and the number of computational parameters is reduced by 7.44 GFLOPs. It is shown that the LSK model, in addition to achieving intelligent switching between global context awareness and local texture focusing, can also compress the model computation. When introducing the MCSEAM module into the original model, the mAP, precision and recall are improved to 0.816, 0.846 and 0.794, respectively, indicating that the MCSEAM module, with its innovative four-branch heterogeneous convolutional topology, effectively solves the feature masking and scale sensitivity problems. Subsequently, when only RFAMPS was substituted for the Conv convolution, the mAP reached 0.823, demonstrating the significant effect of the sub-pixel convolution combined with the RFA strategy in dealing with small targets and complex background interference. MCSEAM and RFAMPS significantly improve the model performance but both increase the computational complexity; however, C3k2_LSK substantially reduces the computational complexity. When comparing the overall performance gains, the increase remains within acceptable limits. When all three modules are integrated, the mAP improves to 0.837, and the precision and recall are 0.862 and 0.818, respectively. C3k2_LSK is responsible for locating potential small targets within the global information; MCSEAM addresses the issue of feature sparsity in high-density scenes by enhancing the responses of non-occluded regions within the local area identified by C3k2_LSK; and RFAMPS utilizes the fine-grained features extracted by the first two modules to achieve high-fidelity reconstruction. Although the ablation experiment validated the effectiveness of each module, certain performance limitations were still observed under specific extreme scenarios. During the validation of the MCSEAM module, it was found that, when two fruits overlapped significantly (without leaf occlusion), the multi-branch enhancement mechanism sometimes struggled to accurately identify the boundary line due to the extremely high semantic similarity between the occluding object and the target, resulting in a decline in local localization accuracy.

4.5. Comparison with Other Detection Models

In order to verify the performance advantages of the proposed model in terms of small targets and occlusion, the improved model is compared with YOLOv5s, YOLOv7, YOLOv8s, YOLOv9, YOLOv10s and Faster CNN (Figure 12). The results show that the YOLO-ST-OD model outperforms the other models in all metrics except for the floating-point operation number. Among them, the Faster R-CNN model performs the worst, with mAP, precision and recall of 0.637, 0.681 and 0.615, respectively, and has the largest number of floating-point operations. The mAP for YOLOv5s, YOLOv7, YOLOv8s, YOLOv9 and YOLOv10 is 0.787, 0.763, 0.783, 0.801 and 0.812, respectively. In conclusion, the YOLO-ST-OD model is well suited for kiwifruit sunburn monitoring.

In order to demonstrate more intuitively the advantages of the proposed model in terms of network structure and feature capture, heatmap visualization of the model output feature maps was carried out using the Grad-CAM method, and the visualization results are shown in Figure 13. As can be seen in Figure 13, in scene 1, where the fruits are not concentrated but are numerous, YOLO-ST-OD accurately identifies kiwifruits, whereas the other models have limited performance, especially Faster R-CNN and YOLOv7. In scene 2, where the fruits are not concentrated and are small in number, the YOLO-ST-OD model not only accurately classifies the target but also identifies the occluded target better; in comparison, the other models are poor in identifying the target under occlusion. In scene 3, where the fruits are concentrated and are numerous, except for Faster R-CNN, the models achieve better recognition, but the YOLO-ST-OD model can recognize them more independently. In scene 4, where the fruits are concentrated and are small in number, YOLOv5s, YOLOv8s, YOLOv9, YOLOv10s and YOLO-ST-OD can all recognize them better, while the other models exhibit the problem of missed detection. In scene 5, where the number of fruits is particularly low, the YOLO-ST-OD model also shows good recognition accuracy.

Scenes 1 to 5 reflected the various morphologies of kiwifruit present in the study area, and the above results show that the YOLO-ST-OD model achieves excellent performance in kiwifruit recognition. However, it can also be noted that its region of attention tends to incorporate a small portion of the surrounding non-sample environment. This phenomenon is not a sign of overfitting but rather an indication that the model is learning to utilize local contextual features to aid in the identification of small-target spots. This strategy helps to achieve more robust feature decoupling in the highly noisy environment of an orchard.

5. Discussion

The results of this study validate that the YOLO-ST-OD model achieves spatial dynamic focusing and dynamic denoising through the large selective kernel of LSKNet, the MCSEAM multi-branch feature compensation mechanism and RFAMPS sub-pixel detail reconstruction to address the two core challenges in kiwifruit sunscald detection: the precise identification of small targets and robustness against complex occlusions. This study fills a technical gap in existing general-purpose models regarding their insufficient sensory capabilities in trellis cultivation environments. The model effectively suppresses feature confusion in complex backgrounds through dynamic receptive field selection; it mitigates occlusion caused by dense canopies by leveraging multi-branch heterogeneous convolutions to exploit channel redundancy; and, thanks to the sub-pixel reconstruction technology, it ensures edge sharpness for sub-centimeter-scale lesions even under multi-scale attenuation. Consequently, the robustness of this model was analyzed and discussed across various scenarios involving sun-scorched kiwifruit, including complex variations in occlusion, density and height.

5.1. Detection of Sunburned Fruits Under Different Shading Backgrounds

When sunscald occurs, the temperature of leaves increases due to prolonged exposure to intense sunlight, and brown or black spots may appear on the surfaces of leaves; in severe cases, leaf tissues become necrotic and wilted and fall off [36]. Because of inconsistent exposure to sunlight and differences in leaf abscission, resulting in varying degrees of shading, the accurate detection of differently shaded sunburned fruits constitutes a major challenge, which was addressed in this study. In order to evaluate the ability of the YOLO-ST-OD model to detect different levels of occlusion, a separate test set, different from the training data, was used.

The visualization results are shown in Figure 14a, which demonstrates that both models can detect sunburned and healthy kiwifruits more accurately in a low-shading situation at the late stage of sunburn disaster. Figure 14b demonstrates the accurate detection of kiwifruits that can be achieved by the YOLO-ST-OD model in the case of moderate shading with new branches and leaf cover. Figure 14c demonstrates the superior performance of the YOLO-ST-OD model in the highly occluded case. The superiority of YOLO-ST-OD will diminish in cases of moderate occlusion and light occlusion. Combining the LSKNet module into the network, adjusting the sensory field according to the target class and providing the required long-range context information for different types of targets enhance the network’s ability to understand the global information of the whole kiwifruit image. The MCSEAM module utilizes a four-branch heterogeneous convolutional topology to capture residual signals from occluded targets simultaneously across multiple receptive field scales. When leaves obscure the main body of the fruit, the module extracts topological features from the non-occluded edges via its parallel branches, thereby dynamically amplifying these faint yet critical pathological clues.

5.2. Detection of Sunburned Fruit at Different Densities

In real-life orchard scenarios, in addition to shading with leaf intergroups, there is also intragroup shading between fruits and fruit densities. Losses caused by sunburn disasters in kiwifruit are usually related to the fruit density at the sun-exposed site, and a kiwifruit detection model can help insurance companies to assess losses by determining sunburned fruit counts at the time of the disaster. The development of a model for detecting different densities of sunburned fruit during a sunburn disaster and accurately detecting the number of sunburned fruit can help to quickly understand the extent of the disaster in the orchard. In this study, datasets with different densities of sunburned fruit were collected to evaluate the ability of the YOLO-ST-OD model to detect sunburned fruit at different densities.

The visualization results are shown in Figure 15, which shows the results before and after improving the YOLOv11s network. Figure 15a represents the case of sparse sunburned fruits, where the original model is less effective in monitoring the within-group shading, and leakage detection occurs. From Figure 15b, it can be seen that, as the density increases, the number of missed detections in the original model increases. As shown in Figure 15c, YOLO-ST-OD also starts to show leakage detection, indicating that YOLO-ST-OD also has certain deficiencies when detecting intergroup occlusion in the large density scenario. However, there is still a large improvement against the original model, indicating that the feature loss of the occluded region is compensated for by enhancing the response of the non-occluded region, mainly in the intergroup occlusion scenario. Unlike leaf occlusion, the overlapping of fruits involves objects with a very high degree of semantic similarity. As the occluded areas and the objects themselves exhibit a high degree of overlap in terms of color, texture and geometric contours, it is difficult for neural networks to decouple boundaries through spatial contrast during the feature extraction stage.

5.3. Sunburned Fruit Detection Under Different Shooting Altitudes

The choice of flight altitude has a significant impact on the recognition accuracy for sunburn features, the detection efficiency and operational costs. Different shooting altitudes correspond to different ground resolutions, and the color distribution in the image changes significantly. Low-altitude flights can obtain high-resolution images and identify sunburn symptoms more easily, but the coverage area is small, the amount of data is large, the processing time is long, and it may also be affected by the shading of fruit trees. High-altitude flights exhibit the opposite characteristics, with wide coverage but a low resolution, and may miss fine features. In this study, datasets with different heights of sunburned fruit were collected to evaluate the ability of the YOLO-ST-OD model to detect sunburned fruit at different heights and to determine the optimal flight altitude.

The visualization results are shown in Figure 16. Figure 16a demonstrates the detection effect for images taken at a 3 m height, for which the YOLO-ST-OD model does not show a notable advantage under low-altitude shooting. The original model shows a large number of misdetections in the yellow box for images taken at a 4 m altitude. With the addition of the RFAMPS model, the improved model YOLO-ST-OD not only works well in occlusion but also has a lower rate of misdetection, indicating that the changes in the dynamic receptive field of the RFAMPS model can more effectively challenge small targets in the image. At a 5 m altitude, the detection is poorer because of the more serious occlusion effect in the view. The structure of the kiwifruit trellis determines the dynamic changes in shading as the viewing angle varies. At a height of 3 m, the viewing angle is too vertical, resulting in significant mutual shading between the fruits, whereas, at a height of 4 m, the moderate depth-of-field effect optimizes the scene’s geometric relationships, expanding the effective field of view while providing the physical space required for the MCSEAM module to extract redundant features from the fruit’s lateral edges, thereby compensating for the loss of vertical signals in a mechanistic manner. To summarize the above, a 4 m flight altitude has the best detection effect, and the cost is much lower than that of manual inspection.

5.4. Future Work

Deploying high-performance, low-wait-time target detectors on edge devices is receiving increasing attention [37]. In the rapid development of smart flying vehicles, there is a lack of lightweight detection models suitable for embedding into devices, as well as a lack of real-time detection and analysis capabilities. In the subsequent stage, we will investigate how to construct a high-precision sunburned kiwifruit detection model using limited arithmetic power. In addition, due to the limited equipment, this study only acquired data with visible image devices, and, in the future, we intend to add thermal infrared data to enhance the detection performance of the model.

6. Conclusions

This study addressed the technical gap in accurately assessing sunscald damage in kiwifruit orchards within complex environments by developing and validating the YOLO-ST-OD model. The mechanism analysis indicates that the synergistic interaction between LSKNet’s dynamic selective filtering of background noise and RFAMPS’s sub-pixel reconstruction mechanism effectively prevents the smoothing and blurring of high-frequency spatial features of sub-centimeter-scale lesions during propagation through the deep neural network, thereby improving the accuracy for small targets to 0.862. Concurrently, the multi-branch heterogeneous feature enhancement behaviour of MCSEAM confirms the feasibility of utilizing channel redundancy to compensate for missing physical signals. Although the model demonstrates exceptional robustness under most operating conditions, it must be acknowledged that, within ultra-high-density fruit clusters, due to the high similarity of semantic features between targets and the lack of edge space, the model still suffers from certain limitations, such as boundary confusion and excessive focus on non-target areas. The performance advantage in mAP (0.837), achieved at a flight altitude of 4 m, while maintaining a low computational cost (24.714 GFLOPs), directly supports the operational objective of automated damage assessment for agricultural insurance, providing solid empirical support for the realization of the large-scale, low-cost intelligent assessment of disaster-affected areas.

Author Contributions

Z.N.: conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, writing—original draft, writing—review and editing. Y.S.: conceptualization, project administration, supervision, writing—review and editing. N.J.: resources, validation. S.X.: writing—review and editing. J.P.: writing—review and editing. N.S.: writing—review and editing. D.H.: funding acquisition, writing—original draft, writing—review and editing. D.Z.: funding acquisition, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the Key Research and Development Projects of Shaanxi Province (Grant No. 2024NC-ZDCYL-05-03), the Fundamental Research Program of Shanxi Province (Grant No. 202203021221231) and the Science and Technology Projects of Yangling Demonstration Zone (Grant No. 2025LHT-05).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gambetta, J.M.; Holzapfel, B.P.; Stoll, M.; Friedel, M. Sunburn in Grapes: A Review. Front. Plant Sci. 2021, 11, 604691. [Google Scholar] [CrossRef]
Wang, F.; Lv, C.; Dong, L.; Li, X.; Guo, P.; Zhao, B. Development of Effective Model for Non-Destructive Detection of Defective Kiwifruit Based on Graded Lines. Front. Plant Sci. 2023, 14, 1170221. [Google Scholar] [CrossRef]
Vyas, S.; Dalhaus, T.; Kropff, M.; Aggarwal, P.; Meuwissen, M.P.M. Mapping Global Research on Agricultural Insurance. Environ. Res. Lett. 2021, 16, 103003. [Google Scholar] [CrossRef]
Benami, E.; Jin, Z.; Carter, M.R.; Ghosh, A.; Hijmans, R.J.; Hobbs, A.; Kenduiywo, B.; Lobell, D.B. Uniting Remote Sensing, Crop Modelling and Economics for Agricultural Risk Management. Nat. Rev. Earth Environ. 2021, 2, 140–159. [Google Scholar] [CrossRef]
Bao, W.; Zhu, Z.; Hu, G.; Zhou, X.; Zhang, D.; Yang, X. UAV Remote Sensing Detection of Tea Leaf Blight Based on DDMA-YOLO. Comput. Electron. Agric. 2023, 205, 107637. [Google Scholar] [CrossRef]
Chen, H.; Chen, A.; Xu, L.; Xie, H.; Qiao, H.; Lin, Q.; Cai, K. A Deep Learning CNN Architecture Applied in Smart Near-Infrared Analysis of Water Pollution for Agricultural Irrigation Resources. Agric. Water Manag. 2020, 240, 106303. [Google Scholar] [CrossRef]
Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight Tomato Real-Time Detection Method Based on Improved YOLO and Mobile Deployment. Comput. Electron. Agric. 2023, 205, 107625. [Google Scholar] [CrossRef]
Liu, Y.; Ren, H.; Zhang, Z.; Men, F.; Zhang, P.; Wu, D.; Feng, R. Research on Multi-Cluster Green Persimmon Detection Method Based on Improved Faster RCNN. Front. Plant Sci. 2023, 14, 1177114. [Google Scholar] [CrossRef]
Wang, T.; Zhao, L.; Li, B.; Liu, X.; Xu, W.; Li, J. Recognition and Counting of Typical Apple Pests Based on Deep Learning. Ecol. Inform. 2022, 68, 101556. [Google Scholar] [CrossRef]
Wang, D.; He, D. Fusion of Mask RCNN and Attention Mechanism for Instance Segmentation of Apples under Complex Background. Comput. Electron. Agric. 2022, 196, 106864. [Google Scholar] [CrossRef]
Khan, S.; Tufail, M.; Khan, M.T.; Khan, Z.A.; Anwar, S. Deep Learning-Based Identification System of Weeds and Crops in Strawberry and Pea Fields for a Precision Agriculture Sprayer. Precis. Agric. 2021, 22, 1711–1727. [Google Scholar] [CrossRef]
Meng, F.; Li, J.; Zhang, Y.; Qi, S.; Tang, Y. Transforming Unmanned Pineapple Picking with Spatio-Temporal Convolutional Neural Networks. Comput. Electron. Agric. 2023, 214, 108298. [Google Scholar] [CrossRef]
Cheng, T.; Zhang, D.; Gu, C.; Zhou, X.-G.; Qiao, H.; Guo, W.; Niu, Z.; Xie, J.; Yang, X. YOLO-CG-HS: A Lightweight Spore Detection Method for Wheat Airborne Fungal Pathogens. Comput. Electron. Agric. 2024, 227, 109544. [Google Scholar] [CrossRef]
Shang, Y.; Xu, X.; Jiao, Y.; Wang, Z.; Hua, Z.; Song, H. Using Lightweight Deep Learning Algorithm for Real-Time Detection of Apple Flowers in Natural Environments. Comput. Electron. Agric. 2023, 207, 107765. [Google Scholar] [CrossRef]
Li, H.; Huang, J.; Gu, Z.; He, D.; Huang, J.; Wang, C. Positioning of Mango Picking Point Using an Improved YOLOv8 Architecture with Object Detection and Instance Segmentation. Biosyst. Eng. 2024, 247, 202–220. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Z.; Zhang, Y.; Wu, Y.; Chen, X.; Zhao, L. Deep Reinforcement Learning-Based Dynamic Resource Management for Mobile Edge Computing in Industrial Internet of Things. IEEE Trans. Ind. Inform. 2021, 17, 4925–4934. [Google Scholar] [CrossRef]
Yang, B.; Guo, W.; Huang, X.; Du, R.; Liu, Z. A Portable, Low-Cost and Sensor-Based Detector on Sweetness and Firmness Grades of Kiwifruit. Comput. Electron. Agric. 2020, 179, 105831. [Google Scholar] [CrossRef]
Pan, F.; Hu, M.; Duan, X.; Zhang, B.; Xiang, P.; Jia, L.; Zhao, X.; He, D. Enhancing Kiwifruit Flower Pollination Detection through Frequency Domain Feature Fusion: A Novel Approach to Agricultural Monitoring. Front. Plant Sci. 2024, 15, 1415884. [Google Scholar] [CrossRef] [PubMed]
Yao, J.; Wang, Y.; Xiang, Y.; Yang, J.; Zhu, Y.; Li, X.; Li, S.; Zhang, J.; Gong, G. Two-Stage Detection Algorithm for Kiwifruit Leaf Diseases Based on Deep Learning. Plants 2022, 11, 768. [Google Scholar] [CrossRef]
Zhang, H.; Wang, L.; Tian, T.; Yin, J. A Review of Unmanned Aerial Vehicle Low-Altitude Remote Sensing (UAV-LARS) Use in Agricultural Monitoring in China. Remote Sens. 2021, 13, 1221. [Google Scholar] [CrossRef]
Boursianis, A.D.; Papadopoulou, M.S.; Diamantoulakis, P.; Liopa-Tsakalidi, A.; Barouchas, P.; Salahas, G.; Karagiannidis, G.; Wan, S.; Goudos, S.K. Internet of Things (IoT) and Agricultural Unmanned Aerial Vehicles (UAVs) in Smart Farming: A Comprehensive Review. Internet Things 2022, 18, 100187. [Google Scholar] [CrossRef]
Liu, W.; Zhou, Z.; Xu, X.; Gu, Q.; Zou, S.; He, W.; Luo, X.; Huang, J.; Lin, J.; Jiang, R. Evaluation Method of Rowing Performance and Its Optimization for UAV-Based Shot Seeding Device on Rice Sowing. Comput. Electron. Agric. 2023, 207, 107718. [Google Scholar] [CrossRef]
Sahoo, M.M.; Tarshish, R.; Tubul, Y.; Sabag, I.; Gadri, Y.; Morota, G.; Peleg, Z.; Alchanatis, V.; Herrmann, I. Multimodal Ensemble of UAV-Borne Hyperspectral, Thermal, and RGB Imagery to Identify Combined Nitrogen and Water Deficiencies in Field-Grown Sesame. ISPRS J. Photogramm. Remote Sens. 2025, 222, 33–53. [Google Scholar] [CrossRef]
Liu, T.; Qi, Y.; Yang, F.; Yi, X.; Guo, S.; Wu, P.; Yuan, Q.; Xu, T. Early Detection of Rice Blast Using UAV Hyperspectral Imagery and Multi-Scale Integrator Selection Attention Transformer Network (MS-STNet). Comput. Electron. Agric. 2025, 231, 110007. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.-M.; Yang, J. LSKNet: A Foundation Lightweight Backbone for Remote Sensing. Int. J. Comput. Vis. 2025, 133, 1410–1431. [Google Scholar] [CrossRef]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. YOLO-FaceV2: A Scale and Occlusion Aware Face Detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Song, T.; Yang, D.; Ye, Y.; Li, K.; Song, Y. RFAConv: Receptive-Field Attention Convolution for Improving Convolutional Neural Networks. Pattern Recognit. 2026, 176, 113208. [Google Scholar] [CrossRef]
Yang, S.; Wang, W.; Gao, S.; Deng, Z. Strawberry Ripeness Detection Based on YOLOv8 Algorithm Fused with LW-Swin Transformer. Comput. Electron. Agric. 2023, 215, 108360. [Google Scholar] [CrossRef]
Javanmardi, S.; Ashtiani, S.-H.M. AI-Driven Deep Learning Framework for Shelf Life Prediction of Edible Mushrooms. Postharvest Biol. Technol. 2025, 222, 113396. [Google Scholar] [CrossRef]
Chen, J.; Ji, C.; Zhang, J.; Feng, Q.; Li, Y.; Ma, B. A Method for Multi-Target Segmentation of Bud-Stage Apple Trees Based on Improved YOLOv8. Comput. Electron. Agric. 2024, 220, 108876. [Google Scholar] [CrossRef]
Zhang, B.; Wang, Z.; Ye, C.; Zhang, H.; Lou, K.; Fu, W. Classification of Infection Grade for Anthracnose in Mango Leaves under Complex Background Based on CBAM-DBIRNet. Expert Syst. Appl. 2025, 260, 125343. [Google Scholar] [CrossRef]
Ma, B.; Hua, Z.; Wen, Y.; Deng, H.; Zhao, Y.; Pu, L.; Song, H. Using an Improved Lightweight YOLOv8 Model for Real-Time Detection of Multi-Stage Apple Fruit in Complex Orchard Environments. Artif. Intell. Agric. 2024, 11, 70–82. [Google Scholar] [CrossRef]
Yi, W.; Xia, S.; Kuzmin, S.; Gerasimov, I.; Cheng, X. YOLOv7-KDT: An Ensemble Model for Pomelo Counting in Complex Environment. Comput. Electron. Agric. 2024, 227, 109469. [Google Scholar] [CrossRef]
Wang, J.; Qin, C.; Hou, B.; Yuan, Y.; Zhang, Y.; Feng, W. LCGSC-YOLO: A Lightweight Apple Leaf Diseases Detection Method Based on LCNet and GSConv Module under YOLO Framework. Front. Plant Sci. 2024, 15, 1398277. [Google Scholar] [CrossRef] [PubMed]
Nguyen, D.-L.; Vo, X.-T.; Priadana, A.; Choi, J.; Jo, K.-H. An Efficient Detector for Automatic Tomato Classification Systems. IEEE Access 2025, 13, 14073–14082. [Google Scholar] [CrossRef]
Spera, N.; Vita, L.I.; Civello, P.M.; Colavita, G.M. Antioxidant Response and Quality of Sunburn Beurré D’Anjou Pears (Pyrus communis L.). Plant Physiol. Biochem. 2023, 198, 107703. [Google Scholar] [CrossRef]
Ju, Y.; Chen, Y.; Cao, Z.; Liu, L.; Pei, Q.; Xiao, M.; Ota, K.; Dong, M.; Leung, V.C.M. Joint Secure Offloading and Resource Allocation for Vehicular Edge Computing Network: A Multi-Agent Deep Reinforcement Learning Approach. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5555–5569. [Google Scholar] [CrossRef]

Figure 1. Study area and weather conditions.

Figure 2. Drone acquiring data at different altitudes.

Figure 3. ESRGAN Arch network diagram.

Figure 4. YOLO-ST-OD network framework.

Figure 5. C3k2_LSK module.

Figure 6. Large selective kernel network module.

Figure 7. Multiple-convolutional space and channel addition module.

Figure 8. Sub-pixel dynamic receptive field module.

Figure 9. Graph of detection results for different backbone networks, the solid red circle represents a missed detection, while the solid yellow circle represents misdetection.

Figure 10. Graph of detection results for different attention mechanisms, the solid red circle represents a missed detection, while the solid yellow circle represents misdetection.

Figure 11. Plots of detection results for different convolution mechanisms, the solid red circle represents a missed detection, while the solid yellow circle represents misdetection.

Figure 12. Comparison results of convolutional neural networks of different depths.

Figure 13. Schematic diagram of qualitative comparison of different network feature extraction models.

Figure 14. Model detection results for different occlusion cases, the solid red circle represents a missed detection, while the dotted circle represents a comparison.

Figure 15. Model detection results under different densities, the solid red circle represents a missed detection, the solid yellow circle represents misdetection, while the dotted circle represents a comparison.

Figure 16. Model detection results for different height cases, the solid red circle represents a missed detection, the solid yellow circle represents misdetection, while the dotted circle represents a comparison.

Table 1. Performance comparison of different backbone networks.

Backbone	Precision	Recall	mAP	GFLOPs
Swin Transformer	0.834	0.782	0.803	20.016
EfficientNet V2	0.813	0.751	0.781	19.313
ConvNeXt V2	0.816	0.757	0.786	15.842
C3k2_LSK	0.836	0.761	0.797	14.181

Table 2. Comparison results of different attention mechanisms.

Algorithm	Precision	Recall	mAP	GFLOPs
SE	0.829	0.763	0.803	20.816
CBAM	0.811	0.748	0.788	20.823
SEAM	0.831	0.772	0.809	21.537
MCSEAM	0.846	0.794	0.816	26.452

Table 3. Comparison results of different convolution mechanisms.

Algorithm	Precision	Recall	mAP	GFLOPs
SPDConv	0.830	0.766	0.804	21.076
GSConv	0.814	0.754	0.793	20.912
RAFConv	0.835	0.783	0.811	21.641
RAFMPS	0.853	0.803	0.823	23.773

Table 4. Ablation test results.

YOLOv11s	LSKNet	MCSEAM	RFAMPS	P	R	mAP	GFLOPs
✓	×	×	×	0.813	0.752	0.792	21.621
✓	✓	×	×	0.821	0.761	0.790	14.181
✓	×	✓	×	0.846	0.794	0.816	26.452
✓	×	×	✓	0.853	0.803	0.823	23.773
✓	✓	✓	×	0.857	0.811	0.826	22.615
✓	✓	×	✓	0.837	0.785	0.812	21.854
✓	×	✓	✓	0.859	0.815	0.831	28.387
✓	✓	✓	✓	0.862	0.818	0.837	24.714

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Niu, Z.; Su, Y.; Jin, N.; Xu, S.; Peng, J.; Sigrimis, N.; Han, D.; Zhang, D. YOLO-ST-OD: An Enhanced YOLO-Based Architecture for UAV Detection of Sunburned Kiwifruit Under Complex Orchard Conditions. Horticulturae 2026, 12, 630. https://doi.org/10.3390/horticulturae12050630

AMA Style

Niu Z, Su Y, Jin N, Xu S, Peng J, Sigrimis N, Han D, Zhang D. YOLO-ST-OD: An Enhanced YOLO-Based Architecture for UAV Detection of Sunburned Kiwifruit Under Complex Orchard Conditions. Horticulturae. 2026; 12(5):630. https://doi.org/10.3390/horticulturae12050630

Chicago/Turabian Style

Niu, Zhen, Yunwang Su, Ning Jin, Suguang Xu, Jiayi Peng, Nick Sigrimis, Dong Han, and Dongyan Zhang. 2026. "YOLO-ST-OD: An Enhanced YOLO-Based Architecture for UAV Detection of Sunburned Kiwifruit Under Complex Orchard Conditions" Horticulturae 12, no. 5: 630. https://doi.org/10.3390/horticulturae12050630

APA Style

Niu, Z., Su, Y., Jin, N., Xu, S., Peng, J., Sigrimis, N., Han, D., & Zhang, D. (2026). YOLO-ST-OD: An Enhanced YOLO-Based Architecture for UAV Detection of Sunburned Kiwifruit Under Complex Orchard Conditions. Horticulturae, 12(5), 630. https://doi.org/10.3390/horticulturae12050630

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-ST-OD: An Enhanced YOLO-Based Architecture for UAV Detection of Sunburned Kiwifruit Under Complex Orchard Conditions

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Acquisition

2.3. Dataset Production and Data Enhancement

3. Methodology

3.1. YOLOv11s Network

3.2. YOLO-ST-OD Network

3.2.1. C3k2_LSK Module

3.2.2. Multi-Convolutional Spatial and Channel Augmentation Module

3.2.3. Sub-Pixel Dynamic Receptive Field Modules

3.3. Model Training and Testing

4. Results and Analyses

4.1. Comparison of Different Backbone Networks

4.2. Comparison of Different Attention Mechanism Modules

4.3. Comparison of Different Convolutional Mechanism Modules

4.4. Ablation Experiment

4.5. Comparison with Other Detection Models

5. Discussion

5.1. Detection of Sunburned Fruits Under Different Shading Backgrounds

5.2. Detection of Sunburned Fruit at Different Densities

5.3. Sunburned Fruit Detection Under Different Shooting Altitudes

5.4. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI