Next Article in Journal
FvIAA16 and FvIAA17: Two Aux/IAA Family Genes Positively Regulate Fruit Ripening in Strawberry
Previous Article in Journal
RrLBD40 Enhances Salt Tolerance in Rosa rugosa via Promoting Root Development
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Computer Vision Model for Accurate Detection of Fresh Jujube Fruits and General Small Targets in Complex Agricultural Environments

1
College of Agricultural Engineering, Shanxi Agricultural University, Jinzhong 030801, China
2
School of Politics and Public Administration, Qufu Normal University, Rizhao 276826, China
*
Author to whom correspondence should be addressed.
Horticulturae 2025, 11(11), 1380; https://doi.org/10.3390/horticulturae11111380
Submission received: 11 October 2025 / Revised: 8 November 2025 / Accepted: 13 November 2025 / Published: 16 November 2025
(This article belongs to the Section Fruit Production Systems)

Abstract

Accurate detection of fresh jujube fruits plays a vital role in precision agriculture, enabling reliable yield estimation and supporting automation tasks such as robotic harvesting. To address the challenges of detecting such small targets (≤32 × 32 pixels) in complex orchard environments, this study proposes JFST-DETR, an efficient and robust detection model based on the Real-Time DEtection TRansformer (RT-DETR). First, to address the insufficient feature representation for small jujube fruit targets, a novel module called the Global Awareness Adaptive Module (GAAM) is designed. Building on GAAM and the innovative Spatial Coding Module (SCM), a new Spatial Enhancement Pyramid Network (SEPN) is proposed. Through the spatial-depth transformation domain and global awareness adaptive processing units, SEPN captures fine-grained features of small targets, enhancing the detection accuracy for small objects. Second, a Dynamic Sampling (DySample) operator is adopted, which optimizes feature space details via dynamic offset calculation and lightweight design, improving detection accuracy while reducing computational costs. Finally, to solve the problem of complex background interference caused by foliage occlusion and illumination variations, Pinwheel-Shaped Convolution (PSConv) is introduced. By using asymmetric padding and multi-directional convolution, PSConv enhances the robustness of feature extraction, ensuring reliable recognition in complex agricultural environments. Experimental results show that JFST-DETR achieves precision, recall, F1, mAP@50, and mAP@50:95 of 93%, 86.8%, 89.8%, 94.3%, and 75.2%. Compared to the baseline model, these metrics improve by 0.8%, 3.7%, 2.4%, 2.6%, and 3.1%, respectively. Cross-dataset evaluations further confirm its strong generalizability, demonstrating potential as a practical solution for small-target detection in intelligent horticulture.

1. Introduction

Jujube (Ziziphus jujuba Mill.) is a nutrient-dense fruit rich in vitamin C, minerals, and antioxidants, which confer benefits in enhancing immunity and promoting digestion. Moreover, it plays an important role in ecological restoration and agricultural sustainable development, which is worthy of further research and promotion [1]. However, in complex agricultural environments, the harvesting of fresh jujube fruits faces challenges in accurately identifying small-target jujube fruits, primarily due to dense foliage, fluctuating lighting conditions, and overlapping fruits [2]. Traditional manual detection is inefficient, costly, and prone to misjudging the optimal harvest timing, thereby affecting fruit quality and yield. Therefore, developing efficient and robust detection technologies for fresh jujube fruits is crucial not only for harvesting management and resource optimization but also for providing references for other agricultural small-target detection tasks.
With the development of artificial intelligence and computer vision technologies, deep learning has been widely researched and applied in the agricultural sector. Among them, Convolutional Neural Networks (CNNs) are chosen by many scholars as one of the primary algorithm models for agricultural detection due to their advantages of high precision and fast detection [3]. For instance, Chuang et al. [4] proposed a YOLOv5n-based method for asparagus length classification and bruise detection using two cascaded lightweight models (YOLOv5n-SGS and YOLOv5n-GES). Fan et al. [5] developed a lightweight YOLO-WDNet for weed detection in complex farmlands, balancing precision and efficiency via ShuffleNet v2, BiFPN, and optimized loss functions. Bai et al. [6] enhanced YOLOv7 with a Swin Transformer prediction head and GS-ELAN module to boost spatial feature extraction, while Lv et al. [7] improved YOLOv5 (YOLO-N) with Coordinate Attention, a P2 small-target layer, and rotated bounding boxes for similar-color backgrounds. Notably, these CNN-based advancements often leverage RGB images, which have proven highly effective in agricultural organ characterization—RGB images combined with CNNs enable accurate detection of fruit shape, size, and maturity, and reliable deduction of crop organ greenness and structural traits [8,9,10,11,12].
However, traditional YOLO-series algorithms rely on predefined anchor boxes and suffer from low feature fusion efficiency, leading to missed detections of dense targets, insufficient multi-scale feature utilization, and time-consuming post-processing in complex agricultural scenarios. In contrast, the Transformer-based Real-Time DEtection TRansformer (RT-DETR) has emerged in agriculture for its global modeling capabilities and efficient real-time performance: Zhao et al. [13] proposed an improved RT-DETR-Tomato model with a Swin Transformer backbone and BiFormer blocks for efficient tomato detection; Gu et al. [14] developed a modified RT-DETR with a CASA structure to overcome tomato occlusion; Li et al. [15] introduced IMLL-DETR (improved RT-DETR) with an MDGA module and P2 head to tackle litchi leaf pest detection issues. These studies demonstrate RT-DETR’s significant potential in agricultural detection [16,17,18]. Nevertheless, Transformer-based detection models have received limited attention in orchard fruit detection, particularly for challenging small-target fruits. Fresh jujube exemplifies this scenario, exhibiting extremely small size (often ≤32 × 32 pixels), high cluster density, and severe occlusion under real orchard conditions. These characteristics pose significant challenges for current detection frameworks. In particular, the original architectures without specialized enhancements are often insufficient to retain discriminative details of ultra-small fruits, leading to severe feature degradation that compromises detection reliability. This limitation underscores the need for tailored architectural adaptations, such as enhanced feature representation and refined sampling strategies, to effectively capture and preserve small-object cues in complex agricultural scenes.
To date, few studies have explored the application of RT-DETR or similar Transformer-based frameworks to such fruits. Therefore, how to simultaneously achieve accurate detection with minimal missed and false positives for densely occluded small fruits under complex field conditions remains an open research question. To address this gap, this study first introduces the real-time detection framework RT-DETR into fresh jujube fruit detection. (1) While most studies enhance small-target detection by adding a P2 detection layer [7,15], this approach increases computational complexity and parameter count, slowing inference speed. Additionally, the P2 layer may introduce noise or redundant features, affecting small-target localization and classification accuracy and exacerbating overfitting risks. To solve this, we design a Spatial Enhancement Pyramid Network (SEPN), which uses a multi-branch learning architecture to capture global-to-local feature representations, improving small-target detection performance. (2) Considering the complex agricultural environment of real jujube orchards, our self-built dataset includes environmental details such as high fruit density, variable lighting conditions and shooting angles, and severe occlusion by fruits, branches, and leaves. To verify the practical effectiveness of the proposed modules and improved model for other small-target detection tasks, we introduce three benchmark datasets: CherryBBCH72, Olive Fruit Object Detection, and Pests and Diseases Tree.

2. Materials and Methods

2.1. Data Acquisition and Dataset Construction

2.1.1. Self-Built Dataset

The image data in this study are mainly collected from a jujube orchard in Dayetou Village, Lugezhuang Town, Yantai City, Shandong Province, China (120°39′ E, 36°50′ N). The jujube trees are 6-year-old ‘Dongzao’ cultivars, and the data collection targets fruits at the edible maturity stage (not overripe). The data collection was conducted on 25 September 2024, from 9 a.m. to 12 noon and 2 p.m. to 5 p.m. All data were acquired under natural light, including different lighting conditions (e.g., backlight, side light, front light, and sunset light as shown in Figure 1a–d), angles, and occlusion scenarios by branches, leaves, and fruits (e.g., dense fruits and serious occlusion as shown in Figure 1e,f). To ensure reproducibility, images were captured from multiple trees and different canopy positions (upper, middle, and lower layers). The collection devices are: 1. an Apple 13 Pro Max smartphone (Apple Inc., Cupertino, USA) equipped with a Sony IMX703 sensor (Sony Corporation, Tokyo, Japan), 26 mm focal length, and f/1.5 aperture; 2. a Huawei nova 11 smartphone (Huawei Technologies Co., Ltd., Shenzhen, China) with a Sony IMX700 sensor (Sony Corporation, Tokyo, Japan), 27 mm focal length, and f/1.9 aperture. To address potential sensor-induced variations (e.g., in color rendering, spatial scaling, and brightness response) arising from hardware and software discrepancies between the two devices, standardized shooting parameters were adopted: a unified square resolution of 2736 × 2736 (1:1 aspect ratio) to reduce spatial inconsistency, fixed ISO range (50–200), and standardized image preprocessing. Images were captured in hand-held shooting mode at distances ranging from 15 cm to 50 cm. To reduce the risk of network model overfitting caused by limited diversity of training samples, images were taken from left, right, and front angles. A total of 617 fresh jujube fruit images were collected and saved in JPG format, then labeled using LabelImg software (version 1.8.6). As shown in Figure 1, the dataset features high fruit density, variable lighting conditions, severe fruit occlusion, and multi-angle data acquisition, which improves the model’s detection performance in complex environments.
To simulate changeable weather in natural environments and further improve the algorithm’s robustness and the detection performance of the model, offline data augmentation is performed on the acquired image data, with different augmentation methods randomly combined [19]. Thus, as a result, 585 new training samples are generated, increasing the total number of samples in the fresh jujube fruit dataset to 1202. Figure 2 shows partial image examples obtained after applying data augmentation techniques to the fresh jujube fruit dataset.
Since fresh jujube fruits occupy a very small pixel ratio in high-resolution images (pixel area < 32 × 32 pixels) [20], this dataset belongs to the small-object detection task. As shown in Table 1, to facilitate model training and evaluation, the final 1202 images are randomly divided into training, testing, and validation datasets at a ratio of 7:2:1, with 841 images for training, 241 for testing, and 120 for validation.

2.1.2. Olive Fruit Object Detection Dataset

The experiment also uses the Olive Fruit Object Detection dataset to evaluate the performance of the proposed Spatial Enhancement Pyramid Network (SEPN) on other small-object detection tasks, demonstrating its generalizability [21]. As shown in Table 2, the dataset contains a total of 972 valid images, which are randomly divided into 680 for training, 195 for testing, and 97 for validation. Figure 3 displays selected example samples from the Olive Fruit Object Detection dataset.

2.1.3. CherryBBCH72Dataset

Given the relatively limited sample size of the Olive Fruit Object Detection dataset, this study introduces the CherryBBCH72 dataset to more comprehensively validate the generalizability of SEPN in small-object detection tasks [22]. As shown in Table 3, after removing missing images and labels, it contains a total of 2458 valid images, with 1720 for training, 492 for testing, and 246 for validation. Figure 4 displays some example images selected from the CherryBBCH72 dataset.

2.1.4. Pests and Diseases Tree Dataset

Olive Fruit Object Detection, CherryBBCH72, and Self-built Dataset are all real orchard environment datasets that share similar agricultural environmental characteristics. To further explore the effectiveness of key improvements in the JFST-DETR model for small-object detection tasks in other domains, this study introduces the Pests and Diseases Tree (PDT) dataset [23]. As shown in Table 4, in terms of data structure, PDT exhibits a distribution characteristic of extremely high small-object proportion, with the number of small targets as many as 89,293. Therefore, this dataset is used to further validate the improved model in other domains. Some example images selected from the PDT dataset are shown in Figure 5.

2.2. RT-DETR Network

RT-DETR (Real-Time DEtection TRansformer) is a real-time end-to-end object detector proposed by the Baidu PaddlePaddle team in 2023 [24]. Specifically, the RT-DETR network is composed of a backbone network, a hybrid encoder, and a transformer decoder. The RT-DETR backbone adopts the classic ResNet network to extract feature representations of images; the hybrid encoder uses the Adaptive Intra-scale Feature Interaction (AIFI) module and the Cross-scale CNN Feature Mixing (CCFM) module, decoupling multi-scale feature interaction into two steps—intra-scale interaction and cross-scale fusion—to gradually improve model accuracy while significantly reducing computational costs; the transformer decoder consists of an IoU-aware Query Selection mechanism and a detection head, avoiding the non-maximum suppression (NMS) process and accelerating the model’s real-time detection speed. By integrating the local perception capability of CNN and the global modeling capability of Transformer, RT-DETR’s improvements enable it to enhance efficiency while ensuring accuracy, making it the primary choice for the basic small-object detection network of fresh jujube fruits in this study [25].
Furthermore, ResNet-18 is chosen as the backbone for its structural compatibility with the proposed improvement modules and balanced computational efficiency [26]. It provides stable multi-scale feature outputs (1/4, 1/8, 1/16, 1/32 resolutions) and a standard convolutional structure, allowing direct integration with the enhancement designs of this study without additional adaptation layers. In contrast, lightweight architectures such as MobileNetV3 prioritize parameter reduction through depth-wise separable convolutions, which excessively compress feature channels and compromise structural adaptability [27]. These architectures require complex adjustments to fit our multi-branch feature learning framework. Thus, ResNet-18 achieves a better balance between model adaptability and efficiency than overly lightweight models that sacrifice compatibility for parameter reduction.

2.3. Improved JFST-DETR Model

To address the challenges faced by RT-DETR in detecting small fresh jujube fruit targets in real agricultural environments, this study proposes a small-object detection Transformer called JFST-DETR. The network architecture is shown in Figure 6. All schematic diagrams of core modules in this study are independently redrawn by the authors based on inspiration from relevant literature and common schematic conventions in computer vision research, ensuring the accuracy and originality of the schematics. First, JFST-DETR constructs a novel object detection architecture named Spatial Enhancement Pyramid Network (SEPN) using the new Global Awareness Adaptive Module (GAAM) and Spatial Coding Module (SCM). Through spatial-depth transformation technology, the network captures finer features; leveraging a multi-branch tunable dynamic convolution kernel architecture, it effectively learns global-to-local feature representations, enhancing the model’s ability to extract features from multi-scale targets and its global perception. Second, the Dynamic Sampling (DySample) operator replaces the traditional upsampling in the baseline model, increasing the spatial information depth of small targets and reducing computational costs through a point sampling design. Finally, the Pinwheel-Shaped Convolution (PSConv) with its unique windmill-shaped receptive field design is used to alleviate the loss of small target details caused by multi-layer downsampling, improving the model’s stability and reliability under the interference of complex agricultural environmental conditions. The main improvements are described in detail below.

2.3.1. Spatial Enhancement Pyramid Network

In real jujube orchard scenes, the surface of fresh jujube fruits often exhibits color gradient features with coexisting red and green. For edge features with color transitions, fixed kernel types fail to dynamically adjust the receptive field to capture complete red-green gradient features, often leading to two issues: misclassification of green boundary pixels on jujube fruit surfaces as leaf backgrounds, and high-density overlap with branches and leaves. These challenges impose special requirements on the scale sensitivity and detail retention capability of feature extraction. Furthermore, after feature extraction through multiple convolutional layers, features of small targets almost vanish in deep feature maps. Taking ResNet18 as the backbone network for feature extraction, for example, the feature map size output by its last layer is 1/32 of the original image size, indicating that a 32 × 32-pixel small target corresponds to only one pixel in the deep feature map. This causes significant degradation of detail features such as color gradients and texture directions of small targets. Aiming at the problems of local detail capture and feature information degradation of fresh jujube fruit features in the network, this study proposes an improved SEPN, which includes two sub-modules: SCM and GAAM, as shown in Figure 7.
First, small targets have limitations in feature representation on traditional P3, P4, and P5 detection layers [28]. Most researchers adopt the strategy of introducing low-level high-resolution features by adding a P2 detection layer to enhance small-target detection [29]. However, the additional convolutional calculations and multi-scale detection head post-processing flow lead to high computational costs and time consumption. Therefore, instead of adding a P2 detection layer, this study performs spatial-to-depth conversion on P2 high-resolution features through the SCM, fully capturing and retaining local detail information such as edges and textures of small targets. Subsequently, cross-scale feature integration is achieved through the GAAM for the P2 features, the P4 feature layer containing coarse-grained semantic information, and the intermediate P3 feature layer. This module constructs a comprehensive feature extraction framework via three independent parallel branches, learning global-to-local feature representations and maintaining sensitivity to small-target feature information, thus improving small-target detection performance. The structure of SCM is shown in Figure 8.
Traditional Convolutional Neural Networks (CNNs) primarily rely on strided convolution and pooling operations to achieve feature map downsampling. However, in small-object detection tasks, this process may lead to the loss of critical information, which significantly weakens the feature representation capability and degrades model performance. In contrast, Spatial Coding Module (SCM) is a spatial coding technology that transforms spatial information of images into depth information. It consists of a Spatial-to-Depth (SPD) layer, a non-strided convolution (Conv) layer, and a 3 × 3 convolution kernel [30]. The SPD layer converts spatial dimension information of the input feature map into the channel dimension while preserving intra-channel information intact. The non-strided Conv layer, following the SPD layer, uses standard convolution operations to perform convolution on each pixel or feature map individually, fully retaining fine-grained information.
Specifically, as shown in Figure 8a, assuming the input feature map size is S × S × C1, the SPD layer slices the feature map. As shown in Equation (1), this process slices the input feature map of size S × S × C1 into four sub-feature maps according to a specified depth factor (scale = 2), each with dimensions (S/2) × (S/2) × C1.
f 0 , 0 = X 0 S s c a l e , 0 S s c a l e , f 1 , 0 = X 1 S s c a l e , 0 S s c a l e , , f s c a l e 1 , 0 = X s c a l e 1 S s c a l e , 0 S s c a l e ; f 0 , 1 = X 0 S s c a l e , 1 S s c a l e , f 1 , 1 , , f s c a l e 1 , 1 = X s c a l e 1 S s c a l e , 1 S s c a l e ; f 0 , s c a l e 1 = X 0 S s c a l e , s c a l e 1 S s c a l e , f 1 , s c a l e 1 , , f s c a l e 1 , s c a l e 1 = X s c a l e 1 S s c a l e , s c a l e 1 S s c a l e
Here, s c a l e denotes the depth factor parameter for downsampling the feature map X , and f 0 , 0 , f 1 , 0 , f 0 , 1 , f 1 , 1 represent the four sub-mapping slices obtained when s c a l e = 2.
Then, the module connects the four sub-feature maps along the channel dimension and arranges them as the depth dimension of a new tensor, forming a final output feature map with dimensions of (S/2) × (S/2) × 4C1. As shown in Figure 8b, after processing by the Spatial-to-Depth layer, the feature map is further processed by the non-strided Convolution layer to generate a feature map of size (S/2) × (S/2) × C2. Finally, a 3 × 3 convolution block is used to regulate information redundancy caused by channel changes and adjust the feature dimensions to match the scales of P3 and P4 layers. The transformed P2 features are integrated with P3 and P4 layers through the GAAM. The structure of GAAM is shown in Figure 9.
To enhance the model’s perception of multi-scale features and global semantics and address the information degradation problem in small-object detection, this study utilizes the Omni-Kernel Module (OKM) as the baseline module [31]. However, its independent application increases computational costs and reduces processing efficiency, making it difficult to handle real-time small-object detection tasks in diverse environments. Thus, the GAAM is designed by embedding OKM into the Cross Stage Partial (CSP) architecture. It achieves reasonable allocation of input features through lightweight feature segmentation, effectively reducing computational redundancy [32]. Meanwhile, the cross-stage fusion mechanism is used to integrate feature information at different levels, significantly reducing computational costs while ensuring feature representation capability.
Specifically, for input features P2, P3, and P4, scale dimension adaptation is first performed via 1 × 1 convolution, followed by weight allocation of feature information controlled by a parameter factor (e = 0.25). One-fourth of the feature information enters the global awareness processing unit composed of three parallel branches: Large, Global, and Local.
The Large Branch employs a depthwise separable hybrid kernel strategy, constructing multi-scale receptive fields through square depthwise convolution (k × k), horizontal strip depthwise convolution (1 × k), and vertical strip depthwise convolution (k × 1) to capture multi-granularity contextual information. The Global Branch models global dependencies via the Dual-Domain Channel Attention Module (DCAM) and Frequency-based Spatial Attention Module (FSAM) to enhance global perception. Given the feature X_Global ∈ RC × H × W, DCAM is defined as in Equation (2).
X F C A = I F F T F F T X G l o b a l W 1 × 1 F C A G A P X G l o b a l X D C A M = X F C A W 1 × 1 S C A G A P X F C A
Here, I F F T and F F T denote the Inverse Fast Fourier Transform and Fast Fourier Transform, respectively; X F C A represents the output of FCA; W 1 × 1 denotes a 1 × 1 convolution layer; G A P stands for Global Average Pooling; and denotes element-wise multiplication. Subsequently, the features enter the SCA module, where X D C A M represents the output of DCAM. FSAM is defined as in Equation (3).
X F S A M = I F F T F F T W 1 × 1 1 X D C A M W 1 × 1 2 X D C A M
Here, X F S A M denotes the output of FSAM. The Local Branch employs small-kernel depthwise convolutions to capture local detail information, forming scale complementarity with the large-kernel branch. GAAM integrates global semantics and large-scale structural information while retaining local details through a global awareness processing system constructed by multi-branch dynamic convolution kernels, enhancing multi-scale feature representation capability.
Additionally, three-fourths of the feature information enters the direct path to preserve critical details of the original features. Subsequently, after the branch outputs are fused, feature refinement is performed via 1 × 1 convolution to effectively integrate feature information, improving the model’s perception of global features while maintaining sensitivity to small targets.
The SEPN effectively addresses the limitations of the baseline model in capturing local details of small targets and feature information degradation through a synergistic architecture composed of the SCM and GAAM. Meanwhile, it avoids the computational explosion and excessive processing time issues associated with the method of adding a P2 detection layer.

2.3.2. DySample Module

RT-DETR adopts nearest neighbor interpolation for upsampling, but this method only copies adjacent pixel values without generating new ones, leading to jagged edges and pixelation in small-target features. This affects the model’s ability to capture details. Moreover, relying on pixel spatial positions while ignoring content information, the method exacerbates pixel distortion and information loss when processing small targets, degrading detection accuracy. To address this, this study introduces the DySample operator, which adaptively matches the geometric structures of small targets by dynamically adjusting sampling point offsets, enhancing the capability to capture edge and detail information [33]. This effectively mitigates detail loss after multi-layer sampling. Additionally, DySample dynamically selects sampling points directly on feature maps, avoiding the computational overhead of generating dynamic convolution kernels, thus improving small-target detection performance while reducing model complexity. The structure of DySample is shown in Figure 10.
In the DySample operator, given an input feature X R C × H × W , the feature map X is processed through two branches: a sigmoid-activated linear layer branch and a basic linear layer branch in the sampling point generator. The outputs of these branches are then computed via element-wise multiplication, followed by a pixel shuffle operation to generate the dynamic offset O . Finally, the sampling set S is obtained by combining this offset with the original sampling grid G , as shown in Equations (4) and (5).
O = 0.5 × σ l i n e a r X l i n e a r X
S = G + O
Here, σ denotes the sigmoid function with a modulation parameter of 0.5, O represents the dynamic offset, G is the original sampling grid, signifies element-wise multiplication, and S is the sampling set. Finally, the grid sampling function is used to obtain the final sampled map X , as defined in Equation (6).
X = g r i d s a m p l e X , S
Here, g r i d s a m p l e denotes the grid sampling function. The DySample operator leverages a dynamic sampling design strategy to optimize the spatial information representation of small targets, enhancing the model’s ability to acquire detailed information while reducing computational burden and improving model processing speed.

2.3.3. Pinwheel-Shaped Convolution

Due to the limitation of high computational complexity in the RT-DETR architecture, the detailed features of small targets gradually get lost after multi-layer convolutional downsampling for feature extraction, primarily caused by resolution degradation and diluted semantic information. This further leads to detection difficulties. Thus, this study replaces the traditional downsampling module with PSConv, creating horizontal and vertical convolution kernels through asymmetric padding to expand the capture field of small-target features [34]. With its windmill-shaped receptive field structure, PSConv enhances the model’s focus on local subtle features of small targets, suppresses interference from background clutter and irrelevant semantic information, and thereby more accurately preserving the spatial details and discriminative information of small targets during feature downsampling. The structure of PSConv is shown in Figure 11.
As shown in Figure 11a, the PSConv module first performs four-directional parallel convolutions on the input feature map X h 1 , w 1 , c 1 , generating horizontal and vertical convolution kernels through asymmetric padding to process different regions of the image, respectively. Each convolution branch uses specific padding parameters. To enhance training stability and speed, Batch Normalization (BN) and Sigmoid Linear Unit (SiLU) activation are applied after each convolution, as defined in Equation (7).
X 1 h , w , c = S i L U B N X P 1 , 0 , 0 , 3 h 1 , w 1 , c 1 , W 1 1 , 3 , c , X 2 h , w , c = S i L U B N X P 0 , 3 , 0 , 1 h 1 , w 1 , c 1 , W 2 3 , 1 , c , X 3 h , w , c = S i L U B N X P 0 , 1 , 3 , 0 h 1 , w 1 , c 1 , W 3 1 , 3 , c , X 4 h , w , c = S i L U B N X P 3 , 0 , 1 , 0 h 1 , w 1 , c 1 , W 4 3 , 1 , c .
Here, denotes the convolution operator, and W 1 1 , 3 , c is a 1 × 3 convolution kernel with an output channel of c . The padding parameters P 1 , 0 , 0 , 3 represent the number of padding pixels in the left, right, top, and bottom directions, respectively. Subsequently, as shown in Figure 11b, the output feature maps of the four-directional convolutions are concatenated along the channel dimension to obtain a feature map with a channel number of 4 c , as defined in Equation (8).
h = h 1 s + 1 , w = w 1 s + 1 , c = c 2 4 , X h , w , 4 c = C a t X 1 , X 2 , X 3 , X 4 .
Here, h , w , and c denote the height, width, and number of channels, respectively; c 2 is the number of channels in the final output feature map of the PSConv module; and s is the convolution stride. Finally, as shown in Figure 11c, the concatenated feature map is normalized using a 2 × 2 convolution kernel to adjust the number of channels in the output feature map.

3. Experiments and Discussions

3.1. Experimental Environment and Evaluation Indicators

3.1.1. Experimental Environment

The JFST-DETR model is trained and tested under the Windows 11 operating system. The hardware configuration includes an Intel Core i7-14650HX CPU (Intel Corporation, Santa Clara, CA, USA), an NVIDIA GeForce RTX 4060 Laptop GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 8 GB memory, 32 GB of RAM, and a 1 TB hard disk. The software environment primarily consists of PyCharm 2023, Python 3.11.9, PyTorch 2.2.2, and CUDA 12.2. For the training parameters, the number of epochs is set to 100, the batch size is 4, the image size is 640 × 640, the number of data loading workers is 4, and the AdamW optimizer is employed with an initial learning rate of 0.0001. Pre-trained weights are not used, with other parameters set to default. Under these settings, each epoch took approximately 1.5 min, resulting in a total training time of 150 min (2.5 h) for 100 epochs, and the model converged at around 90 epochs.

3.1.2. Evaluation Indicators

This study uses Precision (P), Recall (R), F1-Score (F1), and Mean Average Precision (mAP) to evaluate the model’s detection performance. Precision, Recall, and F1-Score are defined in Equations (9)–(11):
P r e c i s i o n = T P T P + F P × 100 %
R e c a l l = T P T P + F N × 100 %
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
Here, F P denotes false positives, F N denotes false negatives, and T P denotes true positives. mAP is defined in Equation (12):
m A P = P A N C
Here, N C denotes the number of classes, and P A denotes the average precision for each class. This study employs two main evaluation metrics in the experiments: mAP@50 and mAP@50:95. mAP@50 represents the mean average precision across all classes when the intersection over union (IoU) threshold is set to 0.5. mAP@50:95 is a more comprehensive metric, calculating mAP values at IoU thresholds ranging from 0.5 to 0.95 (with a step of 0.05) and then averaging these values, providing a more comprehensive evaluation of the model’s performance across different IoU thresholds.
Besides accuracy metrics, the study also focuses on the model’s efficiency performance. Detection speed is measured by Frames Per Second (FPS), where a higher FPS indicates faster processing speed and better real-time performance. Model size is evaluated using two key indicators: computational complexity and the number of parameters. Computational complexity is reflected by the number of Giga Floating-point Operations Per Second (GFLOPs), while the number of parameters refers to the total number of model parameters, measured in millions (M). These two metrics are crucial for assessing the model’s deployability and processing efficiency across different devices.
Statistical analysis. Results in Table 5, Table 6, Table 7 and Table 8 are reported as mean ± standard deviation (SD) over n = 5 independent training runs for each model and dataset, covering Precision, Recall, F1-score, mAP@50, mAP@50:95, and FPS; Parameters and GFLOPs are deterministic and therefore reported as single values. To enhance the objectivity and credibility of the study, significance analysis is used to provide a quantitative basis for model comparisons. In the tables, different lowercase superscript letters indicate significant differences (p < 0.05), whereas the same letter indicates no significant difference.

3.2. Experiment on the Effectiveness Evaluation of SEPN

To evaluate the effectiveness and generality of the SEPN architecture in other small-object detection tasks, validation is conducted on a Self-built Dataset and two public datasets with real-world complex agricultural environments: Olive Fruit Object Detection and CherryBBCH72. The experiments use RT-DETR as the baseline model and separately integrate the SEPN architecture for testing. Since adding a P2 detection layer to enhance small-object feature representation in small-object detection tasks can lead to a surge in computational complexity and a decline in inference efficiency, this experiment includes the P2 detection layer as a comparative scheme. This is to validate the optimization effect of the SEPN detection architecture on model computational complexity and inference efficiency while maintaining effective small-object detection. The experimental results are shown in Table 5.
As shown in Table 5, on the Self-built Dataset, SEPN outperforms the baseline in mAP@50 and mAP@50:95 by 0.8% and 0.7%, respectively, and surpasses P2 by 0.6% and 5.6% in the same metrics. On the Olive Fruit Object Detection dataset, SEPN achieves 1.3% and 1.6% improvements in mAP@50 and mAP@50:95 over the baseline; its mAP@50 exceeds P2 by 0.9%, while mAP@50:95 is nearly identical to P2. For the CherryBBCH72 dataset, SEPN demonstrates 1.7% and 0.3% increases in mAP@50 and mAP@50:95 relative to the baseline, respectively; compared to P2, it improves mAP@50 by 1.9% with a negligible gap in mAP@50:95. In terms of Precision, Recall, and F1-score, SEPN shows varying degrees of improvement. These results confirm that SEPN performs more stably and accurately in different small-object detection tasks, significantly outperforming both the baseline model and the P2 method.
Regarding efficiency metrics, SEPN has a parameter count of 20 M, nearly identical to the baseline and P2. While the multi-branch architecture introduces a slight increase in computational complexity compared to the baseline, SEPN significantly outperforms P2 in complexity reduction. SEPN achieves FPS values of 72.4, 69.2, and 71.7 across the three datasets, slightly lower than the baseline but remarkably better than the P2 layer in inference time, enabling faster processing speed.

3.3. Generalization Evaluation Experiment of JFST-DETR

To validate the improved model’s adaptability to fresh samples, experiments are conducted on Self-built Dataset, Olive Fruit Object Detection, and Pests and Diseases Tree. These datasets represent different crop detection scenarios in agriculture and unmanned aerial vehicle (UAV) remote sensing detection scenarios, both of which involve challenges of small-object detection and complex backgrounds. Experiments on these datasets aim to test the effectiveness and robustness of JFST-DETR in detecting small objects across various agricultural scenarios. The experimental results are shown in Table 6.
On the Self-built Dataset, JFST-DETR achieves the mAP@50 of 94.3%, representing a 2.6% improvement over RT-DETR, which demonstrates the improved model’s enhanced detection capability for small fresh jujube fruits in orchard environments. In terms of Precision, Recall, and F1-score, JFST-DETR outperforms RT-DETR by 0.8%, 3.7%, and 2.4%, respectively, indicating that the model effectively reduces missed detections in dense scenarios and mitigates false positive risks, thus showcasing superior performance balance. On the Olive Fruit Object Detection dataset, JFST-DETR achieves improvements of 2.2%, 3%, 2.7%, and 3.3% in Precision, Recall, F1-score, and mAP@50 compared to RT-DETR, realizing multi-dimensional performance enhancement and improving sensitivity to small objects and detail feature extraction under complex environmental interference. Finally, on the Pests and Diseases Tree dataset, JFST-DETR surpasses RT-DETR in all metrics, with Precision, Recall, F1-score, and mAP@50 reaching 87.9%, 86.1%, 86.9%, and 92.9%, respectively, highlighting the model’s outstanding stability in UAV-based detection tasks.
Additionally, SEPN significantly enhances model detection accuracy, though the improvement in recall is less pronounced. However, with the integration of the DySample and PSConv modules, the feature acquisition domain is expanded, preserving spatial details and discriminative information of small objects to reduce missed detections, ultimately improving the comprehensive detection performance of JFST-DETR.

3.4. Ablation Experiment

In deep neural networks, ablation studies serve as an important method to analyze the contribution of each model component to overall model performance. To deeply explore the specific contributions of each module in the JFST-DETR model for small-object detection tasks, ablation experiments are conducted on the Self-built Dataset. The experimental results are shown in Table 7.
By integrating SEPN alone into the baseline model through the synergistic mechanism of its spatial depth transformation technology and global perception adaptive processing unit, the overall detection accuracy of the baseline model is comprehensively improved, with Precision, Recall, F1-score, mAP@50, and mAP@50:95 improving by 1.8%, 2.1%, 2%, 0.8%, and 0.7%, respectively. This indicates a positive contribution to model performance. DySample accurately captures spatial details of small objects through dynamic offset sampling, enhancing the model’s recall capability and improving Recall by 2.6%. Its lightweight design achieves an FPS of 92.8, realizing dual optimization of detection efficiency and accuracy. When PSConv is added alone, experimental results show limited improvement in the baseline model’s detection accuracy because, without global information integration and detailed feature support, the multi-directional convolution design fails to fully function, leading to limited performance gains. Notably, PSConv shows limited standalone performance but demonstrates greater effectiveness when combined with other modules. Paired with SEPN, PSConv leverages SEPN’s global perception capability to achieve significant accuracy improvements, with mAP@50 and mAP@50:95 improving by 1.5% and 2.7%, respectively. When used with DySample, dynamic sampling optimizes feature input, achieving Precision of 94.8% and FPS of 80.4 to balance speed and accuracy. The collaboration between DySample and SEPN achieves a mAP@50 of 93.4%, representing a significant breakthrough in small-object detection accuracy. Finally, the combination of all three modules improves Precision, Recall, F1-score, mAP@50, and mAP@50:95 by 0.8%, 3.7%, 2.4%, 2.6%, and 3.1%, respectively, with mAP@50 and mAP@50:95 reaching peak values without significantly increasing computational complexity or parameter count, while maintaining fast processing speed.
To further validate the above conclusions, key improvements are sequentially integrated, and the feature maps of each detection output are visualized, as shown in Figure 12. Brighter colors indicate regions that the model pays more attention to.
In complex environments with severe occlusion and color confusion between objects and backgrounds, as improved modules are progressively embedded, the model demonstrates a stepwise improvement in small-object localization accuracy and detail feature capture capability [35]. This enhances the model’s feature representation for small objects, suppresses background interference, and improves the ability to recognize distant and blurred objects.
Ablation experiments and feature map visualization validate the contribution of each module to the overall model performance, finally enhancing the model’s detection robustness in complex environments significantly. Moreover, there is no conflict between the modules.

3.5. Comparison Experiments of Different Models

The experimental results of JFST-DETR on Self-built Dataset are compared with those of other object detection models (YOLOv5m, YOLOv8m, YOLOv8-P2, YOLOv9m, YOLOv10m, YOLOv11m, YOLOv12m, YOLOv12s, RT-DETR-P2, RT-DETR) [36]. For the fairness of the experiment, the YOLO series adopt the default parameters from Ultralytics, while the RT-DETR series use the same parameters as the algorithm in this study, and neither uses pretrained weights. The experimental results are shown in Table 8.
In terms of core detection performance metrics F1 and mAP@50, JFST-DETR outperforms other detectors by 0.4–4.5% and 1–5.4%, respectively, achieving the highest values of 89.8% and 94.3%. Meanwhile, its mAP@50:95 is 75.2%, showing minimal differences from YOLOv8m and YOLOv9m. This indicates that JFST-DETR more accurately detects objects and reduces missed detections in small-object tasks, achieving the best comprehensive performance in stability and accuracy among all models. Compared with models with comparable detection accuracy such as YOLOv5m (mAP@50: 93.2%), YOLOv8m (93%), YOLOv9m (93.3%), YOLOv11m (92.9%), and YOLOv12m (92.7%), JFST-DETR has the same or lower parameter count, the lowest computational complexity except for being 0.7 higher than YOLOv5m, and maintains fast processing speed. In contrast to models with added P2 detection layers for small-object accuracy (YOLOv8-P2, RT-DETR-P2), JFST-DETR demonstrates significant improvements in mAP@50 by 5.4% and 2.4%, respectively, outperforming the P2 layer addition method. To further evaluate the comprehensive performance of each model, the experimental results are visualized as a multi-pane stacked chart to observe model performance under each evaluation metric, as shown in Figure 13.
As shown in Figure 13, JFST-DETR achieves balanced performance across all metrics. It not only reaches the highest detection accuracy but also demonstrates significant advantages in parameter count and computational complexity, with comprehensive performance superior to other models.

3.6. Comparison of Specific Test Examples

To verify the practical performance of JFST-DETR in different scenarios, specific test examples are conducted, as shown in Figure 14.
Results of specific test examples show that compared with other models, the improved model demonstrates higher attention to small objects and stronger anti-interference capability in real-world scenarios, exhibiting reliable performance.

4. Discussion

This study enhances fresh-jujube detection and verifies, via extensive experiments, the proposed method’s feasibility for diverse agricultural small-object detection. Performance gains stem from three key architectural innovations. First, the Spatial Enhancement Pyramid Network (SEPN) uses spatial-to-depth encoding to preserve high-resolution edges and textures, and aggregates multi-scale features via multi-branch pathways (large and small receptive fields) enhanced by channel- and frequency-domain attention. On the Self-built Dataset, SEPN alone increases mAP@50 to 92.5% compared to the baseline, with improved Precision and F1-score, thereby boosting small-target detection accuracy and reducing misdetection. When integrated into the CherryBBCH72 dataset baseline, it elevates mAP@50 by 1.7%, highlighting its generality across other orchard crops. Second, the DySample operator dynamically adjusts upsampling offsets to align fruit boundaries and fine details, raising Self-built Dataset Recall to 85.7% and mAP@50 to 92.4% with consistent cross-dataset improvements, effectively reducing missed detection of small targets. Third, Pinwheel-Shaped Convolution expands receptive fields via asymmetric padding, emphasizes directional structures, suppresses background clutter, and enhances high-accuracy robustness, lifting Self-built Dataset mAP@50 to 92.5% and reducing false detection induced by background interference.
From a feature-learning perspective, these modules complement tightly: SEPN builds a multi-scale feature foundation by retaining fine details and integrating context, providing high-quality input. DySample refines these features via dynamic alignment, preserving the structural integrity of SEPN-captured details during upsampling. PSConv enhances robustness by mitigating background interference and emphasizing directional patterns, sharpening critical discriminative features to complement both. On the Self-Built Dataset, JFST-DETR achieves Precision of 93.0%, Recall of 86.8%, F1-score of 89.8%, and mAP@50 of 94.3%. Additionally, the model exhibits strong scalability for other orchard crops and cross-domain target detection: it reaches an mAP@50 of 79.4% on the Olive Fruit Object Detection Dataset, a 3.3% improvement over the baseline model; on the Pests and Diseases Tree (PDT) Dataset, its mAP@50 further hits a high value of 92.9%.
For horticultural systems, accurate detection of small fruits holds significant agronomic and commercial value: it enables precise yield forecasting and automated harvesting decisions, while also reducing labor costs for field scouting and optimizing harvest scheduling, thereby directly enhancing orchard operational efficiency and economic returns. In this study, the proposed JFST-DETR model demonstrates robust performance under challenging orchard conditions, including dense canopies, mutual fruit occlusion, and the characteristic bicolored (partially red and partially green) phenotype of maturing jujubes, as commonly observed in Yantai winter jujube, indicating its potential for transfer to other fruit species with distinct canopy architectures or low fruit–foliage color contrast, such as apple (where fruits exhibit minimal color differentiation from leaves) and peach (which features a more open and sparse canopy structure).
Nevertheless, this study has several limitations. First, the self-built dataset lacks diversity, as it was collected from a single orchard during one harvest season and includes only one fresh jujube cultivar, with no representation of other varieties, geographic regions, or phenological stages. Second, the model has not been validated under real-world field conditions; its robustness to dynamic factors—such as motion, lighting changes, and platform vibration—and its inference efficiency on edge devices remain untested in operational settings. Third, despite outperforming existing methods, detection accuracy for heavily occluded or immature fruits still requires improvement, a challenge inherent to purely image-based approaches that are sensitive to occlusion and lighting variability.
Future work will address these gaps by: (1) expanding the dataset to include multiple jujube cultivars, diverse growing regions, and key phenological stages; (2) conducting real-time field trials using UAVs or orchard robots to evaluate system latency, robustness, and deployment feasibility; and (3) integrating multimodal data (e.g., spectral and color features) with model optimization techniques—such as channel pruning and quantization—to enhance detection accuracy while enabling efficient inference on resource-constrained hardware [37]. These steps aim to bridge the gap between algorithmic advances and practical application in intelligent horticulture.

5. Conclusions

This study proposes JFST-DETR, a novel small-object detection framework featuring three key innovations: (1) a Spatial Enhancement Pyramid Network (SEPN) that preserves high-resolution details while enhancing multi-scale feature integration; (2) a Dynamic Sampling (DySample) operator that adaptively refines feature alignment for improved recall; and (3) a Pinwheel-shaped Convolution that expands receptive fields and suppresses background interference to reduce false detections. Through these key innovations, JFST-DETR achieves stable and consistently improved detection performance across all key metrics, including precision, recall, and mAP@50. Its robustness and generalization capability are further confirmed on multiple agricultural datasets. In contrast to prevailing small-object detection approaches that incorporate an additional high-resolution feature map, such as the P2-level output from the feature pyramid, JFST-DETR attains superior accuracy without expanding the feature hierarchy or increasing computational demand. The NMS-free, end-to-end design reduces computational overhead and simplifies the inference pipeline, offering strong potential for deployment on resource-constrained agricultural platforms. Such efficient and accurate detection frameworks hold strong potential for real-world agricultural deployment, for instance on edge devices, unmanned aerial vehicles (UAVs), or mobile applications, enabling real-time field monitoring, yield estimation, and support for robotic harvesting systems.

Author Contributions

Conceptualization, T.L.; methodology, T.L. and J.X.; software, T.L.; validation, T.L. and Z.Z.; formal analysis, T.L. and Z.Z.; investigation, M.W., X.Y. and X.W.; resources, J.X.; data curation, T.L.; writing—original draft preparation, T.L.; writing—review and editing, J.X., M.W., X.Y. and X.W.; visualization, Z.Z.; supervision, J.X.; project administration, J.X.; funding acquisition, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanxi Provincial Natural Science Foundation (grant number 202403021221085).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to ongoing experimental work that relies on this dataset in our laboratory.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, M.; Wang, J.; Wang, L.; Liu, P.; Zhao, J.; Zhao, Z.; Yao, S.; Stănică, F.; Liu, Z.; Wang, L.; et al. The Historical and Current Research Progress on Jujube—A Superfruit for the Future. Hortic. Res. 2020, 7, 119. [Google Scholar] [CrossRef] [PubMed]
  2. Li, X.; Wu, J.; Bai, T.; Wu, C.; He, Y.; Huang, J.; Li, X.; Shi, Z.; Hou, K. Variety Classification and Identification of Jujube Based on Near-Infrared Spectroscopy and 1D-CNN. Comput. Electron. Agric. 2024, 223, 109122. [Google Scholar] [CrossRef]
  3. Islam, M.D.; Liu, W.; Izere, P.; Singh, P.; Yu, C.; Riggan, B.; Zhang, K.; Jhala, A.J.; Knezevic, S.; Ge, Y.; et al. Towards Real-Time Weed Detection and Segmentation with Lightweight CNN Models on Edge Devices. Comput. Electron. Agric. 2025, 237, 110600. [Google Scholar] [CrossRef]
  4. Chuang, X.; Qiang, C.; Yinyan, S.; Xiaochan, W.; Xiaolei, Z.; Yao, W.; Yiran, W. Improved Lightweight YOLOv5n-Based Network for Bruise Detection and Length Classification of Asparagus. Comput. Electron. Agric. 2025, 233, 110194. [Google Scholar] [CrossRef]
  5. Fan, X.; Sun, T.; Chai, X.; Zhou, J. YOLO-WDNet: A Lightweight and Accurate Model for Weeds Detection in Cotton Field. Comput. Electron. Agric. 2024, 225, 109317. [Google Scholar] [CrossRef]
  6. Bai, Y.; Yu, J.; Yang, S.; Ning, J. An Improved YOLO Algorithm for Detecting Flowers and Fruits on Strawberry Seedlings. Biosyst. Eng. 2024, 237, 1–12. [Google Scholar] [CrossRef]
  7. Lv, J.; Niu, L.; Xu, L.; Sun, X.; Wang, L.; Rong, H.; Zou, L. A Visual Identification Method of the Growth Posture of Young Peach Fruits in Orchards. Sci. Hortic. 2024, 335, 113355. [Google Scholar] [CrossRef]
  8. Sun, H.; Wang, R.-F. BMDNet-YOLO: A Lightweight and Robust Model for High-Precision Real-Time Recognition of Blueberry Maturity. Horticulturae 2025, 11, 1202. [Google Scholar] [CrossRef]
  9. Ji, P.; Yang, N.; Lin, S.; Xiong, Y. EDI-YOLO: An Instance Segmentation Network for Tomato Main Stems and Lateral Branches in Greenhouse Environments. Horticulturae 2025, 11, 1260. [Google Scholar] [CrossRef]
  10. Esaki, I.; Noma, S.; Ban, T.; Sultana, R.; Shimizu, I. Maturity Classification of Blueberry Fruit Using YOLO and Vision Transformer for Agricultural Assistance. Horticulturae 2025, 11, 1272. [Google Scholar] [CrossRef]
  11. Ma, B.; Xu, J.; Liu, R.; Mu, J.; Li, B.; Xie, R.; Liu, S.; Hu, X.; Zheng, Y.; Zhang, H.; et al. MDAS-YOLO: A Lightweight Adaptive Framework for Multi-Scale and Dense Pest Detection in Apple Orchards. Horticulturae 2025, 11, 1273. [Google Scholar] [CrossRef]
  12. Tsaniklidis, G.; Makraki, T.; Papadimitriou, D.; Nikoloudakis, N.; Taheri-Garavand, A.; Fanourakis, D. Non-Destructive Estimation of Area and Greenness in Leaf and Seedling Scales: A Case Study in Cucumber. Agronomy 2025, 15, 2294. [Google Scholar] [CrossRef]
  13. Zhao, Z.; Chen, S.; Ge, Y.; Yang, P.; Wang, Y.; Song, Y. RT-DETR-Tomato: Tomato Target Detection Algorithm Based on Improved RT-DETR for Agricultural Safety Production. Appl. Sci. 2024, 14, 6287. [Google Scholar] [CrossRef]
  14. Gu, Z.; Ma, X.; Guan, H.; Jiang, Q.; Deng, H.; Wen, B.; Zhu, T.; Wu, X. Tomato Fruit Detection and Phenotype Calculation Method Based on the Improved RTDETR Model. Comput. Electron. Agric. 2024, 227, 109524. [Google Scholar] [CrossRef]
  15. Li, Z.; Shen, Y.; Tang, J.; Zhao, J.; Chen, Q.; Zou, H.; Kuang, Y. IMLL-DETR: An Intelligent Model for Detecting Multi-Scale Litchi Leaf Diseases and Pests in Complex Agricultural Environments. Expert Syst. Appl. 2025, 273, 126816. [Google Scholar] [CrossRef]
  16. Chakrabarty, S.; Deb, C.K.; Marwaha, S.; Haque, M.A.; Kamil, D.; Bheemanahalli, R.; Dhillon, M.K.; Shashank, P.R. Application of Artificial Intelligence in Insect Pest Identification—A Review. Artif. Intell. Agric. 2025, 16, 44–61. [Google Scholar] [CrossRef]
  17. Xie, Z.; Liu, W.; Li, Y.; Du, J.; Long, T.; Xu, H.; Long, Y.; Zhao, J. Enhanced Litchi Fruit Detection and Segmentation Method Integrating Hyperspectral Reconstruction and YOLOv8. Comput. Electron. Agric. 2025, 237, 110659. [Google Scholar] [CrossRef]
  18. Yang, J.; Xu, B.; Wu, B.; Zhao, R.; Liu, L.; Li, F.; Ai, X.; Fan, L.; Yang, Z. Chlorophyll Dynamic Fusion Based on High-Throughput Remote Sensing and Machine Learning Algorithms for Cotton Yield Prediction. Field Crops Res. 2025, 333, 110057. [Google Scholar] [CrossRef]
  19. Wang, Y.; Zhao, C.; Tian, H.; Xing, Z.; Yue, X.; Liu, S.; He, Y.; Bai, J.; Hao, L.; Zhu, M.; et al. NIRS-Based Detection Advances in Agriculture: Data Enhancement, Characteristic Wavelength Selection and Modelling Techniques. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2025, 343, 126611. [Google Scholar] [CrossRef]
  20. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision–ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  21. Zhu, X.; Chen, F.; Zhang, X.; Zheng, Y.; Peng, X.; Chen, C. Detection the Maturity of Multi-Cultivar Olive Fruit in Orchard Environments Based on Olive-EfficientDet. Sci. Hortic. 2024, 324, 112607. [Google Scholar] [CrossRef]
  22. Kodors, S.; Zarembo, I.; Lācis, G.; Litavniece, L.; Apeināns, I.; Sondors, M.; Pacejs, A. Autonomous Yield Estimation System for Small Commercial Orchards Using UAV and AI. Drones 2024, 8, 734. [Google Scholar] [CrossRef]
  23. Zhou, M.; Xing, R.; Han, D.; Qi, Z.; Li, G. PDT: Uav Target Detection Dataset for Pests and Diseases Tree. In Computer Vision–ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 56–72. [Google Scholar]
  24. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
  25. Xu, X.; Mao, Z.; Wang, X.; Tu, Q.; Shen, J. Dynamic Anchor: Density Map Guided Small Object Detector for Tiny Persons. Comput. Vis. Image Underst. 2025, 255, 104325. [Google Scholar] [CrossRef]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2015. [Google Scholar]
  27. Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  28. Li, R.; Shen, Y. YOLOSR-IST: A Deep Learning Method for Small Target Detection in Infrared Remote Sensing Images Based on Super-Resolution and YOLO. Signal Process. 2023, 208, 108962. [Google Scholar] [CrossRef]
  29. Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  30. Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Machine Learning and Knowledge Discovery in Databases; Amini, M.-R., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2023; pp. 443–459. [Google Scholar]
  31. Cui, Y.; Ren, W.; Knoll, A. Omni-Kernel Network for Image Restoration. Proc. AAAI Conf. Artif. Intell. 2024, 38, 1426–1434. [Google Scholar] [CrossRef]
  32. Li, J.; Li, C.; Zeng, S.; Luo, X.; Chen, C.L.P.; Yang, C. A Lightweight Pineapple Detection Network Based on YOLOv7-Tiny for Agricultural Robot System. Comput. Electron. Agric. 2025, 231, 109944. [Google Scholar] [CrossRef]
  33. Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
  34. Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-Shaped Convolution and Scale-Based Dynamic Loss for Infrared Small Target Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 9202–9210. [Google Scholar] [CrossRef]
  35. Huang, D.; Zhang, G.; Li, Z.; Liu, K.; Luo, W. Light-YOLO: A Lightweight and High-Performance Network for Detecting Small Obstacles on Roads at Night. Comput. Vis. Image Underst. 2025, 259, 104428. [Google Scholar] [CrossRef]
  36. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  37. Arya, R.K.; Peddi, R.; Srivastava, R. Hyperspectral Image Classification Using Hybrid Convolutional-Based Cross-Patch Retentive Network. Comput. Vis. Image Underst. 2025, 257, 104382. [Google Scholar] [CrossRef]
Figure 1. Sample images of Self-built Dataset.
Figure 1. Sample images of Self-built Dataset.
Horticulturae 11 01380 g001
Figure 2. Sample image for image augmentation. (a) RandomBrightnessContrast and RandomRain; (b) RandomShadow; (c) RandomSnow and ISONoise; (d) RandomSunFlare; (e) ImageCompression and RandomFog; (f) ImageCompression, RandomShadow and RandomSunFlare; (g) RandomRain and RandomSnow; (h) RandomBrightnessContrast, RandomRain and RandomShadow; (i) ImageCompression, RandomFog and RandomShadow; (j) GaussNoise and RandomShadow; (k) RandomBrightnessContrast, RandomSnow and RandomShadow; (l) RandomBrightnessContrast, RandomRain and RandomSunFlare.
Figure 2. Sample image for image augmentation. (a) RandomBrightnessContrast and RandomRain; (b) RandomShadow; (c) RandomSnow and ISONoise; (d) RandomSunFlare; (e) ImageCompression and RandomFog; (f) ImageCompression, RandomShadow and RandomSunFlare; (g) RandomRain and RandomSnow; (h) RandomBrightnessContrast, RandomRain and RandomShadow; (i) ImageCompression, RandomFog and RandomShadow; (j) GaussNoise and RandomShadow; (k) RandomBrightnessContrast, RandomSnow and RandomShadow; (l) RandomBrightnessContrast, RandomRain and RandomSunFlare.
Horticulturae 11 01380 g002
Figure 3. Partial images of Olive Fruit Object Detection.
Figure 3. Partial images of Olive Fruit Object Detection.
Horticulturae 11 01380 g003
Figure 4. Partial images of CherryBBCH72.
Figure 4. Partial images of CherryBBCH72.
Horticulturae 11 01380 g004
Figure 5. Partial images of Pests and Diseases Tree.
Figure 5. Partial images of Pests and Diseases Tree.
Horticulturae 11 01380 g005
Figure 6. The structure of the JFST-DETR model. This figure is independently redrawn by the authors based on common schematic conventions widely employed in computer vision research.
Figure 6. The structure of the JFST-DETR model. This figure is independently redrawn by the authors based on common schematic conventions widely employed in computer vision research.
Horticulturae 11 01380 g006
Figure 7. The structure of the SEPN.
Figure 7. The structure of the SEPN.
Horticulturae 11 01380 g007
Figure 8. The structure of the SCM. (a) Space-to-depth layers; (b) Stepless convolutional layers; (c) Convolutional blocks used to adjust the dimension of feature information channels. Inspired by the architectural design in [30], this figure is independently redrawn by the authors in accordance with common schematic conventions.
Figure 8. The structure of the SCM. (a) Space-to-depth layers; (b) Stepless convolutional layers; (c) Convolutional blocks used to adjust the dimension of feature information channels. Inspired by the architectural design in [30], this figure is independently redrawn by the authors in accordance with common schematic conventions.
Horticulturae 11 01380 g008
Figure 9. The structure of the GAAM.
Figure 9. The structure of the GAAM.
Horticulturae 11 01380 g009
Figure 10. The structure of the DySample. Based on the core design ideas in [33], this figure is independently redrawn and details are optimized by the authors following common schematic conventions.
Figure 10. The structure of the DySample. Based on the core design ideas in [33], this figure is independently redrawn and details are optimized by the authors following common schematic conventions.
Horticulturae 11 01380 g010
Figure 11. The structure of the PSConv. (a) Perform parallel convolution operations in four directions on the feature map; (b) Output feature maps are concatenated along the channel dimension; (c) Feature map normalization processing adjusts the number of channels in the output feature map. Inspired by the architecture in [34], this figure is independently redrawn by the authors using common schematic conventions.
Figure 11. The structure of the PSConv. (a) Perform parallel convolution operations in four directions on the feature map; (b) Output feature maps are concatenated along the channel dimension; (c) Feature map normalization processing adjusts the number of channels in the output feature map. Inspired by the architecture in [34], this figure is independently redrawn by the authors using common schematic conventions.
Horticulturae 11 01380 g011
Figure 12. Impact of key improvements on feature extraction. Red boxes denote the model’s detection results. The brighter color represents that the model pays more attention to that area. Input image resolution: 2736 × 2736 pixels.
Figure 12. Impact of key improvements on feature extraction. Red boxes denote the model’s detection results. The brighter color represents that the model pays more attention to that area. Input image resolution: 2736 × 2736 pixels.
Horticulturae 11 01380 g012
Figure 13. Visualization of model performance metrics. Different lowercase letters indicate statistically significant differences (p < 0.05) between models for the same metric; the same letter indicates no significant difference; earlier letters correspond to higher values.
Figure 13. Visualization of model performance metrics. Different lowercase letters indicate statistically significant differences (p < 0.05) between models for the same metric; the same letter indicates no significant difference; earlier letters correspond to higher values.
Horticulturae 11 01380 g013
Figure 14. Detection in different scenarios. Input image resolution: 2736 × 2736 pixels.
Figure 14. Detection in different scenarios. Input image resolution: 2736 × 2736 pixels.
Horticulturae 11 01380 g014
Table 1. Self-built Dataset Details.
Table 1. Self-built Dataset Details.
After Data AugmentationTraining SetTest SetValidation SetTotal
Self-built Dataset8412411201202
Number of targets433812436496230
Table 2. Olive Fruit Object Detection Dataset Details.
Table 2. Olive Fruit Object Detection Dataset Details.
Training SetTest SetValidation SetTotal
Olive Fruit Object Detection68019597972
Number of targets11,0742893164115,608
Table 3. CherryBBCH72 Dataset Details.
Table 3. CherryBBCH72 Dataset Details.
Training SetTest SetValidation SetTotal
CherryBBCH7217204922462458
Number of targets13,7933732206819,593
Table 4. Pests and Diseases Tree Dataset Details.
Table 4. Pests and Diseases Tree Dataset Details.
Training SetTest SetValidation SetTotal
Pests and Diseases Tree45365675675670
Number of targets91,12611,59912,655115,380
Table 5. Performance of the SEPN on the Self-built Dataset, Olive Fruit Object Detection, and CherryBBCH72.
Table 5. Performance of the SEPN on the Self-built Dataset, Olive Fruit Object Detection, and CherryBBCH72.
DatasetModelPRF1mAP@50mAP@50:95GFLOPsParametersFPS
Self-built DatasetBaseline92.2 ± 0.25 b83.1 ± 0.25 b87.4 ± 0.24 b91.7 ± 0.20 b72.1 ± 0.23 b56.92088.7 ± 1.54 a
Baseline + P292.4 ± 0.28 b82.3 ± 0.21 c87.0 ± 0.13 c91.9 ± 0.14 b67.2 ± 0.01 c78.11960.6 ± 0.86 c
Baseline + SEPN94.0 ± 0.19 a85.2 ± 0.25 a89.4 ± 0.14 a92.5 ± 0.19 a72.8 ± 0.21 a65.22072.4 ± 0.96 b
Olive Fruit Object DetectionBaseline75.6 ± 0.13 c70.0 ± 0.15 a72.7 ± 0.21 c76.1 ± 0.18 c37.4 ± 0.17 b56.92084.5 ± 0.83 a
Baseline + P276.8 ± 0.17 b69.7 ± 0.26 b73.1 ± 0.18 b76.5 ± 0.21 b39.1 ± 0.27 a78.11960.4 ± 0.99 c
Baseline + SEPN78.4 ± 0.36 a69.5 ± 0.20 b73.7 ± 0.14 a77.4 ± 0.12 a39.0 ± 0.13 a65.22069.2 ± 0.57 b
CherryBBCH72Baseline79.1 ± 0.30 c76.0 ± 0.33 a77.5 ± 0.16 b80.6 ± 0.15 b31.0 ± 0.25 b56.92085.0 ± 0.71 a
Baseline + P279.6 ± 0.29 b74.0 ± 0.17 b76.7 ± 0.23 c80.9 ± 0.17 b31.4 ± 0.25 a78.11963.0 ± 1.09 c
Baseline + SEPN82.0 ± 0.17 a76.3 ± 0.21 a79.1 ± 0.17 a82.3 ± 0.37 a31.3 ± 0.22 ab65.22071.7 ± 0.36 b
Note: P, precision; R, recall; F1, F1-score; mAP@50, mean average precision at IoU threshold 0.5; mAP@50:95, mean average precision averaged over IoU thresholds from 0.5 to 0.95 (step 0.05); GFLOPs, giga floating-point operations per second; FPS, frames per second; P2, detection head added at the P2 feature level; SEPN, Spatial Enhancement Pyramid Network. Values are mean ± SD over n = 5. Different superscript letters indicate significant differences (p < 0.05).
Table 6. Generalization Performance of JFST-DETR on Self-built Dataset, Olive Fruit Object Detection, and Pests and Diseases Tree.
Table 6. Generalization Performance of JFST-DETR on Self-built Dataset, Olive Fruit Object Detection, and Pests and Diseases Tree.
DatasetModelPRF1mAP@50
Self-built DatasetRT-DETR92.2 ± 0.17 c83.1 ± 0.28 c87.4 ± 0.30 c91.7 ± 0.20 c
RT-DETR + SEPN94.0 ± 0.26 a85.2 ± 0.15 b89.4 ± 0.13 b92.5 ± 0.16 b
JFST-DETR93.0 ± 0.27 b86.8 ± 0.22 a89.8 ± 0.17 a94.3 ± 0.12 a
Olive Fruit Object DetectionRT-DETR75.6 ± 0.15 c70.0 ± 0.31 b72.7 ± 0.15 c76.1 ± 0.23 c
RT-DETR + SEPN78.4 ± 0.15 a69.5 ± 0.35 c73.7 ± 0.07 b77.4 ± 0.15 b
JFST-DETR77.8 ± 0.31 b73.0 ± 0.23 a75.4 ± 0.15 a79.4 ± 0.23 a
Pests and Diseases TreeRT-DETR87.4 ± 0.10 b85.6 ± 0.23 b86.5 ± 0.23 b92.2 ± 0.23 b
RT-DETR + SEPN87.7 ± 0.23 a85.7 ± 0.18 b86.7 ± 0.26 ab92.7 ± 0.13 a
JFST-DETR87.9 ± 0.29 a86.1 ± 0.29 a86.9 ± 0.14 a92.9 ± 0.25 a
Note: P, precision; R, recall; F1, F1-score; mAP@50, mean average precision at an intersection-over-union (IoU) threshold of 0.5; SEPN, Spatial Enhancement Pyramid Network. Different superscript letters indicate significant differences (p < 0.05).
Table 7. Comparison of ablation test results for JFST-DETR.
Table 7. Comparison of ablation test results for JFST-DETR.
DatasetSDPPRF1mAP@50mAP@50:95ParametersGFLOPsFPS
RT-DETR92.2 ± 0.27 d83.1 ± 0.15 g87.4 ± 0.25 e91.7 ± 0.29 e72.1 ± 0.34 e2056.988.7 ± 1.24 b
94.0 ± 0.14 b85.2 ± 0.25 e89.4 ± 0.14 b92.5 ± 0.21 d72.8 ± 0.35 d2065.272.4 ± 0.57 d
91.7 ± 0.07 e85.7 ± 0.17 d88.6 ± 0.08 c92.4 ± 0.14 d72.1 ± 0.23 e205792.8 ± 0.35 a
89.4 ± 0.21 h84.9 ± 0.23 ef87.1 ± 0.19 f92.5 ± 0.31 d72.0 ± 0.37 e1956.579.5 ± 0.40 c
90.1 ± 0.29 g86.5 ± 0.23 b88.3 ± 0.15 d93.4 ± 0.20 b73.5 ± 0.07 c2065.265.6 ± 1.07 e
94.8 ± 0.29 a84.8 ± 0.31 f89.6 ± 0.31 ab92.8 ± 0.15 c73.1 ± 0.12 d1956.580.4 ± 0.86 c
91.3 ± 0.33 f86.0 ± 0.18 c88.6 ± 0.21 c93.2 ± 0.28 b74.8 ± 0.15 b2064.763.9 ± 1.24 f
93.0 ± 0.36 c86.8 ± 0.20 a89.8 ± 0.16 a94.3 ± 0.14 a75.2 ± 0.13 a2064.765.5 ± 0.65 e
Note: P, precision; R, recall; F1, F1-score; mAP@50, mean average precision at an intersection-over-union (IoU) threshold of 0.5; mAP@50:95, mean average precision averaged over IoU thresholds from 0.5 to 0.95 with step 0.05; GFLOPs, giga floating-point operations per second; FPS, frames per second. S, SEPN (Spatial Enhancement Pyramid Network); D, DySample (dynamic upsampling module); P, PSConv (Pinwheel-Shaped Convolution). “✓” indicates the module is enabled; “─” indicates it is disabled. Different superscript letters indicate significant differences (p < 0.05).
Table 8. Comparison of model performance.
Table 8. Comparison of model performance.
ModelPRF1mAP@50mAP@50:95ParametersGFLOPsFPS
YOLOv5m92.7 ± 0.29 bc85.4 ± 0.13 c88.9 ± 0.07 c93.2 ± 0.15 bc74.2 ± 0.16 d2564106.1 ± 0.35 c
YOLOv8m93.9 ± 0.32 a85.3 ± 0.22 c89.4 ± 0.12 b93.0 ± 0.08 cd75.9 ± 0.14 a2678.772.6 ± 1.04 h
YOLOv8-P290.5 ± 0.25 f80.6 ± 0.16 i85.3 ± 0.14 i88.9 ± 0.10 i68.6 ± 0.14 h312.2163.1 ± 0.44 a
YOLOv9m92.2 ± 0.19 d87.5 ± 0.16 a89.8 ± 0.16 a93.3 ± 0.22 b75.4 ± 0.14 b2076.574.4 ± 0.82 g
YOLOv10m92.3 ± 0.21 d84.8 ± 0.27 d88.4 ± 0.17 d91.3 ± 0.30 g74.0 ± 0.21 d1663.4101.6 ± 0.66 d
YOLOv11m91.5 ± 0.15 e87.1 ± 0.31 b89.3 ± 0.26 b92.9 ± 0.14 de75.1 ± 0.12 c2067.688.8 ± 0.67 e
YOLOv12m92.2 ± 0.15 d84.2 ± 0.35 e88.0 ± 0.24 e92.7 ± 0.10 e73.7 ± 0.25 e2067.180.6 ± 0.47 f
YOLOv12s90.4 ± 0.22 f81.6 ± 0.19 h85.8 ± 0.13 h90.2 ± 0.29 h69.8 ± 0.20 g921.2117.0 ± 0.74 b
RT-DETR-P292.4 ± 0.37 cd82.3 ± 0.51 g87.0 ± 0.19 g91.9 ± 0.14 f67.2 ± 0.08 i1978.160.6 ± 0.42 j
RT-DETR92.2 ± 0.29 d83.1 ± 0.24 f87.4 ± 0.21 f91.7 ± 0.07 f72.1 ± 0.32 f2056.988.7 ± 0.88 e
JFST-DETR93.0 ± 0.30 b86.8 ± 0.28 b89.8 ± 0.13 a94.3 ± 0.15 a75.2 ± 0.14 bc2064.765.5 ± 1.95 i
Note: P, precision; R, recall; F1, F1-score; mAP@50, mean average precision at an intersection-over-union (IoU) threshold of 0.5; mAP@50:95, mean average precision averaged over IoU thresholds from 0.5 to 0.95 with step 0.05; GFLOPs, giga floating-point operations per second; FPS, frames per second. Different superscript letters indicate significant differences (p < 0.05).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, T.; Xue, J.; Wei, M.; Yuan, X.; Wang, X.; Zhang, Z. A Computer Vision Model for Accurate Detection of Fresh Jujube Fruits and General Small Targets in Complex Agricultural Environments. Horticulturae 2025, 11, 1380. https://doi.org/10.3390/horticulturae11111380

AMA Style

Li T, Xue J, Wei M, Yuan X, Wang X, Zhang Z. A Computer Vision Model for Accurate Detection of Fresh Jujube Fruits and General Small Targets in Complex Agricultural Environments. Horticulturae. 2025; 11(11):1380. https://doi.org/10.3390/horticulturae11111380

Chicago/Turabian Style

Li, Tianzuo, Jianxin Xue, Miaomiao Wei, Xinming Yuan, Xindong Wang, and Zimeng Zhang. 2025. "A Computer Vision Model for Accurate Detection of Fresh Jujube Fruits and General Small Targets in Complex Agricultural Environments" Horticulturae 11, no. 11: 1380. https://doi.org/10.3390/horticulturae11111380

APA Style

Li, T., Xue, J., Wei, M., Yuan, X., Wang, X., & Zhang, Z. (2025). A Computer Vision Model for Accurate Detection of Fresh Jujube Fruits and General Small Targets in Complex Agricultural Environments. Horticulturae, 11(11), 1380. https://doi.org/10.3390/horticulturae11111380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop