1. Introduction
Cucumber (
Cucumis sativus L.) is an important vegetable crop in protected agriculture. It is widely cultivated worldwide. China has long held a leading position in cucumber planting area and total production [
1]. Meanwhile, greenhouse cucumber production in China has high yield potential and substantial economic value [
2]. Cucumber harvesting still depends heavily on manual labor. Rising labor costs and a shrinking agricultural labor pool are accelerating the demand for robotic greenhouse harvesting [
3]. In selective harvesting, target recognition must be completed before grasping and cutting. Stable visual perception therefore directly affects harvesting safety and operational efficiency [
4]. Object detection and localization are key links between visual sensing and robotic execution [
5]. In real greenhouse scenes, however, visual perception is affected by canopy clutter, variable lighting, and the visual similarity between fruits and surrounding plant organs. Robust detection under practical harvesting conditions therefore remains difficult [
6].
Traditional cucumber recognition methods mainly relied on image processing and shallow learning. Sun et al. [
7] enhanced greenhouse canopy images with multi-scale Retinex and color restoration, and then segmented cucumber targets with vegetation indices under variable natural lighting. Li et al. [
8] combined color analysis, texture analysis, support vector machine classification, and SIFT keypoint filtering for greenhouse cucumber detection. Bao et al. [
9] developed a multi-template matching method for cucumber recognition in natural environments and improved recognition accuracy by constructing a scaled and rotated template library. Mao et al. [
10] introduced a multi-path convolutional feature extractor and combined color component selection with an SVM classifier for cucumber recognition in natural environments. These methods provided useful early solutions, but they depend on hand-crafted color, texture, or template features and on staged decision rules. Their performance is sensitive to illumination, viewpoint, and occlusion changes, and their ability to adapt to complex greenhouse canopies remains limited.
Deep learning detectors reduced manual feature engineering and pushed agricultural target recognition toward end-to-end learning. Ren et al. [
11] established the Faster R-CNN framework with region proposal networks and shared convolutional features. Redmon et al. [
12] proposed YOLO and made single-stage real-time detection practical through direct regression. For greenhouse cucumber recognition, Wang et al. [
13] enhanced YOLOv5 with Cr-channel pretraining to suppress near-color interference from leaves and stems. Su et al. [
14] further improved YOLOv5 for cucumber picking in near-color greenhouse backgrounds and explicitly addressed occlusion and scale variation. Liu et al. [
15] applied instance segmentation to greenhouse cucumber detection and improved pixel-level localization under complex occlusion. Bai et al. [
16] combined transfer learning with U-Net and one-stage detectors to improve segmentation and recognition robustness in complex natural environments. Kim et al. [
17] introduced amodal segmentation to recover occluded cucumber regions and improve perception under partial visibility. Even so, CNN-based detectors still rely mainly on local aggregation. Luo et al. [
18] showed that the effective receptive field of deep CNNs occupies only a fraction of the theoretical receptive field, which limits long-range relation modeling. For greenhouse cucumber detection, this limitation matters because reliable recognition often requires connecting discontinuous target cues across the canopy.
Transformer detectors provide a different route for object detection. Carion et al. [
19] reformulated detection as set prediction and removed anchor design and non-maximum suppression from the main pipeline. Zhu et al. [
20] improved DETR with deformable attention and strengthened multi-scale modeling for small objects. Zhao et al. [
21] then proposed RT-DETR and showed that real-time end-to-end detection can surpass YOLO-style detectors of comparable scale while eliminating NMS. In agricultural vision, Jin et al. [
22] proposed a lightweight RT-DETR variant by optimizing the backbone and attention mechanism, which reduced computational complexity by 59.3% for upland rice weed detection. Li et al. [
23] proposed GP-DETR based on RT-DETR. They adopted GELAN as the backbone and introduced enhanced intra-scale interaction and feature fusion, achieving 96.4% mAP@0.5 at 100 FPS. Song et al. [
24] proposed an occlusion-aware greenhouse harvesting framework and improved harvestability decision reliability under severe occlusion through risk-gated reasoning. Huang et al. [
25] proposed a lightweight RT-DETR-based pear detector by replacing the backbone and introducing HiLo attention, which improved perception under illumination variation. Wang et al. [
26] developed HESP-DETR through enhanced multi-scale modeling and feature fusion, which improved mAP@0.5 by 1.0% while reducing model complexity and inference cost. Wu et al. [
27] further improved RT-DETR for fruit ripeness detection by refining the backbone and attention modules, which improved mAP@0.5 by 2.9% while reducing model size and computational complexity. These studies confirm the potential of Transformer-based detectors in agricultural vision. They also reveal a remaining gap for greenhouse cucumber selective harvesting. Existing RT-DETR variants usually improve the backbone, attention module, or feature fusion for a specific crop task, but they seldom address shallow contour preservation, long-range contextual association, efficient multi-scale fusion, and edge-side deployment within one compact detector.
Greenhouse cucumber selective harvesting presents a morphology-specific detection problem. Mature fruits are long and slender, the peduncle occupies a small image region, and the fruit surface is close in color to stems and leaves. Dense foliage can separate one cucumber into discontinuous visible parts. A suitable detector therefore needs to retain shallow contour cues, model long-range relations along the fruit axis, and fuse multi-scale features without excessive computation.
To address this task, this study proposes MCS-DETR, an efficient multi-scale context-aware detector for selective harvesting of greenhouse cucumbers. Built on RT-DETR, MCS-DETR redesigns three parts of the detection pipeline. The backbone strengthens shallow feature extraction, the top encoder layer improves contextual interaction with limited computation, and the neck reduces redundancy during cross-scale feature aggregation. The main contributions are listed below.
A Gated Large Kernel Aggregation Network (GLKAN) is designed as the backbone to improve shallow feature representation for cucumber contours and peduncle regions.
A Single-Head Attention-based Intra-scale Feature Interaction (SH-AIFI) module is introduced at the top of the encoder to strengthen high-level contextual interaction at low computational cost.
A Weighted Reparameterized Partial Aggregation Fusion (WRPAFusion) module is developed in the neck to improve cross-scale feature aggregation and reduce fusion redundancy.
2. Materials and Methods
2.1. Data Acquisition
The cucumber fruit image data used in this study was collected from a vegetable planting base in Gucheng Street, Shouguang City, Shandong Province, China. The crops were grown in a greenhouse environment and were Chinese northern prickly cucumbers. To fully cover the growth state of the fruit at different times and under different lighting conditions, the data collection used an all-day shooting strategy, starting at 7:00 a.m. and ending at 6:00 p.m. It included different lighting scenes such as frontlight and backlight, which improved the model’s ability to adapt to changing lighting conditions. The images were captured using an OPPO Reno 10 smartphone (Guangdong OPPO Mobile Telecommunications Corp., Ltd., No. 18 Haibin Road, Wusha, Chang’an Town, Dongguan, Guangdong, China). After screening, 3750 valid original images were retained, covering frontlight and backlight conditions, different viewing distances, single and multiple fruit targets, fruit overlap, and leaf and stem occlusion. Representative samples are shown in
Figure 1. Because all images were collected from one greenhouse site and one cucumber cultivar, the dataset mainly reflects this specific protected-cultivation condition and does not fully cover variability across different cultivars or greenhouse environments. The generalization issue is further discussed in
Section 4.
2.2. Dataset Creation
To enable harvestability recognition, cucumber fruits were annotated according to two criteria, namely fruit maturity and current operability. Fruit maturity was judged by uniform thickness, deep green coloration, and an overall plump shape. Given that the harvesting robot changes its viewing perspective during operation, the operability of each fruit differs across angles. Mature fruits with clearly visible peduncles from the current viewing angle were therefore labeled as harvestable cucumbers, while the remaining fruits were classified as non-harvestable cucumbers. All annotations were performed with LabelImg 1.8.6. Each target was marked by a rectangular bounding box, and the resulting annotation files recorded category labels along with bounding box coordinates.
The original dataset was split into training, validation, and test sets at a ratio of 8:1:1, containing 3000, 375, and 375 images, respectively. To preserve evaluation objectivity, the validation and test sets retained their original data distributions, and offline augmentation was applied exclusively to the training set. Because the visual appearance of cucumber fruits varies with viewing angle, illumination intensity, and occlusion severity, the original images alone cannot cover all potential field conditions. This study applied random scaling from −10% to 10%, horizontal flipping, random rotation from −20° to 20°, and HSV color-space augmentation, with saturation and brightness perturbations both ranging from −25% to 25%. After augmentation, the number of training images increased from 3000 to 5796.
2.3. The Improved Model MCS-DETR
The proposed MCS-DETR is an end-to-end detector for harvestability recognition of greenhouse cucumbers.
Figure 2 shows its overall architecture. Using RT-DETR as the base detector, MCS-DETR revises three key parts of the detection pipeline, namely the backbone, the top encoder layer, and the cross-scale fusion pathway. The backbone is redesigned to improve shallow feature extraction, the top encoder layer is adjusted to strengthen high-level contextual interaction, and the cross-scale fusion pathway is rebuilt to aggregate multi-scale features more efficiently.
The original backbone does not provide sufficient spatial coverage to perceive slender fruits or distinguish peduncles from adjacent vines. To improve shallow feature extraction, the backbone is replaced with a Gated Large Kernel Aggregation Network (GLKAN), which is composed of Gated Large Kernel Blocks (GLK Blocks). The gated mechanism suppresses background responses from stems and leaves, while large-kernel convolutions expand spatial coverage so that fruit contours and peduncle regions can be captured within a wider local window. In the baseline encoder, standard Multi-Head Self-Attention (MHSA) performs intra-scale interaction on the top-level features. When all channels participate in attention computation, the computational cost and memory overhead become considerable. Full-channel modeling may also introduce irrelevant background details into global associations and weaken the discrimination between harvestable and non-harvestable fruits. Single-Head Attention-based Intra-scale Feature Interaction (SH-AIFI) is therefore introduced at the top layer of the encoder. This module uses Single-Head Self-Attention (SHSA) to apply attention to only a subset of channels, reducing computational cost while strengthening long-range dependencies among high-level features. In the cross-scale fusion pathway, concatenation-based fusion expands the channel dimension at each fusion node and increases the cost of subsequent convolutions. Weighted Reparameterized Partial Aggregation Fusion (WRPAFusion) replaces the original fusion modules to reduce this burden. Fast Normalized Fusion (FNF) fuses multi-scale branches through learnable weights and avoids the channel expansion caused by direct concatenation. Reparameterized Partial Convolution Blocks (RepPConv Blocks) are used inside the Reparameterized Partial Aggregation Block (RPA Block), where the fused features are further refined through partial aggregation and reparameterized convolutions.
2.3.1. Gated Large Kernel Aggregation Network (GLKAN)
The backbone of RT-DETR constructs the feature pyramid by stacking 3 × 3 convolutions stage by stage. This small-kernel stacking structure suffers from low receptive field efficiency and has limited ability to perceive the overall morphology of slender targets. The spatial span of a cucumber fruit from its peduncle junction to the tip requires the backbone to achieve sufficient spatial coverage at relatively shallow layers. Meanwhile, the peduncle region is small and shares a nearly identical color with the surrounding vines, making it difficult for small kernels to capture the difference between peduncles and vines within a single convolution window. To overcome these limitations, the improved backbone alternates Conv downsampling layers with multi-stage GLKAN modules. The backbone outputs four levels of features, P2, P3, P4, and P5, with channel numbers of 128, 256, 384, and 384, respectively. Compared with the baseline, which only outputs P3 to P5, the improved backbone additionally retains the shallow-level P2 features to provide finer-grained edge information for the subsequent fusion stage.
GLKAN is inspired by the cross-stage partial connection strategy of CSPNet [
28]. The input features are split along the channel dimension into a shortcut branch and a transform branch. The shortcut branch preserves shallow-level information, while the transform branch passes through several GLK Blocks before being concatenated with the shortcut branch. The overall structure of GLKAN is shown in
Figure 3a. Let the input feature be
. After a 1 × 1 convolution, the input feature is evenly split into a shortcut branch
and a transform branch
. The transform branch is processed by
successive GLK Blocks. The shortcut branch, the transform-branch input, and the outputs of all GLK Blocks are then concatenated and fused through a 1 × 1 convolution, which is expressed as follows:
where
is the output feature,
denotes the
-th GLK Block, and
is the number of stacked GLK Blocks. This concatenation strategy allows shallow-level edge information and multi-stage transformed semantic features to participate jointly in the output representation.
GLK Block retains the information flow organization of the gated convolution block in MambaOut [
29] and replaces the spatial mixing branch with a large-kernel reparameterized convolution. The structure is shown in
Figure 3b. The input feature is first normalized and expanded by a 1 × 1 convolution, then split along the channel dimension into a gate branch
, an identity branch
, and a convolution branch
. The three branches work together to complete the feature transformation. The core computation of the GLK Block is expressed as:
where
,
, and
are the three branches obtained by evenly splitting the expanded feature along the channel dimension,
is the activation function,
denotes element-wise multiplication, and
denotes the large-kernel spatial mixing operator. The gate branch performs selective channel-wise modulation, enhancing informative responses while suppressing redundant background features.
The convolution branch adopts the Large Kernel Block (LarK Block) from UniRepLKNet [
30]. Its structure is shown in
Figure 4a. Large-kernel convolutions achieve a wider spatial coverage with fewer layers, enabling the principal axis of the fruit body, the peduncle position, and the surrounding vine arrangement to be captured within a single convolution window. The internal components of the convolution branch consist of a Dilated Re-param Block (DRB), batch normalization, a channel attention module, and a feed-forward network, which is expressed as:
where
is the input feature of the convolution branch,
is the Dilated Re-param Block,
is batch normalization,
is the channel attention module, and
is the feed-forward network.
During training, the Dilated Re-param Block consists of a main 7 × 7 depthwise convolution and several parallel small convolutions with different dilation rates. The structure is shown in
Figure 4. The outputs of all branches are summed before being fed into the subsequent layers, which is expressed as:
where
is the number of parallel branches,
denotes the
-th depthwise convolution branch with a specific dilation rate, and
is the corresponding batch normalization layer. During inference, owing to the additivity of linear operations, the convolution weights and batch normalization parameters of all branches can be fused into a single 7 × 7 depthwise convolution, which is expressed as:
where
denotes the convolution operation,
and
are the equivalent convolution kernel and bias after branch fusion. The multi-branch parallel structure during training enhances the feature representation capability, while the branch fusion during inference eliminates extra computational overhead. In addition, the dilated convolutions are able to skip occluded regions occupied by leaves and establish spatial connections between separated fruit segments, which is beneficial for detecting partially occluded cucumbers.
2.3.2. Single-Head Attention-Based Intra-Scale Feature Interaction (SH-AIFI)
The AIFI module applies standard Multi-Head Self-Attention (MHSA) to the top-level features for intra-scale interaction. MHSA is capable of establishing global spatial associations. However, all channels participate in the attention computation simultaneously, resulting in considerable computational cost and memory consumption, which is unfavorable for subsequent edge-side deployment. In addition, full-channel modeling introduces irrelevant background details such as leaf vein textures into the global associations, interfering with the discrimination between harvestable and non-harvestable fruits.
SH-AIFI replaces the original intra-scale interaction module at the top layer of the encoder. The core modification is to replace MHSA with Single-Head Self-Attention (SHSA) proposed by Yun et al. [
31]. Instead of computing attention across all channels, SHSA splits the input features along the channel dimension into two branches: an active branch containing
channels, which performs global spatial association modeling, and a passive branch containing
channels, which skips the attention computation and preserves local spatial information. The two branches are concatenated at the end and fused through a lightweight projection layer. The module structure is shown in
Figure 5.
Let the input feature be
. It is split along the channel dimension into an active branch
and a passive branch
, where
is the channel split ratio and is set to
. The active branch is first processed by group normalization, and then three groups of 1 × 1 convolutions generate the query
, key
, and value
, respectively. The spatial dimensions are flattened, and
denotes the number of spatial positions. Let
denote the channel dimension of the query and key, which is set to
. The single-head attention is computed as:
where
,
, and
are the flattened query, key, and value tensors, respectively,
is the attention matrix, and
is the scaling factor.
The aggregated result
is reshaped back to the two-dimensional spatial form, concatenated with the passive branch along the channel dimension, and projected to produce the output:
where
consists of a 1 × 1 convolution, batch normalization, and SiLU activation. The batch normalization weights in the projection layer are initialized to zero, so that the module output approximates an identity mapping at the beginning of training, which facilitates stable convergence.
The attention computation is applied to only
channels, and the resulting matrix size is much smaller than that of full-channel multi-head attention. Substituting
and
, the core complexity of the SHSA active branch is
, which is lower than
of MHSA. The single-head structure also eliminates the overhead of multi-head splitting and merging [
32].
SH-AIFI is applied only to the highest-level features, where the spatial resolution is the lowest and the computational cost of global attention remains manageable. The active branch establishes spatial associations on compact high-semantic channels, while the passive branch preserves edge and texture information from being diluted. The output features are then fed into the subsequent cross-scale fusion module.
2.3.3. Weighted Reparameterized Partial Aggregation Fusion (WRPAFusion)
In the baseline neck, multi-scale features are fused by channel concatenation, and the fused result is then processed by convolution blocks for feature integration. The concatenation operation causes the channel count at each fusion node to increase instantaneously, and the subsequent convolutions are required to compress the expanded channels, leading to a high computational cost at each node. To reduce the computational burden of the fusion stage, the original fusion modules at all nodes in the neck are replaced with WRPAFusion. The structure is shown in
Figure 6. WRPAFusion consists of two parts: the first part is a Fast Normalized Fusion (FNF) operator, which performs weighted summation of different input branches; the second part is three cascaded RPA Blocks, which further transform the fused features. All fusion nodes in the neck adopt this module.
FNF is derived from the fast normalized fusion method proposed by Tan et al. [
33]. This method assigns a learnable scalar weight to each input branch, normalizes the weights, and then performs element-wise weighted summation. Compared with softmax-based fusion, FNF achieves comparable accuracy but faster inference, with approximately 30% acceleration on GPUs. The computation process of FNF is illustrated in
Figure 7. The learnable fusion weight
corresponding to each input branch is first constrained to be non-negative and then normalized, yielding a branch-level scalar weight coefficient
. Each feature map is scaled by its corresponding scalar coefficient
, and the weighted results from all branches are summed to produce the fusion output.
Let the input features be
,
, where
is the number of branches participating in the fusion. The fusion weights and output of FNF are expressed as:
where
is the learnable weight of the
-th input branch,
is a stability constant,
is the normalized branch-level scalar fusion coefficient, and
is the fusion output. The same formulation is applied to both two-input nodes (
) and three-input nodes (
), differing only in the number of input branches. Compared with channel concatenation, the contribution of features at different scales is allocated at the branch level by FNF. The channel dimension is not altered by the fusion operation itself, so no channel expansion is introduced as in concatenation-based fusion.
The features output by FNF are fed into the RPA Block for further transformation. The RPA Block adopts a partial aggregation architecture: the input features are processed by two 1 × 1 convolutions to generate a main branch
and a bypass branch
. The main branch undergoes deep transformation, while the bypass branch is directly preserved. The two branches are concatenated at the end and mapped to the output channels through a 1 × 1 convolution. This strategy of splitting the input into a transformed part and a preserved part before aggregation reduces redundant computation and improves inference efficiency. The structures of the RPA Block and its internal RepPConv Block are shown in
Figure 8, where
Figure 8a illustrates the main branch and bypass branch architecture, and
Figure 8b shows the training and inference configurations of RepPConv.
The main branch cascades three RepPConv blocks. The output is concatenated with the bypass branch and then projected through a 1 × 1 convolution, which is expressed as:
where
is the 1 × 1 convolution mapping, and
denotes the mapping obtained by sequentially stacking three RepPConv blocks.
The spatial mixing in RepPConv combines partial convolution with structural reparameterization. The partial convolution is inspired by FasterNet [
34]. Only a fraction of the input channels undergo standard convolution, while the remaining channels are directly passed through, thereby reducing both the computation and memory access overhead. Let the input of RepPConv be
. It is split along the channel dimension into a convolution branch
and a bypass branch
. The convolution branch occupies
of the channels and internally adopts the RepVGG structural reparameterization [
35]. During training, three parallel branches are used: a 3 × 3 convolution, a 1 × 1 convolution, and an identity mapping:
where
and
are the 3 × 3 and 1 × 1 convolution kernels, respectively, and
,
, and
are the batch normalization layers on each branch. During inference, the above multi-branch structure is folded into an equivalent single 3 × 3 convolution:
where
,
, and
are the convolution kernels of each branch after fusion with batch normalization,
zero-pads the 1 × 1 kernel to 3 × 3, and
and
are the equivalent parameters used during deployment.
2.4. Evaluation Metrics and Experimental Setup
2.4.1. Evaluation Metrics
Common evaluation metrics for object detection were used to assess model performance. Detection accuracy was measured by Precision (P), Recall (R), mAP@0.5, and mAP@0.5:0.95. Model complexity and deployment cost were characterized by the number of parameters (Params), floating-point operations (FLOPs), and weight size. Params denotes the total number of learnable parameters in the model. FLOPs denotes the number of floating-point operations required for a single forward pass under a fixed input size. Weight size denotes the size of the exported model weight file.
The overlap between a predicted box and a ground-truth box was measured by the Intersection over Union (IoU). Let
denote the predicted box and
denote the ground-truth box. IoU is defined as:
A prediction was counted as a true positive (TP) when the predicted class matched the ground-truth class and
. Predictions that did not satisfy this condition were counted as false positives (FP). Ground-truth objects that were not matched by any predicted box were counted as false negatives (FN). Precision and Recall are defined as
When the confidence threshold changes, a precision–recall curve (P-R curve) can be obtained. For a single class, the Average Precision (AP) is defined as the area under this curve:
At a given IoU threshold
, the mean Average Precision (mAP) is calculated as the average AP over
classes:
Here, mAP@0.5 corresponds to
. mAP@0.5:0.95 is the average mAP over IoU thresholds from 0.50 to 0.95 with a step size of 0.05, and is computed as:
2.4.2. Experimental Setup
All experiments were conducted on Ubuntu 22.04 using an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA) with 24 GB of memory. The implementation was based on Python 3.10.19, PyTorch 2.1.2, and CUDA 11.8. The model was trained for 200 epochs with a batch size of 16. AdamW was used as the optimizer, with a learning rate of 0.0001, a momentum of 0.9, and a weight decay of 0.0001. The input image size was set to 640 × 640 for training and evaluation. The hardware configuration, software environment, and main parameters used for training and analysis are summarized in
Table 1.
3. Experimental Results and Analysis
3.1. Baseline Model Selection
Before developing the improved detector, we first selected a suitable RT-DETR baseline for greenhouse cucumber detection. The four RT-DETR variants were compared, as shown in
Table 2. Increasing model scale did not lead to a clear accuracy gain on this dataset. RT-DETR-R50 improved mAP@0.5 by only 0.1 percentage points over RT-DETR-R18 and showed the same mAP@0.5:0.95, but it required much higher parameters, FLOPs, and weight size. RT-DETR-R34 and RT-DETR-L also increased model complexity without improving the overall accuracy profile. Considering detection accuracy, model complexity, and later edge deployment, RT-DETR-R18 provided the most suitable baseline. It maintained competitive detection performance while requiring the smallest parameter count, the lowest computational cost, and the lightest weight file among the four candidates. Therefore, RT-DETR-R18 was selected for subsequent improvement.
3.2. Comparative Experiments on Attention Mechanisms
High-level contextual interaction is important when visible fruit cues are spatially separated. To compare different attention strategies, we replaced the original intra-scale interaction module in RT-DETR-R18 with HiLo, MSLA, CGA, Efficient Additive, and SH-AIFI, and evaluated them under identical training and testing settings.
As shown in
Table 3, the competing attention modules changed the model size only slightly, but their performance gains were limited or inconsistent. Compared with RT-DETR-R18, HiLo, MSLA, and CGA increased mAP@0.5 by 0.4, 0.5, and 0.6 percentage points, respectively, whereas their mAP@0.5:0.95 decreased by 0.3, 0.3, and 0.7 percentage points. Efficient Additive did not improve the baseline and reduced mAP@0.5 and mAP@0.5:0.95 by 0.2 and 0.7 percentage points, respectively. In contrast, SH-AIFI showed the clearest improvement among the tested attention modules. It reached 91.5% mAP@0.5 (+1.7%) and 75.3% mAP@0.5:0.95 (+1.0%) relative to RT-DETR-R18, while also giving the highest precision and recall. At the same time, the parameter count decreased from 19.9 M to 19.7 M, FLOPs dropped from 58.6 G to 57.0 G, and the weight size was reduced from 80.7 MB to 80.0 MB. These results suggest that SH-AIFI strengthens contextual association among high-level features without increasing model complexity. Based on this comparison, SH-AIFI was used in the subsequent experiments.
3.3. Comparison of Backbone Networks
The backbone network determines the fundamental feature extraction capability of the detector and directly governs the trade-off between detection accuracy and computational cost. To evaluate the effectiveness of the proposed GLKAN backbone, we replaced the default ResNet-18 in RT-DETR with nine representative lightweight architectures. The results are summarized in
Table 4.
GLKAN gave the strongest overall accuracy among the tested backbones. It reached 92.0% mAP@0.5 and 75.4% mAP@0.5:0.95, increasing the two mAP metrics by 2.2% and 1.1% over the RT-DETR-R18 baseline. Its precision and recall were also the highest in
Table 4. This accuracy gain was obtained with a smaller model. The parameter count decreased from 19.9 M to 15.2 M, a reduction of 23.6%, and FLOPs fell from 58.6 G to 51.6 G. The weight file was reduced to 62.4 MB, which was 22.7% smaller than the baseline. These results suggest that GLKAN improves shallow feature representation without relying on a larger backbone. Among the CNN-based alternatives, UniRepLKNet achieved 90.9% mAP@0.5 and 74.5% mAP@0.5:0.95 with lower computation, but it still lagged behind GLKAN in both mAP metrics. GhostNetv2 matched the baseline mAP@0.5 at a much lower computational cost of 27.8 G FLOPs, although its recall remained lower than that of GLKAN. MobileNetv4 compressed the model to 11.3 M parameters and 46.5 MB, but the accuracy loss was clear, with mAP@0.5 falling to 88.4% and mAP@0.5:0.95 to 71.2%. ConvNeXtv2 and EfficientNetv2 did not provide a better accuracy–efficiency balance. ConvNeXtv2 remained below the baseline in mAP@0.5, while EfficientNetv2 reached only 89.6% mAP@0.5 despite using 21.0 M parameters and an 85.7 MB weight file. StarNet showed the weakest accuracy, with 85.6% mAP@0.5 and 68.6% mAP@0.5:0.95.
The two Transformer-based backbones also failed to offer a more suitable trade-off. EfficientViT was the most compact network, with 10.7 M parameters and 27.2 G FLOPs, but its mAP@0.5 dropped to 88.8% and its recall fell to 86.3%. EfficientFormerv2 produced a small gain over the baseline, reaching 90.2% mAP@0.5 and 74.4% mAP@0.5:0.95. This improvement, however, came with a 99.9 MB weight file, about 1.6 times larger than GLKAN. Considering accuracy, model size, and computational cost together, GLKAN provided the most favorable backbone choice for the subsequent experiments.
3.4. Ablation Experiment
To quantify the individual and combined contributions of the three proposed modules, ablation experiments were conducted on RT-DETR-R18. The results are presented in
Table 5 and visualized in
Figure 9.
Each module, when introduced alone, improved the baseline detector in a different way. With GLKAN, mAP@0.5 increased from 89.8% to 92.0%, and mAP@0.5:0.95 rose to 75.4% (+1.1%). The parameter count simultaneously dropped from 19.9 M to 15.2 M, and the weight file size was reduced to 62.4 MB, corresponding to decreases of 23.6% and 22.7%. SH-AIFI mainly improved the balance between classification confidence and target retrieval. It reached 91.5% mAP@0.5 (+1.7%) and raised recall to 91.0%, while the parameter count remained almost unchanged at 19.7 M. This suggests that the enhanced encoder interaction strengthened high-level feature association without adding model burden. WRPAFusion gave the largest single-module gain on the stricter localization metric. Its mAP@0.5:0.95 reached 75.9%, 1.6% higher than the baseline, while FLOPs decreased from 58.6 G to 48.3 G. The two-module settings further show that the three components are complementary. Combining GLKAN with SH-AIFI produced the highest recall among the two-module variants, reaching 91.5%, and increased mAP@0.5 to 92.3%. The GLKAN and WRPAFusion combination gave a stronger gain in localization quality, with mAP@0.5:0.95 increasing from 74.3% to 76.7%. This setting also compressed the model substantially, reducing the weight file from 80.7 MB to 52.0 MB. The SH-AIFI and WRPAFusion setting reached 93.1% mAP@0.5 and 76.4% mAP@0.5:0.95, confirming that the attention and fusion modules can work together effectively even without the redesigned backbone.
The full MCS-DETR configuration produced the strongest result in the ablation study. Compared with the baseline, mAP@0.5 increased from 89.8% to 93.4%, and mAP@0.5:0.95 rose from 74.3% to 76.8%. Precision and recall both remained above 91%. The accuracy gains were accompanied by a clear reduction in model cost. Compared with RT-DETR-R18, the final model used 7.3 M fewer parameters, 14.6 G fewer FLOPs, and a 29.4 MB smaller weight file. These consistent changes show that the three modules improve different parts of the detection pipeline and support each other rather than adding redundant structure.
After confirming the contribution of the complete model, we further checked the repeatability of the key comparison between RT-DETR-R18 and MCS-DETR. Both models were trained three times under the same data split, augmentation settings, optimizer configuration, and hyperparameter settings. The random seed was fixed at 0 in all three training runs. Across the repeated runs, the maximum run-to-run range did not exceed 0.10 percentage points for mAP@0.5 and 0.20 percentage points for mAP@0.5:0.95. MCS-DETR outperformed RT-DETR-R18 in every run, with average gains of 3.60 and 2.47 percentage points in mAP@0.5 and mAP@0.5:0.95, respectively. Because the same random seed was used in all repeated trainings, this analysis is reported as a repeatability check under a fixed random seed rather than as a significance test based on independent random seeds.
3.5. Comparative Experiments of Different Models
To further validate the proposed MCS-DETR, we compared it against a broad range of mainstream detectors, including both CNN-based and DETR-based architectures. The results are reported in
Table 6, and the detection performance and model complexity are visualized in
Figure 10a and
Figure 10b respectively.
Among the CNN-based detectors, earlier two-stage and anchor-based models showed limited accuracy or high computational cost on this task. Faster R-CNN, SSD, and RetinaNet remained below 88% mAP@0.5, and SSD had the lowest mAP@0.5:0.95 among all compared models. Cascade R-CNN improved the two mAP metrics to 88.4% and 71.7%, but this came with the largest model burden in
Table 6, including 69.2 M parameters and 205 G FLOPs. TOOD reached 88.9% mAP@0.5, yet its localization accuracy was still lower than that of the stronger one-stage YOLO variants. YOLOv8m and YOLOv12m provided a more competitive efficiency profile. YOLOv12m reached 89.8% mAP@0.5 with 20.1 M parameters and 67.1 G FLOPs, while YOLOv8m obtained a slightly higher mAP@0.5:0.95 of 74.0%. MCS-DETR outperformed both YOLO variants. Compared with YOLOv12m, it improved mAP@0.5 by 3.6% and mAP@0.5:0.95 by 3.1%, with 37.3% fewer parameters and 34.4% lower FLOPs. Compared with YOLOv8m, MCS-DETR gained 4.0% and 2.8% in the two mAP metrics and reduced the parameter count by more than half.
Within the DETR family, several models achieved higher accuracy than the early CNN detectors, but most required larger model sizes or higher computation. Deformable DETR remained below RT-DETR-R18 in both mAP metrics despite using 40.1 M parameters and 165 G FLOPs. DDQ-DETR produced competitive accuracy, with 90.4% mAP@0.5 and 74.5% mAP@0.5:0.95, but its computational cost increased to 471 G FLOPs. Conditional DETR and DAB-DETR stayed close to 90% mAP@0.5, although both required more than twice the parameters of RT-DETR-R18. DINO was the strongest competing DETR-based detector, reaching 90.7% mAP@0.5 and 74.7% mAP@0.5:0.95. MCS-DETR exceeded DINO by 2.7% and 2.1% in the two mAP metrics while using only 26.5% of its parameters and 18.7% of its FLOPs. Compared with the RT-DETR-R18 baseline, MCS-DETR increased mAP@0.5 and mAP@0.5:0.95 by 3.6% and 2.5%, respectively. This accuracy gain was not obtained by using a larger detector. In terms of model size and computational cost, MCS-DETR remained smaller than the compared CNN-based and DETR-based models, including YOLOv12m and DINO. These comparisons show that MCS-DETR achieved the most favorable accuracy–efficiency trade-off among all tested CNN-based and DETR-based detectors.
Figure 10a presents a radar chart comparing the detection accuracy of all models across precision, recall, mAP@0.5, and mAP@0.5:0.95. The contour of MCS-DETR consistently occupies the outermost region across all four metrics, indicating that its overall detection capability exceeds that of all other compared models.
Figure 10b illustrates the relationship between model complexity and weight file size. Each bubble corresponds to a model, with its position determined by parameter count and FLOPs and its diameter proportional to the weight file size. MCS-DETR is located in the lower-left corner with the smallest bubble, confirming that it simultaneously achieves the lowest parameter count, the fewest FLOPs, and the smallest weight file size. Overall, MCS-DETR achieved the highest detection accuracy among all compared models while maintaining the most compact model size and the lowest computational cost, demonstrating its suitability for greenhouse cucumber selective harvesting where both high accuracy and deployment efficiency are essential.
3.6. Visual Demonstration
3.6.1. Visualization of Heatmap
To further examine how different detectors respond to cucumber targets in cluttered greenhouse environments, Grad-CAM++ heatmaps were generated for RT-DETR, YOLOv8m, and the proposed MCS-DETR on the same representative scenes, as shown in
Figure 11 [
36]. Grad-CAM++ back-propagates the gradients of the target class into the last convolutional layer and produces a spatial activation map that highlights image regions contributing most to the final prediction. Warm colors denote strong activation while cold colors indicate weak activation. All four samples were drawn from the same greenhouse scene but differ in leaf occlusion severity, background complexity, and the degree to which the fruit is exposed.
RT-DETR covers the main fruit body in most cases, though its high-response area tends to spill onto adjacent leaves and tendrils, suggesting that the encoder cannot fully separate foreground texture from the visually similar canopy. YOLOv8m yields a tighter response along the fruit, but the activation becomes fragmented when the target is partially hidden behind stems of similar color and texture. Neither model maintains a spatially coherent heat signature under varying occlusion conditions. MCS-DETR mitigates these issues to a noticeable extent. Its high-activation zone tracks the elongated fruit contour more faithfully, and the transition from target to background is sharper than what the two baselines achieve. The improvement is particularly visible in samples with heavy foliage overlap, where RT-DETR scatters attention across the canopy and YOLOv8m fragments its response, while MCS-DETR retains a continuous, well-bounded heat region along the fruit body. Residual background activation remains in all three models, which is an expected artifact in dense greenhouse imagery where leaves, stems, and fruit share overlapping spectral and textural properties. These observations indicate that the proposed model produces more spatially focused and discriminative feature representations under complex greenhouse conditions.
3.6.2. Detection Result Visualization
Figure 12 presents the qualitative detection results of RT-DETR, YOLOv8m, and the proposed MCS-DETR under four representative greenhouse scenarios: frontlight, backlight, distant view, and fruit overlap. These samples were selected for the variety of interference they present, including strong illumination changes, dense foliage, partial occlusion, scale variation, and close spacing between adjacent fruits. In the relatively unobstructed regions of each image, all three detectors successfully localize the dominant cucumber targets, confirming that each model possesses a baseline detection capability in real greenhouse conditions. The gaps between them surface once the targets become slender, distant, leaf-occluded, or visually entangled with surrounding stems and textures.
Under frontlight, RT-DETR and YOLOv8m detect the principal fruits but exhibit varying degrees of missed detection on smaller cucumbers near the canopy edge. The backlight scene intensifies this problem. Strong backlighting suppresses local contrast, causing RT-DETR to return noticeably lower confidence scores on targets that fall within shadow regions; YOLOv8m misses some of these targets altogether. MCS-DETR still registers them. In the distant-view sample, fruit size shrinks and the background occupies a larger proportion of the frame. Both RT-DETR and YOLOv8m produce false detections or miss several remote targets obscured by foliage, whereas MCS-DETR correctly identifies all ground-truth instances. The fruit-overlap scenario tells a similar story. MCS-DETR achieves clearly better detection performance in this setting, with no false or missed detections observed. Overall, the visual evidence indicates that MCS-DETR delivers stronger detection performance when confronted with illumination disturbance, occlusion, and scale change.
3.7. Model Deployment
To evaluate the edge-deployment performance of MCS-DETR, inference experiments were carried out on an NVIDIA Jetson Orin NX Super, as shown in
Figure 13. This platform offers a favorable trade-off between power budget and on-device computing power, making it a practical choice for real-time visual perception in greenhouse scenarios. All test images were greenhouse cucumber frames resized to 640 × 640. Three configurations were benchmarked: RT-DETR-R18 under the PyTorch backend, MCS-DETR under the same backend, and MCS-DETR accelerated with TensorRT. Results are reported in
Table 7.
Inference cost was quantified by the average end-to-end latency per image. Let
,
, and
denote the preprocessing, network forward, and postprocessing times, respectively. The total inference latency is then defined as:
and the corresponding frame rate is given by:
where
is measured in milliseconds. As shown in
Table 7, MCS-DETR running under PyTorch reached 16.9 FPS, compared with 13.3 FPS for the baseline RT-DETR-R18. With TensorRT optimization enabled, the frame rate of MCS-DETR further increased to 26.3 FPS. These results indicate that the proposed MCS-DETR maintains favorable deployability on resource-constrained embedded hardware while satisfying the real-time response and detection accuracy demands of selective cucumber harvesting in greenhouses.
4. Discussion
Recent agricultural detection studies have shown that RT-DETR and other Transformer-based detectors can be adapted to crop detection tasks through task-specific structural redesign. Previous studies have reduced model complexity in complex field environments [
22], achieved high mAP and inference speed with GP-DETR [
23], and reduced model complexity or model size in improved RT-DETR variants [
26,
27]. These findings indicate that crop-specific detector redesign is often more valuable than simply increasing model scale. MCS-DETR follows this direction but addresses a different application condition. It was developed for greenhouse cucumber harvesting, where the detector needs to preserve shallow contour cues, capture long-range visual associations, and maintain efficient multi-scale fusion within a compact model. The main contribution of MCS-DETR is therefore to combine accuracy improvement with reduced model cost for an on-device harvesting scenario.
The experimental results further show that different components contributed to detection accuracy or model efficiency in different ways, and the complete model achieved the strongest overall performance. This pattern is consistent with previous agricultural detection studies that used backbone redesign, attention refinement, or feature fusion to address crop-specific visual difficulty. The difference is that the present work integrates these design directions into one compact detector. For greenhouse cucumber detection, this integration is important because reliable recognition under partial visibility requires the joint use of boundary information, contextual association, and multi-scale representation.
A detector used in a harvesting robot must provide target information quickly enough for downstream localization, motion planning, and control. Missed detections may leave mature fruits unpicked, while false detections may introduce invalid grasping targets. In this sense, model efficiency is not only a computational metric but also a requirement for field operation. MCS-DETR reached 26.3 FPS on the NVIDIA Jetson Orin NX Super after TensorRT acceleration, suggesting that the proposed detector can serve as an on-device perception module for real-time greenhouse cucumber detection. This result should be interpreted as evidence of deployment feasibility at the perception level. It does not yet demonstrate the performance of a complete harvesting robot, because closed-loop picking also depends on depth sensing, spatial localization, grasp planning, end-effector control, and system-level coordination.
In terms of dataset construction, the current findings still have certain limitations. Although the image set covers common disturbance conditions, such as lighting variation, different viewing distances, fruit overlap, and leaf and stem occlusion, it was collected from a single-site and single-cultivar greenhouse scenario. This setting is useful for evaluating the proposed detector under a representative protected-cultivation condition, but it cannot fully describe the variation caused by different cultivars, greenhouse structures, seasons, or cultivation practices. The detector may also be transferable to other crops with elongated fruit morphology, but such transfer should be validated separately because canopy density, fruit color, fruit surface texture, and training systems may change the visual cues used by the detector. Future work should therefore extend the dataset across cultivars and greenhouse environments and evaluate the detector in closed-loop robotic harvesting trials.
5. Conclusions
This study proposed MCS-DETR for harvestability recognition of greenhouse cucumbers. The model achieved 93.4% mAP@0.5 and 76.8% mAP@0.5:0.95. Compared with RT-DETR-R18, the two mAP metrics increased by 3.6 and 2.5 percentage points, while the number of parameters, FLOPs, and weight size decreased by 36.7%, 24.9%, and 36.4%, respectively. After TensorRT acceleration, MCS-DETR reached 26.3 FPS on the NVIDIA Jetson Orin NX Super. These results indicate that the model can provide an efficient visual perception module for real-time greenhouse cucumber detection. This work may also serve as a reference for agricultural vision tasks involving complex backgrounds, frequent occlusion, and elongated targets.
This study still has several limitations. The current dataset is not sufficient to support a broad claim of cross-cultivar or cross-greenhouse generalization, so further validation is needed under different cultivar and greenhouse conditions. The present work focuses on image-based detection, while selective harvesting also depends on depth sensing, spatial localization, grasp planning, and end-effector control. The deployment experiment verified inference speed, but it did not evaluate picking success rate, cycle time, or long-term operational stability in closed-loop operation. Future work will address these limitations in three directions. The model will first be tested under more diverse greenhouse environments, cultivar conditions, and imaging settings to improve cross-scene generalization. The current detection framework will then be extended toward joint perception of target detection, harvestability assessment, and spatial localization. The detector will finally be deployed in a robotic harvesting system and evaluated through closed-loop picking operations.