MCS-DETR: An Efficient Multi-Scale Context-Aware Detection Model for the Selective Harvesting of Greenhouse Cucumbers

Rong, Lihong; Zhang, Weilong; Sun, Fang; Liu, Huimin; Cai, Changqing; Ding, Fuzhu; Tong, Zhimin

doi:10.3390/app16115530

Open AccessArticle

MCS-DETR: An Efficient Multi-Scale Context-Aware Detection Model for the Selective Harvesting of Greenhouse Cucumbers

by

Lihong Rong

¹

,

Weilong Zhang

¹,

Fang Sun

¹,

Huimin Liu

¹,

Changqing Cai

²

,

Fuzhu Ding

¹ and

Zhimin Tong

^1,*

¹

College of Mechanical and Electrical Engineering, Qingdao Agricultural University, Qingdao 266109, China

²

College of Electrical and Information Engineering, Changchun Institute of Technology, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5530; https://doi.org/10.3390/app16115530

Submission received: 19 March 2026 / Revised: 8 May 2026 / Accepted: 28 May 2026 / Published: 2 June 2026

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Selective harvesting of greenhouse cucumbers requires accurate detection with low inference latency. In greenhouse canopies, mature cucumbers are often partly occluded and visually similar to surrounding stems and leaves, which makes harvestability recognition difficult. Existing real-time detectors still struggle to preserve fine boundary cues, capture long-range context, and remain compact enough for on-device inference under these conditions. This study proposes MCS-DETR, an efficient multi-scale context-aware detector built on RT-DETR. Instead of increasing model scale, MCS-DETR redesigns shallow feature extraction, high-level contextual interaction, and cross-scale feature aggregation within a compact framework. A shallow feature level is also retained to preserve fine contour information. On the greenhouse cucumber dataset, MCS-DETR achieved 93.4% mAP@0.5 and 76.8% mAP@0.5:0.95, outperforming RT-DETR-R18 while requiring fewer parameters and less computation. On an NVIDIA Jetson Orin NX Super (Hunan ChuangLebo Intelligent Technology Co., Ltd., Room 2003, Building C, Xinchanghai Digital Center, Changsha Economic and Technological Development Zone, Changsha, Hunan, China) platform, it reached 26.3 FPS after TensorRT acceleration. These results indicate that MCS-DETR can provide an efficient on-device perception module for real-time greenhouse cucumber detection.

Keywords:

object detection; RT-DETR; multi-scale feature fusion; harvestability recognition; edge deployment

1. Introduction

Cucumber (Cucumis sativus L.) is an important vegetable crop in protected agriculture. It is widely cultivated worldwide. China has long held a leading position in cucumber planting area and total production [1]. Meanwhile, greenhouse cucumber production in China has high yield potential and substantial economic value [2]. Cucumber harvesting still depends heavily on manual labor. Rising labor costs and a shrinking agricultural labor pool are accelerating the demand for robotic greenhouse harvesting [3]. In selective harvesting, target recognition must be completed before grasping and cutting. Stable visual perception therefore directly affects harvesting safety and operational efficiency [4]. Object detection and localization are key links between visual sensing and robotic execution [5]. In real greenhouse scenes, however, visual perception is affected by canopy clutter, variable lighting, and the visual similarity between fruits and surrounding plant organs. Robust detection under practical harvesting conditions therefore remains difficult [6].

Traditional cucumber recognition methods mainly relied on image processing and shallow learning. Sun et al. [7] enhanced greenhouse canopy images with multi-scale Retinex and color restoration, and then segmented cucumber targets with vegetation indices under variable natural lighting. Li et al. [8] combined color analysis, texture analysis, support vector machine classification, and SIFT keypoint filtering for greenhouse cucumber detection. Bao et al. [9] developed a multi-template matching method for cucumber recognition in natural environments and improved recognition accuracy by constructing a scaled and rotated template library. Mao et al. [10] introduced a multi-path convolutional feature extractor and combined color component selection with an SVM classifier for cucumber recognition in natural environments. These methods provided useful early solutions, but they depend on hand-crafted color, texture, or template features and on staged decision rules. Their performance is sensitive to illumination, viewpoint, and occlusion changes, and their ability to adapt to complex greenhouse canopies remains limited.

Deep learning detectors reduced manual feature engineering and pushed agricultural target recognition toward end-to-end learning. Ren et al. [11] established the Faster R-CNN framework with region proposal networks and shared convolutional features. Redmon et al. [12] proposed YOLO and made single-stage real-time detection practical through direct regression. For greenhouse cucumber recognition, Wang et al. [13] enhanced YOLOv5 with Cr-channel pretraining to suppress near-color interference from leaves and stems. Su et al. [14] further improved YOLOv5 for cucumber picking in near-color greenhouse backgrounds and explicitly addressed occlusion and scale variation. Liu et al. [15] applied instance segmentation to greenhouse cucumber detection and improved pixel-level localization under complex occlusion. Bai et al. [16] combined transfer learning with U-Net and one-stage detectors to improve segmentation and recognition robustness in complex natural environments. Kim et al. [17] introduced amodal segmentation to recover occluded cucumber regions and improve perception under partial visibility. Even so, CNN-based detectors still rely mainly on local aggregation. Luo et al. [18] showed that the effective receptive field of deep CNNs occupies only a fraction of the theoretical receptive field, which limits long-range relation modeling. For greenhouse cucumber detection, this limitation matters because reliable recognition often requires connecting discontinuous target cues across the canopy.

Transformer detectors provide a different route for object detection. Carion et al. [19] reformulated detection as set prediction and removed anchor design and non-maximum suppression from the main pipeline. Zhu et al. [20] improved DETR with deformable attention and strengthened multi-scale modeling for small objects. Zhao et al. [21] then proposed RT-DETR and showed that real-time end-to-end detection can surpass YOLO-style detectors of comparable scale while eliminating NMS. In agricultural vision, Jin et al. [22] proposed a lightweight RT-DETR variant by optimizing the backbone and attention mechanism, which reduced computational complexity by 59.3% for upland rice weed detection. Li et al. [23] proposed GP-DETR based on RT-DETR. They adopted GELAN as the backbone and introduced enhanced intra-scale interaction and feature fusion, achieving 96.4% mAP@0.5 at 100 FPS. Song et al. [24] proposed an occlusion-aware greenhouse harvesting framework and improved harvestability decision reliability under severe occlusion through risk-gated reasoning. Huang et al. [25] proposed a lightweight RT-DETR-based pear detector by replacing the backbone and introducing HiLo attention, which improved perception under illumination variation. Wang et al. [26] developed HESP-DETR through enhanced multi-scale modeling and feature fusion, which improved mAP@0.5 by 1.0% while reducing model complexity and inference cost. Wu et al. [27] further improved RT-DETR for fruit ripeness detection by refining the backbone and attention modules, which improved mAP@0.5 by 2.9% while reducing model size and computational complexity. These studies confirm the potential of Transformer-based detectors in agricultural vision. They also reveal a remaining gap for greenhouse cucumber selective harvesting. Existing RT-DETR variants usually improve the backbone, attention module, or feature fusion for a specific crop task, but they seldom address shallow contour preservation, long-range contextual association, efficient multi-scale fusion, and edge-side deployment within one compact detector.

Greenhouse cucumber selective harvesting presents a morphology-specific detection problem. Mature fruits are long and slender, the peduncle occupies a small image region, and the fruit surface is close in color to stems and leaves. Dense foliage can separate one cucumber into discontinuous visible parts. A suitable detector therefore needs to retain shallow contour cues, model long-range relations along the fruit axis, and fuse multi-scale features without excessive computation.

To address this task, this study proposes MCS-DETR, an efficient multi-scale context-aware detector for selective harvesting of greenhouse cucumbers. Built on RT-DETR, MCS-DETR redesigns three parts of the detection pipeline. The backbone strengthens shallow feature extraction, the top encoder layer improves contextual interaction with limited computation, and the neck reduces redundancy during cross-scale feature aggregation. The main contributions are listed below.

A Gated Large Kernel Aggregation Network (GLKAN) is designed as the backbone to improve shallow feature representation for cucumber contours and peduncle regions.
A Single-Head Attention-based Intra-scale Feature Interaction (SH-AIFI) module is introduced at the top of the encoder to strengthen high-level contextual interaction at low computational cost.
A Weighted Reparameterized Partial Aggregation Fusion (WRPAFusion) module is developed in the neck to improve cross-scale feature aggregation and reduce fusion redundancy.

2. Materials and Methods

2.1. Data Acquisition

The cucumber fruit image data used in this study was collected from a vegetable planting base in Gucheng Street, Shouguang City, Shandong Province, China. The crops were grown in a greenhouse environment and were Chinese northern prickly cucumbers. To fully cover the growth state of the fruit at different times and under different lighting conditions, the data collection used an all-day shooting strategy, starting at 7:00 a.m. and ending at 6:00 p.m. It included different lighting scenes such as frontlight and backlight, which improved the model’s ability to adapt to changing lighting conditions. The images were captured using an OPPO Reno 10 smartphone (Guangdong OPPO Mobile Telecommunications Corp., Ltd., No. 18 Haibin Road, Wusha, Chang’an Town, Dongguan, Guangdong, China). After screening, 3750 valid original images were retained, covering frontlight and backlight conditions, different viewing distances, single and multiple fruit targets, fruit overlap, and leaf and stem occlusion. Representative samples are shown in Figure 1. Because all images were collected from one greenhouse site and one cucumber cultivar, the dataset mainly reflects this specific protected-cultivation condition and does not fully cover variability across different cultivars or greenhouse environments. The generalization issue is further discussed in Section 4.

2.2. Dataset Creation

To enable harvestability recognition, cucumber fruits were annotated according to two criteria, namely fruit maturity and current operability. Fruit maturity was judged by uniform thickness, deep green coloration, and an overall plump shape. Given that the harvesting robot changes its viewing perspective during operation, the operability of each fruit differs across angles. Mature fruits with clearly visible peduncles from the current viewing angle were therefore labeled as harvestable cucumbers, while the remaining fruits were classified as non-harvestable cucumbers. All annotations were performed with LabelImg 1.8.6. Each target was marked by a rectangular bounding box, and the resulting annotation files recorded category labels along with bounding box coordinates.

The original dataset was split into training, validation, and test sets at a ratio of 8:1:1, containing 3000, 375, and 375 images, respectively. To preserve evaluation objectivity, the validation and test sets retained their original data distributions, and offline augmentation was applied exclusively to the training set. Because the visual appearance of cucumber fruits varies with viewing angle, illumination intensity, and occlusion severity, the original images alone cannot cover all potential field conditions. This study applied random scaling from −10% to 10%, horizontal flipping, random rotation from −20° to 20°, and HSV color-space augmentation, with saturation and brightness perturbations both ranging from −25% to 25%. After augmentation, the number of training images increased from 3000 to 5796.

2.3. The Improved Model MCS-DETR

The proposed MCS-DETR is an end-to-end detector for harvestability recognition of greenhouse cucumbers. Figure 2 shows its overall architecture. Using RT-DETR as the base detector, MCS-DETR revises three key parts of the detection pipeline, namely the backbone, the top encoder layer, and the cross-scale fusion pathway. The backbone is redesigned to improve shallow feature extraction, the top encoder layer is adjusted to strengthen high-level contextual interaction, and the cross-scale fusion pathway is rebuilt to aggregate multi-scale features more efficiently.

The original backbone does not provide sufficient spatial coverage to perceive slender fruits or distinguish peduncles from adjacent vines. To improve shallow feature extraction, the backbone is replaced with a Gated Large Kernel Aggregation Network (GLKAN), which is composed of Gated Large Kernel Blocks (GLK Blocks). The gated mechanism suppresses background responses from stems and leaves, while large-kernel convolutions expand spatial coverage so that fruit contours and peduncle regions can be captured within a wider local window. In the baseline encoder, standard Multi-Head Self-Attention (MHSA) performs intra-scale interaction on the top-level features. When all channels participate in attention computation, the computational cost and memory overhead become considerable. Full-channel modeling may also introduce irrelevant background details into global associations and weaken the discrimination between harvestable and non-harvestable fruits. Single-Head Attention-based Intra-scale Feature Interaction (SH-AIFI) is therefore introduced at the top layer of the encoder. This module uses Single-Head Self-Attention (SHSA) to apply attention to only a subset of channels, reducing computational cost while strengthening long-range dependencies among high-level features. In the cross-scale fusion pathway, concatenation-based fusion expands the channel dimension at each fusion node and increases the cost of subsequent convolutions. Weighted Reparameterized Partial Aggregation Fusion (WRPAFusion) replaces the original fusion modules to reduce this burden. Fast Normalized Fusion (FNF) fuses multi-scale branches through learnable weights and avoids the channel expansion caused by direct concatenation. Reparameterized Partial Convolution Blocks (RepPConv Blocks) are used inside the Reparameterized Partial Aggregation Block (RPA Block), where the fused features are further refined through partial aggregation and reparameterized convolutions.

2.3.1. Gated Large Kernel Aggregation Network (GLKAN)

The backbone of RT-DETR constructs the feature pyramid by stacking 3 × 3 convolutions stage by stage. This small-kernel stacking structure suffers from low receptive field efficiency and has limited ability to perceive the overall morphology of slender targets. The spatial span of a cucumber fruit from its peduncle junction to the tip requires the backbone to achieve sufficient spatial coverage at relatively shallow layers. Meanwhile, the peduncle region is small and shares a nearly identical color with the surrounding vines, making it difficult for small kernels to capture the difference between peduncles and vines within a single convolution window. To overcome these limitations, the improved backbone alternates Conv downsampling layers with multi-stage GLKAN modules. The backbone outputs four levels of features, P2, P3, P4, and P5, with channel numbers of 128, 256, 384, and 384, respectively. Compared with the baseline, which only outputs P3 to P5, the improved backbone additionally retains the shallow-level P2 features to provide finer-grained edge information for the subsequent fusion stage.

GLKAN is inspired by the cross-stage partial connection strategy of CSPNet [28]. The input features are split along the channel dimension into a shortcut branch and a transform branch. The shortcut branch preserves shallow-level information, while the transform branch passes through several GLK Blocks before being concatenated with the shortcut branch. The overall structure of GLKAN is shown in Figure 3a. Let the input feature be

X \in ℝ^{C \times H \times W}

. After a 1 × 1 convolution, the input feature is evenly split into a shortcut branch

X_{0}

and a transform branch

X_{1}

. The transform branch is processed by

N

successive GLK Blocks. The shortcut branch, the transform-branch input, and the outputs of all GLK Blocks are then concatenated and fused through a 1 × 1 convolution, which is expressed as follows:

Y = {Conv}_{1 \times 1} (Concat (X_{0}, X_{1}, G_{1} (X_{1}), G_{2} (G_{1} (X_{1})), \dots, G_{N} (\dots G_{1} (X_{1}) \dots)))

(1)

where

Y

is the output feature,

G_{k} (\cdot)

denotes the

k

-th GLK Block, and

N

is the number of stacked GLK Blocks. This concatenation strategy allows shallow-level edge information and multi-stage transformed semantic features to participate jointly in the output representation.

GLK Block retains the information flow organization of the gated convolution block in MambaOut [29] and replaces the spatial mixing branch with a large-kernel reparameterized convolution. The structure is shown in Figure 3b. The input feature is first normalized and expanded by a 1 × 1 convolution, then split along the channel dimension into a gate branch

G

, an identity branch

I

, and a convolution branch

C

. The three branches work together to complete the feature transformation. The core computation of the GLK Block is expressed as:

Y = X + DropPath ({Conv}_{1 \times 1} (GELU (G) ⊙ Concat (I, L (C))))

(2)

where

G

,

I

, and

C

are the three branches obtained by evenly splitting the expanded feature along the channel dimension,

GELU (\cdot)

is the activation function,

⊙

denotes element-wise multiplication, and

L (\cdot)

denotes the large-kernel spatial mixing operator. The gate branch performs selective channel-wise modulation, enhancing informative responses while suppressing redundant background features.

The convolution branch adopts the Large Kernel Block (LarK Block) from UniRepLKNet [30]. Its structure is shown in Figure 4a. Large-kernel convolutions achieve a wider spatial coverage with fewer layers, enabling the principal axis of the fruit body, the peduncle position, and the surrounding vine arrangement to be captured within a single convolution window. The internal components of the convolution branch consist of a Dilated Re-param Block (DRB), batch normalization, a channel attention module, and a feed-forward network, which is expressed as:

L (X) = X + DropPath (FFN (SE (BN (DRB (X)))))

(3)

where

X

is the input feature of the convolution branch,

DRB (\cdot)

is the Dilated Re-param Block,

BN (\cdot)

is batch normalization,

SE (\cdot)

is the channel attention module, and

FFN (\cdot)

is the feed-forward network.

During training, the Dilated Re-param Block consists of a main 7 × 7 depthwise convolution and several parallel small convolutions with different dilation rates. The structure is shown in Figure 4. The outputs of all branches are summed before being fed into the subsequent layers, which is expressed as:

DRB (X) = \sum_{m = 1}^{M} {BN}_{m} ({DWConv}_{m} (X))

(4)

where

M

is the number of parallel branches,

{DWConv}_{m} (\cdot)

denotes the

m

-th depthwise convolution branch with a specific dilation rate, and

{BN}_{m} (\cdot)

is the corresponding batch normalization layer. During inference, owing to the additivity of linear operations, the convolution weights and batch normalization parameters of all branches can be fused into a single 7 × 7 depthwise convolution, which is expressed as:

DRB (X) \equiv \hat{W} * X + \hat{b}

(5)

where

*

denotes the convolution operation,

\hat{W} \in ℝ^{C \times 1 \times 7 \times 7}

and

\hat{b} \in ℝ^{C}

are the equivalent convolution kernel and bias after branch fusion. The multi-branch parallel structure during training enhances the feature representation capability, while the branch fusion during inference eliminates extra computational overhead. In addition, the dilated convolutions are able to skip occluded regions occupied by leaves and establish spatial connections between separated fruit segments, which is beneficial for detecting partially occluded cucumbers.

2.3.2. Single-Head Attention-Based Intra-Scale Feature Interaction (SH-AIFI)

The AIFI module applies standard Multi-Head Self-Attention (MHSA) to the top-level features for intra-scale interaction. MHSA is capable of establishing global spatial associations. However, all channels participate in the attention computation simultaneously, resulting in considerable computational cost and memory consumption, which is unfavorable for subsequent edge-side deployment. In addition, full-channel modeling introduces irrelevant background details such as leaf vein textures into the global associations, interfering with the discrimination between harvestable and non-harvestable fruits.

SH-AIFI replaces the original intra-scale interaction module at the top layer of the encoder. The core modification is to replace MHSA with Single-Head Self-Attention (SHSA) proposed by Yun et al. [31]. Instead of computing attention across all channels, SHSA splits the input features along the channel dimension into two branches: an active branch containing

r C

channels, which performs global spatial association modeling, and a passive branch containing

(1 - r) C

channels, which skips the attention computation and preserves local spatial information. The two branches are concatenated at the end and fused through a lightweight projection layer. The module structure is shown in Figure 5.

Let the input feature be

X \in ℝ^{C \times H \times W}

. It is split along the channel dimension into an active branch

X_{a} \in ℝ^{r C \times H \times W}

and a passive branch

X_{b} \in ℝ^{(1 - r) C \times H \times W}

, where

r

is the channel split ratio and is set to

r = 0.25

. The active branch is first processed by group normalization, and then three groups of 1 × 1 convolutions generate the query

Q

, key

K

, and value

V

, respectively. The spatial dimensions are flattened, and

N = H \times W

denotes the number of spatial positions. Let

d

denote the channel dimension of the query and key, which is set to

d = 0.5 C

. The single-head attention is computed as:

A = softmax (\frac{{\tilde{Q}}^{T} \tilde{K}}{\sqrt{d}}) \tilde{Y_{a}} = \tilde{V} A^{T}

(6)

where

Q \in ℝ^{d \times N}

,

K \in ℝ^{d \times N}

, and

V \in ℝ^{r C \times N}

are the flattened query, key, and value tensors, respectively,

A \in ℝ^{N \times N}

is the attention matrix, and

\sqrt{d}

is the scaling factor.

The aggregated result

Y_{a}

is reshaped back to the two-dimensional spatial form, concatenated with the passive branch along the channel dimension, and projected to produce the output:

X_{out} = Proj (Concat (Reshape (\tilde{Y_{a}}), X_{b}))

(7)

where

Proj (\cdot)

consists of a 1 × 1 convolution, batch normalization, and SiLU activation. The batch normalization weights in the projection layer are initialized to zero, so that the module output approximates an identity mapping at the beginning of training, which facilitates stable convergence.

The attention computation is applied to only

r C

channels, and the resulting matrix size is much smaller than that of full-channel multi-head attention. Substituting

d = 0.5 C

and

r C = 0.25 C

, the core complexity of the SHSA active branch is

O (0.75 N^{2} C)

, which is lower than

O (N^{2} C)

of MHSA. The single-head structure also eliminates the overhead of multi-head splitting and merging [32].

SH-AIFI is applied only to the highest-level features, where the spatial resolution is the lowest and the computational cost of global attention remains manageable. The active branch establishes spatial associations on compact high-semantic channels, while the passive branch preserves edge and texture information from being diluted. The output features are then fed into the subsequent cross-scale fusion module.

2.3.3. Weighted Reparameterized Partial Aggregation Fusion (WRPAFusion)

In the baseline neck, multi-scale features are fused by channel concatenation, and the fused result is then processed by convolution blocks for feature integration. The concatenation operation causes the channel count at each fusion node to increase instantaneously, and the subsequent convolutions are required to compress the expanded channels, leading to a high computational cost at each node. To reduce the computational burden of the fusion stage, the original fusion modules at all nodes in the neck are replaced with WRPAFusion. The structure is shown in Figure 6. WRPAFusion consists of two parts: the first part is a Fast Normalized Fusion (FNF) operator, which performs weighted summation of different input branches; the second part is three cascaded RPA Blocks, which further transform the fused features. All fusion nodes in the neck adopt this module.

FNF is derived from the fast normalized fusion method proposed by Tan et al. [33]. This method assigns a learnable scalar weight to each input branch, normalizes the weights, and then performs element-wise weighted summation. Compared with softmax-based fusion, FNF achieves comparable accuracy but faster inference, with approximately 30% acceleration on GPUs. The computation process of FNF is illustrated in Figure 7. The learnable fusion weight

w_{i}

corresponding to each input branch is first constrained to be non-negative and then normalized, yielding a branch-level scalar weight coefficient

α_{i}

. Each feature map is scaled by its corresponding scalar coefficient

α_{i}

, and the weighted results from all branches are summed to produce the fusion output.

Let the input features be

{X_{i}}_{i = 1}^{N}

,

X_{i} \in ℝ^{C \times H \times W}

, where

N

is the number of branches participating in the fusion. The fusion weights and output of FNF are expressed as:

α_{i} = \frac{ReLU (w_{i})}{ε + \sum_{j = 1}^{N} ReLU (w_{j})}, F = \sum_{i = 1}^{N} α_{i} X_{i}

(8)

where

w_{i}

is the learnable weight of the

i

-th input branch,

ε = 10^{- 4}

is a stability constant,

α_{i}

is the normalized branch-level scalar fusion coefficient, and

F

is the fusion output. The same formulation is applied to both two-input nodes (

N = 2

) and three-input nodes (

N = 3

), differing only in the number of input branches. Compared with channel concatenation, the contribution of features at different scales is allocated at the branch level by FNF. The channel dimension is not altered by the fusion operation itself, so no channel expansion is introduced as in concatenation-based fusion.

The features output by FNF are fed into the RPA Block for further transformation. The RPA Block adopts a partial aggregation architecture: the input features are processed by two 1 × 1 convolutions to generate a main branch

F_{m}

and a bypass branch

F_{s}

. The main branch undergoes deep transformation, while the bypass branch is directly preserved. The two branches are concatenated at the end and mapped to the output channels through a 1 × 1 convolution. This strategy of splitting the input into a transformed part and a preserved part before aggregation reduces redundant computation and improves inference efficiency. The structures of the RPA Block and its internal RepPConv Block are shown in Figure 8, where Figure 8a illustrates the main branch and bypass branch architecture, and Figure 8b shows the training and inference configurations of RepPConv.

The main branch cascades three RepPConv blocks. The output is concatenated with the bypass branch and then projected through a 1 × 1 convolution, which is expressed as:

Y = W_{o} * Concat (B^{(3)} (F_{m}), F_{s})

(9)

where

W_{o}

is the 1 × 1 convolution mapping, and

B^{(3)} (\cdot)

denotes the mapping obtained by sequentially stacking three RepPConv blocks.

The spatial mixing in RepPConv combines partial convolution with structural reparameterization. The partial convolution is inspired by FasterNet [34]. Only a fraction of the input channels undergo standard convolution, while the remaining channels are directly passed through, thereby reducing both the computation and memory access overhead. Let the input of RepPConv be

U \in ℝ^{C_{h} \times H \times W}

. It is split along the channel dimension into a convolution branch

U_{c} \in ℝ^{\frac{C_{h}}{4} \times H \times W}

and a bypass branch

U_{i} \in ℝ^{\frac{3 C_{h}}{4} \times H \times W}

. The convolution branch occupies

\frac{1}{4}

of the channels and internally adopts the RepVGG structural reparameterization [35]. During training, three parallel branches are used: a 3 × 3 convolution, a 1 × 1 convolution, and an identity mapping:

R_{train} (U_{c}) = {BN}_{3} (K_{3} * U_{c}) + {BN}_{1} (K_{1} * U_{c}) + {BN}_{i d} (U_{c})

(10)

where

K_{3}

and

K_{1}

are the 3 × 3 and 1 × 1 convolution kernels, respectively, and

{BN}_{3} (\cdot)

,

{BN}_{1} (\cdot)

, and

{BN}_{i d} (\cdot)

are the batch normalization layers on each branch. During inference, the above multi-branch structure is folded into an equivalent single 3 × 3 convolution:

R_{infer} (U_{c}) = \hat{K} * U_{c} + \hat{b}

(11)

\hat{K} = K_{3}^{'} + Pad (K_{1}^{'}) + K_{id}^{'}, \hat{b} = b_{3}^{'} + b_{1}^{'} + b_{id}^{'}

(12)

where

K_{3}^{'}

,

K_{1}^{'}

, and

K_{id}^{'}

are the convolution kernels of each branch after fusion with batch normalization,

Pad (\cdot)

zero-pads the 1 × 1 kernel to 3 × 3, and

\hat{K}

and

\hat{b}

are the equivalent parameters used during deployment.

2.4. Evaluation Metrics and Experimental Setup

2.4.1. Evaluation Metrics

Common evaluation metrics for object detection were used to assess model performance. Detection accuracy was measured by Precision (P), Recall (R), mAP@0.5, and mAP@0.5:0.95. Model complexity and deployment cost were characterized by the number of parameters (Params), floating-point operations (FLOPs), and weight size. Params denotes the total number of learnable parameters in the model. FLOPs denotes the number of floating-point operations required for a single forward pass under a fixed input size. Weight size denotes the size of the exported model weight file.

The overlap between a predicted box and a ground-truth box was measured by the Intersection over Union (IoU). Let

b_{p}

denote the predicted box and

b_{g}

denote the ground-truth box. IoU is defined as:

IoU (b_{p}, b_{g}) = \frac{| b_{p} \cap b_{g} |}{| b_{p} \cup b_{g} |}

(13)

A prediction was counted as a true positive (TP) when the predicted class matched the ground-truth class and

IoU (b_{p}, b_{g}) \geq τ

. Predictions that did not satisfy this condition were counted as false positives (FP). Ground-truth objects that were not matched by any predicted box were counted as false negatives (FN). Precision and Recall are defined as

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}

(14)

When the confidence threshold changes, a precision–recall curve (P-R curve) can be obtained. For a single class, the Average Precision (AP) is defined as the area under this curve:

A P = \int_{0}^{1} P (R) d R

(15)

At a given IoU threshold

τ

, the mean Average Precision (mAP) is calculated as the average AP over

C

classes:

mAP (τ) = \frac{1}{C} \sum_{c = 1}^{C} A P_{c} (τ)

(16)

Here, mAP@0.5 corresponds to

τ = 0.50

. mAP@0.5:0.95 is the average mAP over IoU thresholds from 0.50 to 0.95 with a step size of 0.05, and is computed as:

mAP @ 0.5 : 0.95 = \frac{1}{10} \sum_{k = 0}^{9} mAP (τ_{k}), τ_{k} = 0.50 + 0.05 k

(17)

2.4.2. Experimental Setup

All experiments were conducted on Ubuntu 22.04 using an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA) with 24 GB of memory. The implementation was based on Python 3.10.19, PyTorch 2.1.2, and CUDA 11.8. The model was trained for 200 epochs with a batch size of 16. AdamW was used as the optimizer, with a learning rate of 0.0001, a momentum of 0.9, and a weight decay of 0.0001. The input image size was set to 640 × 640 for training and evaluation. The hardware configuration, software environment, and main parameters used for training and analysis are summarized in Table 1.

3. Experimental Results and Analysis

3.1. Baseline Model Selection

Before developing the improved detector, we first selected a suitable RT-DETR baseline for greenhouse cucumber detection. The four RT-DETR variants were compared, as shown in Table 2. Increasing model scale did not lead to a clear accuracy gain on this dataset. RT-DETR-R50 improved mAP@0.5 by only 0.1 percentage points over RT-DETR-R18 and showed the same mAP@0.5:0.95, but it required much higher parameters, FLOPs, and weight size. RT-DETR-R34 and RT-DETR-L also increased model complexity without improving the overall accuracy profile. Considering detection accuracy, model complexity, and later edge deployment, RT-DETR-R18 provided the most suitable baseline. It maintained competitive detection performance while requiring the smallest parameter count, the lowest computational cost, and the lightest weight file among the four candidates. Therefore, RT-DETR-R18 was selected for subsequent improvement.

3.2. Comparative Experiments on Attention Mechanisms

High-level contextual interaction is important when visible fruit cues are spatially separated. To compare different attention strategies, we replaced the original intra-scale interaction module in RT-DETR-R18 with HiLo, MSLA, CGA, Efficient Additive, and SH-AIFI, and evaluated them under identical training and testing settings.

As shown in Table 3, the competing attention modules changed the model size only slightly, but their performance gains were limited or inconsistent. Compared with RT-DETR-R18, HiLo, MSLA, and CGA increased mAP@0.5 by 0.4, 0.5, and 0.6 percentage points, respectively, whereas their mAP@0.5:0.95 decreased by 0.3, 0.3, and 0.7 percentage points. Efficient Additive did not improve the baseline and reduced mAP@0.5 and mAP@0.5:0.95 by 0.2 and 0.7 percentage points, respectively. In contrast, SH-AIFI showed the clearest improvement among the tested attention modules. It reached 91.5% mAP@0.5 (+1.7%) and 75.3% mAP@0.5:0.95 (+1.0%) relative to RT-DETR-R18, while also giving the highest precision and recall. At the same time, the parameter count decreased from 19.9 M to 19.7 M, FLOPs dropped from 58.6 G to 57.0 G, and the weight size was reduced from 80.7 MB to 80.0 MB. These results suggest that SH-AIFI strengthens contextual association among high-level features without increasing model complexity. Based on this comparison, SH-AIFI was used in the subsequent experiments.

3.3. Comparison of Backbone Networks

The backbone network determines the fundamental feature extraction capability of the detector and directly governs the trade-off between detection accuracy and computational cost. To evaluate the effectiveness of the proposed GLKAN backbone, we replaced the default ResNet-18 in RT-DETR with nine representative lightweight architectures. The results are summarized in Table 4.

GLKAN gave the strongest overall accuracy among the tested backbones. It reached 92.0% mAP@0.5 and 75.4% mAP@0.5:0.95, increasing the two mAP metrics by 2.2% and 1.1% over the RT-DETR-R18 baseline. Its precision and recall were also the highest in Table 4. This accuracy gain was obtained with a smaller model. The parameter count decreased from 19.9 M to 15.2 M, a reduction of 23.6%, and FLOPs fell from 58.6 G to 51.6 G. The weight file was reduced to 62.4 MB, which was 22.7% smaller than the baseline. These results suggest that GLKAN improves shallow feature representation without relying on a larger backbone. Among the CNN-based alternatives, UniRepLKNet achieved 90.9% mAP@0.5 and 74.5% mAP@0.5:0.95 with lower computation, but it still lagged behind GLKAN in both mAP metrics. GhostNetv2 matched the baseline mAP@0.5 at a much lower computational cost of 27.8 G FLOPs, although its recall remained lower than that of GLKAN. MobileNetv4 compressed the model to 11.3 M parameters and 46.5 MB, but the accuracy loss was clear, with mAP@0.5 falling to 88.4% and mAP@0.5:0.95 to 71.2%. ConvNeXtv2 and EfficientNetv2 did not provide a better accuracy–efficiency balance. ConvNeXtv2 remained below the baseline in mAP@0.5, while EfficientNetv2 reached only 89.6% mAP@0.5 despite using 21.0 M parameters and an 85.7 MB weight file. StarNet showed the weakest accuracy, with 85.6% mAP@0.5 and 68.6% mAP@0.5:0.95.

The two Transformer-based backbones also failed to offer a more suitable trade-off. EfficientViT was the most compact network, with 10.7 M parameters and 27.2 G FLOPs, but its mAP@0.5 dropped to 88.8% and its recall fell to 86.3%. EfficientFormerv2 produced a small gain over the baseline, reaching 90.2% mAP@0.5 and 74.4% mAP@0.5:0.95. This improvement, however, came with a 99.9 MB weight file, about 1.6 times larger than GLKAN. Considering accuracy, model size, and computational cost together, GLKAN provided the most favorable backbone choice for the subsequent experiments.

3.4. Ablation Experiment

To quantify the individual and combined contributions of the three proposed modules, ablation experiments were conducted on RT-DETR-R18. The results are presented in Table 5 and visualized in Figure 9.

Each module, when introduced alone, improved the baseline detector in a different way. With GLKAN, mAP@0.5 increased from 89.8% to 92.0%, and mAP@0.5:0.95 rose to 75.4% (+1.1%). The parameter count simultaneously dropped from 19.9 M to 15.2 M, and the weight file size was reduced to 62.4 MB, corresponding to decreases of 23.6% and 22.7%. SH-AIFI mainly improved the balance between classification confidence and target retrieval. It reached 91.5% mAP@0.5 (+1.7%) and raised recall to 91.0%, while the parameter count remained almost unchanged at 19.7 M. This suggests that the enhanced encoder interaction strengthened high-level feature association without adding model burden. WRPAFusion gave the largest single-module gain on the stricter localization metric. Its mAP@0.5:0.95 reached 75.9%, 1.6% higher than the baseline, while FLOPs decreased from 58.6 G to 48.3 G. The two-module settings further show that the three components are complementary. Combining GLKAN with SH-AIFI produced the highest recall among the two-module variants, reaching 91.5%, and increased mAP@0.5 to 92.3%. The GLKAN and WRPAFusion combination gave a stronger gain in localization quality, with mAP@0.5:0.95 increasing from 74.3% to 76.7%. This setting also compressed the model substantially, reducing the weight file from 80.7 MB to 52.0 MB. The SH-AIFI and WRPAFusion setting reached 93.1% mAP@0.5 and 76.4% mAP@0.5:0.95, confirming that the attention and fusion modules can work together effectively even without the redesigned backbone.

The full MCS-DETR configuration produced the strongest result in the ablation study. Compared with the baseline, mAP@0.5 increased from 89.8% to 93.4%, and mAP@0.5:0.95 rose from 74.3% to 76.8%. Precision and recall both remained above 91%. The accuracy gains were accompanied by a clear reduction in model cost. Compared with RT-DETR-R18, the final model used 7.3 M fewer parameters, 14.6 G fewer FLOPs, and a 29.4 MB smaller weight file. These consistent changes show that the three modules improve different parts of the detection pipeline and support each other rather than adding redundant structure.

After confirming the contribution of the complete model, we further checked the repeatability of the key comparison between RT-DETR-R18 and MCS-DETR. Both models were trained three times under the same data split, augmentation settings, optimizer configuration, and hyperparameter settings. The random seed was fixed at 0 in all three training runs. Across the repeated runs, the maximum run-to-run range did not exceed 0.10 percentage points for mAP@0.5 and 0.20 percentage points for mAP@0.5:0.95. MCS-DETR outperformed RT-DETR-R18 in every run, with average gains of 3.60 and 2.47 percentage points in mAP@0.5 and mAP@0.5:0.95, respectively. Because the same random seed was used in all repeated trainings, this analysis is reported as a repeatability check under a fixed random seed rather than as a significance test based on independent random seeds.

3.5. Comparative Experiments of Different Models

To further validate the proposed MCS-DETR, we compared it against a broad range of mainstream detectors, including both CNN-based and DETR-based architectures. The results are reported in Table 6, and the detection performance and model complexity are visualized in Figure 10a and Figure 10b respectively.

Among the CNN-based detectors, earlier two-stage and anchor-based models showed limited accuracy or high computational cost on this task. Faster R-CNN, SSD, and RetinaNet remained below 88% mAP@0.5, and SSD had the lowest mAP@0.5:0.95 among all compared models. Cascade R-CNN improved the two mAP metrics to 88.4% and 71.7%, but this came with the largest model burden in Table 6, including 69.2 M parameters and 205 G FLOPs. TOOD reached 88.9% mAP@0.5, yet its localization accuracy was still lower than that of the stronger one-stage YOLO variants. YOLOv8m and YOLOv12m provided a more competitive efficiency profile. YOLOv12m reached 89.8% mAP@0.5 with 20.1 M parameters and 67.1 G FLOPs, while YOLOv8m obtained a slightly higher mAP@0.5:0.95 of 74.0%. MCS-DETR outperformed both YOLO variants. Compared with YOLOv12m, it improved mAP@0.5 by 3.6% and mAP@0.5:0.95 by 3.1%, with 37.3% fewer parameters and 34.4% lower FLOPs. Compared with YOLOv8m, MCS-DETR gained 4.0% and 2.8% in the two mAP metrics and reduced the parameter count by more than half.

Within the DETR family, several models achieved higher accuracy than the early CNN detectors, but most required larger model sizes or higher computation. Deformable DETR remained below RT-DETR-R18 in both mAP metrics despite using 40.1 M parameters and 165 G FLOPs. DDQ-DETR produced competitive accuracy, with 90.4% mAP@0.5 and 74.5% mAP@0.5:0.95, but its computational cost increased to 471 G FLOPs. Conditional DETR and DAB-DETR stayed close to 90% mAP@0.5, although both required more than twice the parameters of RT-DETR-R18. DINO was the strongest competing DETR-based detector, reaching 90.7% mAP@0.5 and 74.7% mAP@0.5:0.95. MCS-DETR exceeded DINO by 2.7% and 2.1% in the two mAP metrics while using only 26.5% of its parameters and 18.7% of its FLOPs. Compared with the RT-DETR-R18 baseline, MCS-DETR increased mAP@0.5 and mAP@0.5:0.95 by 3.6% and 2.5%, respectively. This accuracy gain was not obtained by using a larger detector. In terms of model size and computational cost, MCS-DETR remained smaller than the compared CNN-based and DETR-based models, including YOLOv12m and DINO. These comparisons show that MCS-DETR achieved the most favorable accuracy–efficiency trade-off among all tested CNN-based and DETR-based detectors.

Figure 10a presents a radar chart comparing the detection accuracy of all models across precision, recall, mAP@0.5, and mAP@0.5:0.95. The contour of MCS-DETR consistently occupies the outermost region across all four metrics, indicating that its overall detection capability exceeds that of all other compared models. Figure 10b illustrates the relationship between model complexity and weight file size. Each bubble corresponds to a model, with its position determined by parameter count and FLOPs and its diameter proportional to the weight file size. MCS-DETR is located in the lower-left corner with the smallest bubble, confirming that it simultaneously achieves the lowest parameter count, the fewest FLOPs, and the smallest weight file size. Overall, MCS-DETR achieved the highest detection accuracy among all compared models while maintaining the most compact model size and the lowest computational cost, demonstrating its suitability for greenhouse cucumber selective harvesting where both high accuracy and deployment efficiency are essential.

3.6. Visual Demonstration

3.6.1. Visualization of Heatmap

To further examine how different detectors respond to cucumber targets in cluttered greenhouse environments, Grad-CAM++ heatmaps were generated for RT-DETR, YOLOv8m, and the proposed MCS-DETR on the same representative scenes, as shown in Figure 11 [36]. Grad-CAM++ back-propagates the gradients of the target class into the last convolutional layer and produces a spatial activation map that highlights image regions contributing most to the final prediction. Warm colors denote strong activation while cold colors indicate weak activation. All four samples were drawn from the same greenhouse scene but differ in leaf occlusion severity, background complexity, and the degree to which the fruit is exposed.

RT-DETR covers the main fruit body in most cases, though its high-response area tends to spill onto adjacent leaves and tendrils, suggesting that the encoder cannot fully separate foreground texture from the visually similar canopy. YOLOv8m yields a tighter response along the fruit, but the activation becomes fragmented when the target is partially hidden behind stems of similar color and texture. Neither model maintains a spatially coherent heat signature under varying occlusion conditions. MCS-DETR mitigates these issues to a noticeable extent. Its high-activation zone tracks the elongated fruit contour more faithfully, and the transition from target to background is sharper than what the two baselines achieve. The improvement is particularly visible in samples with heavy foliage overlap, where RT-DETR scatters attention across the canopy and YOLOv8m fragments its response, while MCS-DETR retains a continuous, well-bounded heat region along the fruit body. Residual background activation remains in all three models, which is an expected artifact in dense greenhouse imagery where leaves, stems, and fruit share overlapping spectral and textural properties. These observations indicate that the proposed model produces more spatially focused and discriminative feature representations under complex greenhouse conditions.

3.6.2. Detection Result Visualization

Figure 12 presents the qualitative detection results of RT-DETR, YOLOv8m, and the proposed MCS-DETR under four representative greenhouse scenarios: frontlight, backlight, distant view, and fruit overlap. These samples were selected for the variety of interference they present, including strong illumination changes, dense foliage, partial occlusion, scale variation, and close spacing between adjacent fruits. In the relatively unobstructed regions of each image, all three detectors successfully localize the dominant cucumber targets, confirming that each model possesses a baseline detection capability in real greenhouse conditions. The gaps between them surface once the targets become slender, distant, leaf-occluded, or visually entangled with surrounding stems and textures.

Under frontlight, RT-DETR and YOLOv8m detect the principal fruits but exhibit varying degrees of missed detection on smaller cucumbers near the canopy edge. The backlight scene intensifies this problem. Strong backlighting suppresses local contrast, causing RT-DETR to return noticeably lower confidence scores on targets that fall within shadow regions; YOLOv8m misses some of these targets altogether. MCS-DETR still registers them. In the distant-view sample, fruit size shrinks and the background occupies a larger proportion of the frame. Both RT-DETR and YOLOv8m produce false detections or miss several remote targets obscured by foliage, whereas MCS-DETR correctly identifies all ground-truth instances. The fruit-overlap scenario tells a similar story. MCS-DETR achieves clearly better detection performance in this setting, with no false or missed detections observed. Overall, the visual evidence indicates that MCS-DETR delivers stronger detection performance when confronted with illumination disturbance, occlusion, and scale change.

3.7. Model Deployment

To evaluate the edge-deployment performance of MCS-DETR, inference experiments were carried out on an NVIDIA Jetson Orin NX Super, as shown in Figure 13. This platform offers a favorable trade-off between power budget and on-device computing power, making it a practical choice for real-time visual perception in greenhouse scenarios. All test images were greenhouse cucumber frames resized to 640 × 640. Three configurations were benchmarked: RT-DETR-R18 under the PyTorch backend, MCS-DETR under the same backend, and MCS-DETR accelerated with TensorRT. Results are reported in Table 7.

Inference cost was quantified by the average end-to-end latency per image. Let

t_{pre}

,

t_{\inf}

, and

t_{post}

denote the preprocessing, network forward, and postprocessing times, respectively. The total inference latency is then defined as:

T = t_{pre} + t_{\inf} + t_{post}

(18)

and the corresponding frame rate is given by:

FPS = \frac{1000}{T}

(19)

where

T

is measured in milliseconds. As shown in Table 7, MCS-DETR running under PyTorch reached 16.9 FPS, compared with 13.3 FPS for the baseline RT-DETR-R18. With TensorRT optimization enabled, the frame rate of MCS-DETR further increased to 26.3 FPS. These results indicate that the proposed MCS-DETR maintains favorable deployability on resource-constrained embedded hardware while satisfying the real-time response and detection accuracy demands of selective cucumber harvesting in greenhouses.

4. Discussion

Recent agricultural detection studies have shown that RT-DETR and other Transformer-based detectors can be adapted to crop detection tasks through task-specific structural redesign. Previous studies have reduced model complexity in complex field environments [22], achieved high mAP and inference speed with GP-DETR [23], and reduced model complexity or model size in improved RT-DETR variants [26,27]. These findings indicate that crop-specific detector redesign is often more valuable than simply increasing model scale. MCS-DETR follows this direction but addresses a different application condition. It was developed for greenhouse cucumber harvesting, where the detector needs to preserve shallow contour cues, capture long-range visual associations, and maintain efficient multi-scale fusion within a compact model. The main contribution of MCS-DETR is therefore to combine accuracy improvement with reduced model cost for an on-device harvesting scenario.

The experimental results further show that different components contributed to detection accuracy or model efficiency in different ways, and the complete model achieved the strongest overall performance. This pattern is consistent with previous agricultural detection studies that used backbone redesign, attention refinement, or feature fusion to address crop-specific visual difficulty. The difference is that the present work integrates these design directions into one compact detector. For greenhouse cucumber detection, this integration is important because reliable recognition under partial visibility requires the joint use of boundary information, contextual association, and multi-scale representation.

A detector used in a harvesting robot must provide target information quickly enough for downstream localization, motion planning, and control. Missed detections may leave mature fruits unpicked, while false detections may introduce invalid grasping targets. In this sense, model efficiency is not only a computational metric but also a requirement for field operation. MCS-DETR reached 26.3 FPS on the NVIDIA Jetson Orin NX Super after TensorRT acceleration, suggesting that the proposed detector can serve as an on-device perception module for real-time greenhouse cucumber detection. This result should be interpreted as evidence of deployment feasibility at the perception level. It does not yet demonstrate the performance of a complete harvesting robot, because closed-loop picking also depends on depth sensing, spatial localization, grasp planning, end-effector control, and system-level coordination.

In terms of dataset construction, the current findings still have certain limitations. Although the image set covers common disturbance conditions, such as lighting variation, different viewing distances, fruit overlap, and leaf and stem occlusion, it was collected from a single-site and single-cultivar greenhouse scenario. This setting is useful for evaluating the proposed detector under a representative protected-cultivation condition, but it cannot fully describe the variation caused by different cultivars, greenhouse structures, seasons, or cultivation practices. The detector may also be transferable to other crops with elongated fruit morphology, but such transfer should be validated separately because canopy density, fruit color, fruit surface texture, and training systems may change the visual cues used by the detector. Future work should therefore extend the dataset across cultivars and greenhouse environments and evaluate the detector in closed-loop robotic harvesting trials.

5. Conclusions

This study proposed MCS-DETR for harvestability recognition of greenhouse cucumbers. The model achieved 93.4% mAP@0.5 and 76.8% mAP@0.5:0.95. Compared with RT-DETR-R18, the two mAP metrics increased by 3.6 and 2.5 percentage points, while the number of parameters, FLOPs, and weight size decreased by 36.7%, 24.9%, and 36.4%, respectively. After TensorRT acceleration, MCS-DETR reached 26.3 FPS on the NVIDIA Jetson Orin NX Super. These results indicate that the model can provide an efficient visual perception module for real-time greenhouse cucumber detection. This work may also serve as a reference for agricultural vision tasks involving complex backgrounds, frequent occlusion, and elongated targets.

This study still has several limitations. The current dataset is not sufficient to support a broad claim of cross-cultivar or cross-greenhouse generalization, so further validation is needed under different cultivar and greenhouse conditions. The present work focuses on image-based detection, while selective harvesting also depends on depth sensing, spatial localization, grasp planning, and end-effector control. The deployment experiment verified inference speed, but it did not evaluate picking success rate, cycle time, or long-term operational stability in closed-loop operation. Future work will address these limitations in three directions. The model will first be tested under more diverse greenhouse environments, cultivar conditions, and imaging settings to improve cross-scene generalization. The current detection framework will then be extended toward joint perception of target detection, harvestability assessment, and spatial localization. The detector will finally be deployed in a robotic harvesting system and evaluated through closed-loop picking operations.

Author Contributions

Conceptualization, L.R. and Z.T.; methodology, W.Z.; software, F.S.; validation, L.R.; formal analysis, H.L.; investigation, F.S.; resources, Z.T.; data curation, W.Z.; writing—original draft preparation, W.Z.; writing—review and editing, F.D.; visualization, W.Z.; supervision, Z.T.; project administration, L.R.; funding acquisition, C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Development Program of Jilin Province (Grant No. 20230202077NC). The APC was funded by Zhimin Tong.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code and dataset supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviation	Full term
AP	Average Precision
BN	Batch Normalization
CCFM	Cross-scale Feature Fusion Module
DETR	Detection Transformer
DRB	Dilated Re-param Block
DWConv	Depthwise Convolution
FFN	Feed-Forward Network
FNF	Fast Normalized Fusion
FLOPs	Floating Point Operations
FPS	Frames Per Second
GLK Block	Gated Large Kernel Block
GLKAN	Gated Large Kernel Aggregation Network
HSV	Hue, Saturation, Value
IoU	Intersection over Union
LarK Block	Large Kernel Block
mAP	mean Average Precision
MHSA	Multi-Head Self-Attention
RPA Block	Reparameterized Partial Aggregation Block
RepPConv Block	Reparameterized Partial Convolution Block
RT-DETR	Real-Time Detection Transformer
SH-AIFI	Single-Head Attention-based Intra-scale Feature Interaction
SHSA	Single-Head Self-Attention
WRPAFusion	Weighted Reparameterized Partial Aggregation Fusion

References

Liu, W.; Sun, H.; Xia, Y.; Kang, J. Real-Time Cucumber Target Recognition in Greenhouse Environments Using Color Segmentation and Shape Matching. Appl. Sci. 2024, 14, 1884. [Google Scholar] [CrossRef]
Liu, H.; Yin, C.; Gao, Z.; Hou, L. Evaluation of cucumber yield, economic benefit and water productivity under different soil matric potentials in solar greenhouses in North China. Agric. Water Manag. 2021, 243, 106442. [Google Scholar] [CrossRef]
Ahmed, A.; Zhang, Z.; Manzoor, S.H.; Abdelhamid, M.A.; Gul, N.; Ahmed, R.; Mhamed, M.; Sun, R.; Hao, C.; Huo, W.; et al. Cucumber picking robots: Technological progress, challenges, and future directions. Smart Agric. Technol. 2026, 13, 101813. [Google Scholar] [CrossRef]
Park, Y.; Seol, J.; Pak, J.; Jo, Y.; Kim, C.; Son, H.I. Human-centered approach for an efficient cucumber harvesting robot system: Harvest ordering, visual servoing, and end-effector. Comput. Electron. Agric. 2023, 212, 108116. [Google Scholar] [CrossRef]
Huang, Y.; Xu, S.; Chen, H.; Li, G.; Dong, H.; Yu, J.; Zhang, X.; Chen, R. A review of visual perception technology for intelligent fruit harvesting robots. Front. Plant Sci. 2025, 16, 1646871. [Google Scholar] [CrossRef]
Fernández, R.; Montes, H.; Surdilovic, J.; Surdilovic, D.; Gonzalez-De-Santos, P.; Armada, M. Automatic Detection of Field-Grown Cucumbers for Robotic Harvesting. IEEE Access 2018, 6, 35512–35527. [Google Scholar] [CrossRef]
Sun, G.; Li, Y.; Wang, X.; Hu, G.; Wang, X.; Zhang, Y. Image segmentation algorithm for greenhouse cucumber canopy under various natural lighting conditions. Int. J. Agric. Biol. Eng. 2016, 9, 130–138. [Google Scholar] [CrossRef]
Li, D.; Zhao, H.; Zhao, X.; Gao, Q.; Xu, L. Cucumber Detection Based on Texture and Color in Greenhouse. Int. J. Pattern Recognit. Artif. Intell. 2017, 31, 1754016. [Google Scholar] [CrossRef]
Bao, G.; Cai, S.; Qi, L.; Xun, Y.; Zhang, L.; Yang, Q. Multi-template matching algorithm for cucumber recognition in natural environment. Comput. Electron. Agric. 2016, 127, 754–762. [Google Scholar] [CrossRef]
Mao, S.; Li, Y.; Ma, Y.; Zhang, B.; Zhou, J.; Kai, W. Automatic cucumber recognition algorithm for harvesting robots in the natural environment using deep learning and multi-feature fusion. Comput. Electron. Agric. 2020, 170, 105254. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Wang, N.; Qian, T.; Yang, J.; Li, L.; Zhang, Y.; Zheng, X.; Xu, Y.; Zhao, H.; Zhao, J. An Enhanced YOLOv5 Model for Greenhouse Cucumber Fruit Recognition Based on Color Space Features. Agriculture 2022, 12, 1556. [Google Scholar] [CrossRef]
Su, L.; Sun, H.; Zhang, S.; Lu, X.; Wang, R.; Wang, L.; Wang, N. Cucumber Picking Recognition in Near-Color Background Based on Improved YOLOv5. Agronomy 2023, 13, 2062. [Google Scholar] [CrossRef]
Liu, X.; Zhao, D.; Jia, W.; Ji, W.; Ruan, C.; Sun, Y. Cucumber Fruits Detection in Greenhouses Based on Instance Segmentation. IEEE Access 2019, 7, 139635–139642. [Google Scholar] [CrossRef]
Bai, Y.; Guo, Y.; Zhang, Q.; Cao, B.; Zhang, B. Multi-network fusion algorithm with transfer learning for green cucumber segmentation and recognition under complex natural environment. Comput. Electron. Agric. 2022, 194, 106789. [Google Scholar] [CrossRef]
Kim, S.; Hong, S.-J.; Ryu, J.; Kim, E.; Lee, C.-H.; Kim, G. Application of amodal segmentation on cucumber segmentation and occlusion recovery. Comput. Electron. Agric. 2023, 210, 107847. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer Nature: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar]
Jin, X.; Zhang, J.; Wang, F.; Zhao, M.; Wang, Y.; Yang, J.; Wu, J.; Zhou, B. PHRF-RTDETR: A lightweight weed detection method for upland rice based on RT-DETR. Front. Plant Sci. 2025, 16, 1556275. [Google Scholar] [CrossRef]
Li, T.; Xue, J.; Wei, M.; Yuan, X.; Wang, X.; Zhang, Z.; Wu, Z.; Sun, Y.; Zhang, T.; Cheng, K. GP-DETR: A lightweight real-time intelligent model for near-background color pepper detection in complex agricultural environments. Smart Agric. Technol. 2025, 12, 101219. [Google Scholar] [CrossRef]
Song, S.; Qu, H.; Wang, S.; Yang, H.; Hao, Y.; Zhang, G. RGHD: A Risk-Gated Harvestability Decision Framework for Occlusion-Aware Greenhouse Melon Harvesting. Agriculture 2026, 16, 589. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, X.; Wang, H.; Wei, H.; Zhang, Y.; Zhou, G. Pear Fruit Detection Model in Natural Environment Based on Lightweight Transformer Architecture. Agriculture 2025, 15, 24. [Google Scholar] [CrossRef]
Wang, S.; Dong, S.; Chen, S.; Liu, M. An efficient scale-aware model based on the improved RT-DETR for pomegranate growth stage detection. Neurocomputing 2025, 647, 130462. [Google Scholar] [CrossRef]
Wu, M.; Qiu, Y.; Wang, W.; Su, X.; Cao, Y.; Bai, Y. Improved RT-DETR and its application to fruit ripeness detection. Front. Plant Sci. 2025, 16, 1423682. [Google Scholar] [CrossRef] [PubMed]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 1571–1580. [Google Scholar]
Yu, W.; Wang, X. MambaOut: Do We Really Need Mamba for Vision? In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; IEEE: New York, NY, USA, 2025; pp. 4484–4496. [Google Scholar]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 5513–5524. [Google Scholar]
Yun, S.; Ro, Y. SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 5756–5767. [Google Scholar]
Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 10778–10787. [Google Scholar]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 12021–12031. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 13728–13737. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: New York, NY, USA, 2018; pp. 839–847. [Google Scholar]

Figure 1. Examples of dataset images under different conditions: (a) Frontlight; (b) Backlight; (c) Distant view; (d) Multiple targets; (e) Fruit overlap; (f) Leaf and stem occlusion.

Figure 2. The overall architecture of MCS-DETR. Red and yellow bounding boxes in the inset image denote harvestable and non-harvestable cucumbers, respectively.

Figure 3. Structure of the Gated Large Kernel Aggregation Network (GLKAN): (a) GLKAN; (b) GLK Block, Gated Large Kernel Block.

Figure 4. Structure of the Large Kernel Block (LarK Block) used in the convolution branch: (a) LarK Block; (b) Dilated Re-param Block (DRB).

Figure 5. Structure of the Single-Head Attention-based Intra-scale Feature Interaction (SH-AIFI).

Figure 6. Overall structure of the Weighted Reparameterized Partial Aggregation Fusion (WRPAFusion).

Figure 7. Computation flow of the Fast Normalized Fusion (FNF) operator.

Figure 8. Structure of the Reparameterized Partial Aggregation Block (RPA Block): (a) RPA Block; (b) RepPConv Block, Reparameterized Partial Convolution Block.

Figure 9. Performance and complexity comparison under different ablation settings.

Figure 10. Comparison of detection performance and model complexity among different models: (a) Comparison of Precision, Recall, mAP@0.5, and mAP@0.5:0.95 for different models; (b) Comparison of model complexity in terms of parameters, FLOPs, and weight size for different models.

Figure 11. Heatmap comparison among different models.

Figure 12. Comparison of detection results of different models under different scenarios: (a) Frontlight; (b) Backlight; (c) Distant view; (d) Fruit overlap.

Figure 13. Edge deployment of MCS-DETR on the NVIDIA Jetson Orin NX Super platform.

Table 1. Experimental setup and training configuration.

Category	Item	Value
Hardware	GPU	NVIDIA GeForce RTX 4090 (24 GB)
Hardware	CPU	16 vCPU Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60 GHz (Intel Corporation, 2200 Mission College Blvd., Santa Clara, CA 95054, USA)
Software	Operating system	Ubuntu 22.04
	Programming language	Python 3.10.19
	Deep learning framework	PyTorch 2.1.2
	CUDA version	CUDA 11.8
Hyperparameters	Optimizer	AdamW
	Learning rate	0.0001
	Weight decay	0.0001
	Batch size	16
	Momentum	0.9
	Training epochs	200
	Input size	640 × 640

Table 2. Comparison of different RT-DETR models.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	FLOPs (G)	Weight Size (MB)
RT-DETR-R18	90.7	89.9	89.8	74.3	19.9	58.6	80.7
RT-DETR-R34	87.9	90.9	89.6	74.1	31	88.8	125.7
RT-DETR-R50	89.3	89.5	89.9	74.3	42	129.5	171.8
RT-DETR-L	89	89.5	89.3	73.7	32	103.4	131.9

Table 3. Comparison of different attention modules.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	FLOPs (G)	Weight Size (MB)
RT-DETR-R18	90.7	89.9	89.8	74.3	19.9	58.6	80.7
HiLo	90.6	88.7	90.2	74	19.8	57.1	80.6
MSLA	90.1	90.1	90.3	74	19.7	57.1	80
CGA	90.9	90.8	90.4	73.6	19.7	57	80.1
Efficient Additive	90.7	89.9	89.6	73.6	19.8	57.2	80.7
SH-AIFI (Ours)	91.9	91	91.5	75.3	19.7	57	80

Table 4. Comparison of different backbone networks.

Module	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	FLOPs (G)	Weight Size (MB)
RT-DETR-R18	90.7	89.9	89.8	74.3	19.9	58.6	80.7
MobileNetv4	87.8	88	88.4	71.2	11.3	39.5	46.5
UniRepLKNet	91.1	89.7	90.9	74.5	12.7	33.4	53.1
EfficientViT	88.4	86.3	88.8	71.9	10.7	27.2	44.6
GhostNetv2	89.6	87.4	89.8	72.4	11.7	27.8	48.1
EfficientFormerv2	89.5	90	90.2	74.4	11.8	29.4	99.9
StarNet	86.5	86.2	85.6	68.6	12	31.8	49.2
EfficientNetv2	89.3	88.7	89.6	72.7	21	53.8	85.7
ConvNeXtv2	90.3	86.6	89.1	72.7	12.3	31.9	50.3
GLKAN (Ours)	91.2	90.7	92	75.4	15.2	51.6	62.4

Table 5. Ablation results for different module settings. A denotes GLKAN, B denotes SH-AIFI, and C denotes WRPAFusion. The symbol √ indicates that the corresponding module is used.

Setting	A	B	C	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	FLOPs (G)	Weight Size (MB)
1				90.7	89.9	89.8	74.3	19.9	58.6	80.7
2	√			91.2	90.7	92	75.4	15.2	51.6	62.4
3		√		91.9	91	91.5	75.3	19.7	57	80
4			√	92.3	90.1	92	75.9	17.4	48.3	70
5	√	√		90.9	91.5	92.3	75.5	15.0	51.7	61.7
6	√		√	91.4	90.9	93	76.7	12.8	43.9	52
7		√	√	91.6	91.3	93.1	76.4	17.2	48.4	69.3
8	√	√	√	92.6	91.1	93.4	76.8	12.6	44	51.3

Table 6. Performance comparison of different models.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	FLOPs (G)	Weight Size (MB)
CNN-based models
Faster R-CNN	83.8	84.6	85.3	63.6	41.2	178	158
SSD	84.4	82.1	85.5	62.6	24.5	87.7	93.6
RetinaNet	83.3	86	87.1	66.3	36.4	174	138.9
Cascade R-CNN	87.8	86.1	88.4	71.7	69.2	205	264.1
TOOD	88.5	83.7	88.9	72.5	32	168	122.4
YOLOv8m	89.1	84.5	89.4	74	25.8	78.7	99.1
YOLOv12m	88.2	85.3	89.8	73.7	20.1	67.1	77.5
DETR-based models
Deformable DETR	88.8	84.7	88.2	70.8	40.1	165	155.8
DDQ-DETR	89.2	90.1	90.4	74.5	48.3	471	187.7
Conditional DETR	90.4	88.9	89.7	74	43.4	85.6	166.1
DAB-DETR	89	87.9	89.9	74.1	43.7	86.9	167.1
DINO	90	89.1	90.7	74.7	47.5	235	181.7
RT-DETR-R18	90.7	89.9	89.8	74.3	19.9	58.6	80.7
MCS-DETR (Ours)	92.6	91.1	93.4	76.8	12.6	44	51.3

Table 7. Edge deployment inference performance of RT-DETR-R18 and MCS-DETR for greenhouse cucumber detection on NVIDIA Jetson Orin NX Super.

Model	Inference Backend	Input Size	FPS
RT-DETR-R18	PyTorch	640 × 640	13.3
MCS-DETR	PyTorch	640 × 640	16.9
MCS-DETR	TensorRT	640 × 640	26.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rong, L.; Zhang, W.; Sun, F.; Liu, H.; Cai, C.; Ding, F.; Tong, Z. MCS-DETR: An Efficient Multi-Scale Context-Aware Detection Model for the Selective Harvesting of Greenhouse Cucumbers. Appl. Sci. 2026, 16, 5530. https://doi.org/10.3390/app16115530

AMA Style

Rong L, Zhang W, Sun F, Liu H, Cai C, Ding F, Tong Z. MCS-DETR: An Efficient Multi-Scale Context-Aware Detection Model for the Selective Harvesting of Greenhouse Cucumbers. Applied Sciences. 2026; 16(11):5530. https://doi.org/10.3390/app16115530

Chicago/Turabian Style

Rong, Lihong, Weilong Zhang, Fang Sun, Huimin Liu, Changqing Cai, Fuzhu Ding, and Zhimin Tong. 2026. "MCS-DETR: An Efficient Multi-Scale Context-Aware Detection Model for the Selective Harvesting of Greenhouse Cucumbers" Applied Sciences 16, no. 11: 5530. https://doi.org/10.3390/app16115530

APA Style

Rong, L., Zhang, W., Sun, F., Liu, H., Cai, C., Ding, F., & Tong, Z. (2026). MCS-DETR: An Efficient Multi-Scale Context-Aware Detection Model for the Selective Harvesting of Greenhouse Cucumbers. Applied Sciences, 16(11), 5530. https://doi.org/10.3390/app16115530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MCS-DETR: An Efficient Multi-Scale Context-Aware Detection Model for the Selective Harvesting of Greenhouse Cucumbers

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Dataset Creation

2.3. The Improved Model MCS-DETR

2.3.1. Gated Large Kernel Aggregation Network (GLKAN)

2.3.2. Single-Head Attention-Based Intra-Scale Feature Interaction (SH-AIFI)

2.3.3. Weighted Reparameterized Partial Aggregation Fusion (WRPAFusion)

2.4. Evaluation Metrics and Experimental Setup

2.4.1. Evaluation Metrics

2.4.2. Experimental Setup

3. Experimental Results and Analysis

3.1. Baseline Model Selection

3.2. Comparative Experiments on Attention Mechanisms

3.3. Comparison of Backbone Networks

3.4. Ablation Experiment

3.5. Comparative Experiments of Different Models

3.6. Visual Demonstration

3.6.1. Visualization of Heatmap

3.6.2. Detection Result Visualization

3.7. Model Deployment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI