1. Introduction
Intelligent sensing technology is becoming a key support in smart agriculture, with wide applications in crop growth monitoring, fruit ripeness detection, and automated harvesting. Tomatoes are among the most important and widely cultivated vegetables worldwide, prized for their rich nutrients and flavor. The ripeness of tomatoes not only directly affects post-harvest quality but also plays a crucial role in their market value. Traditional manual harvesting methods are highly subjective and inefficient. These methods not only result in significant labor costs but also lack scalability, making it difficult to meet the demands of large-scale tomato production [
1]. This inefficiency hinders accurate assessment of ripeness, ultimately compromising the quality and economic value of the harvested tomatoes. As a result, there is a pressing need to develop precise and automated tomato ripeness detection algorithms that can improve harvesting efficiency while reducing operational costs.
In practice, however, tomato ripeness detection in open-field cultivation remains a challenging task. In open-field cultivation, high planting density, severe leaf occlusion, and the clustered distribution of fruits often degrade detection accuracy. Moreover, lighting conditions in field environments are complex and highly variable across different time periods; as a result, sensors often struggle to capture images with consistent illumination. Therefore, developing a high-precision intelligent sensing model that can adapt to complex lighting conditions and perform effectively under occlusion is of great significance for promoting the intelligent and precise harvesting of tomatoes.
To tackle these challenges, researchers have explored a range of tomato detection models. In [
2], the authors introduced a multi-head self-attention mechanism to enhance YOLOv8’s feature extraction under complex lighting, thereby improving its robustness to illumination changes. In [
3], YOLOv4 was combined with the Hue, Saturation, and Value (HSV) color space features to effectively detect ripe tomatoes in natural environments. In [
4], the authors proposed a foreground–foreground class balance method and an improved YOLOv8s network, enhancing the detection performance of tomato fruits in complex environments. While these approaches show promising results under controlled conditions, they encounter significant limitations in real-world agricultural settings where complex occlusions and lighting variations often lead to inconsistent image quality. Consequently, conventional CNN-based models still struggle with detection stability, frequently producing missed or false results.
Given the limitations of traditional convolutional architectures, transformer-based models have recently gained traction in object detection research. DEtection TRansformer (DETR) [
5] pioneered introduction of the transformer architecture [
6] into object detection, breaking away from the limitations of traditional hand-designed components. Compared to convolutional neural networks with limited receptive fields, DETR shows stronger feature extraction and global perception capabilities when dealing with field images captured by sensors under varying lighting conditions and complex backgrounds. Based on [
5], several improved versions have been proposed to further enhance performance. For example, Deformable-DETR [
7] introduces multi-scale deformable attention to better focus on key regions, while DAB-DETR [
8] improves spatial localization through query initialization based on bounding box coordinates and DN-DETR [
9] applies a denoising training strategy to improve the robustness of matching. These enhancements have significantly improved the training efficiency and detection accuracy of DETR-based models. However, the limited fine-grained perception of object boundaries in DETR must still be addressed, especially in images with heavy occlusion from leaves and densely overlapping fruits in real field conditions.
In practical field environments, tomatoes differ from the surrounding background mainly in their color and contour; however, only tomatoes that are near full ripeness exhibit strong color contrast, whereas their contours remain consistently distinguishable across all stages of ripeness. This makes edge information a more stable and reliable visual cue under diverse conditions. Thus, leveraging such contour features can provide valuable guidance for object detection.
Motivated by this observation, we turn to recent advances in deep learning-based edge detection that have shown promising results across diverse domains. Models such as HED [
10] and RCF [
11] utilize multi-scale feature fusion to significantly improve edge representation. Building upon this, DexiNed [
12] and TEED [
10] further promote cross-domain generalization and lightweight design. Due to their strong generalization capabilities, these models can generate precise structural edge maps without requiring fine-tuning on agricultural datasets. Consequently, they are suitable as auxiliary modules for agricultural object detection tasks. Incorporating such edge information into detection frameworks enhances the model’s perception of object boundaries, leading to improved performance in challenging scenarios involving occlusion, clutter, or lighting variation.
Inspired by the above research, this paper proposes a novel tomato ripeness detection method called Edge-Guided DETR (EG-DETR). To address the negative impact of complex foliage occlusion on detection performance, EG-DETR incorporates multi-scale edge features generated by an edge detection network into the transformer decoder. This guides the query process to focus more on foreground regions, thereby enhancing detection accuracy under occluded conditions. In addition, we alleviate redundant predictions caused by clustered fruit growth by designing a redundant box suppression strategy that filters highly similar queries, resulting in improved training efficiency and detection accuracy. EG-DETR can effectively handle variations in illumination caused by different lighting conditions. This approach demonstrates strong adaptability and significantly improves detection performance in complex field environments, especially in images captured by sensors.
To evaluate the practical applicability of our method in real-world agricultural scenarios, we conducted experiments on a multimodal tomato dataset collected by intelligent sensors deployed in open-field settings [
13]. The images in this dataset cover a wide range of typical lighting conditions, including natural light, artificial light, low light, and sodium yellow light, closely simulating the visual challenges caused by changing illumination in actual farming environments. Our experimental results show that EG-DETR achieves excellent ripeness detection performance under various lighting and occlusion conditions, demonstrating its practicality and reliability in intelligent sensing systems for complex agricultural environments.
The main contributions of this paper are as follows:
We propose EG-DETR, a tomato ripeness detection method designed for complex environments that achieves high-precision detection even under severe fruit overlap and leaf occlusion conditions. EG-DETR also demonstrates outstanding performance under various lighting conditions, effectively addressing detection challenges caused by illumination variations in sensor-captured images.
We introduce a novel approach that uses edge information to provide guidance to the queries in the DETR decoder, which effectively enhances its foreground focus capability during feature fusion while reducing background interference. Additionally, a redundant box suppression strategy alleviates query redundancy caused by overlapping fruits, further improving detection performance.
2. Related Work
Object detection technology has undergone significant evolution. Its development path has transitioned from traditional two-stage detectors to single-stage methods. In recent years, it has further evolved to incorporate transformer-based detection frameworks.
CNN-based object detection has gradually evolved from two-stage to one-stage architectures. Faster R-CNN [
14] is a representative two-stage detection model. It first uses a convolutional neural network to extract global feature maps from the entire image, then applies ROI pooling to classify and regress candidate regions. This approach strikes a good balance between speed and accuracy and performs well across various standard detection tasks. However, Faster R-CNN shows limitations in multi-scale feature representation and sensitivity to overlapping objects. Its detection performance significantly declines in complex scenarios with severe occlusion and densely packed objects [
15].
In contrast, the You Only Look Once (YOLO) series [
16,
17,
18,
19] exemplifies the paradigm of one-stage object detectors. YOLO formulates object detection as a regression problem, directly predicting bounding box locations and class probabilities in a single forward pass through the image. This greatly enhances inference speed and makes YOLO highly suitable for real-time applications. Nevertheless, YOLO is more prone to missing small and overlapping objects in dense scenes [
20].
Despite the progress made by both paradigms, these CNN-based detectors still rely heavily on hand-designed components such as anchor boxes and Non-Maximum Suppression (NMS), which limits their flexibility and adaptability.
To address these limitations, DEtection TRansformer (DETR) [
5] was introduced as a novel end-to-end object detection framework built on the basis of the transformer architecture [
6]. By leveraging attention mechanisms, DETR directly learns global context and object relationships, eliminating the need for hand-crafted components. Following its introduction, a significant amount of research has focused on improving the performance of DETR-based methods from various perspectives. These approaches have optimized the DETR architecture in different ways, such as enhancing the execution process [
21,
22], redesigning query representations [
8,
23], reformulating the model [
24,
25], or incorporating prior knowledge [
23]. Deformable-DETR [
7] introduces a multi-scale deformable attention mechanism that selectively focuses on a few key sampled points within the reference bounding box, leading to improved computational efficiency. DN-DETR [
9] employs denoising training to reduce the difficulty of bipartite graph matching. DINO [
26] applies contrastive denoising training and uses a mixed query selection method for anchor initialization.
Building on the above methods, researchers have further focused on improving DETR-based models in situations involving small object detection and complex scenes. DQ-DETR [
22] dynamically adjusts the number of queries to address the imbalance between images and instances, thereby improving the detection accuracy of small objects. Salience-DETR [
27] introduces a hierarchical salience filtering mechanism and a novel scale-invariant salience supervision strategy, which effectively alleviates scale bias and enhances the model’s ability to focus on targets. Relation-DETR [
28] incorporates a positional relation embedding module that encodes the spatial layout between objects, gradually guiding the decoder to learn more accurate positional relationships.
While detectors based on CNNs and transformers have advanced the field of agricultural object detection, they still face considerable limitations in complex real-world scenarios. CNN-based models such as YOLO variants exhibit high inference efficiency but often fail to capture subtle object boundaries, particularly in cases involving small, overlapping, or heavily occluded fruits. Transformer-based methods such as DETR currently exhibit insufficient fine-grained perception, making it difficult to distinguish densely packed tomatoes under varying lighting conditions and foliage interference. Moreover, most existing models rely primarily on color and texture features, which are insufficient in field conditions where illumination is inconsistent and tomato ripeness stages are visually similar. Despite their potential to provide localization cues under occlusion and clutter, the stable structural contours of tomatoes across ripeness stages are often overlooked.
By contrast, we propose EG-DETR, a novel transformer-based framework that integrates multi-scale edge information into the decoder to enhance boundary perception and foreground focus. Additionally, we design a redundant box suppression strategy to reduce overlapping queries in dense fruit regions. These contributions aim to improve detection accuracy under challenging real-world agricultural conditions.
3. EG-DETR
3.1. Overall of EG-DETR
Figure 1 presents the overall architecture of the proposed EG-DETR framework, which extends the DETR structure with targeted enhancements for challenging agricultural environments. The model consists of four main components: a visual backbone, a transformer encoder–decoder, an edge feature extraction module, and a redundant box suppression module.
The visual backbone leverages a ResNet network to extract multi-scale semantic features from input images. These features are passed into a transformer encoder to capture global context. Following this, a fixed set of object queries is generated from learnable parameters. Object queries are then refined through a transformer decoder, which applies self-attention and cross-attention mechanisms to integrate image features and progressively improve detection precision.
To enhance the model’s perception of object boundaries and foreground focus, we integrate an edge feature extraction module, which uses a pretrained lightweight edge detection network to extract edge features. These features are introduced into the decoder as edge prior information, guiding the queries toward object contours and foreground. This design improves boundary localization, especially under occlusion from dense foliage.
Lastly, to address the issue of redundant predictions in clustered scenes, a redundant box suppression module is applied during the query selection stage. This module filters out highly overlapping queries, reducing duplication and enhancing overall detection efficiency.
3.2. Extraction of Edge Features
For edge information extraction, we use the pretrained model named Tiny and Efficient Edge Detector (TEED) [
29], which is specifically designed to offer high accuracy while maintaining low computational complexity. TEED has only 58K parameters, making it a lightweight and efficient model. This significantly reduces the demand for computational resources.
Figure 2 presents the architecture of TEED, which consists of three core components designed to effectively extract and integrate edge features: edge feature extraction blocks, USNet, and the Dfuse fusion module. The edge feature extraction blocks comprise stacked convolutional layers with nonlinear activations, and is enhanced by skip connections to improve feature propagation and preserve spatial structure. Building on these features, USNet employs a convolutional layer followed by activation and a deconvolutional layer to generate edge maps that maintain high spatial resolution. Finally, the Dfuse fusion module processes the edge predictions generated by USNet through two Depth-Wise Convolution (DWConv) layers, producing the edge output.
Compared to traditional edge extraction methods, pretrained models offer stronger generalization capabilities, allowing them to adapt effectively to agricultural images and achieve high-quality edge extraction without requiring additional training or fine-tuning on specific datasets. The edge information extracted by the model can effectively suppress the background noise and highlight the object contour, thereby providing more discriminative edge information for subsequent query and update.
For an input image , we first input it into the edge extraction backbone to obtain edge information . Considering the input dimension requirements of the backbone network in the DETR method, we expand the number of channels of E from 1 to 3 through the convolution layer to obtain to match the feature extraction backbone input shape used by DETR. Subsequently, is further fed into the backbone network to extract multi-scale edge information . The original image is also extracted through the backbone to obtain the feature map , with and having the same shape at the same scale.
3.3. Redundant Box Suppression
In the two-stage DETR framework, the model generates a fixed number of object queries based on the learnable parameters. These queries are then iteratively updated in the decoder, focusing on specific regions of interest in the image. However, in scenarios involving complex backgrounds or small objects, noisy or ambiguous feature representations may interfere with the attention learning process. The resulting redundant queries negatively affect the accuracy of subsequent target localization and classification, ultimately reducing detection performance. To address this issue, we adopt a redundant box suppression strategy inspired by the approach in Salience-DETR [
27]. This strategy first performs an initial selection of queries, then applies NMS [
14] to remove redundant queries with high overlap. The selected queries are assigned a bounding box with a fixed center at
and a width and height of 2, as shown in Equation (
1):
NMS is then applied to these bounding boxes at both the image level and hierarchical levels. To further optimize this process, we incorporate a dynamic IoU thresholding strategy that is scale-aware. Rather than using a fixed IoU threshold across all levels, we define thresholds based on the area of each feature map.
Specifically, feature maps with an area smaller than 512 are classified as small-scale, those with areas between 512 and 2048 are considered medium-scale, and those exceeding 2048 are treated as large-scale. Correspondingly, the IoU thresholds are set to 0.20, 0.30, and 0.35 for the feature maps with small, medium, and large scales, respectively.
This division is motivated by the dense and heavily overlapping nature of tomato fruits in real-world cultivation scenarios. When dealing with lower-resolution feature maps, a single region may attract multiple queries, resulting in a large number of overlapping candidate boxes. Applying a looser IoU threshold in such cases facilitates more effective suppression of redundant boxes. In contrast, a stricter threshold helps preserve valid predictions for higher-resolution feature maps where objects are more spatially separated.
Although the thresholds are manually defined, this scale-aware strategy offers practical effectiveness in dense object detection tasks without adding computational overhead.
3.4. Application of Edge Features in Decoder
In the decoder, we follow the approach of Relation-DETR [
28] by incorporating a position relation encoder to help the queries learn the spatial relationships between objects. Specifically, Relation-DETR encodes normalized relative geometry features between predicted boxes, represented as
. Given a set of predicted boxes, the relation matrix
is formed by computing
for each box pair, as described in Equation (
2):
According to Equation (
3), these four-dimensional features are transformed into high-dimensional embeddings using a sine–cosine positional encoding with a shape of
, where
s,
T, and
are parameters. The obtained embeddings are then processed through a linear transformation to obtain
:
The relational embedding
which captures the relationship between bounding boxes from the
-th and
l-th decoder layers is integrated into the self-attention mechanism based on Equation (
4) to refine query representations:
To improve the focus of queries and enhance detection performance, we incorporate additional edge information
into the decoder. This supplementary edge context allows the decoder to better focus on the foreground regions of interest in the image. Specifically, as illustrated in
Figure 3,
provides additional information during the cross-attention mechanism of the decoder, which facilitates more accurate location of objects.
To begin with, the edge feature map
undergoes further feature extraction through a Feed-Forward Network (FFN) layer. This FFN layer consists of two linear layers and an activation function, which helps to refine the edge features for subsequent processing. The enhanced edge feature map is denoted as
and is defined in Equation (
5):
The
l-layer of the decoder performs synchronous coordinate sampling on
based on the sampling position set
learned from the image feature map
. This ensures spatial consistency between image features and edge features. The decoder then applies cross-attention for both the image feature map
and the edge-enhanced feature map
using the deformable attention mechanism. The cross-attention mechanisms are defined in Equation (
6):
where
is the cross-attention between the query and the image feature map
and
represents the cross-attention between the query and the edge-enhanced feature map
.
To effectively combine the information from the image and the edge features, we introduce a learnable parameter
to control the fusion process. The final update of the query
is obtained by weighted fusion of the two cross-attention outputs, followed by Layer Normalization (LayerNorm) and another FFN. During training,
is adaptively optimized to balance the contributions from both modalities. This fusion mechanism is formulated by Equation (
7):
This hybrid query update process ensures that the information of both the image and the edges is utilized. This allows the model to achieve better object detection performance, especially under challenging scene conditions such as occlusion or clutter.
5. Discussion
The proposed EG-DETR model effectively improves detection performance under complex conditions by incorporating edge information, which is particularly helpful in cases of occlusion and fruit clustering. Additionally, our proposed redundant box suppression strategy reduces query overlaps during detection, further enhancing training efficiency and stability. Experimental results on a multimodal tomato dataset and several public benchmarks validate the model’s generalization and adaptability.
While these results confirm the effectiveness of our method, it is equally important to consider the practical constraints of real-world agricultural applications. In recent years, achieving high detection accuracy under limited computational resources has become an increasingly important research focus in agricultural intelligent sensing. This trend is largely driven by the demand for models that can be deployed on edge computing devices in resource-constrained environments. Consequently, many recent studies have proposed lightweight architectures aimed at reducing model complexity while maintaining competitive detection performance [
38,
39,
40].
Despite the promising performance of EG-DETR, certain limitations remain that restrict its immediate application in these scenarios. First, the relatively high computational overhead of EG-DETR restricts its deployment on edge devices with limited resources; thus, it is currently only suitable for agricultural applications with relatively relaxed latency requirements, such as crop monitoring systems. Moreover, the diversity and coverage of the training data are not yet comprehensive enough to guarantee robust generalization across diverse environmental conditions.
To address these challenges, future work will focus on two main directions. Primarily, we intend to apply model compression techniques such as pruning and quantization to reduce computational requirements, thereby enabling efficient deployment on edge devices. Second, we intend to explore multimodal data fusion methods by integrating information from spectral, thermal, or other sensors in order to improve the adaptability of the model in different agricultural environments.
6. Conclusions
This paper presents a novel intelligent sensing method named EG-DETR intended for tomato ripeness detection. By introducing edge information into the DETR framework, EG-DETR guides queries in the decoder to focus more effectively on foreground regions, resulting in enhanced detection performance under challenging conditions such as occlusion and fruit clustering. EG-DETR further employs a redundant box suppression strategy to reduce query overlap, improving both training efficiency and detection stability.
We evaluated EG-DETR on a multimodal tomato dataset to assess its effectiveness in real agricultural scenarios. It achieved 83.7% , surpassing existing methods and showing strong maturity recognition for automated harvesting. The model also demonstrated good generalization ability, with scores of 51.8% on COCO2017, 32.3% on VisDrone, and 48.4% on NEU-DET. Ablation studies confirmed the effectiveness of our method in handling occlusion and overlapping.
EG-DETR shows great potential for applications in complex and dynamic agricultural environments. It remains stable across various lighting conditions and is able to adapt to image variations captured by sensors in open-field environments. This adaptability enables reliable detection across different times of day, weather conditions, and sensor modalities, making EG-DETR a promising solution for intelligent sensing in open-field environments.