1. Introduction
The introduction of artificial intelligence in agriculture has revolutionized the way problems are solved. Due to its ability to learn hierarchical representations of data, deep learning (DL) is the most suitable technique for developing computer vision systems with high performance and reliability, even on data very different from those used in training [
1]. Detection and segmentation are fundamental tools in many vision systems [
2]. An object detector identifies and categorizes specific regions of pixels within the image, assigning a class label to each object and drawing a box that encloses it. Segmentation classifies each image pixel (considering features such as color, texture, and shape) and groups them according to their class.
Detecting and counting leaves in RGB images is essential in precision agriculture (PA), because it indicates the growth rate, health status, and crop yield of plants. Automated detection and count reduce the need for manual inspections, which are costly and time-consuming in extensive agricultural areas. We aimed to detect tomato (
Solanum sp.) plant leaves in RGB images captured in uncontrolled conditions. Tomato plants have compound leaves that vary in shape and size depending on the genetic characteristics of the variety and the growing conditions. Each leaf has seven to nine leaflets, measuring 4 to 60 mm in length and 3 to 40 mm in width. Generally, they are green in color, glandular pubescent on the upper surface, and exhibit an ashen appearance on the underside [
3].
Mexico is the world’s leading supplier of tomatoes, holding a 25.11% share of the international market in terms of the value of global exports. In 2016, Mexican tomatoes accounted for 90.67% of imports to the United States and 65.31% to Canada, contributing 3.46% to the gross domestic product (GDP). By 2030, global demand is estimated to increase from 8.92 to 11.78 million metric tons, representing a cumulative growth of 32.10%. Meanwhile, national tomato production may increase from 3.35 to 7.56 million metric tons, yielding a cumulative increase of 125.80% [
4].
Various DL models based on semantic segmentation have been used to determine leaf regions. The images used in these works typically contain a single dominant leaf. A person using a cell phone carefully captures an image of the leaf with a uniform background for easy analysis [
5,
6,
7], or the background may be complex [
8]. However, when the camera is handheld or mounted on a robot, semantic segmentation is no longer helpful, and we must resort to instance-based segmentation, as the images contain different numbers and types of leaves [
9,
10]. Nevertheless, we do not always require precise leaf segmentation and only need the leaf’s location in the image. Object detection may be more appropriate for robotic localization and inspection, as well as for monitoring systems and mobile device applications in greenhouse and open-field growing environments.
Object detection is faster than semantic or instance-based segmentation, which makes it ideal for real-time applications. In addition, image labeling for object detection is more straightforward, as it involves only setting bounding boxes around objects, and there is no need to apply a mask per object. Likewise, object detection could be the first stage of a process to segment each leaf quickly and efficiently, thus avoiding the segmentation of the entire image and generating a more accurate extraction of the leaf [
11].
Images of leaves captured in real-world situations can be affected by changes in camera pose, variability in illumination, and variations in the size, shape, and texture of the leaves. Images may include several leaves overlapping and under different focuses or may contain other background objects like weeds, soil, fruits, and branches.
Table 1 lists the challenges of detecting tomato leaves in images: [
12,
13,
14,
15,
16,
17]. In each case, we show the regions of interest predicted (red boxes) by an object detector, which could be true positives, false positives, or false negatives.
This paper focuses on detecting tomato plant leaves using RGB images as a general approach, which requires locating each complete leaf with bounding boxes without emphasizing plant pathological diseases or physiological disorders (such as environmental, nutritional, or physical factors). The main contributions of this paper are as follows:
An object detection model based on Yolov11n [
18] is presented, incorporating attention and transformer modules to detect leaves of tomato (
Solanum sp.) plants in RGB images. We generated a model with fewer trainable parameters and FLOPs (floating-point operations per second), improving its performance compared with the original Yolov11n and Yolov12n [
19].
A dataset of images of tomato plants obtained from various sources, with their respective ground truth (labels for leaf detection).
A qualitative and quantitative evaluation of leaf images collected from the Internet featuring various environments, changes in illumination, camera types and poses, and conditions of the leaves (including spider webs, insects, diseases, and partial occlusions).
The paper consists of the following sections:
Section 2 reviews related work,
Section 3 describes the proposed architecture,
Section 4 presents an experimental evaluation,
Section 5 draws a discussion, and
Section 6 presents conclusions.
2. Related Work
Object detection in plant images using deep learning has been employed for various purposes. For example, Miao et al. [
20] detected the tips of leaves to determine the numbers of leaves on maize and sorghum plants and to classify them as damaged or undamaged. Hayati et al. [
21] detected tip-burn stress in lettuce grown indoors, enabling the administration of quick treatment. Cho et al. [
22] used a ground robot to detect flowers and branching points by analyzing RGB and depth images to obtain plant growth indicators. Shu et al. [
23] detected polygonal regions of corn leaf areas infected by pests, with a special emphasis on tiny areas. Hirahara et al. [
24] estimated the distances between plant shoots in grape crops in images captured both during the day and at night. All these studies demonstrate the importance of object detection in plant development analysis, including the identification of patterns associated with growth and the early detection of diseases. In contrast, our state-of-the-art study focuses solely on detecting whole leaves.
Researchers generally categorize deep neural networks for object detection into two-stage and one-stage models. Two-stage models divide the detection task into region proposal and classification, resulting in higher accuracy at the cost of double processing. In contrast, one-stage models perform these processes in a single step, allowing their use in real time in resource-constrained device applications, albeit at the expense of accuracy. Excellent reviews of state-of-the-art object detection for agriculture in general can be found in the literature [
25,
26,
27]. Both types of models have been used to detect leaves in images. The most relevant related papers are outlined below, organized according to the previously defined categories.
2.1. Two-Stage Models
In the two-stage models, Zhang et al. [
28] used a Faster R-CNN to detect healthy and diseased tomato leaves. They performed k-means clustering to determine the anchor sizes and tested different residual networks as the backbone. The authors took the images under laboratory conditions, with only one leaf per image. Wang et al. [
29] also used a modified Faster R-CNN model with a convolutional block attention mechanism and a region-proposal network module to detect densely overlapping sweet potato leaves. The attention mechanism enabled the extraction of highly relevant features from leaves by merging spatial and cross-channel information. In contrast, the region-proposal mechanism reduced false detections, especially in cases with densely distributed leaves. Despite improvements made to the original network, the model continued to produce many false positives due to the presence of leaves with severe occlusions, and because the young leaves of the plant of interest were very similar in size and appearance to weeds.
2.2. One-Stage Models
Many studies focus on designing models for leaf detection that consider feature extraction at multiple resolutions, since plants can be at different stages of growth. Lu et al. [
30] used the CenterNet network structure and adapted CBAM attention modules and ASPP modules to detect rosette-shaped plant leaves grown in greenhouses. Although this model can deal with plants of different ages and leaves of various sizes, the plant images lack complex backgrounds, varied camera angles, and variable lighting conditions. Xu et al. [
31] proposed a vision system for estimating maize yield based on counting plant leaf growth in the field using drone images. They first segmented the background plant using a Mask R-CNN, and then the segmented image was passed to a one-stage model to detect and count the leaves. They categorized the leaves into two classes: fully unfolded leaves and newly emerged leaves, showing poor performance on the newly emerged leaves because they were small regions of the image. The authors excluded cases of folded corn leaves from their analysis and used only nadir imagery. Using a specific model for segmentation and another for localization allowed the efficient detection of maize leaves. However, this strategy requires more computational resources to be put into production.
The idea of integrating neural networks from the Yolo family with transformers has gained considerable interest. Li et al. [
32] detected sweet potato leaves in field conditions by enhancing the Yolov5 neural network with the addition of two types of transformer blocks, namely the Vision Transformer and Swin Transformer, which enabled combining local and global features to improve detection accuracy. The authors considered different heights at which the plants were captured, but they always maintained a zenith angle of the camera. He et al. [
33] detected maize plant leaves to compute the leaf azimuth angle as a means of improving crop yield through aerial top-view images. The authors also used the Yolov5 model as a basis, to which they added a Swin Transformer, successfully detecting maize leaves with oriented bounding boxes under field conditions. We selected Yolov11n as our base model because it has demonstrated better performance with fewer parameters than Yolov5. In our case, we used horizontal bounding boxes because tomato leaves do not have an elongated shape.
Single-stage models are the ideal choice for developing applications that require real-time operation. Ning et al. [
34] improved the Yolov8 model to detect and count corn leaves. The authors replaced the backbone of Yolov8 with a StarNet network and a CAFM fusion module to combine local information obtained by convolutions and spatial information generated by global attention mechanisms. A StarBlock module was also incorporated into the neck section, and a lightweight shared convolutional detection header was used in the final section of the network. Despite its inference speed and good performance, the model was tested only with plants in indoor environments.
Currently, there is a trend toward using transformer elements to design architectures for object detection in agriculture, such as detecting buds [
35], fruit ripeness [
36], or entire weed plants [
37]. However, we did not find any articles that described using this type of neural network specifically for leaf detection. In general, the most representative transformer-based architectures are RT-DETR [
38], Yolo-world [
39], and Yolov12. The RT-DETR model uses a CNN as a backbone for feature extraction and a transformer-based encoder-decoder to capture global context for real-time object detection. The smallest version of RT-DETR (R18) has 20 M trainable parameters and a count of 60 GFLOPS for inference. Meanwhile, Yolov11n has only 2.6 M parameters and runs at 6.3 GFLOPS, making this model 10 times lighter than the R18 version. Yolo-world is an open-vocabulary real-time object detector that allows textual descriptions to be merged with the visual features extracted by Yolov8. It can also specify new classes using text without retraining the model. However, Yolo-world’s performance depends heavily on the quality of the prompts. Yolov12 is the latest version of the Yolo’s family, which primarily incorporates an attention mechanism to speed up the detection process, even more so than RT-DETR models.
5. Discussion
The study of the state of the art revealed a trend toward improving or proposing new one-stage models over two-stage models, due to their ease of deployment on devices with limited hardware resources, as evidenced by a greater number of publications on this topic. We used Yolov11n as the base architecture and replaced different elements to compress it, both in terms of the number of parameters and computational cost (FLOPs). This modified version extracts better features by fusing local and global spatial relationships.
The incorporation of CBAM and transformer modules into the Yolov11n architecture enables a better combination of local and global dependencies, which is of great importance for addressing the problem at hand. CBAM helps to give greater relevance to the areas of the image belonging to the leaves. The integration of modified MobileViT attention modules into the original backbone enabled the refinement of feature maps. The MV2 block helped reduce the model size, the storage required, and the number of FLOPs. Meanwhile, replacing the original function (CIoU) with WIoUv3 allowed us to modify the way the target location is estimated based on a strategic selection of anchor box quality, which improves performance in detecting leaves that are heavily overlapping.
The ablation study helped identify the strengths and weaknesses of each of the elements analyzed. For example, the
c5 configuration in
Table 6 indicates that the WIoUv3 function alone is insufficient to improve model performance. We achieved the best performance by using the three selected modules to extract better features (CBAM, MobileViT, and MV2), alongside WIoUv3. With an effective feature description, WIoUv3 enables the efficient discrimination of highly atypical or low-quality anchor boxes, and consequently, better estimation of leaf locations.
As noted in the qualitative analysis, accurately detecting tomato plant leaves in images, considering complex environments, remains a challenging problem because the physical characteristics of leaves, such as size, color, texture, and shape, vary significantly depending on the environment. Furthermore, a cluster of leaf blades can give rise to countless ambiguities due to partial occlusions and shadows that alter the visual appearance of the leaf.
For images containing leaves with diverse shapes, colors, and textures, Yolov12n has trouble correctly locating objects of interest, generating redundant bounding boxes on the same leaf, albeit with a much lower confidence level. This problem occurs to a lesser extent in the proposed model or Yolov11n.
In the specific case of dense leaf overlap, both Yolov11n and Yolov12n presented problems in correctly detecting the occluded leaves, even though images were captured at a short distance from the plant. These models generated bounding boxes that included more than two objects, while the proposed model better resolved these cases, achieving a 60.50% mAP@50-95. All models failed when there were leaves with folds near the apex, decreasing the precision metric. The models perform better when the entire leaf blade region is visible in the image.
The performance of the proposed model is comparable to that of the Yolov12n, Yolov11n, Yolov10n, and Yolov8n models, but requires much less storage space and fewer floating-point operations to make inferences. This feature allows our model to be more easily implemented on devices with limited resources, such as cell phones or robots that can be easily transported to agricultural environments. Our model showed 1.2% and 3.4% increases in recall compared with Yolov8n and Yolov12n, respectively. The presence of fewer false negatives when using our model reduces the probability of ignoring leaves that are present in the images.
Several issues still need to be addressed to achieve better leaf detection, such as considering distant shots, differentiating leaves of interest from those of other plant species, and discarding non-natural plant representations. Likewise, our proposed model and the others analyzed still presented high numbers of false positives due to all the issues described above. We plan to modify the input of our architecture in the future to integrate additional information, such as depth maps and synthetic or real near-infrared (NIR) imaging, similar to [
16,
50]. The addition of a fourth channel to the RGB images, incorporating information from the NIR sensor, would provide a better representation of feature maps, as all artificial objects that reflect less radiation would be immediately excluded. Similarly, data from the depth sensor would help the model learn to differentiate leaves on different planes. We could even explore combining NIR and depth information to determine whether the problem can be better modeled.
6. Conclusions
Detecting tomato leaves in uncontrolled environments is a challenging task because plants are naturally subject to a variety of biotic and abiotic factors, resulting in leaves of different sizes, shapes, colors, and textures. Additionally, the leaves of plants are arranged in various ways to minimize the shade produced between them, thereby optimizing photosynthesis. This phenomenon is known as phyllotaxy, which causes partial or total occlusion of the leaves, further increasing the difficulty of correctly detecting them.
This paper presents a deep learning architecture based on Yolov11n for tomato leaf detection in images in uncontrolled environments. We propose incorporating an attention module into the original Yolov11n architecture. The proposed architecture was evaluated both qualitatively and quantitatively by creating a dataset comprising a diverse range of tomato leaf images captured under various conditions and cameras. The proposed model performed comparably to Yolov11n and Yolov12n but required fewer trainable parameters and performed fewer floating-point operations. The incorporation of attention into the Yolov11n architecture enabled it to focus on the relevant features and regions of the leaves, leading to better results in challenging and varied scenarios. We identified other challenging cases that need to be addressed to achieve better detection of plant leaves, such as distant views, mixed leaf species, and artificial leaf prints.