Influence of Model Size and Image Augmentations on Object Detection in Low-Contrast Complex Background Scenes

Sangha, Harman Singh; Darr, Matthew J.

doi:10.3390/ai6030052

Open AccessArticle

Influence of Model Size and Image Augmentations on Object Detection in Low-Contrast Complex Background Scenes

by

Harman Singh Sangha

^*

and

Matthew J. Darr

Department of Agricultural and Biosystems Engineering, Iowa State University, Ames, IA 50011, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(3), 52; https://doi.org/10.3390/ai6030052

Submission received: 10 January 2025 / Revised: 25 February 2025 / Accepted: 26 February 2025 / Published: 5 March 2025

(This article belongs to the Special Issue Artificial Intelligence in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Background: Bigger and more complex models are often developed for challenging object detection tasks, and image augmentations are used to train a robust deep learning model for small image datasets. Previous studies have suggested that smaller models provide better performance compared to bigger models for agricultural applications, and not all image augmentation methods contribute equally to model performance. An important part of these studies was also to define the scene of the image. Methods: A standard definition was developed to describe scenes in real-world agricultural datasets by reviewing various image-based machine-learning applications in the agriculture literature. This study primarily evaluates the effects of model size in both one-stage and two-stage detectors on model performance for low-contrast complex background applications. It further explores the influence of different photo-metric image augmentation methods on model performance for standard one-stage and two-stage detectors. Results: For one-stage detectors, a smaller model performed better than a bigger model. Whereas in the case of two-stage detectors, model performance increased with model size. In image augmentations, some methods considerably improved model performance and some either provided no improvement or reduced the model performance in both one-stage and two-stage detectors compared to the baseline.

Keywords:

deep learning; computer vision; agriculture; image augmentations

1. Introduction

The introduction of electronic systems and computers in agriculture has helped improve crop production and the overall agricultural experience for farmers [1,2]. An increase in crop yield has been observed partly due to the employment of smart farm input management techniques, which are heavily reliant on modern computational methods [3,4]. These applications can vary from high throughput phenotyping of crops, remote sensing, etc., to precise machinery controls and field application [5,6,7]. To advance the technology further, various studies are being conducted to explore the use of machine learning and artificial intelligence in agricultural applications [8]. Artificial intelligence and some machine learning applications take advantage of deep learning methods due to their ability to learn important features in complex datasets. Furthermore, with advancements in imaging and manufacturing technologies, in recent years, camera sensors have become inexpensive and lightweight. This has enabled camera sensors on platforms, such as unoccupied aerial vehicles (UAV), field robots, etc., to collect agricultural image data. These images are used in various applications including, but not limited to, plant phenotyping, disease monitoring, drought monitoring, pest control, in-row navigation, weed management, precision planting, fruit picking, crop harvesting, and crop handling [9,10,11,12,13,14,15,16,17].

Using machine learning with images further enables researchers to optimize agricultural systems and increase productivity. A computer can learn important features in an image and present them in the form of feature identification, feature detection, feature characteristics, and feature segmentation. Among these applications, deep learning-based object detection methods have gained considerable traction in the last decade and have been tested in many agricultural applications [9,13,15,16,18,19,20,21,22,23,24,25,26,27,28].

It is often observed in typical object detection deep neural networks that the bigger and more complex the model, the better its ability to learn difficult features in the images. In one of the studies in the literature, the authors used a YOLO model to detect lemon fruits in orchards [27]. The image scene was complex in nature with fruits having similar color space as the background. The authors stated that for a complex background environment, a smaller or less complex model had a better performance compared to a more complex or bigger model. Furthermore, agricultural datasets are often limited in the number of images or training samples present for deep learning model training. Moreover, it is always advised to use image augmentation methods to improve the model’s ability to generalize to unknown features that might be present but are not captured by the current image dataset when the size of the data is small. While working towards developing a new version for YOLO, the authors also found during testing that not all image augmentations were contributing towards improving the model performance [29].

In the literature, it was observed that many researchers have attempted to define real-world agricultural dataset scenes. Terms like contrast and complex background were repeated multiple times to explain the aesthetic quality of the images in the dataset. However, each definition was different and addressed different aspects of the research the study was focused on. Hence, one of the goals of this research study is to develop an all-encompassing definition of real-world agricultural dataset scenes that can be used to describe the nature of images in real-world agricultural deep learning applications. A universal definition will assist in performing more focused studies and developing easily generalized deep learning models for agricultural applications specific to low-contrast complex backgrounds. The second objective of this study is to evaluate the effects of model size in both one-stage and two-stage detectors on model performance for low-contrast complex background applications. And thirdly, an objective is to gauge the influence of different photo-metric image augmentation methods on model performance for standard one-stage and two-stage detectors.

2. Defining Characteristics of Agricultural Image Datasets

To achieve the research goal, the literature was investigated using the keywords “agriculture”, “background”, and “deep learning”. The Web Of Science Core Collection journal database was searched using the mentioned keywords on 2 November 2022, and was limited to only research articles. The database search yielded 358 research articles mentioning “agriculture”, “background”, and “deep learning”. To further filter the pool of papers, only open-access articles were considered, resulting in 210 articles in total.

The selected articles were further screened based on the following inclusion criteria:

Articles published during and after the year 2018.
Articles referring to the subject area of “Plant Science”, “Agricultural Engineering”, “Agronomy”, “Computer Science”, “Electrical Engineering”, “Artificial Intelligence”, or “Remote Sensing”.
Articles published in the English language.

This reduced the number of articles to 126. A careful reading of the abstracts and articles allowed us to eliminate those that did not describe the environment in the scene for images, used lab-generated imagery, or were unavailable articles within the selected database. Only 63 articles remained after sorting. A detailed reading of these articles allowed us to identify common words and phrases used to describe the scene in agricultural image datasets. Based on the findings, a global definition was presented. Refs. [30,31,32] influenced the method used to search and develop a definition for this topic. A year-wise list of the selected articles is provided in Table 1.

2.1. Image Scene

The features of a scene in an image are described using two-level descriptors. The first level includes features such as color and texture. The second level contains information regarding the objects present in the scene, such as a car and house, and the interaction of these objects. For example, the scene in Figure 1 can be described as “A red car in front of a grey house on a cloudy day”. Here, the low-level descriptors are red, grey, and cloudy, conveying information about the colors and textures present in the scene. The high-level feature descriptors are car, house, and in front of, characterizing the objects in the scene and how they interact.

Similarly, various low-level and high-level descriptors explain the scene in the agricultural datasets. The descriptions in the literature commonly used shadow, sunlight, type of crop, background, color, occlusions, and other biological material terms to create a textual reference for the different types of scenes present in the images for a particular agricultural dataset. There are cases where the images are staged to create a real-world reference in the lab to decrease the data collection time (Figure 2) [94,95]. Here, the real-world descriptors are used to mimic the hue, saturation, lighting, etc., in the lab. The environment present in real-world image data is very complex. Various small and large features compose a scene. Creating an accurate description is very important. The following section discusses the various attempts at describing the scene found in agricultural image data and an effort to find common descriptors to establish a global definition for real-world agricultural datasets.

2.2. Scene Descriptors for Agricultural Imagery

Analyzing all 63 articles from the literature, common descriptors were selected to characterize the scene in real-world agricultural applications and challenges with the scene for image processing. Semantically similar descriptors were kept in a single category, and the number of instances was counted when the authors used these descriptors to explain a scene. The low-level descriptors included color/contrast, texture, morphology, and illumination. The high-level descriptors were occlusion/overlap, background description, and complex/complicated. The count for the number of instances is provided in Figure 3. If an article described the minimal difference in color or the presence of low-contrast, a +1 count was added in the Color/Contrast descriptor. Similarly, if the similarity in texture between foreground and background was mentioned and varying lighting conditions were discussed, a +1 count was added to the Texture and Illumination descriptors, respectively. The Morphology descriptor was given a +1 count if the article discussed morphological features (shape, size, etc.). In the case of high-level descriptors, if the article expressed the interaction of foreground and background as complicated or complex to explain a scene, a +1 count was placed in the Complex/Complicated descriptor. When the article reported the types of objects or noise in the scene’s background, a +1 count was given to the Background description category. Finally, if the objects in the foreground or background were partially or fully blocking the object of interest in the particular study, +1 count was added to the Occlusion/Overlap descriptor. The Texture descriptor was used 16 times, the Morphology descriptor was used 39 times, the Color/Contrast descriptor was employed 43 times, the Illumination descriptor 47 times, the Complex/Complicated descriptor 51 times, the Background descriptor 40 times, and the Occlusion/Overlap descriptor was used 42 times.

Except for the Texture descriptor, the count of other descriptors used to explain a scene present in real-world agricultural image datasets was substantially higher. There is evidence of common patterns in the terminologies that can be used to derive a more universal definition for real-world agricultural datasets.

3. Low-Contrast Complex Background as a Unified Term

Since 2018, there has been a steady increase in the number of studies conducted each year for real-world deep learning agricultural applications. Each study attempted to define the scene in the images for each agricultural dataset. Common phrases and descriptions were observed but were limited to a focused scene. There is a need for a universal definition and unified terms to describe the scenes in real-world agricultural datasets. Defining a universal definition will assist the scientific community in performing more focused studies for agricultural scenes and develop deep learning models that can easily be generalized to other agricultural applications specific to low-contrast complex background situations. The increase in the number of publications each year and common themes within these publications indicates the growing importance of developing deep learning methods for real-world agricultural applications. Deep learning methods for real-world low-contrast complex background agricultural applications should be a sub-discipline within computer vision and machine learning applications in agriculture. A unified definition will be useful to further solidify this research space.

The Merriam–Webster dictionary defines complex (adjective) as “hard to separate, analyze, or solve” and contrast (noun) as “the difference or degree of difference between things having similar or comparable nature”. Based on the common descriptors used, the English language, and the literature reviewed, the low-contrast complex background can be used as a unified term and can be defined as follows:

“An image taken for an agricultural application can be said to contain low-contrast complex background if the object of interest has pixel value, saturation, and hue comparable to the entities present in the background, and/or the object of interest is either surrounded or occluded by other objects of interest, debris, biological materials, shadows, soil, and man-made materials of similar shape and sizes”.

4. Materials and Methods

4.1. Object Detection Models

The process of detecting an object using deep learning models involves two steps. The first step is to locate the objects of interest in the image and the second step is to classify those objects into different classes [96]. Deep convolutional neural networks are known for their feature extraction capabilities from images; hence, the current standard object detection architectures are based upon deep convolutional neural networks [97,98]. The object detection architectures are divided into two categories. The first one is two-stage object detectors. In two-stage architectures the task of object localization and object classification is divided into two different networks within the architecture. The main advantage of two-stage detectors is high accuracy of detection and the major drawback is slow detection speed. Some of the two-stage object detectors are as follows: RCNN [99], SPPNet [100], Fast RCNN [101], Faster RCNN [102], Mask RCNN [103], etc. The other type of architecture is one-stage object detectors. One-stage detectors do not separate the tasks of object localization and classification but directly perform both of the tasks through a single network. The main advantage of one-stage or single-stage detectors is the high speed of detection and the main limitation of these architectures is that the accuracy is comparatively lower than for the two-stage detectors. Some of the common one-stage detectors are the YOLO series [29,104,105,106], SSD [107], RetinaNet [108], etc.

4.1.1. Two-Stage Detectors

RCNN [99] used a deep convolutional neural network (DCNN) to extract image features and SVM to classify regions. In the case of SPPNet [100], it again used DCNN to extract image features and maps regions to propose the feature maps. Furthermore, spatial pyramid pooling is used to input multi-scale images to the DCNN. Fast-RCNN [101] uses DCNN to extract image features and maps down-scaled region proposals using a Region-Of-Interest (ROI) pooling layer to the feature maps. Faster-RCNN [102] uses a Region Proposal Network (RPN) to replace the DCNN and then the RPN shares feature maps with the backbone network. Mask-RCNN [103] uses an ROI Align pooling layer instead of an ROI pooling layer improving the detection accuracy, and later combines training object detection and segmentation to improve accuracy in detection. The model architecture of the selected networks is shown in Figure 4.

4.1.2. One-Stage Detectors

YOLOv1 [104] is an end-to-end single neural network and can implement class probabilities and bounding-boxes regression directly from a full image. YOLOv2 [105] introduced a new backbone network (DarkNet19) and used the k-means clustering algorithm to generate anchor boxes. YOLOv3 [106] used multi-level feature fusion to improve the accuracy of multi-scale detections and introduced a new backbone network (DarkNet53). YOLOv4 [29] experimented with different combinations of model features, such as Weighted-Residual-Connections (WRCs), Cross Mini-Batch Normalization (CMBN), Self-adversarial Training, etc. to achieve state-of-the-art results. It also introduced mosaic data augmentation as a new method for image augmentation. SSD [107] presented a multi-layer detection mechanism and multi-scale anchor mechanisms at different neural layers. RetinaNet [108] used a feature pyramid network to extract features and developed a new focal loss for the model training. The model architecture of some of the networks is shown in Figure 5 and Figure 6.

4.1.3. Backbone Networks

Backbone networks for object detection are networks which are used for feature extraction in state-of-the-art models. The main objective of backbone networks is to identify important features in images and learn their characteristics, such as color, shape, texture, etc. With more and more complex applications, there is a need to improve the model performance. To achieve this, the network architecture becomes more and more complex. In some other situations where limited memory and computing power is available, lightweight networks are proposed, which simplify the architecture without affecting the feature extraction capacity. Some of the backbone networks are discussed below.

VGGNet [109] added more layers to AlexNet [97], increasing the network size to 16–19 layers. This improved the feature extraction capabilities of the network. VGG16 and VGG19 are widely known model architectures. ResNet or Residual Network [110] addressed the problem of gradient dispersion and gradient explosion problems which arises due to increasing the network by adding more layers. The authors achieved this by adding a residual to the output of the previous layer or stack of layers before creating the input of the next layer. ResNet50 to ResNet 152 are widely used as backbone networks. DetNet [111] was proposed as a backbone for object detection to address some shortcomings of generic backbone networks. DetNet used a dilated convolutional network instead of down-sampling some of the last layers to ensure enhanced resolution and a receptive field, which helps in locating large and small objects.

The backbones discussed above are complex in nature to deepen the depth of the network. However, some methods were also developed which sought to reduce the number of parameters in the network. This was performed to address the problem of storage space and processing time to develop a network that can be deployed on a mobile device. InceptionV1/V2/V3 [112,113,114] used kernel decomposition to make a lightweight model. Xception improved over InceptionV3 with a depth-wise separable convolution [115]. MobileNet [116] improved on the structure of the depth-wise separable method. MobileNetV2 [117] used short-cut connections like ResNet [110] along with depth-wise separable convolutions to improve performance.

In this study, Faster-RCNN and RetinaNet were used for comparing networks as they were available in TensorFlow API pre-trained on Imagenet and both networks were able to use ResNet50, ResNet101, and ResNet152 as backbone networks.

4.2. Global Wheat Dataset

The Global Wheat Dataset was used for this study as a reference dataset [118]. The Global Wheat Dataset is composed of more than 6000 images of 1024 × 1024 pixels containing 300k+ unique wheat heads with corresponding bounding boxes. The dataset includes images from 11 different countries and 44 unique measurement sessions. The dataset is a collection of images taken between 2016 to 2019 by nine different institutes and contains genotypes from North America, Asia, Europe, and Australia. The row spacing varied from 12.5 cm to 30.5 cm. The dataset covers varying sowing density (186 to 450 seeds/m²), soil types (loamy soil, silt-clay, etc.), and stage of growth (flowering to ripening). Cameras with different fields of view (7.1° to 45.5°) and focal length (7.7 mm to 60 mm) were employed to collect the imagery with the ground sampling distance varying from 0.2 mm/pixel to 0.56 mm/pixel. The dataset was chosen as it contains images with different lighting conditions and a range of hues with a complex background including occlusion of wheat heads. Selected example images from the Global Wheat Dataset are shown in Figure 7.

4.3. System and Training Specifications

In this study, the experiments were conducted using the deep learning framework provided by TensorFlow API (version: 2.8.0) from Google. Python 3.8.9 version was used for developing the object detection model experimental setup. For training the different neural networks, a machine running the Windows 10 professional operating system was used. The models were trained on Intel i-7 11700K 3.6 GHz CPU, 128 GB ram, and NVIDIA RTX 3090 28 GB VRAM GPU.

The Global Wheat Dataset was divided into three parts: 3000+ images in the training set, 1400+ in the validation set, and 1700+ in the testing set. Each model was trained for 80,000 steps.

4.4. Image Augmentations

Image augmentations are used in deep learning models to address the issue of overfitting during training. Furthermore, image augmentations help when large datasets cannot be collected and labeled for training a deep learning model. Also, image augmentations are used to resolve the problems associated with class imbalance [119]. Image augmentations either warp the data or carry out oversampling to artificially inflate the training dataset. Augmentations dealing with oversampling create synthetic instances and add them to the training set, whereas data warping augmentations transform existing images while preserving their labels [120]. Data warping image augmentations can be further classified as photo-metric augmentations and geometric augmentations. Photo-metric augmentations alter the pixel values of the features in the image, including brightness, contrast, hue, etc., whereas geometric augmentations change the layout of the image, such as by cropping, flipping, rotating, etc.

Some of the image augmentation methods are as follows:

Flipping: In this augmentation, the images are flipped by either the horizontal or vertical axis. It is one of the easiest to implement.

Color Space: An image consists of a 3D matrix, height, width, and color channels. Each element in this matrix consists of a pixel value for all three color channels. These pixel values or channels can be manipulated in order to perform image augmentations. One method is isolation of a single color channel. Changing the brightness, contrast, or saturation randomly is another way to alter the color space.

Cropping: Randomly cropping images can be used as a method of image augmentation. A mixed height and width patch can be cropped from anywhere in the image.

Rotation: In this augmentation, the image is rotated either to the left or right between 1° and 359°. This augmentation helps in training the model to make the objects appear in different orientations.

Noise Injection: Noise injection involves injecting a matrix of random values drawn from a Gaussian distribution within the image. Adding noise to images can help neural networks learn more robust features.

Color Space Transformations: One way to perform color space augmentation is to loop through the images and decrease or increase the pixel values by a constant value. Another quick color space manipulation is to splice out individual RGB color matrices. A further transformation consists of restricting pixel values to a certain min or max value. Also, individual color channels can be distorted to create various color gradient images.

Kernel Filters: Sharpening and blurring images are very widely known kernel filter augmentations. These filters work by sliding an n × n matrix across an image with either a Gaussian blur filter, which will result in a blurrier image, or a high-contrast vertical or horizontal edge filter, which will result in a sharper image along the edges.

Mixing Images: Mixing images together by averaging their pixel values is another approach to data augmentation.

Random Erasing: Random erasing augmentation works by randomly selecting an n × m patch of an image and masking it with either minimum, maximum, mean pixel values, or random values. This technique was specifically designed to combat image recognition challenges due to occlusion. The authors of [119] demonstrated that not all the augmentations which were used in the study, including both geometric and photo-metric augmentations, were contributing equally towards improving a convolutional neural network’s model performance. Ref. [29] also reported similar results, mostly focused on using geometric augmentation to improve model efficacy. Therefore, in this study, five photo-metric augmentations which transformed the color space were tested to quantify their effects on object detection model performance in low-contrast complex background applications. The augmentations were random brightness, random contrast, random saturation, random distort color, and random Red-Green-Blue (RGB) to gray-scale conversion. Random brightness was applied by randomly selecting a brightness value from a uniform distribution; random contrast was applied on an image by randomly selecting contrast values from a uniform distribution; random saturation was applied on an image by randomly selecting saturation values from a uniform distribution; random distort color was applied by randomly selecting a color channel and randomly changing the pixel values; and random RGB to gray-scale was applied by randomly selecting an image and converting it into a gray-scale image. Visual examples of these augmentations on the Global Wheat Dataset are shown in Figure 8.

4.5. Performance Metrics

In supervised machine learning, it is necessary to provide ground truth labels to the model, which assists the model to learn important features in the data. For object detection, the ground truth is defined as a bounding box that gives the location of the object of interest in the image. The model is provided as rectangular coordinates with respect to the image Cartesian plane, i.e., x-coordinate for the upper left corner of the bounding box, y-coordinate for the upper left corner of the bounding box, with the width of the bounding box in pixels, and the height of the bounding box in pixels. To evaluate the performance of the object detection models, the overlap of the detected bounding box provided by the model and the ground truth bounding box is calculated.

The Intersection-Over-Union (IOU) is the metric used to calculate the overlap of the predicted and ground truth bounding boxes (1). The IOU, also known as the Jaccard Index, is the ratio of the area of intersection of the two regions (boxes) in the numerator and the area of union of the two regions (boxes) in the denominator (Figure 9). If there is no overlap, but the boxes have common edges, the ratio will have a value of 0 as there is no intersection area. If there is a slight overlap, the ratio will achieve a value close to zero as the area of intersection is very small but the area of union is very large. As the overlap between the boxes increases, the area of intersection will increase and the area of union will decrease; therefore, the value of the ratio will approach 1. At full overlap, both the intersection and the union area are equal and the value of the ratio is 1.

I o U = \frac{a r e a o f i n t e r s e c t i o n}{a r e a o f u n i o n}

(1)

For a detection to be successful, the minimum percentage of overlap should be defined. This definition varies from project to project. In most cases, an IoU threshold of 0.5 (50% overlap) is considered a successful detection. Other common thresholds are 0.75 and 0.95 (75% and 95% overlap). By increasing the IoU threshold, the detected bounding box becomes highly accurate but it increases the difficulty and training time needed to detect a successful bounding box [121].

An object detection model provides multiple predicted bounding boxes for a single image. It is important to quantify how many predicted boxes were correct and how many out of all the ground truth boxes were detected. With every detection event, there will be some boxes which were predicted correctly depending on the set threshold (True Positive), there will be boxes that were not predicted correctly (False Positive), and finally, there will be some ground truth boxes which were never predicted (False Negative). These terms can be efficiently quantified as Precision and Recall. Precision can be defined as the proportion of the predicted positives that were actually correct (2) and Recall can be defined as the proportion of the actual positives that were predicted correctly (3).

P r e c i s i o n = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e P o s i t i v e s}

(2)

R e c a l l = \frac{T r u e P o s i t i v e s}{T r u e P o s i t i v e s + F a l s e N e g a t i v e s}

(3)

For all the classes present in a dataset, the precision and recall values can be calculated and can be plotted to create a Precision-Recall curve. The Precision-Recall curve is a good way to evaluate the performance of an object detection model. An object detection model is considered to be performing well if the precision of the model remains high as the recall of the model is increasing. A model sometimes can have high precision (0 False Positives) or high recall (0 False Negatives). Precision-Recall curves usually start with a precision value—the precision starts decreasing as the value of recall increases (Figure 10).

A drawback of Precision-Recall curves is that they cannot be compared from model to model. To address this, a quantitative value of Average Precision (AP) is used. The AP is the area under the curve of the Precision-Recall curve. AP is a numerical value which makes it easy to compare different models and can be defined as the average value of precision for all values of recall between 0 and 1 (4) [121]. Furthermore, if the APs of each class are averaged for all the classes, the resultant value is called the mean Average Precision (mAP).

A P = \frac{1}{n} \sum_{r \in [0, 0.1, . . . ., 1]} ρ_{i n t e r p (r)}

(4)

4.6. Experimental Design

In the first set of experiments, both of the selected object detection models, i.e., RetinaNet and Faster-RCNN, were trained on three different backbones. The backbones were as follows: ResNet50, ResNet101, and ResNet152. The models were compared on the COCO detection evaluation metrics, which include the average mAP at IOU = 0.5:0.05:0.95, the mAP at IOU = 0.50, the mAP at IOU = 0.75, the mAP for small objects:

a r e a < 32^{2}

pixels, the mAP for medium objects:

32^{2} < a r e a < 95^{2}

pixels, and the mAP for large objects:

a r e a > 96^{2}

pixels.

In the second set of experiment, both RetinaNet and Faster-RCNN were trained using different color space augmentations for 80,000 steps. The backbone was fixed to ResNet101 for all the image augmentation experiments. The input image size was kept fixed at 1024 × 1024 pixels and no resizing was performed to avoid variations due to pixel interpolation. The color space augmentations used were random brightness, random contrast, random saturation, random distort color, and random Red-Green-Blue (RGB) to gray-scale conversion. Again, the models were compared on the COCO detection metrics.

Furthermore, all the models were compared on the total testing loss along with their overall mAP to quantify their effects on model fitting. Also, the percentage improvement of each model was compared either to a baseline or to the smallest model in the training. In this study, six different mean Average Precision (mAP) measures from the COCO evaluation metrics were used to quantify the performance of the different object detection models. The measures are, namely, the average mAP at IOU = 0.5:0.05:0.95, the mAP at IOU = 0.50, the mAP at IOU = 0.75, the mAP for small objects:

a r e a < 32^{2}

pixels, the mAP for medium objects:

32^{2} < a r e a < 95^{2}

pixels, and the mAP for large objects:

a r e a > 96^{2}

pixels. The resolution of the images used for this study was 1024 × 1024 pixels, meaning small objects occupy less than 0.1% and large objects occupy more than 0.9% area. In just the training set of the Global Wheat Dataset containing 3658 images, out of 163,644 individual bounding boxes, there were 2109 small objects, 134,299 medium objects, and 27,236 large objects. There were 1477 images in the testing set, containing 44,331 bounding boxes, out of which 1132 were small objects, 24,851 were medium objects, and 18,348 were large objects. The mAP values were calculated at the end of each training session on the validation dataset.

5. Results

In Table 2, the results from the first set of experiments are listed. In the case of RetinaNet, it was observed that the model with the smallest backbone, i.e., ResNet50, performed best compared to the other backbones for RetinaNet, with a mAP of 0.381. The other backbones had mAP values of 0.362 and 0.366 for ResNet101 and ResNet152, respectively. For the Faster-RCNN models, the model with the largest backbone performed best, i.e., ResNet152, with a mAP of 0.418. The backbones mAP values were 0.377 and 0.403 for ResNet50 and ResNet101, respectively.

Table 3 presents the results for the second set of experiments, i.e., the image augmentation methods for the RetinaNet model with a backbone of ResNet101. The baseline model with no data augmentations involved had a mAP of 0.362. The first augmentation performed was random contrast. The resultant mAP was 0.351, meaning the model performed poorly compared to the baseline. The next augmentation was random brightness, with a mAP of 0.385, representing a reasonable improvement compared to the baseline. The augmentation after random brightness was random saturation which was performed similarly, producing a mAP value of 0.385. The random distort color was the best performing augmentation compared to the other methods tested, giving a mAP of 0.409. Finally, the random RGB to gray-scale performed worse than the other methods except random contrast, with a mAP of 0.375.

The results for the second set of experiments with the Faster-RCNN model with a backbone of ResNet101 are shown in Table 4. The baseline mAP for Faster-RCNN with no augmentations performed was 0.403. It was comparably higher compared to RetinaNet’s baseline. This can be accounted for by Faster-RCNN being a two-stage detector, which is supposed to be more robust and accurate compared to RetinaNet, which is a singlestage detector. The first augmentation method tested was random contrast, providing a mAP of 0.406, which was a little greater than the baseline. A mAP of 0.414 was obtained when using the random brightness augmentation method during the training. This was a considerable increase compared to the baseline. The next augmentation was random saturation, giving a mAP of 0.412, which was slightly lower than the mAP for random brightness. The highest mAP, similar to the case with RetineNet, was provided by tthe random distort color augmentation method, with a value of 0.428. Lastly, the random RGB to gray-scale method gave a mAP of 0.418, which was comparable to the random brightness performance.

Furthermore, to better understand the performance for each image augmentation method, the testing loss was also compared along with the mAP for both the models. The methods which provided a higher mAP and a lower testing loss performed overall better compared to the other methods used to train the RetinaNet models. In Figure 11, random distort color is shown to have a testing loss of 1.305 with a mAP of 0.409, exhibiting the best performance for the RetinaNet model. Also, the lowest mAP value was produced by the random contrast trained model (0.362) and the maximum loss was produced by the baseline model (1.442). In the case of Faster-RCNN, a similar pattern was seen. The random distort color-trained model provided the least testing loss (1.503) and maximum mAP (0.428) when compared to the other Faster-RCNN-trained models in this experiment. For the Faster-RCNN models, the random contrast-trained model gave the highest training loss (1.616) and the baseline model had the lowest mAP (0.403) (Figure 12). In Table 5, the % improvement provided by each augmentation method in the case of RetinaNet is displayed. The best performing augmentation method, i.e., random distort color, gave a model improvement of 12.98% and the random contrast method decreased the model performance by −3.03%. The percentage model performance improvements using different image augmentation methods for Faster-RCNN are presented in Table 6. Similar to the RetinaNet case, the best performance improvement was obtained by using the random distort color method (6.2%). The least improvement with Faster-RCNN was produced by the random contrast method, i.e., 0.74%.

6. Discussion

In both the cases with RetinaNet and Faster-RCNN, it was observed that the image augmentation method random contrast either decreased the model performance or did not improve the model performance significantly compared to the baseline trained model. This can be attributed to the random contrast method while augmenting pixel values sometimes creating features which behave as negative examples during model training. These negative examples cause the model to learn them as features, which are not always indicative that the object of interest is present there. The next step will be to check how the random contrast augmentation method affects the model performance of the other augmentation methods.

All the other four types of image augmentation methods, which are random brightness, random saturation, random distort color, and random RGB to gray-scale were used in combination with random contrast to train new models for both RetinaNet and Faster-RCNN. They were then compared to the models which were trained using only the first four aforementioned image augmentation methods. In Table 7 and Table 8, the percentage changes in the model performance using all the other augmentation methods, in combination with random contrast for training both RetinaNet and Faster-RCNN, are presented. In both cases, random contrast did not provide any significant improvement except when used with random RGB to gray-scale for training RetinaNet. This may be because in combination with RGB to gray-scale, random contrast adds more difficult examples for the model to train on, making the model quite robust to unknown sets of images. Overall, random contrast, specifically when used with RetinaNet and Faster-RCNN to train on the agricultural dataset, did not provide any significant advantage. It either reduced or slightly increased the model performance when compared to the other image augmentation methods, which altered the pixel values to produce diversity in the dataset. In this study, a mid-size dataset was used. These findings might not apply to very small datasets as a small amount of examples are present for training. For small-scale datasets, further investigation is required to understand the effects of the data augmentations.

7. Conclusions

Low-level and high-level descriptors can be used to explain a scene in an image. Low-level features relate to color, texture, etc., while high-level features deal with objects and the relative information of those objects. For any image-based dataset, a description of the scene in the images is required to explain the information the images provide. Realworld agricultural datasets can have various low-level and high-level features present in them. Due to this, instances were found in the literature where different descriptions were given to explain the same type of scene. Numerous studies were reviewed that attempted to define scenes in real-world agricultural datasets to address the variations found in dataset descriptions. Later, common themes and patterns were identified, and a definition was provided to explain scenes in real-world agricultural datasets as low-contrast complex backgrounds.

Furthermore, the effects of model size and photo-metric image augmentation methods on one-stage (RetinaNet) and two-stage (Faster-RCNN) object detection deep learning models was studied. There are countless examples in the literature where it is stated that larger models perform better. There was a need to study the effects of model size while training on low-contrast complex background agricultural datasets. It was observed that for one-stage detectors, smaller models performed better compared to larger models. The model sizes were varied using different backbones for the networks, including ResNet50, ResNet101, and ResNet152. In the case of two-stage detectors, the backbone size did not affect the model performance, while with the bigger model, the model performance improved. In the second set of experiments, different photo-metric-based image augmentation methods were compared to understand the affect they have on model performance. It is often suggested that in the case of limited data availability, image augmentations can be used to make the deep learning model more robust and generalized. In the literature, it was found that some augmentations might affect the model performance negatively. Comparing different augmentation methods helped in understanding their effects on one-stage and two-stage detectors when training on low-contrast complex background applications. It was observed that except for the random contrast image augmentation method, every other method significantly improved model performance. Even when random contrast was used with other augmentation methods, there was no significant improvement.

In future research, the study can be expanded further to include other single-stage and two-stage detectors, which may have backbones similar to the ones used in this study. For models with different backbone structures, a broader comparison can be made to evaluate the effects of different backbone model structures containing a similar number of parameters. In addition, other datasets related to agricultural operations should be explored to verify the patterns observed in this study. This study used row crop images as a dataset; future studies can use image datasets from other agricultural focus areas, such as horticulture, floriculture, specialty crops, etc. This will help expand knowledge of objection detection models for low-contrast complex background applications in agriculture.

Author Contributions

Conceptualization, H.S.S. and M.J.D.; methodology, H.S.S.; formal analysis, H.S.S.; investigation, H.S.S.; resources, M.J.D.; writing—original draft preparation, H.S.S.; writing—review and editing, M.J.D.; visualization, H.S.S.; supervision, M.J.D.; project administration, M.J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The data presented in this study are available in Zenodo at https://zenodo.org/records/5092309. These data were derived from the following resources available in the public domain: [122]. The data was accessed on 1 June 2021.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, N.; Wang, M.; Wang, N. Precision agriculture—A worldwide overview. Comput. Electron. Agric. 2002, 36, 113–132. [Google Scholar] [CrossRef]
Kayad, A.; Sozzi, M.; Gatto, S.; Whelan, B.; Sartori, L.; Marinello, F. Ten years of corn yield dynamics at field scale under digital agriculture solutions: A case study from North Italy. Comput. Electron. Agric. 2021, 185, 106126. [Google Scholar] [CrossRef]
De Alwis, S.; Hou, Z.; Zhang, Y.; Na, M.H.; Ofoghi, B.; Sajjanhar, A. A survey on smart farming data, applications and techniques. Comput. Ind. 2022, 138, 103624. [Google Scholar] [CrossRef]
Chamara, N.; Islam, M.D.; Bai, G.F.; Shi, Y.; Ge, Y. Ag-IoT for crop and environment monitoring: Past, present, and future. Agric. Syst. 2022, 203, 103497. [Google Scholar] [CrossRef]
Feng, L.; Chen, S.; Zhang, C.; Zhang, Y.; He, Y. A comprehensive review on recent applications of unmanned aerial vehicle remote sensing with various sensors for high-throughput plant phenotyping. Comput. Electron. Agric. 2021, 182, 106033. [Google Scholar] [CrossRef]
Ullo, S.L.; Sinha, G.R. Advances in IoT and smart sensors for remote sensing and agriculture applications. Remote Sens. 2021, 13, 2585. [Google Scholar] [CrossRef]
Cobbenhagen, A.; Antunes, D.J.; van de Molengraft, M.; Heemels, W. Opportunities for control engineering in arable precision agriculture. Annu. Rev. Control 2021, 51, 47–55. [Google Scholar] [CrossRef]
Sood, A.; Sharma, R.K.; Bhardwaj, A.K. Artificial intelligence research in agriculture: A review. Online Inf. Rev. 2021, 46, 1054–1075. [Google Scholar] [CrossRef]
Bari, B.S.; Islam, M.N.; Rashid, M.; Hasan, M.J.; Razman, M.A.M.; Musa, R.M.; Ab Nasir, A.F.; P.P. Abdul Majeed, A. A real-time approach of diagnosing rice leaf disease using deep learning-based faster R-CNN framework. PeerJ Comput. Sci. 2021, 7, e432. [Google Scholar] [CrossRef]
Costa, L.; Ampatzidis, Y.; Rohla, C.; Maness, N.; Cheary, B.; Zhang, L. Measuring pecan nut growth utilizing machine vision and deep learning for the better understanding of the fruit growth curve. Comput. Electron. Agric. 2021, 181, 105964. [Google Scholar] [CrossRef]
Fue, K.G.; Porter, W.M.; Barnes, E.M.; Rains, G.C. Ensemble Method of Deep Learning, Color Segmentation, and Image Transformation to Track, Localize, and Count Cotton Bolls Using a Moving Camera in Real-Time. Trans. ASABE 2021, 64, 341–352. [Google Scholar] [CrossRef]
Jabir, B.; Falih, N.; Rahmani, K. Accuracy and Efficiency Comparison of Object Detection Open-Source Models. Int. J. Online Biomed. Eng. (iJOE) 2021, 17, 165–184. [Google Scholar] [CrossRef]
Li, Z.; Li, Y.; Yang, Y.; Guo, R.; Yang, J.; Yue, J.; Wang, Y. A high-precision detection method of hydroponic lettuce seedlings status based on improved Faster RCNN. Comput. Electron. Agric. 2021, 182, 106054. [Google Scholar] [CrossRef]
Patrício, D.I.; Rieder, R. Computer vision and artificial intelligence in precision agriculture for grain crops: A systematic review. Comput. Electron. Agric. 2018, 153, 69–81. [Google Scholar] [CrossRef]
Wang, R.; Liu, L.; Xie, C.; Yang, P.; Li, R.; Zhou, M. Agripest: A large-scale domain-specific benchmark dataset for practical agricultural pest detection in the wild. Sensors 2021, 21, 1601. [Google Scholar] [CrossRef]
Weyler, J.; Milioto, A.; Falck, T.; Behley, J.; Stachniss, C. Joint Plant Instance Detection and Leaf Count Estimation for In-Field Plant Phenotyping. IEEE Robot. Autom. Lett. 2021, 6, 3599–3606. [Google Scholar] [CrossRef]
Yandun Narváez, F.; Gregorio, E.; Escolà, A.; Rosell-Polo, J.R.; Torres-Torriti, M.; Auat Cheein, F. Terrain classification using ToF sensors for the enhancement of agricultural machinery traversability. J. Terramechanics 2018, 76, 1–13. [Google Scholar] [CrossRef]
Li, W.; Wang, D.; Li, M.; Gao, Y.; Wu, J.; Yang, X. Field detection of tiny pests from sticky trap images using deep learning in agricultural greenhouse. Comput. Electron. Agric. 2021, 183, 106048. [Google Scholar] [CrossRef]
Bazame, H.C.; Molin, J.P.; Althoff, D.; Martello, M. Detection, classification, and mapping of coffee fruits during harvest with computer vision. Comput. Electron. Agric. 2021, 183, 106066. [Google Scholar] [CrossRef]
Chen, Y.; Wu, Z.; Zhao, B.; Fan, C.; Shi, S. Weed and corn seedling detection in field based on multi feature fusion and support vector machine. Sensors 2021, 21, 212. [Google Scholar] [CrossRef]
Li, Y.; Li, M.; Qi, J.; Zhou, D.; Zou, Z.; Liu, K. Detection of typical obstacles in orchards based on deep convolutional neural network. Comput. Electron. Agric. 2021, 181, 105932. [Google Scholar] [CrossRef]
Parvathi, S.; Tamil Selvi, S. Detection of maturity stages of coconuts in complex background using Faster R-CNN model. Biosyst. Eng. 2021, 202, 119–132. [Google Scholar] [CrossRef]
Gao, J.; Westergaard, J.C.; Sundmark, E.H.R.; Bagge, M.; Liljeroth, E.; Alexandersson, E. Automatic late blight lesion recognition and severity quantification based on field imagery of diverse potato genotypes by deep learning. Knowl.-Based Syst. 2021, 214, 106723. [Google Scholar] [CrossRef]
Anagnostis, A.; Tagarakis, A.C.; Asiminari, G.; Papageorgiou, E.; Kateris, D.; Moshou, D.; Bochtis, D. A deep learning approach for anthracnose infected trees classification in walnut orchards. Comput. Electron. Agric. 2021, 182, 105998. [Google Scholar] [CrossRef]
Gomez, A.S.; Aptoula, E.; Parsons, S.; Bosilj, P. Deep Regression Versus Detection for Counting in Robotic Phenotyping. IEEE Robot. Autom. Lett. 2021, 6, 2902–2907. [Google Scholar] [CrossRef]
Fu, L.; Duan, J.; Zou, X.; Lin, J.; Zhao, L.; Li, J.; Yang, Z. Fast and accurate detection of banana fruits in complex background orchards. IEEE Access 2020, 8, 196835–196846. [Google Scholar] [CrossRef]
Li, G.; Huang, X.; Ai, J.; Yi, Z.; Xie, W. Lemon-YOLO: An efficient object detection method for lemons in the natural environment. IET Image Process. 2021, 15, 1998–2009. [Google Scholar] [CrossRef]
Zhao, W.; Yamada, W.; Li, T.; Digman, M.; Runge, T. Augmenting crop detection for precision agriculture with deep visual transfer learning—A case study of bale detection. Remote Sens. 2021, 13, 23. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Sponchioni, G.; Vezzoni, M.; Bacchetti, A.; Pavesi, M.; Renga, F. The 4.0 revolution in agriculture: A multi-perspective definition. In Proceedings of the Summer School F. Turco-Industrial Systems Engineering, Brescia, Italy, 11–13 September 2019; pp. 143–149. [Google Scholar]
Schreefel, L.; Schulte, R.; De Boer, I.; Schrijver, A.P.; Van Zanten, H. Regenerative agriculture—The soil is the base. Glob. Food Secur. 2020, 26, 100404. [Google Scholar] [CrossRef]
Zhang, W.; Yang, G.; Lin, Y.; Ji, C.; Gupta, M.M. On Definition of Deep Learning. In Proceedings of the 2018 World Automation Congress (WAC), Stevenson, WA, USA, 3–6 June 2018; IEEE: Piscataway, NJ, USA, 2018; Volume 2018, pp. 1–5. [Google Scholar] [CrossRef]
Wang, D.; He, D. Fusion of Mask RCNN and attention mechanism for instance segmentation of apples under complex background. Comput. Electron. Agric. 2022, 196, 106864. [Google Scholar] [CrossRef]
Zhou, J.; Li, J.; Wang, C.; Wu, H.; Zhao, C.; Wang, Q. A vegetable disease recognition model for complex background based on region proposal and progressive learning. Comput. Electron. Agric. 2021, 184, 106101. [Google Scholar] [CrossRef]
Jiang, Y.; Li, C.; Paterson, A.H.; Robertson, J.S. DeepSeedling: Deep convolutional network and Kalman filter for plant seedling detection and counting in the field. Plant Methods 2019, 15, 141. [Google Scholar] [CrossRef] [PubMed]
Gao, L.; Lin, X. A method for accurately segmenting images of medicinal plant leaves with complex backgrounds. Comput. Electron. Agric. 2018, 155, 426–445. [Google Scholar] [CrossRef]
Albattah, W.; Javed, A.; Nawaz, M.; Masood, M.; Albahli, S. Artificial Intelligence-Based Drone System for Multiclass Plant Disease Detection Using an Improved Efficient Convolutional Neural Network. Front. Plant Sci. 2022, 13, 1313. [Google Scholar] [CrossRef]
Cheng, Z.; Qi, L.; Cheng, Y. Cherry Tree Crown Extraction from Natural Orchard Images with Complex Backgrounds. Agriculture 2021, 11, 431. [Google Scholar] [CrossRef]
Lin, G.; Tang, Y.; Zou, X.; Cheng, J.; Xiong, J. Fruit detection in natural environment using partial shape matching and probabilistic Hough transform. Precis. Agric. 2020, 21, 160–177. [Google Scholar] [CrossRef]
Yang, L.; Luo, J.; Wang, Z.; Chen, Y.; Wu, C. Research on recognition for cotton spider mites’ damage level based on deep learning. Int. J. Agric. Biol. Eng. 2019, 12, 129–134. [Google Scholar] [CrossRef]
Sun, J.; He, X.; Ge, X.; Wu, X.; Shen, J.; Song, Y. Detection of key organs in tomato based on deep migration learning in a complex background. Agriculture 2018, 8, 196. [Google Scholar] [CrossRef]
Wang, N.; Qian, T.; Yang, J.; Li, L.; Zhang, Y.; Zheng, X.; Xu, Y.; Zhao, H.; Zhao, J. An Enhanced YOLOv5 Model for Greenhouse Cucumber Fruit Recognition Based on Color Space Features. Agriculture 2022, 12, 1556. [Google Scholar] [CrossRef]
Li, X.; Pan, J.; Xie, F.; Zeng, J.; Li, Q.; Huang, X.; Liu, D.; Wang, X. Fast and accurate green pepper detection in complex backgrounds via an improved Yolov4-tiny model. Comput. Electron. Agric. 2021, 191, 106503. [Google Scholar] [CrossRef]
Jiang, Y.; Li, C.; Xu, R.; Sun, S.; Robertson, J.S.; Paterson, A.H. DeepFlower: A deep learning-based approach to characterize flowering patterns of cotton plants in the field. Plant Methods 2020, 16, 156. [Google Scholar] [CrossRef]
Cao, Q.; Xu, L. Unsupervised greenhouse tomato plant segmentation based on self-adaptive iterative latent dirichlet allocation from surveillance camera. Agronomy 2019, 9, 91. [Google Scholar] [CrossRef]
Wang, X.; Wang, X.; Hu, C.; Dai, F.; Xing, J.; Wang, E.; Du, Z.; Wang, L.; Guo, W. Study on the Detection of Defoliation Effect of an Improved YOLOv5x Cotton. Agriculture 2022, 12, 1583. [Google Scholar] [CrossRef]
Liu, C.; Zhu, H.; Guo, W.; Han, X.; Chen, C.; Wu, H. EFDet: An efficient detection method for cucumber disease under natural complex environments. Comput. Electron. Agric. 2021, 189, 106378. [Google Scholar] [CrossRef]
Afonso, M.; Fonteijn, H.; Fiorentin, F.S.; Lensink, D.; Mooij, M.; Faber, N.; Polder, G.; Wehrens, R. Tomato fruit detection and counting in greenhouses using deep learning. Front. Plant Sci. 2020, 11, 571299. [Google Scholar] [CrossRef] [PubMed]
Moradi, F.; Javan, F.D.; Samadzadegan, F. Potential evaluation of visible-thermal UAV image fusion for individual tree detection based on convolutional neural network. Int. J. Appl. Earth Obs. Geoinf. 2022, 113, 103011. [Google Scholar] [CrossRef]
ur Rehman, Z.; Khan, M.A.; Ahmed, F.; Damaševičius, R.; Naqvi, S.R.; Nisar, W.; Javed, K. Recognizing apple leaf diseases using a novel parallel real-time processing framework based on MASK RCNN and transfer learning: An application for smart agriculture. IET Image Process. 2021, 15, 2157–2168. [Google Scholar] [CrossRef]
Osorio, K.; Puerto, A.; Pedraza, C.; Jamaica, D.; Rodríguez, L. A deep learning approach for weed detection in lettuce crops using multispectral images. AgriEngineering 2020, 2, 471–488. [Google Scholar] [CrossRef]
Ahmad, A.; Aggarwal, V.; Saraswat, D.; El Gamal, A.; Johal, G.S. GeoDLS: A Deep Learning-Based Corn Disease Tracking and Location System Using RTK Geolocated UAS Imagery. Remote Sens. 2022, 14, 4140. [Google Scholar] [CrossRef]
Wang, L.; Xiang, L.; Tang, L.; Jiang, H. A convolutional neural network-based method for corn stand counting in the field. Sensors 2021, 21, 507. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Gao, S.; Xiao, F.; Li, G.; Ding, Y.; Guo, Q.; Paul, M.J.; Liu, Z. Leaf to panicle ratio (LPR): A new physiological trait indicative of source and sink relation in japonica rice based on deep learning. Plant Methods 2020, 16, 117. [Google Scholar] [CrossRef] [PubMed]
Maji, A.K.; Marwaha, S.; Kumar, S.; Arora, A.; Chinnusamy, V.; Islam, S. SlypNet: Spikelet-based yield prediction of wheat using advanced plant phenotyping and computer vision techniques. Front. Plant Sci. 2022, 13, 889853. [Google Scholar] [CrossRef]
Wang, Y.; Qin, Y.; Cui, J. Occlusion robust wheat ear counting algorithm based on deep learning. Front. Plant Sci. 2021, 12, 645899. [Google Scholar] [CrossRef]
Xu, X.; Li, H.; Yin, F.; Xi, L.; Qiao, H.; Ma, Z.; Shen, S.; Jiang, B.; Ma, X. Wheat ear counting using K-means clustering segmentation and convolutional neural network. Plant Methods 2020, 16, 106. [Google Scholar] [CrossRef] [PubMed]
Craze, H.A.; Pillay, N.; Joubert, F.; Berger, D.K. Deep Learning Diagnostics of Gray Leaf Spot in Maize under Mixed Disease Field Conditions. Plants 2022, 11, 1942. [Google Scholar] [CrossRef]
Wang, X.; Liu, J. Diseases detection of occlusion and overlapping tomato leaves based on deep learning. Front. Plant Sci. 2021, 12, 792244. [Google Scholar] [CrossRef]
Zhou, C.; Hu, J.; Xu, Z.; Yue, J.; Ye, H.; Yang, G. A monitoring system for the segmentation and grading of broccoli head based on deep learning and neural networks. Front. Plant Sci. 2020, 11, 402. [Google Scholar] [CrossRef]
Kim, T.; Kim, H.; Baik, K.; Choi, Y. Instance-Aware Plant Disease Detection by Utilizing Saliency Map and Self-Supervised Pre-Training. Agriculture 2022, 12, 1084. [Google Scholar] [CrossRef]
Afzaal, U.; Bhattarai, B.; Pandeya, Y.R.; Lee, J. An instance segmentation model for strawberry diseases based on mask R-CNN. Sensors 2021, 21, 6565. [Google Scholar] [CrossRef]
Chandra, A.L.; Desai, S.V.; Balasubramanian, V.N.; Ninomiya, S.; Guo, W. Active learning with point supervision for cost-effective panicle detection in cereal crops. Plant Methods 2020, 16, 34. [Google Scholar] [CrossRef]
Deng, J.; Lv, X.; Yang, L.; Zhao, B.; Zhou, C.; Yang, Z.; Jiang, J.; Ning, N.; Zhang, J.; Shi, J.; et al. Assessing Macro Disease Index of Wheat Stripe Rust Based on Segformer with Complex Background in the Field. Sensors 2022, 22, 5676. [Google Scholar] [CrossRef] [PubMed]
Albattah, W.; Nawaz, M.; Javed, A.; Masood, M.; Albahli, S. A novel deep learning method for detection and classification of plant diseases. Complex Intell. Syst. 2021, 8, 507–524. [Google Scholar] [CrossRef]
Xuan, G.; Gao, C.; Shao, Y.; Zhang, M.; Wang, Y.; Zhong, J.; Li, Q.; Peng, H. Apple detection in natural environment using deep learning algorithms. IEEE Access 2020, 8, 216772–216780. [Google Scholar] [CrossRef]
Jin, X.; Bagavathiannan, M.; Maity, A.; Chen, Y.; Yu, J. Deep learning for detecting herbicide weed control spectrum in turfgrass. Plant Methods 2022, 18, 94. [Google Scholar] [CrossRef]
Zheng, Z.; Xiong, J.; Lin, H.; Han, Y.; Sun, B.; Xie, Z.; Yang, Z.; Wang, C. A Method of Green Citrus Detection in Natural Environment using a Deep Convolutional Neural Network. Front. Plant Sci. 2021, 7, 705737. [Google Scholar] [CrossRef]
Han, Z.; Hu, W.; Peng, S.; Lin, H.; Zhang, J.; Zhou, J.; Wang, P.; Dian, Y. Detection of Standing Dead Trees after Pine Wilt Disease Outbreak with Airborne Remote Sensing Imagery by Multi-Scale Spatial Attention Deep Learning and Gaussian Kernel Approach. Remote Sens. 2022, 14, 3075. [Google Scholar] [CrossRef]
Kc, K.; Yin, Z.; Li, D.; Wu, Z. Impacts of background removal on convolutional neural networks for plant disease classification in-situ. Agriculture 2021, 11, 827. [Google Scholar] [CrossRef]
Zhang, Z.; Qiao, Y.; Guo, Y.; He, D. Deep learning based automatic grape downy mildew detection. Front. Plant Sci. 2022, 13, 872107. [Google Scholar] [CrossRef]
Shao, H.; Tang, R.; Lei, Y.; Mu, J.; Guan, Y.; Xiang, Y. Rice ear counting based on image segmentation and establishment of a dataset. Plants 2021, 10, 1625. [Google Scholar] [CrossRef]
Wang, C.; Wang, Y.; Liu, S.; Lin, G.; He, P.; Zhang, Z.; Zhou, Y. Study on pear flowers detection performance of YOLO-PEFL model trained with synthetic target images. Front. Plant Sci. 2022, 13, 911473. [Google Scholar] [CrossRef]
Miao, Y.; Huang, L.; Zhang, S. A Two-Step Phenotypic Parameter Measurement Strategy for Overlapped Grapes under Different Light Conditions. Sensors 2021, 21, 4532. [Google Scholar] [CrossRef] [PubMed]
Xiang, R.; Zhang, M.; Zhang, J. Recognition for Stems of Tomato Plants at Night Based on a Hybrid Joint Neural Network. Agriculture 2022, 12, 743. [Google Scholar] [CrossRef]
Zhu, J.; Cheng, M.; Wang, Q.; Yuan, H.; Cai, Z. Grape leaf black rot detection based on super-resolution image enhancement and deep learning. Front. Plant Sci. 2021, 12, 695749. [Google Scholar] [CrossRef] [PubMed]
Suo, J.; Zhan, J.; Zhou, G.; Chen, A.; Hu, Y.; Huang, W.; Cai, W.; Hu, Y.; Li, L. Casm-amfmnet: A network based on coordinate attention shuffle mechanism and asymmetric multi-scale fusion module for classification of grape leaf diseases. Front. Plant Sci. 2022, 13, 846767. [Google Scholar] [CrossRef]
Zhang, R.; Tian, Y.; Zhang, J.; Dai, S.; Hou, X.; Wang, J.; Guo, Q. Metric learning for image-based flower cultivars identification. Plant Methods 2021, 17, 65. [Google Scholar] [CrossRef] [PubMed]
Teimouri, N.; Jørgensen, R.N.; Green, O. Novel Assessment of Region-Based CNNs for Detecting Monocot/Dicot Weeds in Dense Field Environments. Agronomy 2022, 12, 1167. [Google Scholar] [CrossRef]
Magalhães, S.A.; Castro, L.; Moreira, G.; Dos Santos, F.N.; Cunha, M.; Dias, J.; Moreira, A.P. Evaluating the single-shot multibox detector and YOLO deep learning models for the detection of tomatoes in a greenhouse. Sensors 2021, 21, 3569. [Google Scholar] [CrossRef]
Yadav, A.; Thakur, U.; Saxena, R.; Pal, V.; Bhateja, V.; Lin, J.C.W. AFD-Net: Apple Foliar Disease multi classification using deep learning on plant pathology dataset. Plant Soil 2022, 477, 595–611. [Google Scholar] [CrossRef]
Wang, X.; Liu, J.; Zhu, X. Early real-time detection algorithm of tomato diseases and pests in the natural environment. Plant Methods 2021, 17, 1–17. [Google Scholar] [CrossRef]
Qian, X.; Zhang, C.; Chen, L.; Li, K. Deep learning-based identification of maize leaf diseases is improved by an attention mechanism: Self-Attention. Front. Plant Sci. 2022, 13, 864486. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Liu, J. Tomato anomalies detection in greenhouse scenarios based on YOLO-Dense. Front. Plant Sci. 2021, 12, 634103. [Google Scholar] [CrossRef] [PubMed]
Li, F.; Bai, J.; Zhang, M.; Zhang, R. Yield estimation of high-density cotton fields using low-altitude UAV imaging and deep learning. Plant Methods 2022, 18, 55. [Google Scholar] [CrossRef] [PubMed]
Tufail, M.; Iqbal, J.; Tiwana, M.I.; Alam, M.S.; Khan, Z.A.; Khan, M.T. Identification of tobacco crop based on machine learning for a precision agricultural sprayer. IEEE Access 2021, 9, 23814–23825. [Google Scholar] [CrossRef]
Buxbaum, N.; Lieth, J.H.; Earles, M. Non-destructive Plant Biomass Monitoring with High Spatio-Temporal Resolution via Proximal RGB-D Imagery and End-to-End Deep Learning. Front. Plant Sci. 2022, 13, 758818. [Google Scholar] [CrossRef]
Liu, C.; Zhao, C.; Wu, H.; Han, X.; Li, S. ADDLight: An Energy-Saving Adder Neural Network for Cucumber Disease Classification. Agriculture 2022, 12, 452. [Google Scholar] [CrossRef]
Yao, J.; Wang, Y.; Xiang, Y.; Yang, J.; Zhu, Y.; Li, X.; Li, S.; Zhang, J.; Gong, G. Two-Stage Detection Algorithm for Kiwifruit Leaf Diseases Based on Deep Learning. Plants 2022, 11, 768. [Google Scholar] [CrossRef]
Sozzi, M.; Cantalamessa, S.; Cogato, A.; Kayad, A.; Marinello, F. Automatic bunch detection in white grape varieties using YOLOv3, YOLOv4, and YOLOv5 deep learning algorithms. Agronomy 2022, 12, 319. [Google Scholar] [CrossRef]
Wang, C.; Zhou, J.; Zhang, Y.; Wu, H.; Zhao, C.; Teng, G.; Li, J. A Plant Disease Recognition Method Based on Fusion of Images and Graph Structure Text. Front. Plant Sci. 2022, 12, 731688. [Google Scholar] [CrossRef]
Zenkl, R.; Timofte, R.; Kirchgessner, N.; Roth, L.; Hund, A.; Van Gool, L.; Walter, A.; Aasen, H. Outdoor plant segmentation with deep learning for high-throughput field phenotyping on a diverse wheat dataset. Front. Plant Sci. 2021, 12, 774068. [Google Scholar] [CrossRef]
Abreu, F.F.; Rodrigues, L.H.A. Monitoring mini-tomatoes growth: A non-destructive machine vision-based alternative. J. Agric. Eng. 2022, 53, 3. [Google Scholar]
Wan, P.; Toudeshki, A.; Tan, H.; Ehsani, R. A methodology for fresh tomato maturity detection using computer vision. Comput. Electron. Agric. 2018, 146, 43–50. [Google Scholar] [CrossRef]
Buzzy, M.; Thesma, V.; Davoodi, M.; Mohammadpour Velni, J. Real-time plant leaf counting using deep object detection networks. Sensors 2020, 20, 6896. [Google Scholar] [CrossRef] [PubMed]
Xiao, Y.; Tian, Z.; Yu, J.; Zhang, Y.; Liu, S.; Du, S.; Lan, X. A review of object detection based on deep learning. Multimed. Tools Appl. 2020, 79, 23729–23791. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, the Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. arXiv 2018, arXiv:1708.02002. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Detnet: Design backbone for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 334–350. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
David, E.; Ogidi, F.; Guo, W.; Baret, F.; Stavness, I. Global Wheat Challenge 2020: Analysis of the competition design and winning models. arXiv 2021, arXiv:2105.06182. [Google Scholar]
Taylor, L.; Nitschke, G. Improving deep learning with generic data augmentation. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1542–1547. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
David, E.; Serouart, M.; Smith, D.; Madec, S.; Velumani, K.; Liu, S.; Wang, X.; Pinto, F.; Shafiee, S.; Tahir, I.S.A.; et al. Global Wheat Head Detection 2021: An Improved Dataset for Benchmarking Wheat Head Detection Methods. Plant Phenomics 2021, 2021, 9846158. [Google Scholar] [CrossRef]

Figure 1. An image scene containing a red car in front of a grey house on a cloudy day. The description of the scene in the image uses both low-level descriptors (red, grey, and cloudy) and high-level descriptors (car, house, and in front of) (Google Image Search: ′car in front of a house′).

Figure 2. Examples for staged images used for machine learning applications [94,95]. Maturity levels in (a) aroma tomato and (b) pear tomato.

Figure 3. Number of instances for each descriptor for real-world agricultural scenes.

Figure 4. Model architecture of some of the 2-stage detectors [96].

Figure 5. Model architecture of some of the 1-stage detectors [96].

Figure 6. RetinaNet [108].

Figure 7. A few sample images from the Global Wheat Dataset.

Figure 8. Visual examples of random brightness, random contrast, random saturation, random distort color, and random Red-Green-Blue (RGB) to gray-scale conversion on the Global Wheat Dataset.

Figure 9. Illustration showing the difference between a poor detection and a good detection while comparing the overlap of two regions. The green box is the ground truth bounding box and the red box is the predicted bounding box.

Figure 10. A sample precision-recall curve.

Figure 11. Comparison of mAP and testing loss for all five augmentations with RetinaNet.

Figure 12. Comparison of mAP and testing loss for all five augmentations with Faster-RCNN.

Table 1. Year-wise distribution of reviewed articles.

2022	2021	2020	2019	2018
[33]	[34]	[26]	[35]	[36]
[37]	[38]	[39]	[40]	[41]
[42]	[43]	[44]	[45]	-
[46]	[47]	[48]	-	-
[49]	[50]	[51]	-	-
[52]	[53]	[54]	-	-
[55]	[56]	[57]	-	-
[58]	[59]	[60]	-	-
[61]	[62]	[63]	-	-
[64]	[65]	[66]	-	-
[67]	[68]	-	-	-
[69]	[70]	-	-	-
[71]	[72]	-	-	-
[73]	[74]	-	-	-
[75]	[76]	-	-	-
[77]	[78]	-	-	-
[79]	[80]	-	-	-
[81]	[82]	-	-	-
[83]	[84]	-	-	-
[85]	[86]	-	-	-
[87]	-	-	-	-
[88]	-	-	-	-
[89]	-	-	-	-
[90]	-	-	-	-
[91]	-	-	-	-
[92]	-	-	-	-
[93]	-	-	-	-

Table 2. Comparison of 1-stage and 2-stage detectors with different backbones.

Model	Backbone	Input	$mAP$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{small}$	${mAP}_{medium}$	${mAP}_{large}$
RetinaNet	ResNet50	1024 × 1024	0.381	0.782	0.320	0.022	0.308	0.489
RetinaNet	ResNet101	1024 × 1024	0.362	0.744	0.301	0.011	0.289	0.478
RetinaNet	ResNet152	1024 × 1024	0.366	0.746	0.313	0.009	0.288	0.488
Faster-RCNN	ResNet50	1024 × 1024	0.377	0.821	0.281	0.037	0.309	0.487
Faster-RCNN	ResNet101	1024 × 1024	0.403	0.838	0.325	0.050	0.336	0.503
Faster-RCNN	ResNet152	1024 × 1024	0.418	0.844	0.355	0.057	0.346	0.523

Table 3. Comparison of different image augmentations for RetinaNet.

Augmentation	Backbone	Input	$mAP$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{small}$	${mAP}_{medium}$	${mAP}_{large}$
Baseline	ResNet101	1024 × 1024	0.362	0.744	0.301	0.011	0.289	0.478
Random Contrast	ResNet101	1024 × 1024	0.351	0.740	0.278	0.010	0.279	0.464
Random Brightness	ResNet101	1024 × 1024	0.385	0.781	0.328	0.019	0.311	0.494
Random Saturation	ResNet101	1024 × 1024	0.385	0.779	0.332	0.014	0.308	0.500
Distort Color	ResNet101	1024 × 1024	0.409	0.796	0.375	0.014	0.327	0.526
RGB to Gray	ResNet101	1024 × 1024	0.375	0.766	0.320	0.020	0.301	0.491

Table 4. Comparison of different image augmentations for Faster-RCNN.

Augmentation	Backbone	Input	$mAP$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{small}$	${mAP}_{medium}$	${mAP}_{large}$
Baseline	ResNet101	1024 × 1024	0.403	0.838	0.325	0.050	0.336	0.503
Random Contrast	ResNet101	1024 × 1024	0.406	0.839	0.333	0.050	0.337	0.508
Random Brightness	ResNet101	1024 × 1024	0.414	0.847	0.344	0.055	0.343	0.515
Random Saturation	ResNet101	1024 × 1024	0.412	0.841	0.343	0.049	0.341	0.513
Distort Color	ResNet101	1024 × 1024	0.428	0.852	0.347	0.056	0.353	0.534
RGB to Gray	ResNet101	1024 × 1024	0.418	0.846	0.385	0.046	0.346	0.523

Table 5. % Improvement using different image augmentations for RetinaNet.

Augmentation	$mAP$	% Improvement
Baseline	0.362	-
Random Contrast	0.351	−3.03%
Random Brightness	0.385	6.35%
Random Saturation	0.385	6.35%
Distort Color	0.409	12.98%
RGB to Gray	0.375	3.59%

Table 6. % Improvement using different image augmentations for Faster-RCNNt.

Augmentation	$mAP$	% Improvement
Baseline	0.403	-
Random Contrast	0.406	0.74%
Random Brightness	0.414	2.72%
Random Saturation	0.412	2.23%
Distort Color	0.428	6.20%
RGB to Gray	0.418	3.72%

Table 7. % Change while using random contrast augmentation with other image augmentation methods for RetinaNet.

Augmentation	$mAP$	Augmentation Combinations	$mAP$	% Change
Random Brightness	0.385	Brightness-Contrast	0.389	1%
Random Saturation	0.385	Saturation-Contrast	0.376	−2%
Distort Color	0.409	Distort Color-Contrast	0.409	0%
RGB to Gray	0.375	RGB to Gray-Contrast	0.394	5%

Table 8. % Change while using random contrast augmentation with different image augmentation methods for Faster-RCNN.

Augmentation	$mAP$	Augmentation Combinations	$mAP$	% Change
Random Brightness	0.414	Brightness-Contrast	0.412	−0.5%
Random Saturation	0.412	Saturation-Contrast	0.413	0.2%
Distort Color	0.428	Distort Color-Contrast	0.426	−0.4%
RGB to Gray	0.418	RGB to Gray-Contrast	0.419	0.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sangha, H.S.; Darr, M.J. Influence of Model Size and Image Augmentations on Object Detection in Low-Contrast Complex Background Scenes. AI 2025, 6, 52. https://doi.org/10.3390/ai6030052

AMA Style

Sangha HS, Darr MJ. Influence of Model Size and Image Augmentations on Object Detection in Low-Contrast Complex Background Scenes. AI. 2025; 6(3):52. https://doi.org/10.3390/ai6030052

Chicago/Turabian Style

Sangha, Harman Singh, and Matthew J. Darr. 2025. "Influence of Model Size and Image Augmentations on Object Detection in Low-Contrast Complex Background Scenes" AI 6, no. 3: 52. https://doi.org/10.3390/ai6030052

APA Style

Sangha, H. S., & Darr, M. J. (2025). Influence of Model Size and Image Augmentations on Object Detection in Low-Contrast Complex Background Scenes. AI, 6(3), 52. https://doi.org/10.3390/ai6030052

Article Menu

Influence of Model Size and Image Augmentations on Object Detection in Low-Contrast Complex Background Scenes

Abstract

1. Introduction

2. Defining Characteristics of Agricultural Image Datasets

2.1. Image Scene

2.2. Scene Descriptors for Agricultural Imagery

3. Low-Contrast Complex Background as a Unified Term

4. Materials and Methods

4.1. Object Detection Models

4.1.1. Two-Stage Detectors

4.1.2. One-Stage Detectors

4.1.3. Backbone Networks

4.2. Global Wheat Dataset

4.3. System and Training Specifications

4.4. Image Augmentations

4.5. Performance Metrics

4.6. Experimental Design

5. Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI