Augmenting Crop Detection for Precision Agriculture with Deep Visual Transfer Learning—A Case Study of Bale Detection

: In recent years, precision agriculture has been researched to increase crop production with less inputs, as a promising means to meet the growing demand of agriculture products. Computer vision-based crop detection with unmanned aerial vehicle (UAV)-acquired images is a critical tool for precision agriculture. However, object detection using deep learning algorithms rely on a signiﬁcant amount of manually prelabeled training datasets as ground truths. Field object detection, such as bales, is especially difﬁcult because of (1) long-period image acquisitions under different illumination conditions and seasons; (2) limited existing prelabeled data; and (3) few pretrained models and research as references. This work increases the bale detection accuracy based on limited data collection and labeling, by building an innovative algorithms pipeline. First, an object detection model is trained using 243 images captured with good illimitation conditions in fall from the crop lands. In addition, domain adaptation (DA), a kind of transfer learning, is applied for synthesizing the training data under diverse environmental conditions with automatic labels. Finally, the object detection model is optimized with the synthesized datasets. The case study shows the proposed method improves the bale detecting performance, including the recall, mean average precision (mAP), and F measure (F1 score), from averages of 0.59, 0.7, and 0.7 (the object detection) to averages of 0.93, 0.94, and 0.89 (the object detection + DA), respectively. This approach could be easily scaled to many other crop ﬁeld objects and will signiﬁcantly contribute to precision agriculture.


Introduction
According to the United Nations population estimates and projections, the growing world population will be nearly 10 billion in 2050 [1]. By 2050, there will be an increase in food demand of 59-98% [2]. To increase crop production while minimizing inputs, the adoption of advanced computing technologies, including computer vision, machine learning, and big data analytics, have recently gained interests among researchers in the fields of agriculture. Precision agriculture takes advantage of advanced computing technologies to minimize the inputs required, to improve the crop quality, and to increase yields.
With the reduction of equipment costs, increase in computing power, and availability of non-destructive food assessment methods, the efforts of many researchers and practitioners to improve the crop quality and yields have focused on computer vision and machining learning [3]. Computer vision helps with object detection and machine learning allows for useful information to be extracted from the collected data, showing tremendous advantages over the traditional methods applied in agriculture [4].
A number of research efforts has shown that the combination of computer vision and machine learning techniques regarding the multiple periods of crop production and harvesting are promising [5]. Computer vision in agriculture can be applied easily to analyze digital images collected from the field and to provide high-level understandable information to the users [6]. For example, computer vision not only detects the weeds fast and effortlessly, but also accurately applies treatment with the help of a ground robot [7]. In addition, computer vision can detect the diseases on crops and inform users for them to take action [8,9].
Image Acquisitions: To collect the images as inputs for the computer vision, using an unmanned aerial vehicle (UAV) is an efficient approach, which has been widely used in precision agriculture as well as many other fields, such as path planning, design, and wildlife rescuing [10,11]. A UAV combined with computer vision can also contribute to remote sensing to help inform farmers about the geo-specific crop yield and identify crop diseases [12,13]. Sometimes, decisions are required to be made off-board once the data have been collected and processed by the UAV, based on the information provided by the images processed from the computer vision technique [14,15]. For example, UAVs can be used to detect a potential issue, and then obtain high-resolution images or inspect and apply treatments correspondingly.
Bale detection challenges: When it comes to object detection, associated methods are commonly sensitive to the illumination and object and background domain change. A non-robust model can easily fail if it was not taking into account the variation in light conditions [16,17]. Because of the diversity of illumination situations, seasons, and weather conditions, object detection in the outdoor environment is more complicated than in the indoor environment, since humans can manipulate a consistent environment, as is shown in Table 1.
To emphasize, the illumination and hue change are the most significant factors impacting the bale detection model performance. Illumination variation, including the change in light conditions, with/without shadow covering, plays a significant role in object detection in the context of outdoor practices. Patrício and Rieder [18] suggested that consistent light conditions between the source domain and target domain will decrease the difficulties of shaping accurate classification models built on the deep learning architecture. A similar conclusion has been drawn by Hornberg [19], in that adequate lighting in the environment can increase the reliability of the performance of the models based on the collected images.
Hue change due to the season transition and variation in light conditions is another key factor to be considered in precision agriculture. During the growing season of vineyards, when the light is not strong enough, Baweja et al. [20] added extra light when collecting images by using a strobe lighting mounted on a ground robot image-capturing machine to compensate for the hue variation, to build a reliable deep learning model.
Since the deep learning-based object detection model always needs a large number of images labeled as the ground truths before training a supervised object detection model, the accuracy and thus detecting performance is impacted by the quality of the labeled data. One approach to improve the quality of the labeled data is to include balanced data, by including various images from the target domains, listed in Table 1. However, if we want to guarantee the quality, i.e., for each condition, the total number of images required to be manually labeled could be large and take significant resources to complete.
To reduce the task of labeling the objects manually, style transferring methods have been developed. To minimize the discrepancy between the source domain and target domain regarding the domain distribution, we propose a model by combining the convolutional neural network (CNN)-based YOLOv3 model and domain adaption (DA), a representative method in transfer learning. Domain adaption works very well where the tasks are similar, except for the domain distribution between the source domain and target domain [21]. In methodology section, we illustrate the proposed biomass detection model on the basis of CNN and DA. Since it has the strengths of accuracy and speed for object detection, YOLOv3 was selected to build the CNN model [22]. To realize the DA, an unpaired translation method, cycle generative adversarial networks (CycleGAN), was used to tackle the image difference due to the illumination, hue, and clarity discrepancy. Lighting condition commonly sensitive to the illumination and object and background domain change. A non-robust model can easily fail if it was not taking into account the variation in light conditions [16,17]. Because of the diversity of illumination situations, seasons, and weather conditions, object detection in the outdoor environment is more complicated than in the indoor environment, since humans can manipulate a consistent environment, as is shown in Table 1. The inconsistent changes in background and bale with season trigger the decrease of bale detection performance.
To gain the efficiency of agriculture, different process routines are conducted to crops in the morning, afternoon, and night.
Decreasing the difficulties of shaping an accurate classification models built on the deep learning architecture.
Shadow non-robust model can easily fail if it was not taking into account the variation in light conditions [16,17]. Because of the diversity of illumination situations, seasons, and weather conditions, object detection in the outdoor environment is more complicated than in the indoor environment, since humans can manipulate a consistent environment, as is shown in Table 1.

Lighting condition
To gain the efficiency of agriculture, different process routines are conducted to crops in the morning, afternoon, and night.
Decreasing the difficulties of shaping an accurate classification models built on the deep learning architecture.

Shadow
Shadow is commonly seen during daytime. This always happens in the rainy season. The images taken by UAV includes shadows in certain months.
Shadows crossing the objects decrease the accuracy of the classification on these kinds of objects. The scale of the background and bale size also makes it worse.

Hue change
Farms with different plants have various harvest seasons. As a result, the bales and backgrounds vary in different season.
The inconsistent changes in background and bale with season trigger the decrease of bale detection performance.
Shadow is commonly seen during daytime. This always happens in the rainy season. The images taken by UAV includes shadows in certain months.
Shadows crossing the objects decrease the accuracy of the classification on these kinds of objects. The scale of the background and bale size also makes it worse.

Seasonal Change (Target Domain 2)
Hue change non-robust model can easily fail if it was not taking into account the variation in light conditions [16,17]. Because of the diversity of illumination situations, seasons, and weather conditions, object detection in the outdoor environment is more complicated than in the indoor environment, since humans can manipulate a consistent environment, as is shown in Table 1. The high performance of supervised learning and semisupervised learning (object detection) in haze weather is always a challenge.

Snow covered
Tracking bales in winter and in a snow environment is also important for continuously feeding livestock.
Restoration-based algorithms may mislead or overfit the object compared to the original one. The snow weather reduces the features of the objects in the images.
To emphasize, the illumination and hue change are the most significant factors impacting the bale detection model performance. Illumination variation, including the change in light conditions, with/without shadow covering, plays a significant role in object detection in the context of outdoor practices. Patrício and Rieder [18] suggested that consistent light conditions between the source domain and target domain will decrease the difficulties of shaping accurate classification models built on the deep learning architecture. A similar conclusion has been drawn by Hornberg [19], in that adequate lighting in the environment can increase the reliability of the performance of the models based on the collected images.
Hue change due to the season transition and variation in light conditions is another key factor to be considered in precision agriculture. During the growing season of vineyards, when the light is not strong enough, Baweja et al. [20] added extra light when collecting images by using a strobe lighting mounted on a ground robot image-capturing machine to compensate for the hue variation, to build a reliable deep learning model.
Since the deep learning-based object detection model always needs a large number of images labeled as the ground truths before training a supervised object detection model, the accuracy and thus detecting performance is impacted by the quality of the labeled data. One approach to improve the quality of the labeled data is to include Haze weather sometimes happen with a temperature drop or precipitation change. This may cause a grain lifecycle adjustment, which needs to be monitored.
The high performance of supervised learning and semi-supervised learning (object detection) in haze weather is always a challenge.

Haze
Haze weather sometimes happen with a temperature drop or precipitation change. This may cause a grain lifecycle adjustment, which needs to be monitored.
The high performance of supervised learning and semisupervised learning (object detection) in haze weather is always a challenge.

Snow covered
Tracking bales in winter and in a snow environment is also important for continuously feeding livestock.
Restoration-based algorithms may mislead or overfit the object compared to the original one. The snow weather reduces the features of the objects in the images.
To emphasize, the illumination and hue change are the most significant factors impacting the bale detection model performance. Illumination variation, including the change in light conditions, with/without shadow covering, plays a significant role in object detection in the context of outdoor practices. Patrício and Rieder [18] suggested that consistent light conditions between the source domain and target domain will decrease the difficulties of shaping accurate classification models built on the deep learning architecture. A similar conclusion has been drawn by Hornberg [19], in that adequate lighting in the environment can increase the reliability of the performance of the models based on the collected images.
Hue change due to the season transition and variation in light conditions is another key factor to be considered in precision agriculture. During the growing season of vineyards, when the light is not strong enough, Baweja et al. [20] added extra light when collecting images by using a strobe lighting mounted on a ground robot image-capturing machine to compensate for the hue variation, to build a reliable deep learning model.
Since the deep learning-based object detection model always needs a large number of images labeled as the ground truths before training a supervised object detection model, the accuracy and thus detecting performance is impacted by the quality of the labeled data. One approach to improve the quality of the labeled data is to include balanced data, by including various images from the target domains, listed in Table 1.
Tracking bales in winter and in a snow environment is also important for continuously feeding livestock.
Restoration-based algorithms may mislead or overfit the object compared to the original one. The snow weather reduces the features of the objects in the images.
The present study sought to test the proposed method by collecting data from the field by a UAV equipped with RGB cameras, including 243 images captured with good illumination conditions in the fall and 150 images in other conditions, with the baled biomass also collected. Manually labeling each baled biomass from these conditional images is essential to train the YOLOv3 model. In addition, we also needed to manually label the images collected from other conditions to test the accuracy of the prediction given by the model and provide validation of the method.
In addition to use our proposed model, we also apply the traditional background subtraction algorithm developed by Li et al. [23], using the same data. The results show that our method gained the best F scores, indicating that it performs well when dealing with the discrepancy of domain distribution due to the different outdoor environments. Part of the images was manually labeled, while the rest of the images with different illumination contexts share the same labels by implementing CycleGAN for domain transferring. The processed images were used as inputs for the proposed YOLOv3 model to perform bale Remote Sens. 2021, 13, 23 4 of 17 detection. The goal was to show that our proposed model, a combination of computer vision and domain adaption, could improve the accuracy and efficiency of bale detection.
The key contributions of this work are listed in the following three points: i. For bale detection under illumination conditions, a YOLOv3 model was built. The associate training dataset will be released under conditions with the current work to fill the voids in the bale training dataset, with labels as the ground truths. ii.
We constructed an innovative object detection approach (algorithms pipeline), including YOLOv3 and domain adaptation (DA). Additionally, this approach improves the capability of bale detection. iii.
We augmented the labeled training data with more scenarios using domain adaptation. Combined with our manually labeled data, we are able to provide a valuable training dataset of over 1000 bale images, which is publicly available after this publication.

Computer Vision in Precision Agriculture
A number of research studies have investigated the application of computer vision in different key steps in agriculture, including observing crop growing, detecting diseases, and facilitating crop harvest [24].
Crop Growth Monitoring: Computer vision techniques have been used to collect the nutritional status of plants. Romualdo et al. [25] conducted research on maize plants to realize the diagnosis of the nitrogen nutritional status by implementing the computer vision technique at different development stages. Compared to the traditional method that relies on human observations, the computer vision technique improves the detection efficiency and accuracy. Pérez-Zavala et al. [26] proposed a computer vision approach to detect the grape bunches in vineyard scenes relying on the shape, texture descriptors, and bunch separation strategy to realize automatic monitoring of grapevine growth. Chandel et al. [27] applied deep learning models to monitor the water condition of crops and identified the water stress with over 90% accuracy. Parra et al. [28] compared various edge detection filters for weed recognition in lawns and identified that the sharping filters provided the best results with low computing requirements.
Disease Detection: Computer vision techniques also help with disease detection in agriculture. Oberti et al. [29] implemented computer vision to detect powdery mildew on grapevine leaves and the accuracy has been improved significantly by adjusting the view angles from 40 to 60 degrees, hence improving the overall quality of the plants. Pourreza et al. [30] explored the application of a computer vision technique to detect Huanglongbing disease on trees infected by a citrus psyllid. To analyze the performance of our model, laboratory and field experiments were taken and the results showed that the new method improve the target disease detection accuracy from 95.5% to 98.5%. Instead of identifying a single disease, the computer vision technique also contributes to the classification of multiple diseases of crops. Maharlooei et al. [31] applied image processing technology on detecting and counting soybean aphids to achieve the identification and enumeration of mites with lower costs and a high accuracy in strong light conditions. Toseef and Khan [32] used a fuzzy inference system to generate an intelligent mobile application to help rural farmers diagnose diseases that commonly occur on wheat and cotton crops with a 99% accuracy, reducing the loss of farmers due to crop diseases and dramatically improving the crop yields. Rustia et al. [33] applied an image and environmental sensor network to automatically detect greenhouse insect pests and achieved a 93% average temporal accuracy in terms of counting insect pests.
Crop harvest: Crop harvest is another aspect that benefits from the computer vision techniques. Barnea et al. [34] developed crop harvesting robots by using a color-agnostic shape-based 3D fruit detection technique on a registered image and depth to address the localization issue in precision agriculture, due to shape variations and occlusions. Lehnert et al. [35] designed an approach based on effective vision algorithms for harvesting sweet pepper and protecting the cropping system, demonstrated to be successful by the experiments of harvesting sweet peppers from modified and unmodified crops.
Dealing with the biomass after crop harvesting is essential. Biomass collection can provide economic benefits and, in certain cases, may also benefit future crops [36]. Biomass from crop fields are usually baled to a compact form before collection and transportation. In addition, stacking the bales to utilize the efficient bale-hauling equipment is desired. Other benefits of putting bales into stacks include efficiently clearing the crop field for the next grow cycle; avoiding bales, as they can be a hindrance that adversely affect the mechanical crop management operations; and shortening the time costs between harvest and planting.

Transfer Learning and Domain Adaptation
Transfer learning is a popular machine learning technique that aims to help with repetitive tasks by using the existing developed model. When it comes to situations where labeled data are only available in a source domain, domain adaption (DA), a common technique in transfer learning, as shown in Figure 1, can be applied. A little distribution change or domain shift, due to illumination, pose, and image quality, between the source and target domains can lead to a degraded performance of the machine learning models. Domain adaption (DA) provides an opportunity to mimic the human vision system that allows to perform new tasks in a target domain by using the labeled data from more relevant source domains. A number of research studies have recently addressed the issue of domain shift.
image and environmental sensor network to automatically detect greenhouse insect pests and achieved a 93% average temporal accuracy in terms of counting insect pests.
Crop harvest: Crop harvest is another aspect that benefits from the computer vision techniques. Barnea et al. [34] developed crop harvesting robots by using a color-agnostic shape-based 3D fruit detection technique on a registered image and depth to address the localization issue in precision agriculture, due to shape variations and occlusions. Lehnert et al. [35] designed an approach based on effective vision algorithms for harvesting sweet pepper and protecting the cropping system, demonstrated to be successful by the experiments of harvesting sweet peppers from modified and unmodified crops.
Dealing with the biomass after crop harvesting is essential. Biomass collection can provide economic benefits and, in certain cases, may also benefit future crops [36]. Biomass from crop fields are usually baled to a compact form before collection and transportation. In addition, stacking the bales to utilize the efficient bale-hauling equipment is desired. Other benefits of putting bales into stacks include efficiently clearing the crop field for the next grow cycle; avoiding bales, as they can be a hindrance that adversely affect the mechanical crop management operations; and shortening the time costs between harvest and planting.

Transfer Learning and Domain Adaptation
Transfer learning is a popular machine learning technique that aims to help with repetitive tasks by using the existing developed model. When it comes to situations where labeled data are only available in a source domain, domain adaption (DA), a common technique in transfer learning, as shown in Figure 1, can be applied. A little distribution change or domain shift, due to illumination, pose, and image quality, between the source and target domains can lead to a degraded performance of the machine learning models. Domain adaption (DA) provides an opportunity to mimic the human vision system that allows to perform new tasks in a target domain by using the labeled data from more relevant source domains. A number of research studies have recently addressed the issue of domain shift. To implement CNN techniques, a large images dataset with manually labeled targets is required, which is expensive and challenging [37]. By synthesizing images through use To implement CNN techniques, a large images dataset with manually labeled targets is required, which is expensive and challenging [37]. By synthesizing images through use of the DA techniques, one can reduce the images needed to be collected from the field and solve the problem when the labeled data cannot be acquired from the target domain [38]. Various research has been conducted and has achieved promising results. Ganin et al. [39] used unlabeled images from the target domains based on labeled images from the source domains for a deep learning architecture, based on a few standard layers and an additional gradient reversal layer. Othman et al. [40] designed a domain adaption network to overcome the issues of a domain shift in classification scenarios where the labeled images from the source domain and unlabeled ones from the target have completely different geographical features. Overall, when it comes to the problems of domain shift between the source and target domains, the DA technique can not only reduce the costs of data preparation, but also improve image recognition [41,42].

Methodology
Bale detection method pipeline summary: Figure 2 shows the completed structure of the bale detection method, from image acquisition to creating the model, and then to augment the model proposed in this work. We divided this pipeline into three steps, as follows: Step 1 trains a primary object detection model with YOLOv3, only based on the manually labeled initial condition images.
Step 2 demonstrates the method how we use the manually labeled ground truth images to generate more ground truth images with automatic labels. Then, in Step 3, we augment the object detection model in Step 1 with the mixed labeled ground truth images as the training data.
domains for a deep learning architecture, based on a few standard layers and an additional gradient reversal layer. Othman et al. [40] designed a domain adaption network to overcome the issues of a domain shift in classification scenarios where the labeled images from the source domain and unlabeled ones from the target have completely different geographical features. Overall, when it comes to the problems of domain shift between the source and target domains, the DA technique can not only reduce the costs of data preparation, but also improve image recognition [41,42].

Methodology
Bale detection method pipeline summary: Figure 2 shows the completed structure of the bale detection method, from image acquisition to creating the model, and then to augment the model proposed in this work. We divided this pipeline into three steps, as follows: Step 1 trains a primary object detection model with YOLOv3, only based on the manually labeled initial condition images.
Step 2 demonstrates the method how we use the manually labeled ground truth images to generate more ground truth images with automatic labels. Then, in Step 3, we augment the object detection model in Step 1 with the mixed labeled ground truth images as the training data. Step 1: Primary object detection A YOLOv3 model was trained for primary bale detection using 243 images captured with good illumination conditions in the fall. We define these labeled images as the source domain. CNN-based object detection methods, such as Faster R-CNN, YOLO, and Mask R-CNN, gained popularity among researchers and have been proved to be efficient [43][44][45]. YOLOv3 was released by Redmon and Farhadi in 2018, extended from the previous YOLO versions [46]. In this paper, YOLOv3 is implemented in the baled detection process, taking advantages of its accuracy and fast speed on object detection. Instead of using multiple networks for analysis, YOLOv3, indicated by its name You Only Look Once, passes the input image once to a convolutional neural network, lowering the costs and improving the performance significantly. In addition, the network splits the input into multiple regions and works on each one with the bounding boxes and their classification probabilities. By focusing on the global context of the image, YOLOv3 decreases the possibility of making a location classification error. Step 1: Primary object detection A YOLOv3 model was trained for primary bale detection using 243 images captured with good illumination conditions in the fall. We define these labeled images as the source domain. CNN-based object detection methods, such as Faster R-CNN, YOLO, and Mask R-CNN, gained popularity among researchers and have been proved to be efficient [43][44][45]. YOLOv3 was released by Redmon and Farhadi in 2018, extended from the previous YOLO versions [46]. In this paper, YOLOv3 is implemented in the baled detection process, taking advantages of its accuracy and fast speed on object detection. Instead of using multiple networks for analysis, YOLOv3, indicated by its name You Only Look Once, passes the input image once to a convolutional neural network, lowering the costs and improving the performance significantly. In addition, the network splits the input into multiple regions and works on each one with the bounding boxes and their classification probabilities. By focusing on the global context of the image, YOLOv3 decreases the possibility of making a location classification error.
To implement YOLOv3, we used PyTorch to train the model and to make inferences, based on Darknet-53 (an architecture that consists of 53 convolutional neural networks). The initial weights between the layers were provided by the Darknet-53 backbone [46]. YOLOv3 relies on a deeper architecture to extract features; the backbone here is "Darknet-53" with 53 convolutional layers. Leaky ReLU activation as well as normalization were added to every layer. Instead of using any form of pooling, often contributing to a loss of low-level features, we applied a stride of 2 in convolutional layers to reduce the size of the samples of the feature maps. Stride refers to the factor between the applications of the filter to the input image. An image of size 416 × 416, for instance, can be down-sampled to 13 × 13 by a stride of 32. The shape of the input images is (m, 416, 416, 3). The output consists of bounding boxes, representing the recognized classes. Each bounding box is defined by  6 numbers (p c , b x , b y , b h , b w , c). With augmenting cc (class) to an 80 dimensions vector, 85 numbers are used to describe every single bounding box, as shown in Figure 3.
samples of the feature maps. Stride refers to the factor between the applications of the filter to the input image. An image of size 416 × 416, for instance, can be down-sampled to 13 × 13 by a stride of 32. The shape of the input images is (m, 416, 416, 3). The output consists of bounding boxes, representing the recognized classes. Each bounding box is defined by 6 numbers ( , , , ℎ , , ). With augmenting cc (class) to an 80 dimensions vector, 85 numbers are used to describe every single bounding box, as shown in Figure 3. Similar to the object detectors, features learned by the convolutional layers are filtered to predict the detection, such as the coordinates of the bounding boxes and the class label. YOLO v3 is based on a 1-to-1 convolution to predict, so the prediction map has the same size as the input. Each cell in the prediction map represents a fixed number of bounding boxes, as shown in Figure 4. Step 2: Augmenting the training data with domain adaptation Domain adaptation, as a kind of transfer learning, is designed for augmenting the training data scenarios with automatic labels. As shown in the lower left in Figure 2, more than two conditional images are listed as Target Domain 1, 2, etc. Traditionally, all the Similar to the object detectors, features learned by the convolutional layers are filtered to predict the detection, such as the coordinates of the bounding boxes and the class label. YOLO v3 is based on a 1-to-1 convolution to predict, so the prediction map has the same size as the input. Each cell in the prediction map represents a fixed number of bounding boxes, as shown in Figure 4.
samples of the feature maps. Stride refers to the factor between the applications of the filter to the input image. An image of size 416 × 416, for instance, can be down-sampled to 13 × 13 by a stride of 32. The shape of the input images is (m, 416, 416, 3). The output consists of bounding boxes, representing the recognized classes. Each bounding box is defined by 6 numbers ( , , , ℎ , , ). With augmenting cc (class) to an 80 dimensions vector, 85 numbers are used to describe every single bounding box, as shown in Figure 3. Similar to the object detectors, features learned by the convolutional layers are filtered to predict the detection, such as the coordinates of the bounding boxes and the class label. YOLO v3 is based on a 1-to-1 convolution to predict, so the prediction map has the same size as the input. Each cell in the prediction map represents a fixed number of bounding boxes, as shown in Figure 4. Step 2: Augmenting the training data with domain adaptation Domain adaptation, as a kind of transfer learning, is designed for augmenting the training data scenarios with automatic labels. As shown in the lower left in Figure 2, more than two conditional images are listed as Target Domain 1, 2, etc. Traditionally, all the Step 2: Augmenting the training data with domain adaptation Domain adaptation, as a kind of transfer learning, is designed for augmenting the training data scenarios with automatic labels. As shown in the lower left in Figure 2, more than two conditional images are listed as Target Domain 1, 2, etc. Traditionally, all the targeting objects in the images need to be manually labeled. However, our proposed method, combining YOLOv3 with DA, decreases the laborious manual identification work but also ensure the performance of the model by applying style transferring. This method is created by referring to other state-of-the-art research that uses a similar structured approach; e.g., Song et al. [47] proposed an advanced subspace alignment algorithm combining convolutional neural networks, in order to classify remote sensing images, with domain adaptation on a theoretical level. Another fundamental work [48] proposes an approach with a pipeline of algorithms, including FasterR-CNN, DA, and H-divergence theory. While this research validates their application in cityscapes and other public datasets, it is without a practical-use scenario. Khodabandeh et al. [49] inserts noise during pre-processing of the training datasets and DA, which makes the object detection model be resilient to random noise. However, since the idea of object detection with DA is still under development, most related researches are based on a few public datasets with limited practical implementation scenarios, especially in the agriculture domain. There is no similar approach or concept related to bale detection. The following approach breaks the ice for augmenting the bale detection ability, taking advantage of state-of-the-art algorithms.
We only labeled the inputs from the images with one condition and then we collected more images with diverse illuminations, hues, and styles under different environments. Then we built a domain transferring model to convert the images of the initial condition to new images of the other conditions. Instead of manually labeling all the inputs required by the model, only part of the images was manually processed and the rest of the inputs shared the same label automatically because of the style transfer. In this way, a more robust YOLOv3 model that performs accurately on augmented styles of images could be achieved.
The DA technique is applied to shape the translation mapping from the source domain (S) in the initial environment to the target domain T in the other environments, and vice versa, as shown in Figure 5. The images from two different domains were not related in any way. CycleGAN [50] was implemented to transfer the styles between the two domains to synthesize the target domain images from the source domain (S).
is still under development, most related researches are based on a few public datasets with limited practical implementation scenarios, especially in the agriculture domain. There is no similar approach or concept related to bale detection. The following approach breaks the ice for augmenting the bale detection ability, taking advantage of state-of-the-art algorithms.
We only labeled the inputs from the images with one condition and then we collected more images with diverse illuminations, hues, and styles under different environments. Then we built a domain transferring model to convert the images of the initial condition to new images of the other conditions. Instead of manually labeling all the inputs required by the model, only part of the images was manually processed and the rest of the inputs shared the same label automatically because of the style transfer. In this way, a more robust YOLOv3 model that performs accurately on augmented styles of images could be achieved.
The DA technique is applied to shape the translation mapping from the source domain ( S ) in the initial environment to the target domain T in the other environments, and vice versa, as shown in Figure 5. The images from two different domains were not related in any way. CycleGAN [50] was implemented to transfer the styles between the two domains to synthesize the target domain images from the source domain ( S ).  Two GANs were used for applying the CycleGAN in the style transfer. Each one includes one generator and one adversarial discriminator. The generator, Gen (S,T) , in the first GAN translates images from the source domain (S) to the target domain (T), while the adversarial discriminator D T outputs the likelihood that the images taken from the target domain (T) are real images. Similarly, the generator Gen (S,T) , in the other GAN, translates images from the target domain (T) to the source domain (S), and its adversarial discriminator D S outputs the likelihood that the images taken from the source-domain (S) is real images from the source domain (S). I S and I t represents images from domain (S) and (T), respectively. Given i S ∈ I S and i t ∈ I t , it represents images in domain (S) and (T), respectively.
T represents the domains of the images synthesized in Figure 2. It represents the domain of the diverse seasons and illuminations of the synthetic images generated from the real initial environment images, whileŜ denotes the initial synthetic images generated from the real other environment images. By applying Gen (S,T) , images i S ∈ I S is transferred to synthetic images inT, while the corresponding adversarial discriminator improves the model by encouraging the translated image hardly distinguishable from the domain (T). Ideally, when the translated image from the source domain (S) to the target domain (T) is translated back from the target domain (T) to the source domain (S), we should get identical images. However, learning models are not perfect, and two different images will be obtained. The difference between the two images is measured by the cycle consistency loss, as defined below: Equation (3) defines the loss in adversarial training: To train these generators and discriminators, we need to solve Gradient descent is first applied to Equation (4), followed by backpropagation to allow the generator Gen (S,T) to complete the style transfer between the real initial-style images and synthetic other-style images, without changing the spatial relationship between the biomass in the images.
Step 3: Optimize the YOLOv3 model with the extended datasets from Step 2.
There are two optional methods we can apply to optimize the performance of the model. One is retraining the model, and the other one is fine-tuning. Retraining a model using extended data with a proper preprocessing is a straightforward and robust way, however it takes longer than fine-tuning.
A commonly used manner to transfer the trained model to the new dataset is finetuning, which is more efficient when the size of the new dataset is small. Fine-tuning trained models can not only reduce the probability of overfitting, but also provides better generalization if the original dataset and new dataset share similar domains. In this research, we applied both methods, keeping the better results of the two.

Experiment Equipment
The input data, the baled housing biomass, were collected from the fields by a drone from the Arlington Research Station (Arlington, WI, USA). The drone, equipped with a 1-inch Exmor R CMOS sensor and a gimbal stabilizer that handles the lateral and vertical vibration, allowed us to collect images from different heights, as shown in Figure 5. Through each campaign, the locations of the baled biomass were identified by a Global Navigation Satellite System (GNSS) and their corresponding centers were surveyed by a Carlson Surveyor 2. These two additional systems are for validation of the location accuracy and as a contribution to the public database for future research.

Bales Data Collection and Description
All the images collected of the bale biomass in the fields were taken with one drone model. Images from two different heights, 200 ft and 400 ft, were captured through seven campaigns to provide different resolutions to test our model performance. The size of the collected images was 5472 × 3648 pixels, corresponding to a 20-megapixel resolution, as shown in Figure 6. In addition, we created a second dataset by rescaling the collected images to 1080 × 720 with a 3:2 ratio, simulating a camera with a less than 1-megapixel resolution. The image numbers specifications used in the experiments are shown in Table  2, as "Initial condition". There were a total of 300 images used for the training, validation, and testing. All these images were collected in the fall under good illumination conditions, without shadows. We also collected 128 real images under the other conditions as ground truths for both training the CycleGAN model and testing the performance. We used independent training, validation, and testing datasets when we conducted this research.
We tested the various cases without using their original images anywhere in the training or validation. More images under the other conditions were generated by the CycleGAN model.
images to 1080 × 720 with a 3:2 ratio, simulating a camera with a less than 1-megapixel resolution. The image numbers specifications used in the experiments are shown in Table  2, as "Initial condition". There were a total of 300 images used for the training, validation, and testing. All these images were collected in the fall under good illumination conditions, without shadows. We also collected 128 real images under the other conditions as ground truths for both training the CycleGAN model and testing the performance. We used independent training, validation, and testing datasets when we conducted this research. We tested the various cases without using their original images anywhere in the training or validation. More images under the other conditions were generated by the CycleGAN model.

Primary Bale Detection with YOLOv3 Corresponding to Step 1
The YOLOv3 detector trained with only initial condition images in Step 1 was applied to detect bales in the real images. Although the training process does not include images under extended conditions, we still include these images in the testing results for comparison with the optimized detection model. The testing results, in terms of precision, recall, mAP, and F1 score for each scenario, are presented in Table 3. The high value of precision indicates a low incidence of false positives, meaning that the algorithm did not detect a bale where there was not any. On the other hand, the low recall means the algorithm fails to see some of the bales inside the image.

Primary Bale Detection with YOLOv3 Corresponding to Step 1
The YOLOv3 detector trained with only initial condition images in Step 1 was applied to detect bales in the real images. Although the training process does not include images under extended conditions, we still include these images in the testing results for comparison with the optimized detection model. The testing results, in terms of precision, recall, mAP, and F1 score for each scenario, are presented in Table 3. The high value of precision indicates a low incidence of false positives, meaning that the algorithm did not detect a bale where there was not any. On the other hand, the low recall means the algorithm fails to see some of the bales inside the image.
The prediction performance on the initial condition images achieved high values (0.92) of precision, recall, mAP, and F1 score. However, these four indices vary in the negative way for the extended conditions. The precision values for all conditions, except shadow, are over 0.85. As is shown in Figure 8b, the bales usually failed to be detected inside or partially covered by the shadow. The other three indices (recall, mAP, and F1 score) are all lower than expected for the extended conditions (average values are less than 0.59, 0.7, and 0.7 respectively). The F1 score is the harmonic average between precision and recall. Since the last one was low, the F1 score also got low. The mean average precision (mAP) was low and it varies through the different simulated scenarios. However, mAP was high in the haze condition since all the images with haze condition are collected with minor haze or fog, which may cause a significant blur on the background instead of bales. The general results are as expected, since there are few image samples in the extended conditions in the training datasets. However, mAP was high in the haze condition since all the images with haze condition are collected with minor haze or fog, which may cause a significant blur on the background instead of bales. The general results are as expected, since there are few image samples in the extended conditions in the training datasets.  Table 3. There are some undetected bales and low confidence scores listed in Figure 8.

Augmenting the Training Data with CycleGAN Corresponding to Step 2
During Step 2, we built a CycleGAN model to convert the real images to synthetic/fake images, as shown in Figure 9a. In the figure, real_A and real_B are real images, fake_B is the synthetic/fake image from real_A, and rec_A stands for the reconstructed image A based on fake_B. The second row has the same idea as the first one. With this CycleGAN model, 1200 synthetic images were generated and will be used as the extended training dataset in Step 3.  Table 3. There are some undetected bales and low confidence scores listed in Figure 8.

Augmenting the Training Data with CycleGAN Corresponding to Step 2
During Step 2, we built a CycleGAN model to convert the real images to synthetic/fake images, as shown in Figure 9a. In the figure, real_A and real_B are real images, fake_B is the synthetic/fake image from real_A, and rec_A stands for the reconstructed image A based on fake_B. The second row has the same idea as the first one. With this CycleGAN model, 1200 synthetic images were generated and will be used as the extended training dataset in Step 3.

Augmenting the Training Data with CycleGAN Corresponding to Step 2
During Step 2, we built a CycleGAN model to convert the real images to synthetic/fake images, as shown in Figure 9a. In the figure, real_A and real_B are real images, fake_B is the synthetic/fake image from real_A, and rec_A stands for the reconstructed image A based on fake_B. The second row has the same idea as the first one. With this CycleGAN model, 1200 synthetic images were generated and will be used as the extended training dataset in Step 3. Identity loss is the index when measuring the discrepancy due to translating one style of image to another style image, regulating the generator to generate images with high fidelity translated from the real samples in the target domain. No extra change is needed for the images that are almost distinguishable from the target domain. Generally, a greater identity loss value will be applied for unknown content. Figure 9b,c are the loss values during different and repeated training processes. These two tables show a slight reduction in some losses, especially cycle_A in the green color, as expected. So, the six loss values and four loss values are all partial references for the training status, but not the decision of a successful training process. Based on the purpose of these loss values, we do not expect a growing loss. A stable trend of loss with a flat plot or decreasing plot is expected. More information about the model parameters and logic can be found in Zhu et al. [45]. Figure 10 shows some examples of the augmented bale images with multiple environmental conditions.

Optimized YOLOv3 Model with Extended Datasets Corresponding to Step 3
The optimized YOLOv3 detector, trained with both the real images and synthetic images in Step 3, was applied to the same testing datasets. Table 4 shows the testing Identity loss is the index when measuring the discrepancy due to translating one style of image to another style image, regulating the generator to generate images with high fidelity translated from the real samples in the target domain. No extra change is needed for the images that are almost distinguishable from the target domain. Generally, a greater identity loss value will be applied for unknown content. Figure 9b,c are the loss values during different and repeated training processes. These two tables show a slight reduction in some losses, especially cycle_A in the green color, as expected. So, the six loss values and four loss values are all partial references for the training status, but not the decision of a successful training process. Based on the purpose of these loss values, we do not expect a growing loss. A stable trend of loss with a flat plot or decreasing plot is expected. More information about the model parameters and logic can be found in Zhu et al. [45]. Figure 10 shows some examples of the augmented bale images with multiple environmental conditions. Identity loss is the index when measuring the discrepancy due to translating one style of image to another style image, regulating the generator to generate images with high fidelity translated from the real samples in the target domain. No extra change is needed for the images that are almost distinguishable from the target domain. Generally, a greater identity loss value will be applied for unknown content. Figure 9b,c are the loss values during different and repeated training processes. These two tables show a slight reduction in some losses, especially cycle_A in the green color, as expected. So, the six loss values and four loss values are all partial references for the training status, but not the decision of a successful training process. Based on the purpose of these loss values, we do not expect a growing loss. A stable trend of loss with a flat plot or decreasing plot is expected. More information about the model parameters and logic can be found in Zhu et al. [45]. Figure 10 shows some examples of the augmented bale images with multiple environmental conditions.

Optimized YOLOv3 Model with Extended Datasets Corresponding to Step 3
The optimized YOLOv3 detector, trained with both the real images and synthetic images in Step 3, was applied to the same testing datasets. Table 4 shows the testing

Optimized YOLOv3 Model with Extended Datasets Corresponding to Step 3
The optimized YOLOv3 detector, trained with both the real images and synthetic images in Step 3, was applied to the same testing datasets. Table 4 shows the testing results, which will be compared with the performance of the primary YOLOv3 model in Step 1. YOLOv3 in Steps 1 and 3 have a similar performance for bale image detection under the initial condition, as shown in the line "Initial condition" in Tables 3 and 4. The generic testing results, in terms of precision, recall, mAP@0.5, and F1 score, for each scenario, are presented in Table 4. In most cases, the recall, mAP, and F1 score are obviously improved from average (0.59, 0.7, and 0.7) to average (0.93, 0.94, and 0.89), respectively. All the significantly increased values are marked in green. The increment in the recall indicates that most of the bales that cannot be detected in Step 1 are detected in Step 3. Meanwhile, the precision value keeps a similar level with occasional reduction because of the occasional increased false positives and true positives. This result is strong evidence that using synthetic images from transfer learning is a reasonable approach to enhance the detection capability with images under new conditions. The same examples of the tested bale images under multiple environmental conditions using the model trained in Step 3 are shown in Figure 11. These typical results under different conditions show the improvement compared to Figure 8. evidence that using synthetic images from transfer learning is a reasonable approach to enhance the detection capability with images under new conditions. The same examples of the tested bale images under multiple environmental conditions using the model trained in Step 3 are shown in Figure 11. These typical results under different conditions show the improvement compared to Figure 8.

Comparison and Advantages
To better understand the detection improvement on images under different environmental conditions, we plotted the F1 value between Step 1 and Step 3 under each condition separately, as shown in Figure 12. It was discovered, clearly, that under most conditions, the performance increases a lot, except for the initial condition, hue change (early winter), and haze; this is because the optimization curve generally slows down after the accuracy is over 80% when improving the object detection performance. What we aim for is improving the detection accuracy of the conditions with a lower accuracy (less than 80%). So, in our case, we expect to see a big jump for conditions such as illumination, shadow, hue change (summer), and snow, which all have a less than 75% accuracy. The following results analysis shows that we not only kept the original high performance but also increased the performance of some conditions that originally had a low accuracy.

Comparison and Advantages
To better understand the detection improvement on images under different environmental conditions, we plotted the F1 value between Step 1 and Step 3 under each condition separately, as shown in Figure 12. It was discovered, clearly, that under most conditions, the performance increases a lot, except for the initial condition, hue change (early winter), and haze; this is because the optimization curve generally slows down after the accuracy is over 80% when improving the object detection performance. What we aim for is improving the detection accuracy of the conditions with a lower accuracy (less than 80%). So, in our case, we expect to see a big jump for conditions such as illumination, shadow, hue change (summer), and snow, which all have a less than 75% accuracy. The following results analysis shows that we not only kept the original high performance but also increased the performance of some conditions that originally had a low accuracy. Firstly, the initial condition already has a high accuracy of over 93% with either model. Similarly, the hue change (early winter) condition also keeps a relevantly high performance around 90% before and after our approach. This method maintains the high accuracy score with a slight change during the enhancement of the training dataset volume and the false-negative samples. Meanwhile, the haze images enlarge the base number when calculating the F1 score. Although the haze condition accuracy is high, we can still make improvements by collecting more better-quality images season by season. However, this would be a long period and would entail continuous collecting work for our lab, which is not the core contribution of this algorithm research. Secondly, for conditions like, illumination, shadow, hue change (summer), and snow, our method significantly ameliorates the detection accuracy by around 15%, 26%, 10%, and 28%, respectively. Since it includes more images to train the model, the performance of the detection accuracy in the initial condition is slightly compromised, while its performance for other environmental conditions were improved significantly, in that the minimal F1 measure is at least above 80%. Adding more images from various conditions means more diverse bale types are considered. This may increase the false positives and false negatives, resulting in the slightly degraded measure of precision, recall, and F1. This phenomenon is common in the object-detection, deep-learning practice. Generally, this YOLOv3 + DA model proves its advantages in augmenting the detection ability with at least 80% accuracy for all conditions. Moreover, we estimated the time cost of manually labeling bales in all images, as shown in Table 5.
Step 1 only needs image labeling under the initial condition, for around 90 h. After that, we have two options to augment the bale detection model. One is to label every new image under all extended conditions with 260 extra hours of work, the other one is to train a CycleGAN model without extra labeling other than the first 90 h. Since the general F1 score, precision, recall, and mAP from the proposed approach are all over 0.9, this is sufficient for this specific task. Thus, the proposed method provides additional advantages of time and labor saving.

Performance F1
Step1 model Step3 model Firstly, the initial condition already has a high accuracy of over 93% with either model. Similarly, the hue change (early winter) condition also keeps a relevantly high performance around 90% before and after our approach. This method maintains the high accuracy score with a slight change during the enhancement of the training dataset volume and the false-negative samples. Meanwhile, the haze images enlarge the base number when calculating the F1 score. Although the haze condition accuracy is high, we can still make improvements by collecting more better-quality images season by season. However, this would be a long period and would entail continuous collecting work for our lab, which is not the core contribution of this algorithm research. Secondly, for conditions like, illumination, shadow, hue change (summer), and snow, our method significantly ameliorates the detection accuracy by around 15%, 26%, 10%, and 28%, respectively. Since it includes more images to train the model, the performance of the detection accuracy in the initial condition is slightly compromised, while its performance for other environmental conditions were improved significantly, in that the minimal F1 measure is at least above 80%. Adding more images from various conditions means more diverse bale types are considered. This may increase the false positives and false negatives, resulting in the slightly degraded measure of precision, recall, and F1. This phenomenon is common in the object-detection, deep-learning practice. Generally, this YOLOv3 + DA model proves its advantages in augmenting the detection ability with at least 80% accuracy for all conditions. Moreover, we estimated the time cost of manually labeling bales in all images, as shown in Table 5.
Step 1 only needs image labeling under the initial condition, for around 90 h. After that, we have two options to augment the bale detection model. One is to label every new image under all extended conditions with 260 extra hours of work, the other one is to train a CycleGAN model without extra labeling other than the first 90 h. Since the general F1 score, precision, recall, and mAP from the proposed approach are all over 0.9, this is sufficient for this specific task. Thus, the proposed method provides additional advantages of time and labor saving.

Train Approach Time Cost (Hours)
w/Initial condition images 90 w/Domain adaption images 90 w/Labeled all conditions images 1 350 1 "Labeled all conditions images" means manually labeling real images under all conditions and then training a model with these labeled data.

Conclusions
A YOLOv3 bale detection model combined with the domain adaptation approach is proposed in this paper, augmenting the ability for crop/bale detection in three seasons, different illumination conditions, and diverse weather conditions. This method is advantageous as it needs limited manual-labeling tasks. In this work, only the images captured under the initial condition needed to be manually labeled as the source-domain data. Then the domain adaptation approach, CycleGAN models, were trained to transfer the source-domain images to the target domain (images under other conditions) with the same labeled annotation file. We have effectively augmented the training datasets under extended conditions but without extra manual-labeling tasks. After these two steps, we trained the YOLOv3 model again with augmented training datasets. The optimized YOLOv3 model shows a significant improvement in general detecting performance. This approach decreases the labor and time cost by way of improving the crop quality and yields. It also shows strong scalability to many other crops and will significantly reduce the cost of precision agriculture. Future work should include collecting more real images under more specific conditions, generating more synthetic images associated with these conditions, and combining the activate learning method with the CycleGAN model, making the whole pipeline of the algorithm more robust and easier to use.