A UAV Open Dataset of Rice Paddies for Deep Learning Practice

: Recently, unmanned aerial vehicles (UAVs) have been broadly applied to the remote sensing ﬁeld. For a great number of UAV images, deep learning has been reinvigorated and performed many results in agricultural applications. The popular image datasets for deep learning model training are generated for general purpose use, in which the objects, views, and applications are for ordinary scenarios. However, UAV images possess different patterns of images mostly from a look-down perspective. This paper provides a veriﬁed annotated dataset of UAV images that are described in data acquisition, data preprocessing, and a showcase of a CNN classiﬁcation. The dataset collection consists of one multi-rotor UAV platform by ﬂying a planned scouting routine over rice paddies. This paper introduces a semi-auto annotation method with an ExGR index to generate the training data of rice seedlings. For demonstration, this study modiﬁed a classical CNN architecture, VGG-16, to run a patch-based rice seedling detection. The k-fold cross-validation was employed to obtain an 80/20 dividing ratio of training/test data. The accuracy of the network increases with the increase of epoch, and all the divisions of the cross-validation dataset achieve a 0.99 accuracy. The rice seedling dataset provides the training-validation dataset, patch-based detection samples, and the ortho-mosaic image of the ﬁeld.


Introduction
Underlying the global climate change and a two billion increase of world population in next projected 30 years [1,2], sufficient yielding of grain crops has been considered in many countries as one of the most important issues to maintain food security. Remote sensing for land use [3][4][5][6] and agricultural monitoring [7][8][9] from satellites have been greatly adopted since the space era [10]. Satellites carry multispectral sensors, hyperspectral sensors, panchromatic sensors, and synthetic aperture radar that have been widely used for land use classification, agricultural monitoring and management, and disaster assessment [11][12][13][14]. The often-used satellites, such as Landsat, SPOT, Sentinel, and RADARSAT, provide a monthly-to weekly-level revisiting cycle and up to meter-level spatial resolution [15][16][17][18]. However, limited by the temporal and spatial resolution, satellite images usually cannot provide real-time and highly detailed data for precision agriculture [19]. Thanks to the development of mechanical and electronic techniques, unmanned aerial vehicles (UAVs) have been broadly applied to the remote sensing field. Compared to satellite remote sensing, UAVs process many advantages, such as ultra-high spatial resolution, flexible monitoring ability, and reasonable cost. Thus, UAVs have performed various notable applications on combining multispectral data, thermal data, and field information to classify crop species, assess disasters, and monitor plant growth [20][21][22][23].
With the development of computing power and a great number of UAV images, deep learning techniques have been reinvigorated and performed many results in agricultural applications. Egli and Höpke [24] developed a lightweight convolutional neural network (CNN) for automated tree species classifying with high-resolution UAV images. Chen et al. [25] applied an object detection network on counting strawberries with ultra-highresolution UAV images for yield prediction. Yang et al. [26] applied deep-learning to UAV images to estimate rice lodging over a vast area. Li et al. [27] proposed an improved object detection model for high-precision detection of hydroponic lettuce seedlings. Pearse et al. [28] applied a CNN model for tree seedlings detection to map and monitor the regeneration of forests in UAV images. Oh et al. [29] applied the object detection technique to cotton seedling counting in UAV images to analyze the plant density for precision field management.
Although deep-learning applications on UAVs are numerous, the UAV datasets vary with applications and are limited for free access. The commonly used image datasets are CIFAR-10, ImageNet-1000, and COCO [30][31][32], which were released by Canadian Institute For Advanced Research (CIFAR), ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and Microsoft, respectively. Images in the above-mentioned datasets are for general purpose use, in which the objects, views, and applications are for ordinary scenarios. The images acquired from the UAVs are mostly from a look-down perspective. The significant difference of viewing angles on the same objects results in a different context that degrades the applicability of the general-use dataset on UAVs deep-learning applications.
Rice is one of the major grain crops worldwide, over half of the world's population consumes rice as the staple food, and over 85% of consumption accounts for Asia [33,34]. To precisely estimate the grain yield and quality of rice, exploring the hill number of rice seedlings is a key component for cultivation density and uniform maturity of precision agriculture. This paper collected UAV images of rice seedlings in-field at the early stage of growth with a UAVs' look-down perspective. For demonstration, the rice seedling dataset was adopted to identify the number and position of rice seedlings using a lightweight CNN classification architecture. The proposed CNN model is trained with a 5-fold crossvalidation dataset, which reduces the effect of bias data on the model. In addition, the performance is evaluated by classification accuracy.
The aim of this paper is to provide a platform of UAV image dataset of rice paddy for data sharing by making labeled and unlabeled data findable and accessible through domain-specific repositories. For this scope, this paper focuses on the description of the dataset, including what methods used for collecting and producing the data, where the dataset may be found, and how to use the data with useful information and a showcase.

Data Introduction
The datasets published at GitHub consists of orthomosaic image, training-validation dataset, and the demo dataset. The orthomosaic image (see Figure 1) is the image stitched from a series of nadir-like view UAV images. The dataset provides 13 images for consecutive growth stages, which were imaged in 2018, 2019, and 2020 as listed in Table 1. All the images are georeferenced in TWD97/TM2 zone 121 (EPSG: 3826) projected coordinate system. The training-validation dataset (green bounding area) was generated by the method discussed in Section 2.4, and images were saved under a specific subfolder by each class. The demonstration dataset (red bounding area) is used for the test of object detection. This study clipped eight square images with an 8 m × 8 m area, which contains approximately one thousand hills of rice seedlings in each image. The detail of the demo dataset is discussed in Section 3.1.
Remote Sens. 2021, 13, 1358 3 of 17 method discussed in Section 2.4, and images were saved under a specific subfolde each class. The demonstration dataset (red bounding area) is used for the test of o detection. This study clipped eight square images with an 8 m × 8 m area, which con approximately one thousand hills of rice seedlings in each image. The detail of the d dataset is discussed in Section 3.1. Figure 1. An overview of field No. 80 (cyan bounding area). Image acquired on 7th August 2018. The green bounding area represents the area for training-validation dataset, and the red bounding area represents the area for the object detection demonstration dataset.

Training-Validation Dataset
The training-validation dataset was collected by a multi-rotor UAV flying a plan scouting routine over a paddy operated by Taiwan Agricultural Research Institute (T in Wufeng District, Taichung, Taiwan. The data were collected on 7, 14, and 23 Au 2018, between 07:03 am and 08:00 am local time. The UAV flew at a constant altitude The green bounding area represents the area for training-validation dataset, and the red bounding area represents the area for the object detection demonstration dataset.

Training-Validation Dataset
The training-validation dataset was collected by a multi-rotor UAV flying a planned scouting routine over a paddy operated by Taiwan Agricultural Research Institute (TARI) in Wufeng District, Taichung, Taiwan. The data were collected on 7, 14, and 23 August 2018, between 07:03 am and 08:00 am local time. The UAV flew at a constant altitude and carried an RGB sensor with an approximate nadir view for the duration of data collection using a 4-rotor commercial range UAV, DJI Phantom 4 Pro (Da-Jiang Innovations, Shenzhen, PRC) [35]. The equipped sensor is an RGB sensor with a 1-inch diagonal size and a focal length of 8.8 mm. The sensor parameters are listed in Table 2. The UAV flew nominally at a 20 m altitude above ground to generate a spatial resolution of 5.3 mm/pixel. The ground speed was between 1.8 and 2.2 m/s and was relatively constant during data collection. Figure 2a depicts the data collecting area over a satellite image, and Figure 2b depicts the flight routes (white dots) and orthomosaic image overlapped on the satellite image. The designed route overlap was 80% and side-overlap was 75%, resulting in a total number of 349, 299, and 443 images, respectively. The detail of the training-validation data collection missions is listed in Table 3.

Expansion Dataset
To test the impact of environmental disturbances, additional UAV datasets acquired in 2019 and 2020 were also provided. The data were acquired in field No. 78, which is located next to field No. 80. In 2020, image data were acquired from an RGB sensor, DJI Zenmuse X7, which is an interchangeable lens camera and is equipped with a lens with 24 mm focal length [36]. The detail of this sensor is listed in Table 2. The designed flight height was 5 of 17 40 m subject to a narrow FOV and a high sensor resolution to acquire the approximately same spatial resolution as the 2018 and 2019 UAV datasets.
The expansion data provide more UAV paddy images for challenging test. Amongst, several images appear the influence of environmental disturbances, such as the variety of illuminations, weather, soil moisture, and seedling sizes, and the presence of algae, that can be treated as expansion image datasets. Some examples were shown in Figure 3. To adapt to these disturbances, users could augment the data through photometric and geometric transformations or add noise to the original training set to learn more robust features [37]. The detail of the expansion data acquisition missions is listed in Table 4.

Data Preprocessing
UAV images were orthorectified and stitched through a commercial software, Agisoft Metashape (St. Petersburg, Russia) [38], to form a single orthomosaic image. To extract the rice seedlings rapidly, this paper introduced a semi-auto annotation method through an excess-green-minus-excess-red index (ExGR) to enhance the greenness of the images [39]. Yen's thresholding method was applied to obtain a binary map [40]. Then, a morphological process was employed to enhance the object features, and then the centric point of every object was calculated using the contour extraction from the OpenCV library [41]. Finally, rice seedling objects can be subset and saved as single images one by one, or generate the annotations for object detection training set. The workflow of preprocessing is shown in Figure 4. morphological process was employed to enhance the object features, and then the cent point of every object was calculated using the contour extraction from the OpenCV libra [41]. Finally, rice seedling objects can be subset and saved as single images one by one, generate the annotations for object detection training set. The workflow of preprocessi is shown in Figure 4.

UAV Dataset of Rice Seedling Classification
One paddy image selected from the UAV dataset acquired on 7th August 2018 adopted as training data for rice seedling classification. Training samples of UAV imag extracted by binarization and morphological processing (discussed in Section 2.4) we manually verified by agricultural experts. The classes of the UAV images in this datas are categorized into rice seedling and arable land, in which each class contains 28 K an 26.5 K samples, respectively. The dataset comprises two annotated classes ( Figure  54.6K samples in total, and 48 × 48 pixels in size of each image.

UAV Dataset of Rice Seedling Classification
One paddy image selected from the UAV dataset acquired on 7th August 2018 is adopted as training data for rice seedling classification. Training samples of UAV images extracted by binarization and morphological processing (discussed in Section 2.4) were manually verified by agricultural experts. The classes of the UAV images in this dataset are categorized into rice seedling and arable land, in which each class contains 28 K and 26.5 K samples, respectively. The dataset comprises two annotated classes ( Figure 5), 54.6 K samples in total, and 48 × 48 pixels in size of each image.  Table 5 shows the number of samples for each class for training, validation, and testing of classification. The dataset was split in an 80/20 ratio of training/test data, which is the most commonly adopted in deep learning applications [42]. Besides, a 10% subset of the test samples was used to validate the training result. A total of 43.7K samples were used for training. 1.1K samples and 9.8K samples were used for validation and testing, respectively. In this paper, annotations of object detection were provided with three serial missions, 7th, 14th, and 23rd August 2018. The training and validation images were cropped from eight subsets into 600 training samples, in which each subset generates 25 training  Table 5 shows the number of samples for each class for training, validation, and testing of classification. The dataset was split in an 80/20 ratio of training/test data, which is the most commonly adopted in deep learning applications [42]. Besides, a 10% subset of the test samples was used to validate the training result. A total of 43.7 K samples were used for training. 1.1 K samples and 9.8 K samples were used for validation and testing, respectively.

UAV Dataset of Rice Seedling Detection
In this paper, annotations of object detection were provided with three serial missions, 7th, 14th, and 23rd August 2018. The training and validation images were cropped from eight subsets into 600 training samples, in which each subset generates 25 training samples with a size of 320 × 320 pixels and each sample contains approximately 50 seedlings. The annotations were generated in PASCAL VOC [43] format by a graphical image annotation tool, LabelImg [44]. An example of these XML files is given in Appendix A to show the information about image size, classes, and coordinates of bounding boxes. Examples of three growth stages of the rice seedling detection dataset were shown in Figure 6.

Data Application
The rice seedling dataset was fed to a deep learning classifier for training. The training phase involves hyperparameter tuning, including learning rates, decay ratio of learning rates, batch sizes, and the number of epochs. This study modified a classical CNN architecture, VGG-16 [45], to demonstrate a simple classification.

Demonstration of Rice Seedling Detection
To demonstrate data application to patch-based object detection scenarios, this paper clipped 8 images from the orthomosaic image ( Figure 7) with a region of 8 × 8 meters and a size of 1527 × 1527 pixels for each image. The object detection annotation of ground truth is also provided for the eight demo images in PASCAL VOC.

Data Application
The rice seedling dataset was fed to a deep learning classifier for training. The training phase involves hyperparameter tuning, including learning rates, decay ratio of learning rates, batch sizes, and the number of epochs. This study modified a classical CNN architecture, VGG-16 [45], to demonstrate a simple classification.

Demonstration of Rice Seedling Detection
To demonstrate data application to patch-based object detection scenarios, this paper clipped 8 images from the orthomosaic image ( Figure 7) with a region of 8 × 8 m and a size of 1527 × 1527 pixels for each image. The object detection annotation of ground truth is also provided for the eight demo images in PASCAL VOC.

Classification Model
This paper performed the image classification with the dataset using a convolutional neural network (CNN) algorithm, which was modified from the classical algorithm, VGG-16, due to its promising classification architecture. The model was redesigned with a relatively simple network structure by keeping the iconic stack-convolution structure but reducing the number of convolution layers, filters, and fully-connected layers that decrease the number of parameters in the training phase to mitigate an overfitting problem. The visualized architecture of the network is shown in Figure 8. The input images are in 48 × 48 pixels and contain three visible bands (R, G, B). Table 6 shows the layer parameters of the model. architecture, VGG-16 [45], to demonstrate a simple classification.

Demonstration of Rice Seedling Detection
To demonstrate data application to patch-based object detection scenarios, this paper clipped 8 images from the orthomosaic image ( Figure 7) with a region of 8 × 8 meters and a size of 1527 × 1527 pixels for each image. The object detection annotation of ground truth is also provided for the eight demo images in PASCAL VOC.

Classification Model
This paper performed the image classification with the dataset using a convolutional neural network (CNN) algorithm, which was modified from the classical algorithm, VGG-16, due to its promising classification architecture. The model was redesigned with a relatively simple network structure by keeping the iconic stack-convolution structure but reducing the number of convolution layers, filters, and fully-connected layers that decrease the number of parameters in the training phase to mitigate an overfitting problem The visualized architecture of the network is shown in Figure 8. The input images are in 48 × 48 pixels and contain three visible bands (R, G, B). Table 6 shows the layer parameters of the model.  The layers in CNN are defined as follows: 1. The first two convolution layers both comprise 6 filters and a kernel size of 3 × 3 pixels. Each convolution layer is followed by a rectified linear unit (ReLU) operation. This conception is adopted from a VGG-16 network architecture, so-called stack convolution which can achieve barely the same result with fewer parameters and computations than a larger convolution kernel. Besides, the convolution operation uses the same-padding option, which expands the boundary pixels before the convolution operation to remain the same size as the input tensor.
2. The stacked convolution layers are followed by a batch-normalization operation which speedups the convergence and prevents the problem of gradient vanishing, and a  The layers in CNN are defined as follows: 1. The first two convolution layers both comprise 6 filters and a kernel size of 3 × 3 pixels. Each convolution layer is followed by a rectified linear unit (ReLU) operation. This conception is adopted from a VGG-16 network architecture, so-called stack convolution, which can achieve barely the same result with fewer parameters and computations than a larger convolution kernel. Besides, the convolution operation uses the same-padding option, which expands the boundary pixels before the convolution operation to remain the same size as the input tensor.
2. The stacked convolution layers are followed by a batch-normalization operation, which speedups the convergence and prevents the problem of gradient vanishing, and a max-pooling layer with a kernel size of 3 × 3 pixels and a stride of 3.
3. The second stacked convolution layer and batch-normalization layer use the same manner as the first one, except that the convolution kernel is set to 16 filters. The batchnormalization layer is followed by a max-pooling layer with a kernel size of 4 × 4 pixels and a stride of 4.
4. The first full connection layer comprises 64 neurons, followed by a ReLU and a dropout operation. The dropout operation is proposed to eliminate overfitting as it trains only some randomly active neurons. The rate of the dropout was set to 0.1.
5. The second full connection layer has three neurons, which represent two classes of images in the rice seedling dataset, followed by ReLU operation. The output layer is a softmax activation function by forcing the sum of the output values equal to 1.0. This activation function also limits each output value between 0-1, which means the probability of each class.

Performance Evaluation
The evaluation metrics were adopted in this study to evaluate the classification model as the following description in detail [46].

Precision
Precision is the ratio of the correct classification to the total number of classifications in the specific class. A low precision indicates a large number of false positives. Precision can be represented as: where TP c depicts the positive class correctly classified by the model, and FP c depicts the model misclassifies the samples as the positive class.

Recall
The recall is the ratio of the number of the correct classifications to the total number of samples. A high recall indicates a small number of misclassified samples. Recall can be represented as: where FN c depicts the model misclassified the samples as the negative class.

Accuracy
Accuracy is the fraction of the correctness of the model, and is calculated as the sum of correct classification divided by all the classifications as: where TN c depicts that the model correctly classifies the samples as the negative class.

F1-Score
F1-score quantifies the harmonic mean between precision and accuracy. This metric usually presents the robustness of the classifying task, and can be calculated as:

Model Training
This study adopted python programming language to implement the preprocessing workflow and the classification. The deep learning architecture is TensorFlow version 2.2 [47], and the used libraries are skimage, matplotlib, and numpy.
In the beginning of training, a Gaussian distribution was applied to initialize the weights of the layers randomly. To train the network, an adaptive moment estimation (Adam) optimizer [48] was adopted with an initial learning rate of 5E-5, batch size of 128, and the number of epochs of 20. To avoid the possible bias in the particular division of the training dataset, k-fold cross-validation was introduced. k was set to 5 to obtain an 80/20 dividing ratio of training/test data. The accuracy of the network increases with the number of epoch, and all the divisions of the cross-validation dataset achieve a 0.99 accuracy and close to 1.0 ( Figure 9). All the divisions of cross-validation datasets show a steady increase in validation accuracy and a steady descend in the loss. In Figure 9, the model has good performance without overfitting. To choose the best model from the five models, this paper compared the validation-accuracy of each model, which are all above 99.9%, and the one with the lowest validation-loss (the fifth model) was chosen for the evaluation and demonstration of patch-based rice seedling detection.
Remote Sens. 2021, 13, x FOR PEER REVIEW 11 of 17 In the beginning of training, a Gaussian distribution was applied to initialize the weights of the layers randomly. To train the network, an adaptive moment estimation (Adam) optimizer [48] was adopted with an initial learning rate of 5E-5, batch size of 128, and the number of epochs of 20. To avoid the possible bias in the particular division of the training dataset, k-fold cross-validation was introduced. k was set to 5 to obtain an 80/20 dividing ratio of training/test data. The accuracy of the network increases with the number of epoch, and all the divisions of the cross-validation dataset achieve a 0.99 accuracy and close to 1.0 ( Figure 9). All the divisions of cross-validation datasets show a steady increase in validation accuracy and a steady descend in the loss. In Figure 9, the model has good performance without overfitting. To choose the best model from the five models, this paper compared the validation-accuracy of each model, which are all above 99.9%, and the one with the lowest validation-loss (the fifth model) was chosen for the evaluation and demonstration of patch-based rice seedling detection.

Model Evaluation and Detection Demonstration Results
Five divisions of the test dataset were tested with the evaluation metrics using the second model, which was discussed in Sections 3.3 and 3.4, as shown in Table 7. The re-

Model Evaluation and Detection Demonstration Results
Five divisions of the test dataset were tested with the evaluation metrics using the second model, which was discussed in Sections 3.3 and 3.4, as shown in Table 7. The results indicate the model possessing a superior classification ability on all five test datasets with all F1-scores of every class reaching 99.9%.  Figure 10 shows the post-process of the patch-based rice seedling detection in the Subset 7 demo image shown in Figure 7. The rice seedling detection consists of an overlapped patch-based image detection and a post-process of heatmaps. Images for detection are subset into many patches with an overlap (also called sliding window) to form a long sequence of image sets with a size of 48 × 48 pixels that are applied to the proposed classification model to output the probability of each pixel in each class.   Figure 10 shows the post-process of the patch-based rice seedling detection in the Subset 7 demo image shown in Figure 7. The rice seedling detection consists of an overlapped patch-based image detection and a post-process of heatmaps. Images for detection are subset into many patches with an overlap (also called sliding window) to form a long sequence of image sets with a size of 48 × 48 pixels that are applied to the proposed classification model to output the probability of each pixel in each class.
The classification results were reordered to form a heatmap (Figure 10a) in which the size is identical to the original image. Then a threshold of 0.99 as the confidence of the classification (Figure 10b). An erosion operation with a diamond-shaped filter was applied to simply disconnect the slightly adjacent objects (Figure 10c). Finally, the findContours() function from OpenCV was applied to extract objects, and called the bound-ingRect() function to get the top-left position of objects and the width and height of the bounding boxes. To visualize the bounding boxes, the boxes were drawn with yellow and the width of 2 pixels on the raw image (Figure 10d). The classification results were reordered to form a heatmap (Figure 10a) in which the size is identical to the original image. Then a threshold of 0.99 as the confidence of the classification (Figure 10b). An erosion operation with a diamond-shaped filter was applied to simply disconnect the slightly adjacent objects (Figure 10c). Finally, the findContours() function from OpenCV was applied to extract objects, and called the boundingRect() function to get the top-left position of objects and the width and height of the bounding boxes. To visualize the bounding boxes, the boxes were drawn with yellow and the width of 2 pixels on the raw image (Figure 10d).
The comparison of the prediction image and ground truth image of Subset 1 is presented in Figure 11, in which the detected seedlings were drawn with the yellow bounding boxes. Due to the limited layout, the remaining images can be accessed from the web. Table 8. Comparison of the hill number of rice seedlings from patch-based detection and the ground truth. Amongst, Subset 1 and Subset 4 got an above 10% error rate in the number of the detected rice seedlings. To explore this issue, this paper focused on the highly undetected areas in these two images. The comparison between prediction images and ground truth images is shown in Figure 12. The undetected rice seedlings are visually smaller than the detected rice seedlings. This paper also provides images for two consecutive growing stages after 7th August. The comparison between success and failure shows that the undetected rice seedlings are generally smaller than the detected rice seedlings. The comparison of the prediction image and ground truth image of Subset 1 is p sented in Figure 11, in which the detected seedlings were drawn with the yellow boundi boxes. Due to the limited layout, the remaining images can be accessed from the w Table 8. Comparison of the hill number of rice seedlings from patch-based detection a the ground truth. Amongst, Subset 1 and Subset 4 got an above 10% error rate in the nu ber of the detected rice seedlings. To explore this issue, this paper focused on the high undetected areas in these two images. The comparison between prediction images a ground truth images is shown in Figure 12. The undetected rice seedlings are visua smaller than the detected rice seedlings. This paper also provides images for two cons utive growing stages after 7th August. The comparison between success and failure sho that the undetected rice seedlings are generally smaller than the detected rice seedling Figure 11. Comparison of the prediction images and ground truth images of Subset 1 in the detection demonstration.