A High-Resolution Spatial and Time-Series Labeled Unmanned Aerial Vehicle Image Dataset for Middle-Season Rice

: The existing remote sensing image datasets target the identiﬁcation of objects, features, or man-made targets but lack the ability to provide the date and spatial information for the same feature in the time-series images. The spatial and temporal information is important for machine learning methods so that networks can be trained to support precision classiﬁcation, particularly for agricultural applications of speciﬁc crops with distinct phenological growth stages. In this paper, we built a high-resolution unmanned aerial vehicle (UAV) image dataset for middle-season rice. We scheduled the UAV data acquisition in ﬁve villages of Hubei Province for three years, including 11 or 13 growing stages in each year that were accompanied by the annual agricultural surveying business. We investigated the accuracy of the vector maps for each ﬁeld block and the precise information regarding the crops in the ﬁeld by surveying each village and periodically arranging the UAV ﬂight tasks on a weekly basis during the phenological stages. Subsequently, we developed a method to generate the samples automatically. Finally, we built a high-resolution UAV image dataset, including over 500,000 samples with the location and phenological growth stage information, and employed the imagery dataset in several machine learning algorithms for classiﬁcation. We performed two exams to test our dataset. First, we used four classical deep learning networks for the ﬁne classiﬁcation of spatial and temporal information. Second, we used typical models to test the land cover on our dataset and compared this with the UCMerced Land Use Dataset and RSSCN7 Dataset. The results showed that the proposed image dataset supported typical deep learning networks in the classiﬁcation task to identify the location and time of middle-season rice and achieved high accuracy with the public image dataset.


Introduction
One of the main tasks of the State Statistics Bureau is sampling, investigating, monitoring, and estimating the yield of the crops for the sampling plot to calculate the gross domestic product (GDP) in agriculture. Rice is one of the important food crops in the developing world [1], and China produces approximately 30% of the global production [2]. As more than 65% of the Chinese population Wuhan University published a remote sensing image dataset named RSSCN7; the RSSCN7 dataset [15] contains seven classes from 400 images, and each image has a size of 400 × 400 pixels. Another group from Wuhan University published a large-scale open dataset for scene classification in remote sensing images named RSD46-WHU that contains 117,000 images with 46 classes [16]. The ground resolution of most classes is 0.5 m, and the others are approximately 2 m. Northwestern Polytechnical University published a benchmark dataset for remote sensing image scene classification, named NWPU-RESISC45 [17]. This is constructed from a list with a selection of 45 representative classes from all the existing datasets available worldwide.
Each class contains 700 images of size 256 × 256 pixels, and the spatial resolution ranges from 0.2 to 30 m for images sourced from the selection derived from these existing public datasets. The Aerial Image Dataset (AID) [18] is a large-scale dataset for the purpose of scene classification. Notably, as a large dataset, it contains more than 30 classes of buildings and residential and surface targets. The AID consists of 10,000 images of size 600 × 600 pixels. Each class in the dataset contains approximately 220 to 420 images with a spatial resolution that varies between 0.5 and 8 m. These image datasets are extracted from satellite images, such as Google Earth Imagery or the United States Geological Survey (USGS).
Another category of using remote sensing image datasets is for object detection. The Remote Sensing Object Detection Dataset (RSOD-Dataset) is an open dataset for object detection. The dataset includes aircraft, oil tanks, playgrounds, and overpasses, for a total of four kinds of objects [19]. The spatial resolution is approximately 0.5 to 2 m. The count of the images is 2326. The High-Resolution Remote Sensing Detection (TGRS-HRRSD-Dataset) contains 55,740 object instances in 21,761 images with a spatial resolution from 0.15 to 1.2 m [20]. The Institute of Electrical and Electronics Engineers Conference on Computer Vision and Pattern Recognition (IEEE CVPR) holds a challenge series on Object Detection in Aerial Images, which published the DOTA-v1.0 and DOTA-v1.5. DOTA-v1.5 contains 0.4 million annotated object instances within 16 categories, and the images are mainly collected from Google Earth, satellite JL-1, and satellite GF-2 of the China Centre for Resources Satellite Data and Application.
The third category of remote sensing image dataset application is for semantic classification. The Inria Aerial Image Labelling dataset [21] covers dissimilar urban settlements, ranging from densely populated areas with a spatial resolution of 0.3 m. The ground-truth data describe two semantic classes: building and not building. The National Agriculture Imagery Program (NAIP) dataset (SAT-4 and SAT-6) [22] samples image patches from a multitude of scenes (a total of 1500 image tiles) covering different landscapes, such as rural areas, urban areas, densely forested, mountainous terrain, small to large water bodies, agricultural areas, etc.
The fourth category of remote sensing image datasets is for remote sensing image retrieval (RSIR). The benchmark dataset for performance evaluation is named PatternNet [11] and contains 38 classes where each class consists of 800 images with a size of 256 × 256 pixels. Open Images is a dataset of more than nine million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives [23]. The dataset is annotated with 59.9 Million image-level labels spanning 19,957 classes.
The RSID also contains many UAV image datasets. The UZH-FPV Drone racing dataset consists of over 27 sequences [24] with high-resolution camera images. The UAV Image Dataset (UAVid) contains eight categories of street scene context with 300 static images with a size of 4096 × 2160. This dataset is a target for semantic classification [25]. The AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University, China, presented a large-scale benchmark with carefully annotated ground truth for various important computer vision tasks, named VisDrone, and collected 10,209 static images [26]. The VisDrone dataset is captured by various drone-mounted cameras, covering a wide range of aspects, including location. Peking University collected a drone image dataset in Huludao city and Cangzhou city named the Urban Drone Dataset (UDD) [27]. The newly released UDD-6 contains six categories for semantic classification. Graz University of Technology released the Semantic Drone Dataset that focuses on the semantic understanding of urban scenes acquired at an altitude of 5 to 30 m above ground [28]. The Drone Tracking Benchmark (DTB70) is a unified tracking benchmark on the drone platform [29]. The King Abdullah University of Science and Technology (KAUST) released a benchmark for a UAV Tracking Dataset (UAV123) [30], which supports applications of object tracking from UAVs.
However, public special image datasets for farm crops are insufficient. Middle-season corn high-resolution satellite imagery data have been collected at mid-growing season for the identification of within-field variability and to forecast the corn yield at different sites within a field [3]. For weed detection in line crops [6] and evaluation of late blight severity in potato crops [31], tests have been conducted with UAV images and deep learning methods for classification. However, there is currently no article that concentrates on middle-season rice.
As we can ascertain, in the existing remote sensing image datasets or the public image datasets, such as ImageNet, the samples of the images in these datasets always refer to a single object or objects with clear patterns. These image datasets support the ability to identify an object in a category but cannot distinguish the fine classification. Therefore, such datasets are not suitable for the applications of classification of one kind of feature varying in different places and time series.
In this paper, we proposed a UAV image dataset for middle-season rice. We hoped the image dataset can support the monitoring of the growth of middle-season rice. After that, by applying the deep learning methods with this image dataset, we can improve the accuracy and the efficiency of the State Statistics Bureau in agricultural investigation. We scheduled five villages of two cities in Hubei province for source acquisition from 2017 to 2019. The samples of the images have a size of 128 × 128 and 256 × 256, the spatial resolution is 0.2 m, and the total number of the samples is over 500,000 images. This image dataset represents a large area where middle-season rice is grown.
The contributions of this paper are as follows. First, we set up a high-resolution image dataset for middle-season rice using weekly collected images from drones during the growing period for three years in five villages of two cities. Those images reveal the yearly growth for middle-season rice in the fields of plains along the Yangtze River. This accumulative image dataset will help to monitor the growth of the crops.
Second, we applied the vector information of the fields to tag the samples with the spatial and temporal information automatically. Therefore, we obtained thousands of samples for the middle-season rice in different periods and places. This automatic tagged method can extend the building method for the remote sensing image dataset.
Last, we use a fine classification method to learn the spatial-temporal information of the middle-season rice. The image dataset can support different deep learning networks and achieve a good result. This strategy of conversion for fine classification of the image dataset will extend the applications for the original deep learning algorithms.

Study Area
The growing area of middle-season rice includes more than eight provinces in China and has a subtropical warm and humid monsoon climate. The growing area is approximately 16.5 million square kilometers and accounts for 68.1 percent of the total area sown with rice.
In this paper, the selected villages were located in the middle and east of the mid-season rice production area. Figure 1 shows the selected cities and the villages on the map, where five villages of two cities in Hubei are shown. Village 1 and 2 were selected to be in Xiantao city, southern Hubei province, with a geographical location of longitude 113 • 36 30 -113 • 36 47 and latitude 30 • 12 40 -30 • 12 55 . Village 3, 4, and 5 were selected in the administrative region of Zhijiang city, Hubei province, whose geographical location is between longitude 111 • 25 00 -112 • 03 00 east and latitude 30 • 16 00 -30 • 40 00 north. These regions can better reflect the middle and lower reaches of the Yangtze River in China. Hubei; the left city is Zhijiang and the right city is Xiantao; (b) the selected villages 3 and 4 in Zhijiang city, where we set three flight sites for UAV data acquisition in village 3 and two flight sites for data acquisition in village 4. We acquired the data for field blocks, not for the whole village.

The UAV and Camera System
We applied the model Phantom 4 of the DJI Spirits series with an RGB camera for data acquisition. The UAV platform and the camera were ordinary consumer products of low cost that are suitable for small areas less than 1 km and half an hour of operation. Table 1 describes the detailed parameters for the UAV platform and the camera. The model of the UAV is the DJI M200 commercial product, and this UAV can fly with a max speed of wind at 12 m/s, and reach a max speed of 82.8 km per hour. The maximum flying time was about 38 min; thus, we needed to exchange the battery three times for one flight task. The size of the drone was 887 × 880 × 378 mm, and the total payload was approximately 6.14 kg. This platform was equipped with an Hubei; the left city is Zhijiang and the right city is Xiantao; (b) the selected villages 3 and 4 in Zhijiang city, where we set three flight sites for UAV data acquisition in village 3 and two flight sites for data acquisition in village 4. We acquired the data for field blocks, not for the whole village.

The UAV and Camera System
We applied the model Phantom 4 of the DJI Spirits series with an RGB camera for data acquisition. The UAV platform and the camera were ordinary consumer products of low cost that are suitable for small areas less than 1 km 2 and half an hour of operation. Table 1 describes the detailed parameters for the UAV platform and the camera. The model of the UAV is the DJI M200 commercial product, and this UAV can fly with a max speed of wind at 12 m/s, and reach a max speed of 82.8 km per hour. The maximum flying time was about 38 min; thus, we needed to exchange the battery three times for one flight task. The size of the drone was 887 × 880 × 378 mm, and the total payload was approximately 6.14 kg. This platform was equipped with an Inertial Measurement Unit (IMU) and Global Positioning System (GPS) to record the position and pose for each shutting, and we also installed a system clock to record the date and time for the images. The camera system was the Zenmuse X4S, which has 1-inch Complementary Metal-Oxide-Semiconductor (CMOS) sensors in RGB colors. This sensor can acquire images with a resolution of 20 million pixels under a field of view (FOV) of 84 • . This camera system is equipped with three-axis consoles for stabilization during the flight. The UAV system and camera are all commercial products and were acquired not equipped. Considering a standardized description and expression for the growth process of middle-season rice, we introduce the term phenological stage in this paper. On the one hand, there are specialized definitions of each phenological stage, and the obvious differences in the stages can be manually judged by a field investigator. On the other hand, the high-resolution monitoring and modeling for the crop growth in each phenological stage can achieve a higher precision yield estimation.
There are models for product assessments based on the phenological stages, and we first adapted the Biologische Bundesanstalt, Bundessortenamt, and Chemical Industry (BBCH) scale [32] as the basis for the division of phenological stages, which uses decimal code to describe the growth of crops. Taking the existing definition of the middle-season rice phenological stage into consideration, we refer to seven phenological stage settings in this paper, that is, the transplanting, tillering, booting, heading, milk, dough, and maturity stages. Table 2 presents the phenological stages descriptions and provides the judgment of the growing outlook for each stage. We used the changes of leaves, blooming, grains, and the colors of the field to help the investigators decide which stage the field is in during the UAV flight for acquiring photos. Table 3 presents the data acquisition dates for village 2 from 2017 to 2019.
The phenological stages are an important index for growth monitoring not only for middle-season rice but also for a larger category of plants. However, is difficult for the investigators to distinguish tiny differences in a small interval from the text description of phenological stages for middle-season rice. In addition, the middle-season rice in Hubei possesses the same phenological stages, but there is a time difference of one day or several days in dates in different places and in different years. Figure 2 demonstrates an example of the seven phenological stages in the images of village 1 in Xiantao. The color of the images is different at each stage.  help the investigators decide which stage the field is in during the UAV flight for acquiring photos. Table 3 presents the data acquisition dates for village 2 from 2017 to 2019. The phenological stages are an important index for growth monitoring not only for middleseason rice but also for a larger category of plants. However, is difficult for the investigators to distinguish tiny differences in a small interval from the text description of phenological stages for middle-season rice. In addition, the middle-season rice in Hubei possesses the same phenological stages, but there is a time difference of one day or several days in dates in different places and in different years. Figure 2 demonstrates an example of the seven phenological stages in the images of village 1 in Xiantao. The color of the images is different at each stage.

Schedule for Data Acquisition
We set the flight altitude of the UAV to 70 m in Xiantao City and 110 m in Zhijiang City, considering the terrain of the villages, and the resolution of the original raster image was 18 cm and 29 cm for different altitudes, respectively. In addition, each image included four bands, that is, the RGB channels and the transparent channel.
We scheduled a weekly data acquisition by UAV in each village by considering the sowing time and the conditions of the field. Village 1 in Xiantao City was the first place for data collection, which began in June 2017, for a total of 11 weeks; the same data collection times were used for village 5 in 2019. In contrast, the other villages, village 2, village 3, and village 4 were also scheduled from June to October lasting for 13 weeks. The difference exists for the first two weeks in the last three villages because the data acquisition started earlier for two weeks.
We scheduled the data collection in each village in strict accordance with the growth cycle of middle-season rice and arranged it with the same flight altitude and the same periods of time from 10 a.m. to 2 p.m. We designed a table to record the data acquisition. Table 3 shows an example for village 2 from 2017 to 2019. We recorded the date for the UAV flight and the raw images.
We set a different overlap from 30 to 45% during the flight according to the colors of the field blocks, with a higher overlap for a tiny difference between the boundaries and a lower overlap for a clear distinction. This is the reason for the sizable difference in the number of images for each time.
We applied the Pix4dmapper software as the pre-processing tool for raw images. For data processing, we integrated all the photos of the village into one whole image that was saved as a geotiff format file. Figure 3a shows the result of the ortho-image after processing for the village.
In addition, we scheduled the data acquisition tasks to the members of the agriculture investigation team and arranged a vector data production job for the field blocks as well as the determination job of the block of middle-season rice and the phenological stage. The vector data were processed to match with the images. Figure 3b shows the result for the merging of the vector data with the image. In each field block, we noted the type of crops, and for the middle-season rice, the details of the phenological stage. This work was then repeated during each week. We needed to ensure the correct information regarding the vector and the attribution of each block.

Schedule for Data Acquisition
We set the flight altitude of the UAV to 70 m in Xiantao City and 110 m in Zhijiang City, considering the terrain of the villages, and the resolution of the original raster image was 18 cm and 29 cm for different altitudes, respectively. In addition, each image included four bands, that is, the RGB channels and the transparent channel.
We scheduled a weekly data acquisition by UAV in each village by considering the sowing time and the conditions of the field. Village 1 in Xiantao City was the first place for data collection, which began in June 2017, for a total of 11 weeks; the same data collection times were used for village 5 in 2019. In contrast, the other villages, village 2, village 3, and village 4 were also scheduled from June to October lasting for 13 weeks. The difference exists for the first two weeks in the last three villages because the data acquisition started earlier for two weeks.
We scheduled the data collection in each village in strict accordance with the growth cycle of middle-season rice and arranged it with the same flight altitude and the same periods of time from 10 AM to 2 PM. We designed a table to record the data acquisition. Table 3 shows an example for village 2 from 2017 to 2019. We recorded the date for the UAV flight and the raw images.
We set a different overlap from 30 to 45% during the flight according to the colors of the field blocks, with a higher overlap for a tiny difference between the boundaries and a lower overlap for a clear distinction. This is the reason for the sizable difference in the number of images for each time.
We applied the Pix4dmapper software as the pre-processing tool for raw images. For data processing, we integrated all the photos of the village into one whole image that was saved as a geotiff format file. Figure 3a shows the result of the ortho-image after processing for the village.
In addition, we scheduled the data acquisition tasks to the members of the agriculture investigation team and arranged a vector data production job for the field blocks as well as the determination job of the block of middle-season rice and the phenological stage. The vector data were processed to match with the images. Figure 3b shows the result for the merging of the vector data with the image. In each field block, we noted the type of crops, and for the middle-season rice, the details of the phenological stage. This work was then repeated during each week. We needed to ensure the correct information regarding the vector and the attribution of each block. As for the location information, we obtained the information by two methods and used crossvalidation to determine the information. The first method derived the information from the matched vector data, and the second method obtained the GPS information for each shooting moment during the flight. For convenience, we set the location information of the selected village information to all images acquired there. Considering the administrative district, we set the hierarchical folders from As for the location information, we obtained the information by two methods and used cross-validation to determine the information. The first method derived the information from the matched vector data, and the second method obtained the GPS information for each shooting moment during the flight. For convenience, we set the location information of the selected village information to all images acquired there. Considering the administrative district, we set the hierarchical folders from the village and city to province to store all the images; in this manner, the images can hold the location information on a different scale. The location information was saved as the name of each folder in the saved directory.

Methodology
With the purpose of supporting the machine learning methods to identify the images for the correct place and time, we set up the image dataset in three steps. The first step was data preparation. We scheduled the data acquisition to accompany the survey business of the agricultural investigation team of the National Bureau of Statistics. In this stage, we collected the high-resolution images, the vector bounding data of the field block, and the correct attributes for each field block; therefore, we labeled the data for the correct place and phenological stages.
The second stage is for sample generation. We applied commercial software for the image processing, and after that, we developed a tool to obtain the piece of the field with the image and the vector data. We used an extraction algorithm to sample the block images to generate a standard image size for the dataset that was suitable for most machine learning methods. Finally, we tested our image dataset with several machine learning networks, and the results show that the dataset can be used to perform regression by all of the methods with high accuracy. Figure 4 shows the overview flowchart of this paper and the detailed steps in each stage. As we mentioned above, the vector data of blocks and the labeling data of different phenological stages were collected with the help of investigation members.  . The overview flowchart of the processing. This is divided into three steps. The data preparation is for the data acquisition and preparation. The generation stage is building the dataset, and the test stage is the evaluation of the dataset.

Sample Generation
When we scheduled the data acquisition, we first assigned the data collection task to the investigation member who was on duty for all the data collection in one village. The data collection task in each village included the UAV images, which were recorded in .jpeg format with GPS, and date and time information. Applying Pix4DMapper data processing functions, the original images had been produced into the DOM in geotiff format. Based on the DOM, the agricultural investigation team drew the field blocks in the QGIS system, and the final vector data was in a shape file. The input data included the raster data of DOMs and the vector data of shp files of the field blocks. Additionally, the labeling data were stored in the vector data as attributes. We refer to the field block as the statistical unit, in this manner, we used the geometry area of the vector data to assess the image statistical accuracy.
The generated high-resolution UAV images usually had more details. The larger the image size is, the greater the computational power and video memory needed to be input into the convolutional neural network for processing. It was impossible to use the original image for training or testing by deep learning methods. The images also contained other crops in addition to the middle-season rice. Therefore, we cut the original large-size image into field blocks and used the label information to obtain the blocks of middle-season rice.
First, we clipped the labeled polygonal blocks on the raster image according to the vector. The AOI (area of interest) was applied to the raster image according to the label information in the vector and the polygon vertex coordinates. When the image and the vector data were matched, it was easy to cut the block out of the original image. With the help of the Geospatial Data Abstraction Library (GDAL), we developed a tool to cut hundreds of polygonal image blocks from the image that corresponded to geographical entities in the vector data. Figure 5a shows a good result. While it is ideal for all vector data to match with the image, an offset is always obtained. Figure 5b shows that the vector data acquired an offset to the field block of the image. There are two possible reasons: one is that an error is generated by projection, and the other reason is because the data changed, or a low-quality vector was produced. We needed a calibration for the whole image with the vector data, or we needed a careful adjustment for the vector data after the data acquisition.

Sample Generation
When we scheduled the data acquisition, we first assigned the data collection task to the investigation member who was on duty for all the data collection in one village. The data collection task in each village included the UAV images, which were recorded in .jpeg format with GPS, and date and time information. Applying Pix4DMapper data processing functions, the original images had been produced into the DOM in geotiff format. Based on the DOM, the agricultural investigation team drew the field blocks in the QGIS system, and the final vector data was in a shape file. The input data included the raster data of DOMs and the vector data of shp files of the field blocks. Additionally, the labeling data were stored in the vector data as attributes. We refer to the field block as the statistical unit, in this manner, we used the geometry area of the vector data to assess the image statistical accuracy.
The generated high-resolution UAV images usually had more details. The larger the image size is, the greater the computational power and video memory needed to be input into the convolutional neural network for processing. It was impossible to use the original image for training or testing by deep learning methods. The images also contained other crops in addition to the middle-season rice. Therefore, we cut the original large-size image into field blocks and used the label information to obtain the blocks of middle-season rice.
First, we clipped the labeled polygonal blocks on the raster image according to the vector. The AOI (area of interest) was applied to the raster image according to the label information in the vector and the polygon vertex coordinates. When the image and the vector data were matched, it was easy to cut the block out of the original image. With the help of the Geospatial Data Abstraction Library (GDAL), we developed a tool to cut hundreds of polygonal image blocks from the image that corresponded to geographical entities in the vector data. Figure 5a shows a good result. While it is ideal for all vector data to match with the image, an offset is always obtained. Figure 5b shows that the vector data acquired an offset to the field block of the image. There are two possible reasons: one is that an error is generated by projection, and the other reason is because the data changed, or a low-quality vector was produced. We needed a calibration for the whole image with the vector data, or we needed a careful adjustment for the vector data after the data acquisition. For the five villages and seven phenological stages in each village per year, we first obtained more than 12,000 image blocks. We used the image blocks to generate samples because the different blocks always exhibit a different outlook, even when planting occurs at the same time and with the same kind of seeds. The blocks typically have one kind of plant. Figure 6 shows the original image matched with the vector data in village 1 of 2018 and the generated block images for each field block. For the five villages and seven phenological stages in each village per year, we first obtained more than 12,000 image blocks. We used the image blocks to generate samples because the different blocks always exhibit a different outlook, even when planting occurs at the same time and with the same kind of seeds. The blocks typically have one kind of plant. Figure 6 shows the original image matched with the vector data in village 1 of 2018 and the generated block images for each field block. With the help of the vector data, we obtained the block images from the raw data. However, it was still very difficult to cut the samples directly due to the irregular shape and the edge containing redundant pixels. We applied a cross-over strategy for further processing to obtain the ideal small sample from the target polygonal image block, as shown in Figure 7a. The quadrant size was 256 pixels, and only when the quadrants of the four directions existed could we obtain the sample quadrant in the middle. For the whole image, we retrieved all valid quadrants, and from only these valid quadrants could the orange quadrant in Figure 7a be selected as the final sample. The size of the sample depended on the need of the machine learning algorithm; in this paper, we employed pixel sizes of 256 × 256 and 128 × 128.
Even for the ideal matched vector and image pairs, we obtained the block image from the original without a rectangle or square shape, as shown in Figure 7b, and we also found that the pixels near the boundary were always mixed with other features, such as the path or other plants. This block image shows that the image contains many blank pixels and grass pixels. To generate uniform and clear samples, we employ the strategy that begins at the left top of the image, following the sequence from left to right and top to bottom, obeying the cross-quadrant ruler to generate the samples in different sizes. There was always a clear difference in the texture pattern between the middle-season rice blocks and non-middle-season rice blocks. The samples between block images changed considerably. Figure  8a shows the samples for middle-season rice from the same block image in Xianto. Figure 8b shows the non-middle-season rice feature samples from the same source image. With the help of the vector data, we obtained the block images from the raw data. However, it was still very difficult to cut the samples directly due to the irregular shape and the edge containing redundant pixels. We applied a cross-over strategy for further processing to obtain the ideal small sample from the target polygonal image block, as shown in Figure 7a. The quadrant size was 256 pixels, and only when the quadrants of the four directions existed could we obtain the sample quadrant in the middle. For the whole image, we retrieved all valid quadrants, and from only these valid quadrants could the orange quadrant in Figure 7a be selected as the final sample. The size of the sample depended on the need of the machine learning algorithm; in this paper, we employed pixel sizes of 256 × 256 and 128 × 128.
Even for the ideal matched vector and image pairs, we obtained the block image from the original without a rectangle or square shape, as shown in Figure 7b, and we also found that the pixels near the boundary were always mixed with other features, such as the path or other plants. This block image shows that the image contains many blank pixels and grass pixels. To generate uniform and clear samples, we employ the strategy that begins at the left top of the image, following the sequence from left to right and top to bottom, obeying the cross-quadrant ruler to generate the samples in different sizes. With the help of the vector data, we obtained the block images from the raw data. However, it was still very difficult to cut the samples directly due to the irregular shape and the edge containing redundant pixels. We applied a cross-over strategy for further processing to obtain the ideal small sample from the target polygonal image block, as shown in Figure 7a. The quadrant size was 256 pixels, and only when the quadrants of the four directions existed could we obtain the sample quadrant in the middle. For the whole image, we retrieved all valid quadrants, and from only these valid quadrants could the orange quadrant in Figure 7a be selected as the final sample. The size of the sample depended on the need of the machine learning algorithm; in this paper, we employed pixel sizes of 256 × 256 and 128 × 128.
Even for the ideal matched vector and image pairs, we obtained the block image from the original without a rectangle or square shape, as shown in Figure 7b, and we also found that the pixels near the boundary were always mixed with other features, such as the path or other plants. This block image shows that the image contains many blank pixels and grass pixels. To generate uniform and clear samples, we employ the strategy that begins at the left top of the image, following the sequence from left to right and top to bottom, obeying the cross-quadrant ruler to generate the samples in different sizes. There was always a clear difference in the texture pattern between the middle-season rice blocks and non-middle-season rice blocks. The samples between block images changed considerably. Figure  8a shows the samples for middle-season rice from the same block image in Xianto. Figure 8b shows the non-middle-season rice feature samples from the same source image. There was always a clear difference in the texture pattern between the middle-season rice blocks and non-middle-season rice blocks. The samples between block images changed considerably. Figure 8a shows the samples for middle-season rice from the same block image in Xianto. Figure 8b shows the non-middle-season rice feature samples from the same source image.

The Image Dataset
In this paper, we collected the UAV high-resolution images to generate samples in two cities from 2017 to 2019. Table 4 shows the classes and number of samples in each class. There were seven phenological stages in one year for middle-season rice; we set up seven classes for each stage and one class for the others, so there was a total of eight classes for one city per year. We obtained 120 classes in total with more than 500,000 samples in the dataset.

The Image Dataset
In this paper, we collected the UAV high-resolution images to generate samples in two cities from 2017 to 2019. Table 4 shows the classes and number of samples in each class. There were seven phenological stages in one year for middle-season rice; we set up seven classes for each stage and one class for the others, so there was a total of eight classes for one city per year. We obtained 120 classes in total with more than 500,000 samples in the dataset.

The Splits of the Dataset
We simply split the total image dataset into two parts by randomly selecting 75% of the images as the training dataset and the other 25% for the testing dataset. All classes had two folders: one contained the training images, and the other folder stored the test images. The example of images for training and testing in Xiantao on 10 August 2017 is shown in Figure 9.  We simply split the total image dataset into two parts by randomly selecting 75% of the images as the training dataset and the other 25% for the testing dataset. All classes had two folders: one contained the training images, and the other folder stored the test images. The example of images for training and testing in Xiantao on 10 August 2017 is shown in Figure 9.

Comparison with Typical Remote Sensing Image Datasets
We compared our UAV image dataset with serval public image datasets for deep learning in Table 5. Many RSIDs are from Google Earth imagery with a spatial resolution from 0.3 to 30 m. The existing image datasets can support the land cover classification, scene classification, and object detection applications. Our UAV image dataset is also suitable for those applications, and, with the labeled semantic of middle-season rice, this dataset can support the semantic classification of rice. The class of the existing RSID has several to hundreds of samples per class, and there was no information regarding the acquisition time and the location.
In our UAV image dataset, the different places and different phenological stages corresponded to one class that had 3000 to 6000 samples. The existing RSIDs do not have the ability to classify or detect features of different times and places. We defined massive classes to represent the spatial and temporal information.

Tests and Analysis
To verify the effectiveness of the proposed image dataset, we carried out two tests. The first test was the fine classification for spatial and temporal information by deep learning methods. In this test, we examined the classification of mid-season or not in the two cities and five villages and then examined the seven different phenological stages in different places. We selected the PyTorch deep learning framework and four classical neural network algorithms, namely, AlexNet [33], VGG16 [34], ResNet [35], and DenseNet [36] to test and compare with the UCMerced Land Use Dataset [13] and RSSCN7 Dataset [15].
The training hardware adopted in this paper consisted of an NVIDIA GTX-1080 graphics card with 8 GB of video memory and an NVIDIA TITAN V with 12 GB of video memory.
At the beginning of the first test, we selected 1/10 of the images from the training set for the verification set. We also adjusted the learning rate and other hyperparameters to obtain better performance.
During testing, we carried out the tensor normalization (ToTensor) operation on all test set images; thus, the RGB values of the images that ranged from 0 to 255 were normalized to 0 to 1. We applied standardization accordingly (normalize) to speed up the convergence of the algorithm and subsequently applied random shuffle (shuffle) to the training set. Then, we divided the images into small batches in the network. The batch size of the experiment varied according to the algorithm; in this paper, the value employed in the experiment was between 32 and 128. A larger value of the batch size will significantly improve the speed, but the requirement for video memory will also significantly increase.
During the experiment, we recorded the cross-entropy loss between the prediction result of the test set and the real label by the model obtained through each cycle of (epoch) training. We tested each model for 200 epochs and then calculated the average accuracy for our dataset. The test results are shown in Figure 10. We examined the classification of mid-season or not in different villages, and we added them (places) to the network. In this case, we obtained the 40 classes for identifying the villages in two cities for three years. For the examined seven different phenological stages with the different places, we added them (stages) to the end of the network. The total number of classes was 105 where we could identify each stage of the different villages in each year. The accuracy is the right classification ratio.  Figure 10. We examined the classification of mid-season or not in different villages, and we added them (places) to the network. In this case, we obtained the 40 classes for identifying the villages in two cities for three years. For the examined seven different phenological stages with the different places, we added them (stages) to the end of the network. The total number of classes was 105 where we could identify each stage of the different villages in each year. The accuracy is the right classification ratio.   While there are large counts of samples in the training set, we obtained the accuracy in the validation set and test set. Table 6 shows the results for classification. On AlexNet, the accuracy of the validation set reached 91.00% in the phenological stages test and 98.01% in the middle rice and non-middle rice classification of different villages. The training session took approximately 13 h.
On VGG16 [37], the accuracy of the validation set reached 95.70% in the phenological stages test and 98.38% in the middle rice and non-middle rice classification of different villages, while the accuracy of the test set only reached 65.07% in the phenological stages test and 86.62% in the middle rice and non-middle rice classification of different villages. The training session took approximately 37 h.
On ResNet-50 [38], the accuracy of the validation set reached 97.50% in the phenological stages test and 98.80% in the middle rice and non-middle rice classification of different villages, while the highest accuracy of the test set was only 75.31% in the phenological stages test and 91.64% in the middle rice and non-middle rice classification of different villages. The training session took approximately 43 h. The results show that the UAV image dataset of middle-season rice supported the deep learning classification task of middle-season rice, which lays a foundation for the further addition of the regional classification experiments. The fine classification of images for spatial and temporal information can improve the efficiency for the agricultural investigation team. We needed six members for one village to do the manner classification and two days to finish the works of one phenological stage. By applying the deep learning method to our dataset, two members could finish the same work in two days. The high-resolution image dataset was able to improve the accuracy for agricultural investigation.
After training the models with our dataset, we tested if the reference image dataset could achieve the result by fine classification. We applied the UCMerced Land Use Datasets and RSSCN7 image dataset for testing.
As for the UCMerced Land Use Datasets dataset, we found that 100 images contained agriculture and 92 images were related to rice. We relabeled these images to fit the classification. For the RSSCN7 image dataset, there were about 400 images related to agriculture, and we also relabeled 295 images to fit our test. For the UCMerced Land Use Datasets dataset, we tested for 50 epochs, and we tested the RSSCN7 image dataset for 100 epochs. The accuracy was the average of the cross-entropy loss between the prediction and the labeled information. Table 7 shows the result of the fine classification of different villages of different cities. The two public image datasets had few images for agriculture, and scarce images for middle-season rice. Thus, the models had very low accuracy. To verify that our image dataset could support the public image classification as land use, we performed the second test. We used the published models and the parameters of classification tasks on three image datasets and compared the results. In this test, AlexNet, VGG, ResNet, and DenseNet were applied for land use classification. The result is shown in Table 8.
For land-use classification, the AlexNet on UCMerced Land Use Data Set obtained 93.1% [40] and RSSCN7 obtained 92.32%, while our image dataset obtained 95.95%. VGG16 achieved 93.98% on the UCMerced Land Use Data Set and 94.62% on the RSSCN 7 dataset [41]. Resnet [40] and DenseNet [42] obtained 91.92% and 97.70% on the UCMerced Land Use Dataset and 96.07% and 97.46% on the RSSCN 7 image dataset, respectively. Our image dataset obtained a higher accuracy classification for all networks.

Discussion
The purpose of the paper was to build an image dataset for middle-season rice based on high-resolution UAV images for accurate subclassification of the correct place and correct time by machine learning methods.
The data collected in this paper were based on agricultural investigation, and the operation process will generate considerable business data. To avoid specific business impacts, the generation of the dataset only applies the field block vector data, the place, and the time of collection. Due to business needs, data collection can be conducted on the same village for many years to form periodic image comparisons, which also reduces the amount of manual data processing. After completing the vector data production for field blocks in the first year, the same vector data will be used in subsequent years.
In practice, the vector data of the field block in the village changes little except for the adjustment of farmers in some areas. Due to the need to consider the data accuracy and data precision, this paper does not consider the method of automatic segmentation of land. After the calibration of the image, the vector data of the field block in the survey area can well fit the image. In the areas with large differences, the original image data can be checked or confirmed through field investigation and manually modified to ensure the proper amount of sampling area and the image nesting relationship. Finally, the generation of the standard quadrant can be determined.
The ground resolution of the image was a relatively high resolution of 0.2 m. In this paper, the UAV data collection was conducted for three consecutive years in Hubei province. The collected areas are mostly plains, and the area has an elevation between 20 and 50 m. Therefore, this was not sufficient to represent the middle and lower reaches of the Yangtze River with the large difference in elevation. The plants of middle-season rice may have a different outlook in mountains and hilly landforms. New places should be taken into consideration for further data acquisition.
It is difficult to obtain the time-series data in different locations for rice crops. The growing areas are distributed, and professional information about the growing stages is required. Although we scheduled careful data acquisition tasks and the images are of high quality, the machine learning algorithms still classified two different stages of images as one class. The outlook of the middle-season rice of different growing stages in different places always has little difference. In our implementation, we chose places that had an obvious outlook. The texture pattern of middle-season rice in the earlier phenological stage was uniform for rice shoots scattered in the paddy field. Another difference is that our dataset will add the class when places are extended, and the data continue to be acquired for additional years.
In this paper, we adopted consumer products, employing an unmanned aerial vehicle (UAV) system and a camera from DJI, not professional products with high-level camera modules, and positioning navigation equipment. The multi-resolution construction was only based on the highest resolution using a unified method and resampling generation. Pix4D was used in the image processing in this paper, and the accuracy of the results, the processing method, the degree of automation, and the efficiency of its image processing were not considered for optimization. Based on the processing results of widely accepted commercial software, the data processing time was not evaluated. The workload of data processing was heavy. The UAV image data collected at each sampling point required dozens of hours of image processing time on a single workstation.
Finally, four kinds of neural networks, including AlexNet, VGG, ResNet, and DenseNet, were tested for classification based on this image dataset. Our image dataset obtained a reasonable result of identifying the spatial and temporal information for the images by fine classification. AlexNet [33], VGG16 [34], ResNet [35], and DenseNet [36] achieved more than 90% accuracy. In this situation, our image dataset met the design of the target. We used these models to test with the UCMerced Land Use Dataset [13] and RSSCN7 Dataset [15], and all of them obtained an accuracy of less than 10%. The two image datasets contain few images of middle-season rice, and none of them came from the flight site villages. Therefore, using the public image dataset cannot support this application.
To judge our image set in land use or other public classification tasks, we compared the public two image datasets. The results of AlexNet [40], VGG16 [37] ResNet [38], and DenseNet [42] on the UCMerced Land Use Dataset were lower than our image dataset, due to the image resolution and the samples numbers. The same situation was found for the networks on the RSSCN 7 dataset.
This article does not focus on how to further improve the classification and the innovation of new networks. We only confirmed in this paper that the image dataset provided accurately reflected the situation of middle-season rice in the selected region, particularly at the time and under the specific climatic conditions.

Conclusions
This paper presents a high-resolution UAV image dataset of middle-season rice suitable for the middle and lower reaches of the Yangtze River. Through the acquisition of UAV images of five villages of two cities by the Yangtze River in Hubei in the past three years with the same cycle and the manual crop interpretation confirmation of an agricultural survey business, the vector data of regional plots were obtained, and the image dataset of the middle rice growth process was constructed. The dataset was divided into 15 categories according to the place and year.
Each category was subdivided into 11 or 13 phases according to the crop phenological stages, and more than 500,000 samples of size 256 × 256 were obtained in total. After the construction of the dataset, we used four common networks to conduct the binary classification of whether the data represented middle-season rice or not and performed the fine classification for the phenological stages and the place. The results show that this dataset achieved good results in many networks, satisfying the common classification application.
The monitoring of growth for rice crops is an interesting and difficult job. The deep learning method provides a possibility for performing this job. Our dataset supports an important foundation for the reorganization of middle-season rice with the timing of the phenological stages and locations. Combined with agricultural investigation information of the average product for a particular location, this allows us to predict the yield of middle-season rice. For each phenological stage, the agricultural investigator can find abnormal growth and send messages to the farmer.
Due to the limitations of the sample locations, the current datasets well fit the Hubei plains. However, for more suitable implementations for the Yangtze River, we should increase other sampling locations of the middle-season rice provinces, in particular the hilly and mountainous areas. Further, we will apply the new UAV system equipped with the multispectral sensors to support complex applications and improve the intelligent and automatic level of the agricultural production process.