Semantic Segmentation of Cabbage in the South Korea Highlands with Images by Unmanned Aerial Vehicles

: Identifying agricultural ﬁelds that grow cabbage in the highlands of South Korea is critical for accurate crop yield estimation. Only grown for a limited time during the summer, highland cabbage accounts for a signiﬁcant proportion of South Korea’s annual cabbage production. Thus, it has a profound effect on the formation of cabbage prices. Traditionally, labor-extensive and time-consuming ﬁeld surveys are manually carried out to derive agricultural ﬁeld maps of the highlands. Recently, high-resolution overhead images of the highlands have become readily available with the rapid development of unmanned aerial vehicles (UAV) and remote sensing technology. In addition, deep learning-based semantic segmentation models have quickly advanced by recent improvements in algorithms and computational resources. In this study, we propose a semantic segmentation framework based on state-of-the-art deep learning techniques to automate the process of identifying cabbage cultivation ﬁelds. We operated UAVs and collected 2010 multispectral images under different spatiotemporal conditions to measure how well semantic segmentation models generalize. Next, we manually labeled these images at a pixel-level to obtain ground truth labels for training. Our results demonstrate that our framework performs well in detecting cabbage ﬁelds not only in areas included in the training data but also in unseen areas not included in the training data. Moreover, we analyzed the effects of infrared wavelengths on the performance of identifying cabbage ﬁelds. Based on the results of our framework, we expect agricultural ofﬁcials to reduce time and manpower when identifying information about highlands cabbage ﬁelds by replacing ﬁeld surveys.


Introduction
Monitoring distribution and changes in a region of interest (RoI) is a fundamental task in land-cover classification and has been part of many applications such as urban management [1], land-used management [2], and crop classification [3]. Land-cover classification generates information on the status of land use. It has been used especially in the South Korea highlands for a study of classifying cabbage as well as potatoes [4]. The reason for identifying cabbage in the highlands is that these regions, despite their short growing seasons, are the major cultivation areas for these cabbages. Therefore, it is important to develop land-cover classification methods to investigate the agricultural highlands of South Korea.
Traditionally, manual field surveys have been used to identify agricultural lands and derive land maps [4]. However, this method requires a significant amount of time and manpower. To replace manual field surveys, many studies have been conducted with remote sensing (RS) imagery using satellites and aircrafts [5]. However, this approach is disadvantageous in that they are mostly low-resolution images and can be degraded by weather conditions or shadows that complicate the collection of accurate information [6,7]. Especially, these issues have been a major problem for the highlands of South Korea [4]. Recently, the development of RS technology has made capable high-resolution photography with unmanned aerial vehicles (UAV) [8] has created a trend away from satellite and aircraft imagery. UAV imagery is less likely to be affected by weather because it is taken at lower altitudes with higher spatial and spectral resolution capabilities that improve accuracy in rendering ROI [8]. In this study, we operated UAVs in the highlands and collected high-resolution photographs taken under different spatiotemporal conditions to generate information on cabbage cultivation fields.
In general, a multispectral sensor has been equipped with satellites, aircrafts, and UAVs to generate information on RoI [9]. Multispectral sensors capture not only visible wavelength reflectances but also infrared wavelengths, which enable the monitoring of vegetation growth [10,11]. Based on theoretical backgrounds for agriculture, we equipped UAVs with a multispectral sensor and collected various information on cabbage cultivation fields in South Korea highlands. In addition, we analyze the effect on infrared wavelengths in terms of performance on detecting cabbage fields at a pixel-level.
Recently, convolutional neural networks (CNN) have led to advances for computer vision tasks such as image classification [12], object detection [13], and semantic segmentation [14]. In particular, the semantic segmentation task aims to assign a class label to each pixel in an image [14]. Accordingly, CNN-based semantic segmentation algorithms have also been successfully applied to analyzing UAV imagery. For instance, semantic segmentation algorithms were applied to UAV imageries to ascertain the ratio of roads in cities and to assess pavement crack segmentation on an airport runway [15,16].
In this study, we propose a semantic segmentation framework based on UAV imagery to automate the process of identifying the cabbage fields in the South Korea highlands. First of all, UAV multispectral images of cabbage fields collected under various conditions were annotated with the help of agricultural experts to generate information on the important cabbage. Second, we trained several different semantic segmentation models and compared their performances. Moreover, we made extensive studies to analyze the generalization performance of our models. Finally, we analyzed the effects of infrared wavelengths on identifying cabbage fields at a pixel-level. The main contributions of this study can be summarized as follows: • To measure how well our framework generalizes well despite differences for highlands and shooting dates, we operated UAVs equipped with a multispectral sensor and collected multispectral images under different spatiotemporal conditions. • Our proposed framework shows exceptional detection performance for test images collected from the areas included in the training data but on different dates. Moreover, our method generalizes well to unseen areas not used during training. • To analyze which wavelength in multispectral images has a positive effect on detection performance, we experimented with four different combinations of input wavelengths and compared their detection performances. Based on the results, we demonstrate that the semantic segmentation model trained with blue, green, red, and red edge wavelengths is the most suitable for automating the process of identifying cabbage cultivation fields.
The remainder of this paper is organized as follows. In Section 2, we review the papers associated with semantic segmentation models based on CNN, land-cover classification, and applications of CNN to UAV imagery. In Section 3, we describe the proposed framework used to detect cabbage cultivation. In Section 4, we give a thorough analysis of the experimental results of different semantic segmentation models. In Section 5, we discuss our results in terms of semantic segmentation models and input wavelengths. In Section 6, we summarize our study with conclusions and future research directions.

Semantic Segmentation Models
Semantic segmentation is the task of assigning each pixel in an image to a class label [14]. Various studies based on CNNs have been proposed to tackle the semantic segmentation task. First, Fully convolutional networks (FCNs) [17] have demonstrated performance improvement on PASCAL visual object classes dataset [18], which is the popular image segmentation benchmark.The models after FCNs have been based on encoderdecoder architectures. More recently, CNN encoder-decoder architectures [19,20] have been studied in depth. The encoder module maps raw image pixels to a rich representation which is a collection of feature maps with smaller spatial dimensions. The decoder module makes pixel-wise predictions by taking the representation emitted by the encoder and mapping it back to the original input image size. Meanwhile, spatial pyramid pooling and atrous convolution have been developed to extract multi-scale information from input images [21,22]. In PSPNet [21], original feature maps are calculated by max-pooling operations with different scales and outputs, formed by different regions of original maps. Based on spatial pyramid pooling (SPP), each output is upsampled to match the size of original feature maps. Atrous convolution, or dilated convolution [22], is another convolution operation with different factors to expand receptive fields of CNN representation vectors without losing resolution. Extracting multi-scale information, including SPP and atrous convolution, is helpful in that the resulting representations comprise of feature maps of different scales which prevents contextual information from vanishing. In this study, we experiment with models based on an encoder-decoder architecture which is capable of extracting multi-scale information.

Land Cover Classification
The task of land cover classification is an important task for the RS community, which aims to observe the Earth and monitoring its changes [5]. Early works mostly relied on low-resolution RS images to generate information about the Earth [23]. To extract information, each pixel in RS images was classified by a patch around each pixel [24]. However, this method is computationally inefficient. Another disadvantage was that boundary information in RS images was fuzzy, leading to incorrect information [25]. To monitor a target study area, it is difficult to extract accurate information from low-resolution RS images. Recently, the development of RS technology has made capable high-resolution images for land cover classification, such as disease detection [26], yield estimation [27], and weed detection [28]. With high-resolution RS images, semantic segmentation methods were applied to classify whether each pixel in vineyards is healthy or diseased [25]. Likewise, various information can be extracted by the semantic segmentation methods at a pixel-level depending on what researchers aim to classify. In this study, we take a first approach to classifying cabbage fields in the highlands.

Applications of Semantic Segmentation for Agriculture
Overall, several studies have successfully applied semantic segmentation models to the agricultural data domain. One study applied a feature pyramid network to classify seven land types such as urban land, agricultural land, and water [29]. Other works include but are not limited to applying semantic segmentation models to detect weed plants or identifying cranberry fields [30,31]. In our study, we propose a framework based on stateof-the-art semantic segmentation models including U-Net [19], SegNet [20], and DeepLab V3+ [32], to detect the cabbage fields at a pixel-level in the South Korea highlands. We conducted various experiments to compare model performances for identifying cabbage pixels under different spatiotemporal conditions unseen from the training dataset.

Target Crop and Regions
The purpose of this study is to automate the process of identifying cabbage fields at a pixel level in the South Korea highlands. As shown in Figure 1, vegetables on the left and right are different. The left vegetable is called napa cabbage or kimchi cabbage, which is mainly used to make a traditional Korean food called kimchi. The right vegetable is a common cabbage in the West. Our work focuses on detecting cabbages like those in the left figure. We propose a semantic segmentation framework to classify cabbage-growing regions that are invariant to spatiotemporal changes. In South Korea, high summer temperature, humidity, and the proliferation of insects from June to August restrict the successful cultivation of cabbages except for these three highlands [4]. UAV images taken in Maebongsan on 24 July 2019, and 5 August 2019; Gwinemi on 5 August 2019; and in Anbanduck on 6 August 2019, were used. We attached a multispectral sensor, RedEdge-MX (MicaSense, Inc, Seattle, WA, USA), to a rotary-wing UAV (DJI M2010) to collect images from the study areas. The sensor collects image data within specific wavelength ranges consisting of blue (B, 475 ± 20 nm), green (G, 560 ± 20 nm), red (R, 668 ± 10 nm), red edge (RE, 717 ± 10 nm), and near-infrared (NIR, 840 ± 40 nm) wavelengths. Each image size with ground sampling distance (GSD) of approximately 12 cm is 1280(width) × 960(height).

Image Preprocessing
Misalignment between the five lenses of RedEdge-MX requires an additional data processing step prior to model training. A visual example of bandwidth misalignment is shown in Figure 3. Simply stacking the five bandwidths along the channel dimension results in counterfactual images such as Figure 3f. To cope with the misalignment and create RGB images visible to the human eye, we applied enhanced correlation coefficient (ECC) transformation [34,35]. Next, with the help of agricultural experts, we visually examined the RGB images and generated pixel-wise ground truth labels of the cabbage fields, as shown in Figure 3h. We used four different combinations of input wavelengths to train the semantic segmentation models: RGB, RGB with RE, RGB with NIR, and RGB with both RE and NIR. We give comparisons with respect to the input types in Sections 4.3 and 5.2.

Semantic Segmentation Models
We use three different semantic segmentation models that have demonstrated acceptable performance across various image domains: U-Net [19], SegNet [20], and DeepLab V3+ [32]. U-Net was originally developed to distinguish neuronal structures in electron microscopic stacks. The model architecture consists of two modules; one is a contracting path (encoder), and the other an expanding path (decoder). In the contracting path, context information is extracted by a sequence of neural network blocks. Each block in the path consists of two convolution layers, followed by a batch normalization layer, a rectified linear unit (ReLU), and a 2 × 2 max-pooling layer. Five blocks constitute a contracting path. In the expanding path, rich context information from a contracting path is restored to the original input size. The expanding path consists of four blocks in which each block consists of a transposed convolution layer, followed by a convolution layer, a batch normalization layer, and a ReLU. In every transposed convolution operation, context information from each block in the contracting path is cropped and copied to alleviate the loss of border information of an input image when calculating convolution and max-pooling layers. Combining cropped feature maps in the contracting path with feature maps in the expanding path enables aggregation of multi-scale feature maps. Another method to prevent the loss of border information is a mirroring strategy, which extrapolates the border regions of an input image as shown in Figure 4. This technique allows the border regions of the input to be accurately predicted. SegNet has an encoder-decoder architecture and was developed to use pixels to recognize objects on roads for use in an autonomous driving system [20]. The encoder of SegNet is identical to the feature extractor of VGG-16 [36]. Moreover, the model records the max-pooling indices of the encoder and uses them during upsampling layers in the decoder. Using max-pooling indices shortens the inference time of SegNet. We expected the use of max-pooling indices in SegNet to make the prediction boundaries of cabbage fields sharper.
Finally, DeepLab V3+ combines advantages from encoder-decoder structures and a multi-scale feature extraction module [32]. The encoder-decoder structure generates an output image with pixel-wise prediction results equal to the original input size. This structure enables accurate detection of object boundary surfaces by combining information from the encoder during decoder upsampling [32]. The multi-scale feature extraction module, called an atrous spatial pyramid pooling (ASPP), consists of several atrous convolutions with different dilation rates. The ASPP module enlarges receptive fields of representation vectors and enables DeepLab V3+ to recognize objects of different sizes [32,37]. We assume that DeepLab V3+ is different from U-Net and SegNet in terms of multi-scale feature extraction method with atrous convolution and has an invariant performance under different conditions from the training dataset. Figure 5 depicts the DeepLab V3+ structure used in our study. DeepLab V3+ comprises three modules: an encoder, an ASPP module, and a decoder. The encoder is an Xception model [38], which extracts representation vectors from RGB and multispectral images. Next, the ASPP module performs atrous depth-wise separable convolution operated for each channel, followed by a 1 × 1 convolution [39]. The advantage of the ASPP module is that it reduces computational complexity while maintaining predictive performance. Finally, the decoder stacks the feature maps obtained from the second block of the encoder and also the feature maps obtained from the ASPP module. The model performs 1 × 1 convolution on the stacked feature maps before upsampling to the input resolution space for the final pixel-wise prediction.
Our loss function for learning models, L cls , is a categorical cross entropy function and formulated as follows: where c is an index of ground truth class, α is weighting factor calculated for ground truths in our training dataset, n is the number of pixel in images of an mini batch, y i,c is a target class value, andŷ i,c is a predicted score of ith pixel which belongs to class c. We give comparisons with respect to the semantic segmentation models in Sections 4.3 and 5.1.

Dataset
We took 1240 UAV images of Maebongsan recorded on 24 July 2019, and used 1032 for training and 208 for model validation. In order to improve the diversity of the training and validation dataset, five data augmentation methods, such as horizontal flips, vertical flips, and rotation in 90, 180, and 270 degrees, were applied to both the training and validation sets to obtain a final training set of 6192 images and a validation set of 1248. We aim to train robust semantic segmentation models for cabbage fields of different scales through data augmentations. Next, we composed three different testing sets to measure the generalization performance of the proposed method. The first testing set, denoted as the MBS dataset, consisted of 471 images taken from the Maebongsan area on 5 August 2019. Evaluation on the MBS dataset is necessary to measure how well our models generalize to data collected under different temporal conditions. The second testing set, denoted as the GNM dataset, consisted of 156 images taken from the Gwinemi area on 5 August 2019. Lastly, 143 images of the Anbanduck area on 6 August 2019, comprised the third testing set, denoted as the ABD dataset. Evaluations on the GNM and ABD datasets are necessary to measure how well our models perform on data collected under different spatiotemporal conditions.

Hyperparameters
We used three different semantic segmentation models in our experiments: U-Net, SegNet, and DeepLab V3+. These models were initialized by Xavier initialization [40]. We trained U-Net with the AdamW optimizer [41], using a learning rate of 0.01 for 100 epochs with a batch size of 12. The input heights and widths were fixed to 572. SegNet was trained with stochastic gradient descent for 100 epochs with a batch size of 16, using a learning rate of 0.01, and a momentum factor of 0.9. The input heights and widths were set to 224. Finally, we trained DeepLab V3+ with the AdamW optimizer , using a learning rate of 0.001 and a weight decay factor of 0.01 for 70 epochs with a batch size of 8. The input heights and widths were fixed to 513. All experiments were implemented with PyTorch 1.4.0 and conducted on a single NVIDIA TITAN RTX GPU.

Model Performance
We used the mean intersection over union (MIoU) metric [17] to quantify the performance of the semantic segmentation models used in this study. The formula of MIoU for a dataset of N images is defined as follows: where n (i) 11 is the number of correctly classified pixels in ith image, n (i) 1. is the number of ground truth pixels labeled as cabbages, and n (i) .1 is the number of pixels predicted as cabbage fields. Best hyperparamters were chosen based on the validation MIoU metric. We checked the MIoU score after every training epoch, and saved the best model checkpoint for further inference on the MBS, GNM, and ABD datasets. We compared the train and validation learning curves of our experiments in Figure 6. It could be observed that the learning curves of DeepLab V3+ shown in Figure 6a,d,g,j exhibited the fastest rate of convergence compared to other models. Also, the fluctuation ranges of the DeepLab V3+ validation MIoU were smaller than other models. In addition, as could be seen in Figure 6f,i,l, SegNet had wide fluctuation ranges and were unstable after epoch 80.
As shown from the first row to the fourth row in Figure 6, the more the number of input wavelengths were, the wider fluctuation ranges occurred. Reasons are to be given in Section 5.2. As shown in Figure 6j-l, it can be evaluated that models with NIR wavelength suffer from overfitting. To demonstrate the applicability of our semantic segmentation framework, we should measure how the framework generalizes well for images in three test datasets under different spatiotemporal conditions. Each combination of input data and models was repeated ten times with different random seeds, and we report the average and standard deviations of the MIoU metric. As shown in Table 1, DeepLab V3+ with RGB outperformed both U-Net and SegNet on the MBS dataset. Moreover, we observed better performances in all three models when using RGB images rather than the other input wavelengths. Although the best validation MIoU was the performance of U-Net with RGB and RE, we demonstrated that DeepLab V3+ has more generalizability than U-Net and SegNet based on the MBS dataset MIoU.    To verify whether DeepLab V3+ generalizes across different times and regions, we evaluated its predictive performance on the GNM dataset collected from Gwinemi. As shown in Table 2, DeepLab V3+ with RGB and RE performed the best. We found that the RE wavelength has a positive effect on the detection performance of images collected from different spatiotemporal conditions. In addition, the standard deviation of DeepLab V3+ with RGB and RE was the smallest. Further, in Figure 8, we provided visualizations of predictions on the GNM dataset. The example in the first row included greenhouses. We found that DeepLab V3+ with RGB and RE wavelengths succeeds in distinguishing the greenhouses from cabbage fields. In addition, the second example of Figure 8 demonstrates that the DeepLab V3+ is capable of distinguishing cabbage fields from weeds (dark green). Last but not least, we could see from the third row's example that the land left fallow in the lower left part of the image is correctly classified as not growing cabbage. The ABD dataset was composed of images photographed in Anbanduck. As can be seen from the data in Table 2, DeepLab V3+ with RGB and RE performed the best like the GNM dataset. Also, we observed that adding RE to the input improves the detection performance. Figure 9 shows that our model successfully distinguishes fields of weeds, forest regions, and wind turbines from cabbage fields.

Comparisons of Models
We made several assumptions on semantic segmentation models as mentioned in Section 3.3. First, we assumed that DeepLab V3+ has the best invariant performance under different spatiotemporal conditions. This was demonstrated in Tables 1 and 2. The ASPP module in DeepLab V3+ enabled the extraction of multi-scale information of cabbage fields by applying atrous convolutions with different dilation rates. Receptive fields of representation vectors had a positive influence on the detection performance [42]. As can be seen from Figures 7-9, DeepLab V3+ accurately predicted cabbage fields of different sizes. Next, the assumption on U-Net's mirroring strategy did not hold. As shown in the U-Net visualizations of Figure 7, pixels in the border regions were still misclassified. Finally, SegNet's max-pooling indices received from each encoder did not sharpen prediction boundaries from Figures 7-9. In summary, DeepLab V3+ demonstrated superior performance compared to U-Net and SegNet. Especially, we claim that multi-scale feature map aggregation induced by the ASPP module was critical when identifying the complex environment of the South Korea highlands under different spatiotemporal conditions. This has also been emphasized in several previous studies on extracting multi-scale information in UAV images for agriculture [42,43].

Difference in Input Wavelengths
Based on detection performance comparisons in Section 4.3, a histogram-based analysis was included to identify why models trained with RGB and RE inputs showed better performance than other wavelength combinations under spatiotemporal conditions. To visualize distributions of each wavelength reflection in Figure 10, we randomly selected 100 multispectral images in the train, MBS, GNM, and ABD datasets, respectively. Two-sample t-tests were then performed between each wavelength distribution in the training dataset and three testing datasets. The null hypothesis states that means of each wavelength reflection for the training and testing dataset are equal. As shown in Table 3, p-values were zero, indicating that we could reject the null hypothesis. It demonstrated that a discrepancy exists between the wavelength distributions of the training and test sets. From these results, we claim that such differences caused the performance to degrade in models trained with RGB, RE, and NIR naively stacked along the channel dimension. [44]. Furthermore, we believe that these differences caused multispectral models to overfit to the training data as well. A recent study pointed out that naively stacking RGB with NIR/RE bandwidths can result in redundant information and mutual interference when training CNNs [45]. As shown in Table 2, we also observed a similar phenomenon. Because models with RGB and RE outperformed models with the other combinations of wavelengths, while models with RGB, RE, and NIR showed the worst detection performance. In addition, improvement for the performance of models with RGB and RE was larger than the improvement with RGB and NIR. Based on the results, we claim that RE wavelength could have a more positive effect on identifying cabbage fields under spatial and temporal differences than NIR wavelength. We believe that our proposed method can be improved by incorporating several encoders to learn separate low-level representations for each wavelength including RE and NIR.

Conclusions
In this study, we have proposed a semantic segmentation framework based on UAV images to automate the process of identifying cabbage fields in the highlands of South Korea. First, we collected high-resolution multispectral images by operating UAVs over different highlands in South Korea to generate information on the important cabbage. Second, we applied ECC transformation to handle misalignment between channels and generated pixel-wise ground truth labels. We compared the performances of detecting cabbage cultivation fields by three semantic segmentation models and four combinations of input wavelengths and concluded that DeepLab V3+, which was trained on RGB and RE wavelengths, performed the best. We demonstrated that the model was effective in distinguishing between the cabbage fields, fields of weeds, and buildings despite changes in operational dates and regions. Based on the results of our proposed framework, we expect agricultural officials to save time and manpower when collecting information about cabbage cultivating fields in South Korea highlands by replacing manual field surveys. In future studies, we plan to apply semantic segmentation models to detect multiple crops such as cabbages, peppers, and beans. We also expect to make use of additional infrared wavelengths, such as RE and NIR, by incorporating several CNN-based encoders to learn low-level representations from each wavelength and enhance model performance.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are not publicly available due to privacy and legal restrictions.