Uncertainty Analysis of Object-Based Land-Cover Classification Using Sentinel-2 Time-Series Data

Recently, time-series from optical satellite data have been frequently used in object-based land-cover classification. This poses a significant challenge to object-based image analysis (OBIA) owing to the presence of complex spatio-temporal information in the time-series data. This study evaluates object-based land-cover classification in the northern suburbs of Munich using time-series from optical Sentinel data. Using a random forest classifier as the backbone, experiments were designed to analyze the impact of the segmentation scale, features (including spectral and temporal features), categories, frequency, and acquisition timing of optical satellite images. Based on our analyses, the following findings are reported: (1) Optical Sentinel images acquired over four seasons can make a significant contribution to the classification of agricultural areas, even though this contribution varies between spectral bands for the same period. (2) The use of time-series data alleviates the issue of identifying the “optimal” segmentation scale. The finding of this study can provide a more comprehensive understanding of the effects of classification uncertainty on object-based dense multi-temporal image classification.


Introduction
There has been a progressive increase in the availability of open-source remote-sensing data (e.g., Landsat and Sentinel imagery). This allows the application of satellite image time-series (SITS) data in remote sensing-based land-cover classification [1][2][3][4][5][6]. Two common paradigms are used to exploit time-series information according to different input data types. For the first paradigm, the spectral features of multi-temporal images or features derived from them are used as inputs to a conventional supervised classification procedure [7][8][9][10][11]; these conventional classification procedures include support vector machines (SVM) [8] and random forests (RF) [9]. For the second paradigm, semantic features based on phenological information are directly utilized for classification [12][13][14]; a common method used for this purpose is dynamic time warping (DTW) [13,14].
Belgiu et al. [12] compared the performance of both time-series classification paradigms using the DTW method and an RF classifier. They confirmed that the DTW framework, representative of the first paradigm as it only uses enhanced normalized difference vegetation index (NDVI) time-series, is not superior to the RF framework, which is representative of the second paradigm as it uses all of the features of individual spectral bands. Coincidentally, Pelletier et al. [15] recently claimed that RF may be the best method for remote sensing time-series image classification, i.e., better than the

Study Area and Dataset
This study used the suburbs to the north of Munich, Germany, as the study area ( Figure 1). The first experimental site (Study Area 1) is far from the urban area of Munich, covering an area of approximately 53,731 ha, which mainly includes (coniferous) forests, grasslands, maize fields, cereal fields, and artificial land. Thus, this area is sufficiently representative of agricultural areas. The second site (Study Area 2) is located closer to the actual urban extent of Munich and covers an area of~21,726 ha. The primary land-cover types are (mixed and broad-leaved) forests, water bodies, maize fields, cereal fields, and artificial land. Study Area 2 can be used to examine the mapping of suburban areas.
The optical Sentinel images (Level-2A) were downloaded from the Copernicus Open Access Hub (https://scihub.copernicus.eu/dhus/#/home). We selected temporal images with <20% cloud coverage according to the metadata, acquired between January and December 2018, yielding a total of 39 images. Subsequently, Study Areas 1 and 2 were extracted by clipping. Cloud-free images for these areas were selected for subsequent classification analysis to explore how the time-series images impact the OBIA. Thus, stacks of 20 and 22 images were obtained for Study Areas 1 and 2, respectively (see Table A1 in Appendix A). Given that images with high spatial resolution are preferred in OBIA [19], only the 10 m resolution bands (R, G, B, and near-infrared (NIR)) of the optical Sentinel images were employed in this study. The optical Sentinel images (Level-2A) were downloaded from the Copernicus Open Access Hub (https://scihub.copernicus.eu/dhus/#/home). We selected temporal images with <20% cloud coverage according to the metadata, acquired between January and December 2018, yielding a total of 39 images. Subsequently, Study Areas 1 and 2 were extracted by clipping. Cloud-free images for these areas were selected for subsequent classification analysis to explore how the time-series images impact the OBIA. Thus, stacks of 20 and 22 images were obtained for Study Areas 1 and 2, respectively (see Table A1 in Appendix A). Given that images with high spatial resolution are preferred in OBIA [19], only the 10 m resolution bands (R, G, B, and near-infrared (NIR)) of the optical Sentinel images were employed in this study. Figure 2 presents the main steps employed to assess the uncertainty caused by integrating OBIA with SITS data. After preparing the data (e.g., clipping and stacking) as mentioned above, the input data that satisfied the conditions for both areas were generated. Then, a sampling process was conducted to generate the reference layer for labeling the segmented objects. Segmentation based on  Figure 2 presents the main steps employed to assess the uncertainty caused by integrating OBIA with SITS data. After preparing the data (e.g., clipping and stacking) as mentioned above, the input data that satisfied the conditions for both areas were generated. Then, a sampling process was conducted to generate the reference layer for labeling the segmented objects. Segmentation based on the multi-temporal images was then performed to delimit the outlines of homogeneous areas for classification. Feature selection, as an optional process, was carried out before RF classification. We note that it is possible to repeat the classification process by randomly separating the labeled objects to obtain enough classification accuracy records and serve various uncertainty analysis evaluations.

Methods
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 18 the multi-temporal images was then performed to delimit the outlines of homogeneous areas for classification. Feature selection, as an optional process, was carried out before RF classification. We note that it is possible to repeat the classification process by randomly separating the labeled objects to obtain enough classification accuracy records and serve various uncertainty analysis evaluations.

Segmentation of Multi-Temporal Images
Multi-resolution segmentation is used [29] to partition the images into homogeneous objects. This step is realized using the eCognition 9.0 commercial software. For segmentation, the red, green, blue, and near-infrared spectral bands for six images from Study Area 1 and seven images from Study

Segmentation of Multi-Temporal Images
Multi-resolution segmentation is used [29] to partition the images into homogeneous objects. This step is realized using the eCognition 9.0 commercial software. For segmentation, the red, green, blue, and near-infrared spectral bands for six images from Study Area 1 and seven images from Study Area 2 ( Figure 3) were used because a large number of images yield complex segmentation. Images for an entire calendar year (corresponding to the solid triangles in Figure 3) were used for segmentation to account for the characteristics of crop phenology, resulting in stacks of 24 and 28 layers for Study Areas 1 and 2, respectively. Here, the general parameter setting suggestions for multi-resolution segmentation were followed to ensure that the spectral information had the most important role during segmentation [18]; the color/shape parameters were set to 0.9/0.1 and the smoothness/compactness ratio was set to 0.5/0.5. The size of the segmented objects was controlled by the scale parameter (homogeneity threshold). Subsequently, different segmented layers were generated from scale 40 to 150 at increments of 10 to analyze the impact of scales on the accuracy of multi-temporal object-based classification; this has rarely been addressed in previous studies. Then, the segmentation results with feature information for each scale were exported for classification using Visual Studio 2010 and ArcEngine 10.0.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 18 Area 2 ( Figure 3) were used because a large number of images yield complex segmentation. Images for an entire calendar year (corresponding to the solid triangles in Figure 3) were used for segmentation to account for the characteristics of crop phenology, resulting in stacks of 24 and 28 layers for Study Areas 1 and 2, respectively. Here, the general parameter setting suggestions for multi-resolution segmentation were followed to ensure that the spectral information had the most important role during segmentation [18]; the color/shape parameters were set to 0.9/0.1 and the smoothness/compactness ratio was set to 0.5/0.5. The size of the segmented objects was controlled by the scale parameter (homogeneity threshold). Subsequently, different segmented layers were generated from scale 40 to 150 at increments of 10 to analyze the impact of scales on the accuracy of multi-temporal object-based classification; this has rarely been addressed in previous studies. Then, the segmentation results with feature information for each scale were exported for classification using Visual Studio 2010 and ArcEngine 10.0.

Training and Validation of Data Collection
To obtain sample objects for classification, polygon-shaped sampling units were generated and labeled. For this step, visual interpretation keys were used based on expert knowledge and crop phenology information from the European Land Use and Coverage Area Frame Survey (LUCAS) and the CORINE Land-Cover (CLC) data updated in 2018. Subsequently, these reference polygons were obtained manually; Table 1 lists the total sample area of each class for both study sites. Then, the segmented objects at each segmentation scale were labeled according to the 50% overlap rule with these reference polygons [30]. Subsequently, 30% of the labeled objects were selected as training samples using the stratified random sampling strategy [18], whereas all of the labeled objects were used for validation.
When utilizing CLC and LUCAS data for interpretation to obtain a reference layer, if the definitions of the CLC classes differ from those of the LUCAS classes, the latter were adopted. However, as barley (class B13), common wheat (class B11), and oats (class B15) have similar growth cycles and there were limited samples of barley and oat classes in the LUCAS dataset for the experimental sites, they were all recognized as cereals in this study. Table 1 lists the detailed definition principles, which also provides the relationship between the classes defined in this study and those of the CLC and LUCAS systems.

Training and Validation of Data Collection
To obtain sample objects for classification, polygon-shaped sampling units were generated and labeled. For this step, visual interpretation keys were used based on expert knowledge and crop phenology information from the European Land Use and Coverage Area Frame Survey (LUCAS) and the CORINE Land-Cover (CLC) data updated in 2018. Subsequently, these reference polygons were obtained manually; Table 1 lists the total sample area of each class for both study sites. Then, the segmented objects at each segmentation scale were labeled according to the 50% overlap rule with these reference polygons [30]. Subsequently, 30% of the labeled objects were selected as training samples using the stratified random sampling strategy [18], whereas all of the labeled objects were used for validation.
When utilizing CLC and LUCAS data for interpretation to obtain a reference layer, if the definitions of the CLC classes differ from those of the LUCAS classes, the latter were adopted. However, as barley (class B13), common wheat (class B11), and oats (class B15) have similar growth cycles and there were limited samples of barley and oat classes in the LUCAS dataset for the experimental sites, they were all recognized as cereals in this study. Table 1 lists the detailed definition principles, which also provides the relationship between the classes defined in this study and those of the CLC and LUCAS systems.

Classification Using Random Forest
Since its proposal by Breiman [31], the RF classification algorithm has been proven to outperform other supervised algorithms in extracting information from remote-sensing images [32,33]. As its name suggests, the algorithm randomly constructs a forest consisting of many interdependent decision trees. After the forest is constructed using training samples, if new samples must be classified, all decision trees are employed to make separate decisions. These decisions are taken as votes and the sample is classified into the class with the highest number of votes. Based on previous studies [32], in this study, the RF model used 479 trees and one randomly split variable; the 'randomForest' R package was integrated into the Visual Studio platform to implement classification for all of the images from both study areas.

Filtering Feature Subset and Temporal Characteristics Analysis
To obtain feature patterns in a season, the frequency of features selected were evaluated for different periods. This differs from the approach of evaluating an individual feature using the feature importance index. Therefore, correlation-based feature selection (CFS) [34] was used to calculate the frequency of the selected features. The CFS assesses the worth of a set of features using a heuristic evaluation function based on the correlation of features, and has been proven to be suitable for object-based classification in our previous study [24].
For feature evaluation and classification, the 20 images from Study Area 1 and 22 images from Study Area 2 were used. In this experiment, the inputs of the features for feature evaluation were derived from the red, green, blue, and near-infrared spectral bands of 10-m-resolution Sentinel data, resulting in stacks of 80 and 88 features for Study Areas 1 and 2, respectively. CFS was applied to these features repeatedly, maintaining a constant segmentation scale. This enabled the identification of the most used features in a certain period to determine the feature pattern in multi-temporal object-based classification.

Accuracy Evaluation and Statistical Tests
In this study, multi-temporal object-based classification was evaluated in terms of the overall accuracy (OA) and user's accuracy (UA) metrics. These metrics were calculated using the area-based accuracy evaluation framework [35]. The OA was used to analyze the uncertainty of the segmentation scale, the number of images used, and the feature pattern. UA was employed to analyze the class-specific classification uncertainty influenced by the number of images used and the image scale. In addition, Welch's t-test was conducted to compare the results obtained with and without feature selection.

Influence of Multi-Temporal Images and Segmentation Scale on Overall Accuracy (OA)
First, the relationship between the image number/segmentation scale and OA was assessed. This was completed by classifying the samples from both study areas at various scales and incrementally increasing the number of input images. For this purpose, new images were added consecutively based on their DOY attribute. The contours in Figure 4a,b show the change pattern of the OA with regard to the segmentation scale and number of images used for both areas. From Figure 4, the influence of the number of images used on classification accuracy is much stronger than that of the segmentation scale, such that the accuracy increases steadily with the number of images used (Figure 4a,b). When the maximum number, i.e., 20, of images are used, up to 80 features are utilized. However, the RF classifier can still effectively use the multi-temporal spectral information and the results do not exhibit a significant Hughes phenomenon (Hughes phenomenon states that excessive features may lead to a negative impact on the classification accuracy) [36]. This may be attributed to the fact that images can contribute to the classification performance regardless of their specific acquisition time because of the different growth stages of crops [37]. We note that the results of this study are not consistent with those presented by Stromann et al. [28] because they argue that dimensionality reduction should be a key step in land-cover classification using SVM; however, this discrepancy can be attributed to the usage of the sensitive SVM.    To analyze the classification stability with changes in the segmentation scale, the mean value and mean square error of the classification accuracies of different segmentation scales (from scale 40 to 150, with an increment of 10) were calculated with a changing number of images. Figure 5 shows the results in the form of error bars, which indicates that the classification results at different scales differ more significantly when fewer images are included. In contrast, when more images are used during classification, scale variations have less of an influence on the classification accuracy (see Figure 5). Hence, we suggest that the use of multi-temporal data significantly alleviates the problem of identifying the "optimal" segmentation scale. This result is important because it means that selection of scales is less important in OB-SITS mapping, which until now, has been a particularly difficult task in OBIA [18,19]. Furthermore, owing to this novel finding, the integration of OBIA and time-series analysis becomes more feasible.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 18 To analyze the classification stability with changes in the segmentation scale, the mean value and mean square error of the classification accuracies of different segmentation scales (from scale 40 to 150, with an increment of 10) were calculated with a changing number of images. Figure 5 shows the results in the form of error bars, which indicates that the classification results at different scales differ more significantly when fewer images are included. In contrast, when more images are used during classification, scale variations have less of an influence on the classification accuracy (see Figure 5). Hence, we suggest that the use of multi-temporal data significantly alleviates the problem of identifying the "optimal" segmentation scale. This result is important because it means that selection of scales is less important in OB-SITS mapping, which until now, has been a particularly difficult task in OBIA [18,19]. Furthermore, owing to this novel finding, the integration of OBIA and time-series analysis becomes more feasible.

Effect of Multi-Temporal Images on Category Accuracy
Here, the UA index was used to evaluate the class-specific classification uncertainty. According to the previous analysis of the OA, the segmentation scales have less influence on accuracy than the number of images used. Therefore, this section examines the effects of the number of images used on the classification accuracy for different classes. For this purpose, bar charts were plotted to show the classification accuracy for different classes with different numbers of images. The results show that the accuracies of seasonal crops (maize, cereals, and rapeseed) generally increased when using an increasing number of input images (see Figures 6 and 7). Figures 6 and 7 also show that the classification quality for winter or summer crops is significantly affected by the time of data retrieval. For both areas, we observed that the potential to classify summer crops (maize) increased from spring to summer and stabilized toward late summer to autumn. For both winter crops (rapeseed and cereals), the input of winter images was necessary to improve the performance, especially in the case of rapeseed. The rapeseed classification performance decreased with the input of summer images; this effect was most notable when the

Effect of Multi-Temporal Images on Category Accuracy
Here, the UA index was used to evaluate the class-specific classification uncertainty. According to the previous analysis of the OA, the segmentation scales have less influence on accuracy than the number of images used. Therefore, this section examines the effects of the number of images used on the classification accuracy for different classes. For this purpose, bar charts were plotted to show the classification accuracy for different classes with different numbers of images. The results show that the accuracies of seasonal crops (maize, cereals, and rapeseed) generally increased when using an increasing number of input images (see Figures 6 and 7). more images were included as input (Figure 7). This is likely because this study area is closer to the urban area of Munich. Urban areas in Study Area 2 are more complicated due to the presence of various types of vegetation; a significant proportion of the forest areas in Study Area 2 comprise mixed and broad-leaved forests. This is the reason why Mendili et al. [38] adopted optical Sentinel time-series data in urban mapping; they suggested that the vegetation in an urban area affects the mapping of that area.   Based on the above analysis, we can conclude that the effect of the time of data retrieval on the classification quality can be explained with respect to the development stages of winter and summer crops. Furthermore, the recommendations for feature selection in the frame of crop mapping proposed by Veloso et al. [37] are acceptable. However, a decreasing trend was not observed in the classification accuracy for a single class when all images were used as input. Instead, a notable Figures 6 and 7 also show that the classification quality for winter or summer crops is significantly affected by the time of data retrieval. For both areas, we observed that the potential to classify summer crops (maize) increased from spring to summer and stabilized toward late summer to autumn. For both winter crops (rapeseed and cereals), the input of winter images was necessary to improve the performance, especially in the case of rapeseed. The rapeseed classification performance decreased with the input of summer images; this effect was most notable when the segmentation scale was large. However, the use of all images in the same year improved the classification performance of rapeseed (see Figure 6). In contrast, the classification accuracy for forests, grasslands, and artificial lands remained almost unchanged even if more images were used. This can be attributed to their spectral information being relatively stable throughout the year because the forest area in Study Area 1 is almost completely covered by coniferous forest. Despite this, we observed improvements in the accuracy for artificial lands and forests in Study Area 2 when more images were included as input (Figure 7). This is likely because this study area is closer to the urban area of Munich. Urban areas in Study Area 2 are more complicated due to the presence of various types of vegetation; a significant proportion of the forest areas in Study Area 2 comprise mixed and broad-leaved forests. This is the reason why Mendili et al. [38] adopted optical Sentinel time-series data in urban mapping; they suggested that the vegetation in an urban area affects the mapping of that area.
Based on the above analysis, we can conclude that the effect of the time of data retrieval on the classification quality can be explained with respect to the development stages of winter and summer crops. Furthermore, the recommendations for feature selection in the frame of crop mapping proposed by Veloso et al. [37] are acceptable. However, a decreasing trend was not observed in the classification accuracy for a single class when all images were used as input. Instead, a notable increase was observed in the classification accuracy of seasonal crops. Therefore, we recommend the use of as many Sentinel-2 images as possible within the year of interest to ensure an optimal classification performance, especially when the optical data in the time-series are not numerous (approximately 20 timestamps). Furthermore, excluding images from certain periods is not advised.

Effect of Segmentation Scale on Category Accuracy
To analyze the change pattern of the classification of a specific class, error bars with the mean value and mean square error were plotted to show the change in the UA for different classes when the number of images used was different (Figures 8 and 9). As mentioned in the previous section, Figures 8 and 9 show that the accuracies of seasonal crops (e.g., maize and cereals) benefit more from an increasing number of input images. More importantly, as revealed by the error bars, when more images are used, there is a reduction in the fluctuation of the accuracy of seasonal crops caused by segmentation scale variations (Figures 8 and 9). This phenomenon was observed in the overall classification accuracy ( Figure 5).
The findings of this study are slightly different from those reported by Löw et al. [8]. According to Löw et al. [8], constantly accurate classification can be achieved using, in general, five images. They stated that dense Sentinel or Landsat data exhibit no advantages in time-series classification. However, this study demonstrates that using more images enhances the classification accuracy due to the contribution of additional images to seasonal crop recognition. Thus, we recommend that image selection should not be applied in multi-temporal object-based classification. classification accuracy ( Figure 5).
The findings of this study are slightly different from those reported by Löw et al. [8]. According to Löw et al. [8], constantly accurate classification can be achieved using, in general, five images. They stated that dense Sentinel or Landsat data exhibit no advantages in time-series classification. However, this study demonstrates that using more images enhances the classification accuracy due to the contribution of additional images to seasonal crop recognition. Thus, we recommend that image selection should not be applied in multi-temporal object-based classification.

Feature Selection Response
The classification was repeated 10 times with CFS for each segmentation scale and the frequency of the feature selected was calculated. In this section, for conciseness, only the experimental results for Study Area 1 are shown. In Figure 10, the size of the dot indicates the selected frequency of a specific band (y-axis) and date (x-axis) in the classification models. Only the 10 m resolution bands were evaluated. When only the acquisition time is considered, images taken in all seasons can be used with an equal frequency, except for images taken during the spring-summer transition in June. (a-f) show forest, grassland, artificial land, maize, water, and cereals, respectively.

Feature Selection Response
The classification was repeated 10 times with CFS for each segmentation scale and the frequency of the feature selected was calculated. In this section, for conciseness, only the experimental results for Study Area 1 are shown. In Figure 10, the size of the dot indicates the selected frequency of a specific band (y-axis) and date (x-axis) in the classification models. Only the 10 m resolution bands were evaluated. When only the acquisition time is considered, images taken in all seasons can be used with an equal frequency, except for images taken during the spring-summer transition in June. In contrast, based on comparisons of the frequencies at which different bands are selected, for the winter season, we observe that bands 3 and 4 are often not chosen in winter while bands 1 and 2 contribute more. This is likely because different bands respond to crops differently. For example, band 4 (NIR) is relatively sensitive to vegetation. However, vegetation coverage is less in winter; hence, the NIR band cannot contribute significantly to analyses during this period. In summary, Sentinel images acquired over all four seasons yield significant contributions to the classification of agricultural areas. Hence, images taken during a certain period of time must not be excluded without careful inspection and consideration. Moreover, we do not recommend filtering imported data based on a timeline, which has been conducted in most previous object-based multi-temporal classification studies (e.g., Vieira et al. [39]).
In addition, the classification was repeated 10 times for each scale with or without feature selection, followed by Welch's t-test to compare their performance. From Table 2, all of the p-values are more than the significance level of alpha = 0.05, except at scale 50. Therefore, for almost all segmentation scales, we can conclude that the mean value of the classification accuracies with feature selection is not significantly different from that with all features. Hence, according to the experimental results, feature selection is not required when RF classifiers are used. This is possible because RF classifiers themselves can overcome the limitations of dimensionality more satisfactorily than other classifiers [20].

Conclusions
In this study, object-based land-cover classification using RF was applied to time-series optical Sentinel data. A systematic evaluation was conducted to understand classification uncertainty in object-based dense multi-temporal image classification, including the impact of the segmentation scale, spectral features, categories, frequency, and acquisition timing of optical satellite images. Subsequently, several important findings were obtained regarding the input of time-series data and Figure 10. Selected frequency of a specific band and date. Band numbers 1, 2, 3, and 4 indicate red, green, blue, and near-infrared (NIR), respectively.
In addition, the classification was repeated 10 times for each scale with or without feature selection, followed by Welch's t-test to compare their performance. From Table 2, all of the p-values are more than the significance level of alpha = 0.05, except at scale 50. Therefore, for almost all segmentation scales, we can conclude that the mean value of the classification accuracies with feature selection is not significantly different from that with all features. Hence, according to the experimental results, feature selection is not required when RF classifiers are used. This is possible because RF classifiers themselves can overcome the limitations of dimensionality more satisfactorily than other classifiers [20].

Conclusions
In this study, object-based land-cover classification using RF was applied to time-series optical Sentinel data. A systematic evaluation was conducted to understand classification uncertainty in object-based dense multi-temporal image classification, including the impact of the segmentation scale, spectral features, categories, frequency, and acquisition timing of optical satellite images. Subsequently, several important findings were obtained regarding the input of time-series data and the optimization of the segmentation scale.
The use of multi-temporal data significantly alleviates the problem associated with identifying an "optimal" segmentation scale. This finding is important because this makes the selection of scales, which was a challenge in OBIA, less important in OB-SITS mapping. As a result, the integration of OBIA and time-series analysis becomes more feasible. The findings of this study provide a scientific basis for the future application of Sentinel time-series data in conventional object-based supervised land-cover classification. We recommend the use of as many images as possible to enhance classification performance. Feature selection is an optional process when only limited Sentinel 2 images (e.g., approximately 20 timestamps) are used with RF as the classifier.  Acknowledgments: Sincere thanks to anonymous reviewers and members of the editorial team, for the comments and contributions.

Conflicts of Interest:
The authors declare no conflict of interest.