1. Introduction
The number of remote sensing platforms is continually increasing, and they are producing a tremendous amount of earth-observation image data. However, it is challenging to extract essential information from these images [
1,
2,
3]. Image classification plays a fundamental role in this task, and there have been great efforts dedicated to the development of image classification methods, such as data mining, deep learning [
4,
5], and object-based image analysis (OBIA). Among these techniques, OBIA has been widely applied to high spatial resolution image classification [
6,
7,
8] because it can take high advantage of the spatial information that is captured within these images. Besides, many researchers consider OBIA to be an interesting and evolving paradigm for various applications, e.g., agricultural mapping [
9,
10], forest management [
11,
12], and urban monitoring [
13,
14].
Recent studies have frequently reported that OBIA can achieve good performance in remote sensing applications [
7,
15,
16,
17], mainly due to two reasons. For one thing, OBIA-based classification algorithms can reduce or even eliminate salt-and-pepper noises, which often exist in the classification results of pixel-based strategies [
18]. This is because image segmentation is generally the first step in OBIA. This procedure partitions an image into several non-overlapping and homogeneous parcels (or segments, objects) so that noisy pixels do not cause errors in classification results [
19]. Lv et al. [
20] developed an object-based filter technique that can reduce image noise and accordingly improve classification performance. Additionally, compared with pixel-based classification approaches, OBIA is capable of utilizing various image features. This is because the processing unit in OBIA is an object instead of the pixel, so this paradigm can effectively employ object-level information [
21,
22]. At the object level, it is convenient to extract geometric and spatial contextual features which may enhance the discriminative power of the feature space. Driven by Tobler’s first law of geography, Lv et al. [
23] extracted object-based spatial–spectral features to enhance the classification accuracy of aerial imagery.
Though OBIA has the mentioned above merits, its potential has not yet been fully utilized. The first challenge originates from the step of image segmentation. When segmenting a remote sensing image, over- or under-segmentation errors often occur, even if a state-of-the-art segmentation algorithm is employed [
24,
25,
26]. OBIA classification suffers greatly from segmentation errors, especially in the case of under-segmentation [
27,
28]. The second issue is that, similarly to traditional pixel-based approaches, the performance of OBIA is limited by the quantity and quality of its training samples. For a number of real applications, it is required that training samples should be as few as possible to achieve sufficient classification accuracy because in some situations, collecting samples is costly. Examples include mapping tasks in hazardous or remote areas, such as post-earthquake cities, landslide affected zones, and outlying agricultural fields [
29,
30,
31,
32]. The accessibility of these areas is often limited, so sample acquisition is difficult or impossible by human on-site visits. Sending drones or purchasing satellite images of a much higher spatial resolution may help obtain the ground truth information, but the cost would be raised significantly. Accordingly, the real problem is how to collect the most useful samples within a limited sample-collection budget. Active learning (AL) aims to provide a solution to this issue by guiding the user to select samples that may optimally increase classification accuracy [
33,
34,
35]. In this way, users do not have to spend time or cost on attempting to get the information of some useless samples, which are either redundant or produce little effects on improving classification performance. Under the guidance of an AL method, a relatively small sample set that is capable of optimizing the separability of a classifier can be achieved, fulfilling the objective of getting a sufficiently high classification accuracy with a limited number of training samples. Though this seems tempting, it is challenging to apply AL techniques to OBIA, since related studies are comparatively rare.
In the field of remote sensing, it is common to apply AL to pixel-based classification. In these studies, AL deals with supervised classification problems when a small set of training pixels is available. By iteratively adding new samples into the training set, AL may help raise classification performance. Thus, it is clear that the key objective of AL is to identify the most informative samples that optimally improve classification accuracy. In an implementation, an AL method aims to identify the sample with the largest classification uncertainty [
36,
37,
38], and there are three main ways to achieve it [
39]. The first strategy uses information gain that is generally formulated by using Shannon entropy. The name of the second category is breaking-tie (BT), and it adopts the criterion of posterior probabilistic difference. The third one’s name is margin sampling, and it mostly combines with a support vector machine. There are some examples, which are introduced as follows. Tuia et al. [
34] constructed a BT algorithm to enable AL to detect new classes. By fusing entropy and BT approaches, Li et al. [
40] developed a Bayesian-based AL algorithm for hyperspectral image classification. Inspired by the idea of region-partition diversity, Huo and Tang [
41] implemented a margin sampling-based AL method. Xu et al. [
39] proposed a patch-based AL algorithm by considering the BT criterion. Sun et al. [
42] designed three AL methods based on a Gaussian process classifier and three modified BT strategies.
Compared to the above-mentioned pixel-based AL methods, there have been relatively few efforts to document object-based AL, but there are a few examples. Liu et al. [
43] proposed an AL scheme based on information gain and the BT criterion to classify PolSAR imagery. Based on margin sampling and multiclass level uncertainty, Xu et al. [
44] implemented an object-based AL strategy for earthquake damage mapping. Ma et al. [
45] developed an object-based AL approach by considering samples with zero and large uncertainty. These studies have made conspicuous contributions to OBIA and AL, but some issues still exist. This article focuses on two of them.
The first issue is that when it comes to OBIA, feature space is generally much more complicated than the counterpart of the pixel-based analysis. In the pixel-based paradigm, one generally calculates features based on a single pixel, or a window of pixels centered at the target pixel; therefore, this process can only capture the information of a limited spatial range. In comparison, object-based features contain more information, such as geometric and spatial contextual cues. The more complicated feature space brings about a new challenge to the traditional AL algorithms because AL relies on feature variables to quantify uncertainty. Accordingly, the more complicated feature set in OBIA requires new AL methods, a fact which motivated this work. For this purpose, this paper presents a new strategy for uncertainty measurement based on one-against-one (OAO) binary random forest (RF) classifiers. Though RF is a popular and successful classifier in remote sensing [
46] and OBIA [
47,
48,
49], it is interesting to test the OAO binary RF for AL method construction and to see if it can fine estimate uncertainty when using object-level features.
Secondly, to the best of our knowledge, none of the previous studies investigated the effects of object feature types on AL performance. Though some previous studies have explored how to determine the most discriminative features, most of them have focused on hyperspectral image classification [
50,
51]. There have been even fewer similar studies on OBIA. OBIA generally provides four types of object-based features, including geometric, spectral, textural, and contextual feature categories [
49,
52]. The discriminative power of the four feature types may vary wildly in different scenarios, and this can produce a great influence upon the AL process. However, according to previous works, it is unclear how classification performance behaves when an object-based AL uses different combinations of the four feature categories. In this work, the evaluation part considers the effects of different object-feature types, and we consider this as a contribution in terms of experimental design.
According to the issues described above, this article proposes a new object-based AL algorithm by using an OAO-based binary RF model. It combines the posterior probabilistic outputs with a modified BT criterion to quantify classification uncertainty in a detailed way. Additionally, with the proposed AL approach, different combinations of the four object feature categories are tested to see which combination is the most appropriate for object-based AL.
It is as follows to organize this paper.
Section 2 details the principle of the new object-based AL.
Section 3 shows the experimental results, as well as tests on the effects of different combinations of the four object feature types on AL performance. Additionally, there are comparisons between the proposed algorithm and other competitive AL approaches.
Section 4 and
Section 5 provide discussion and conclusion, respectively.
2. Methodology
In this part, we first introduce the basic concepts of AL in
Section 2.1, for the convenience of describing the proposed algorithm. Then,
Section 2.2 discusses the implementation of an object-based AL, followed by a detailed description of the proposed AL approach in
Section 2.3 and
Section 2.4. The last sub-section introduces the object-based features used in this study.
2.1. Basics of Active Learning
An AL method consists of 5 parts, including a training set
T, a classifier
C, a pool of unlabeled samples
U, a query function
Q, and a supervisor
S.
Table 1 describes a simple AL process. Step 2 and 3 make up an iterative process in which the most important component should be
Q since it determines whether the process can select good samples. Generally speaking, for successful AL, the classification accuracy of the output
T should be evidently higher than that produced by using the initial
T. This depends on whether the samples found by
Q are beneficial to the classification performance of
C. To enhance the readability of this article, the meanings of the abbreviations and letter symbols related to the principle of an AL approach are listed in
Table A1 of the
Appendix A section.
Intuitively, a good Q can identify samples with the highest classification uncertainty, because many hold that the uncertain samples can help raise the discriminative power of C. Accordingly, AL related studies have all focused on the design of Q. In implementation, Q is a criterion that is used to measure the classification uncertainty of unlabeled samples.
2.2. Object-Based Active Learning
The work-flow of an object-based AL is similar to that delineated in
Table 1. However, since the processing unit in OBIA is an object, a sample differs from that of pixel-based methods. For pixel-based AL, a sample consists of a pixel label and a feature vector of that pixel. For object-based AL, a sample corresponds to an object, and so does its label and feature vector. This directly results in 2 differences between the 2 AL types.
First, the searching space of object-based AL can be much smaller than the pixel-based counterpart, because, for the same image, the number of objects is much lower than that of pixels. This tends to simplify the sample searching process of Q, but it may not be the case, mainly due to the next aspect. The second difference resides in the feature vector contents because, in OBIA, there are more feature types for a processing unit. OBIA allows for the extraction of geometric information and statistical features (e.g., mean, median, and standard deviation values for spectral channels). Thus, the object feature space can be bigger and more complicated, bringing a great challenge to the computation of Q.
Accordingly,
Q in object-based AL should be able to finely estimate the appropriateness of an unlabeled sample. Previous studies concerning this aspect have attempted to split a multiclass classification problem into a set of binary classification procedures so that each class can be treated carefully in the sub-problem. To do so, there are mainly 2 schemes: one-against-all (OAA) and one-against-one (OAO) [
42,
53]. Suppose that there are
L classes to be classified in an image; then, OAA divides the
L-class problem into
L binary classification cases. The user trains each of these binary classifiers by using 2 groups of samples, including the samples of one class and the samples of the other
L − 1 classes. Though OAA is a widespread scheme, it may suffer from imbalanced training due to the allocation of the samples of
L − 1 classes into one training set. In comparison, OAO can avoid this issue, but it has to construct
L·(
L − 1)/2 binary classifiers, which is a little more complicated than the OAA approach. This work adopts the OAO strategy due to the above-mentioned merit.
2.3. Random Forest-Based Query Model
Breiman proposed random forest (RF) [
54], and during recent years, it has been successfully applied to diverse remote sensing applications. As indicated by its name, the most intriguing feature of RF is its randomness embodied in 2 aspects. First, RF is composed of a large set of decision trees (DT), each of which is trained by using a sample subset that is randomly selected from the total training set. This procedure adopts a bootstrap sampling method that can enhance the generalizability and robustness for RF. Second, each DT exploits a subset of feature variables, which can help avoid over-fitting and further improve robustness.
This work proposes a binary RF-based query model and applies it to object-based AL. The key component of this query model is to quantify the appropriateness of the tested samples, and then, the model selects the most appropriate sample(s). To achieve this, we designed 3 steps in the proposed algorithm.
Step (1): initialization. According to the OAO rule, for L (L > 2) classes, L·(L − 1)/2 binary classifiers are built up by using the initial training sample set. In implementation, each binary RF is trained by using the samples of only 2 classes.
Step (2): test sample processing. A test sample is classified by using the initialized binary classifiers. Each of them can produce a label (l) and the associated probability value (p). In this way, L·(L − 1)/2 pairs of l and p can be obtained for a test sample.
Step (3): appropriateness estimation. Among the results obtained in the last step, the dominant class can be identified. If the label of this class is
ld, then there are
nd binary classifiers that assign
ld to the test sample. It is easy to understand that the maximum value of
nd is
L − 1, because among the
L·(
L − 1)/2 binary classifiers, there are
L − 1 classifiers involved with each class. For the
nd classifiers (suppose they make up a set
Fd), the one producing the maximum uncertainty is chosen to reflect the degree of appropriateness for the test sample. This process can be formulated by using Equation (1),
where
pi,1 and
pi,2 represent the probability values of the
ith binary classifier in
Fd; for convenience, it can be prescribed that
pi,1 ≥
pi,2;
ma represents the appropriateness measure for a test sample. This equation implicates that the class which is the most confusing with
ld yields the highest level of classification uncertainty. Given that
pi,2 = 1 −
pi,1 in the case of binary classification, Equation (1) can be rewritten as
which is equivalent to
In implementation, Equation (3) is adopted.
Equation (3) is similar to that proposed by Sun et al. [
42], except that this work derives
l and
p by using RF, while Sun’s method uses the Gaussian process classifier. In this study, the RF model that is implemented in OpenCV was adopted, and this implementation allowed for the derivation of
p only in the context of binary classification. In more detail,
p is estimated here by using the ratio between the number of DTs producing one class label and the total number of DTs.
To better understand the proposed query model,
Figure 1 illustrates an example. The number of classes (
L) is 4 so that there are 6 binary classifiers that are constructed by using the initial training set, in which there are 2 samples for each class. For an unlabeled sample
ui, each of the 6 classifiers produces a label and an associated probability value. It can be seen that 4 is the dominant class, 2 is the most confusing with 4, and the uncertainty value can be calculated as 0.05 according to Equation (3).
In some real cases, it may occur that more than one dominant class exists after step (2). For example, if the label predicted by classifier F(2,4) is 2 instead of 4, the classes of 1, 2, and 4 have the same number of prediction results, and the three classes can all be considered as the dominant class. The model of Equation (3) cannot handle this situation. To solve this problem, it is defined that among the results of the multiple dominant classes, the minimum value of p − 0.5 is used as ma.
2.4. The Proposed AL Algorithm
2.4.1. Details of the Proposed AL
With the query model described in the last sub-section, we can now provide the overall workflow of the proposed AL approach. The red-solid-line box in
Figure 2 illustrates the detailed process of the proposed AL method. The most important part is query function (
Q), which is an adaptation of the 3 steps that are delineated in
Section 2.3 because, in the framework of AL, samples of high-level appropriateness should be selected and used to iteratively update
T,
U, and
C. For illustrative purposes, the arrows are numbered to indicate the order of the steps.
It is worth noting that the sorting procedure of Q arranges the tested samples in ascending order because the one with the lowest ma value is considered to contain the highest uncertainty. What is more, to enable batch mode AL, it is defined that the first q (q ≥ 1) sorted samples is/are selected in the sample selection step. According to previous research, the batch mode can speed up the calculation efficiency of AL, but it may compromise AL performance. Thus, q is deemed as an important parameter and was analyzed in the experiment.
After the steps of Q, q unlabeled samples {ui | i = 1, …, q} are selected and labeled by a supervisor/user, and then U and T are accordingly updated: T = T∪{ui}, U = U\{ui}. Note that if the number of the remaining samples in U is less than q, the whole process is terminated. Another termination condition is that, if the total number of samples in T is greater than a predefined threshold, the AL method ends. The output of the AL algorithm is an enlarged T, in which the added samples are expected to raise classification performance.
2.4.2. Details of the Whole Processing Chain
The overall process is shown in the upper part of
Figure 2. The first step is image segmentation, the objective of which is to partition an image into several non-overlapping and homogeneous segments. In OBIA, unsupervised segmentation algorithms are generally used, and this study also follows this road. A frequently adopted method, called multi-resolution segmentation (MRS) [
55], was used in this work. MRS is an unsupervised region merging technique. Initially, it treats each pixel as a single segment. Then, according to a heterogeneity change criterion based on spectral and geometric metrics [
55], an iterative region merging process is initiated. During this process, only the segments that are mutually best fitting are merged. The mutual best fitting rule is explained in [
55], and it can effectively reduce inappropriate merging. MRS has 3 parameters, including a shape parameter, a compactness parameter, and scale. The former two are both within the range of (0,1) and serve as weights in spectral and geometrical heterogeneity measures [
55]. Scale is generally considered as the most important parameter because it controls the average size of the resulted segmentation. A high scale leads to large segments, and thus under-segmentation errors tend to occur, while a low scale results in small segments; hence, over-segmentation errors may be produced. To avoid this issue, the optimal scale has to be exploited.
Section 3.2 provides the related details on this problem.
After segmentation, samples are then prepared for subsequent steps. As shown in
Figure 2, there are training (
T) and unlabeled (
U) samples. The former contains 2 parts: (1) the class label and (2) the feature vector. The latter only has part (2). Thus, feature vector should be extracted for all of the samples in
T and
U. Since the segment is the processing unit, segment-level features are computed, the details of which are given in
Section 2.5.
The next procedure is the proposed AL, which is described in the aforementioned sub-sections. The output of AL is an enlarged
T, acting as the final training set for a classifier. In this study, an RF classifier is applied. Note that this RF is different from those mentioned in
Section 2.3, since this RF is a standard multi-class classifier, while those used in the AL query model are binary classifiers and are used for uncertainty quantification.
Then, classification can be achieved by using the aforementioned standard RF, which takes the feature vector of each segment as input and predicts a class label for that segment. This RF is trained by using 5-fold cross-validation, and throughout this work, its 2 parameters (the number of decision trees (Ntree) and the number of split variables (mtry)) are set as 300 and the square root of the total number of features, respectively. This setting was tested to be sufficient for this study.
To output the final classification result, the pixels of a segment are rendered with the same label as the prediction result for that segment. The result can then be used for classification evaluation and illustration.
2.5. Object-Based Feature Extraction
In OBIA, a processing unit is represented by several segment-level features that may contain spatial, spectral, textural, and contextual information. This may significantly lengthen the feature vector and complicate the feature space, and it is inclined to produce some influences on AL. To investigate these effects, object-based features that have been frequently applied in OBIA studies are listed here and were tested in the experiment.
There were 4 types of object-based features used in this work, including 10 geometric features, 3 ×
BS spectral features, 3·
BT textural features, and 3·
BS contextual features.
BS means the number of spectral channels of the input image, and
BT represents the number of textural feature bands.
Table 2 details the information on these features.
In the description of geometric features, the outer bounding box refers to a rectangle bounding the object. The edge direction of such a rectangle is parallel to the edge of the image. Thus, such a bounding box is generally not the minimum one for the object, but this feature is frequently used because it is simple to compute and can reflect the relative geometric direction of an object.
To extract textural features, a grey-level co-occurrence matrix (GLCM) and principal component analysis (PCA) is adopted. At first, 8 GLCM-based textural feature descriptors are calculated for each spectral band. These descriptors include mean, variance, homogeneity, contrast, dissimilarity, entropy, second moment, and correlation [
56]. The grey-scale quantification level is set as 32, and 3 processing window sizes (3 × 3, 5 × 5, and 7 × 7) are utilized to capture multi-scale texture. The co-occurrence shifts in horizontal and vertical directions are both set as 1. This configuration leads to 3 × 8 ×
BS textural feature bands that contain too much redundant information. Accordingly, PCA is used to generate a concise set of textural feature bands. The PCA-transformed feature bands that correspond to the first four principal components are selected to derive object-based textural features. For the datasets used in this study, the details of the PCA results are provided in
Section 3.1.
For both spectral and contextual features, average, median, and standard deviation are utilized. The three variables can reflect the statistical pattern for an object.
Among the 4 types of object-based features, the spectral feature is the most frequently used. Textural and contextual features are extracted based on spectral features, so the 3 types may contain some dependence. On the other hand, geometric features are independent of the other 3 types, but whether positive effects on classification performance are produced is dependent on the application at hand. In the following experiment, different combinations of the 4 feature types were tested to investigate their influences on AL performance.
3. Dataset
3.1. Satellite Image Data
Three sets of high spatial resolution multispectral images were employed to validate the proposed approach. They can be seen in
Figure 3,
Figure 4 and
Figure 5. The three sets are symbolized as “T1,” “T2,” and “T3.” Note that in each dataset, there were two scenes which had similar landscape patterns and geo-contents. For convenience, the two images in “T1” are coded as “T1A” and “T1B.” “T1A” was used for AL execution, while “T1B” was exploited for validation experiments (the images in “T2” and “T3” were coded and used in the same way). To be more specific, AL was firstly applied to “T1A,” leading to some of the samples that were selected from this image; these samples were then used for training an RF classifier that was subsequently employed for the classification of “T1B.” The objective of such an experimental design was to see whether samples of good generalizability could be selected by using the proposed AL technique. If the samples selected from “T1A” could lead to high classification accuracy for “T1B,” then we could conclude that these samples had sound generalizability, and, accordingly, the AL method had good performance.
The two images in “T1” were acquired by Gaofen-1 satellite, while the scenes in “T2” and “T3” were all acquired by Gaofen-2 satellite. The two Gaofen satellites are both Chinese remote sensing platforms. They were designed to capture earth-observing imagery with high quality. The sensors on-board Gaofen-1 and Gaofen-2 are similar, both including a multi-spectral and a panchromatic camera device. Their spectral resolutions are the same, while the two satellites differ in spatial resolution. Both of the multispectral sensors on-board Gaofen-1 and -2 have 4 spectral bands: near-infrared (NIR) (770–890 nm), red (630–690 nm), green (520–590 nm), and blue (450–520 nm). Their spatial resolutions (represented by ground sampling distance) are 8.0 and 3.24 m, respectively. In this study, the images were all acquired by the multi-spectral sensor.
“T1A” and “T1B” had the same acquisition date, which was 17 Nov 2018, but they were subsets that were extracted from different images. The sizes of “T1A” and “T1B” were 1090 × 818 and 1158 × 831 pixels, respectively. As can be seen in
Figure 3, there were 2 large coal mine open-pits in “T1A,” while there was only one in “T1B.” The central-pixel coordinates for the two images were (E112°23′51″, N39°29′46″) and (E112°27′17″, N39°33′56″), respectively. This indicated that the geo-location of the two scenes was Antaibao, which is within Shuozhou city of Shanxi province, China. This place is the largest coal mine open-pit in China, and it has experienced extensive mining activities since 1984. It still keeps the highest daily production record, which is 79 thousand tons. However, the operational mining has produced significant ecological and hydrological effects on the local environment, so it is of environmental importance to monitor this area by using remote sensing techniques. Considering that the open-pit region has a large acreage, and it is inconvenient and even dangerous to collect field-visit data at this place. Therefore, it is meaningful to apply AL to the mapping task of this area.
The two images of “T2” were acquired on the same day, which was 7 May 2017. Similar to “T1,” “T2A” and “T2B” were also subsets extracted from two different scenes. The sizes of the two subsets were, respectively, 1010 × 683 and 942 × 564 pixels. The central pixel coordinates of “T2A” and “T2B” were (E108°19′6″, N41°7′17″) and (E108°4′13″, N41°0′4″), respectively. According to this, both subsets were located at a county called Wuyuan Xian, which is in western Inner Mongolia, China. “T2” is illustrated in
Figure 4, and it is evident that both subsets covered agricultural landscapes. Because the acquisition date was at the initial stage of the local agricultural calendar, many fields do not have vegetation cover. Instead, they are bare soil with different levels of moisture. Note that there are some immersed fields and damp ones. This is due to the practice of immersion, which reduces the saline and alkaline contents of local soil. In this way, farmers can grow crops such as corn, wheat, and sunflower at this place. Though local agriculture is quite developed, transportation in this rural region is still inconvenient, which increases the cost and difficulty of field data collection. This fact makes AL useful for mapping this place.
“T3” was very different from “T1” and “T2” in two aspects. First, the two images in “T3” covered urban areas. Second, the two subsets in “T1” or “T2” were quite close, while those of “T3” were very distant. The second aspect led to relatively large differences between the two “T3” subset images, which was conducive to testing the generalizability of the proposed AL. “T3A”/”T3B” had a size of 1069 × 674/995 × 649 pixels, and the central pixel coordinate was (E116°4′42″, N30°36′55″)/(E113°31′20″, N23°8′8″). The acquisition date of “T3A” was 2 Dec 2015, while it was 23 Jan 2015 for the other image. “T3A” captured an industrial area of Wuhan City, China, while “T3B” was in the economic development area of Huangpu District, Guangzhou City, China.
Figure 5 exhibits the two images of “T3.” Though at first glance, it is conspicuous that “T3A” and “T3B” had a similar urban appearance, their geo-objects were quite different in spatial distribution and quantity. There was more vegetation cover and more bright buildings in “T3B” than “T3A,” and the vegetation in “T3B” was more reddish than the counterpart of “T3A.” Moreover, there were more light color buildings in “T3A” than “T3B.” These differences were mainly due to the difference of geo-location and acquisition time.
For the three datasets, since the two subset images came from different scenes that differed in solar illumination and atmosphere conditions, the spectral signatures of the same land cover type may not have been consistent. To alleviate this effect, atmospheric correction was performed by using the quick atmospheric correction tool in ENVI 5.0 software.
3.2. Sample Collection
The AL-based classification experiment required training and testing samples. In this study, we determined these samples by using visual interpretation and manual digitization. For this purpose, we hired 3 experienced remote sensing image interpreters to extract geo-objects in the three datasets. In this process, they used polygons for digitization since this scheme can speed up sample collection. To guarantee the correctness of the collected samples, the interpreters cross-checked the initially obtained samples, and the polygons with high certainty and confidence were finally used in this experiment. The right columns of
Figure 3,
Figure 4 and
Figure 5 demonstrate the resulted reference samples. The numbers of the sampled polygons are listed in
Table 3,
Table 4 and
Table 5. The following details the types of land use and land cover in the three image datasets.
In “T1,” there were four major geo-object types, including coal mine (red), shadow (green), dark bare soil (blue), and bright bare soil (yellow). The colors in brackets correspond to the sample map displayed in the right column of
Figure 3. Coal mine and shadow were both dark and blackish in spectra, which was confusing and thus brought a great challenge to their differentiation. However, the geometric features of the 2 object types were quite different since most shadow areas were thin and elongated, while this pattern was not very evident for coal mine. The other 2 types were relatively easy to discriminate, mainly due to their distinctive spectral and textural appearance, but their spectral and textural features had a large range of variance, which may have confounded classification results.
As for “T2,” we identified 5 land cover types, namely vegetation (red), watered field (green), bright bare soil (blue), dry bare soil (yellow), and moist bare soil (cyan). According to the local crop calendar, the vegetation in “T2” was mostly wheat since the other crop types such as corn and sunflower were not planted in early May. It is interesting to note that in the first half of May, the locals immerse the crop fields with irrigation water before sowing the seeds of sunflower or some vegetables, which leads to the dark-color fields found in “T2.” The difference between watered field and moist field is that the former is covered by water, while the latter is merely soil with relatively high moisture. Different from dry bare soil, the bright bare soil fields are not used for growing crops. They are adopted for stacking harvested crops and are usually very flat, resulting in a high reflectance and, thus, a whitish appearance. The heterogeneous fields of vegetation, dry bare soil, and moist bare soil were not easy to discriminate, which can be recognized as a challenge for “T2”’s classification.
By carefully observing the two images of “T3,” we determined 4 major land cover categories, which are bright building, light color building, dark color building, and vegetation. The 3 building types have different spectral appearances because their materials are not the same. The bright buildings mostly have flat cement roofs, while metal makes up the counterparts of the light color buildings. The dark color buildings correspond to brick-roofs, and their appearance is dark gray. Though there is a small lake and a thin river in “T3B,” we did not consider this type since water objects do not exist in “T3A.” Note that the vegetation in “T3A” consists mainly of bushes, meadows, or low trees, with small areas and light red color. In the other image, forests dominate the vegetation class, leading to a very reddish color. Such an inconsistency of spectra may have contributed to some classification errors for this dataset.
Note that in this experiment, for each dataset, we used the first image (“T1A,” “T2A,” or “T3A”) to run AL, resulting in an enlarged set of training samples. The resulted sample set was then adopted to train an RF classifier, aiming to finish the classification task of the second image (“T1B,” “T2B,” or “T3B”). The ground truth samples of the second image were employed to evaluate classification performance, while we exploited the samples of the first image for the labeling step of AL, which corresponds to the component
S in
Table 1. To be more specific, when running the AL method, some unlabeled samples were selected, which means that their labels were unknown to the AL algorithm. The ground truth samples were used for labeling these samples so that they could be added to the training set.
The aforementioned experimental process generally requires a large number of human–computer interactions, especially when many iterations in AL exist. To automate this procedure, we extracted the samples with labels in the first image and made their labels initially unknown to an AL approach (note that at the beginning of an AL, only a very small training set is inputted). When the AL determined some samples with good appropriateness, their labels were then inputted to the AL to enable the following AL steps. By doing so, we could automatically test an AL method with many repetitions, each of which was initialized by using a different initial training sample set. In this way, the effects of the initial sample configuration on AL performance could be investigated, and we hold that this plays a significant part in the validation of an AL algorithm.
Note that the AL method and the subsequent classification are all based on objects, i.e., the processing unit used in this study was a segment/object. However, the ground truth samples presented in
Figure 3,
Figure 4 and
Figure 5 are polygons containing non-overlapping pixels. To deal with this inconsistency, we needed to select the objects that match the ground truth samples. Considering that the object boundaries did not often align with those of sample polygons, we designed a matching criterion that is expressed by Equation (4),
where
H is a test value for an object, and the object is selected only when it equals 1,
nm represents the maximum number of object pixels overlapped with the sample polygon(s) of one class,
no means the number of pixels of the object under consideration,
Ts is a user-defined threshold, and its numerical scope is (0,1). In this study, 0.7 was found to be sufficient for “T1A” and “T2A.” While for “T3A,” a small value, 0.3, was chosen because larger values led to too few selected objects. For a selected object, its label was identical to that of the
nm pixels.
This criterion guarantees that only the objects with relatively high homogeneity were adopted in AL execution. Because when an object is too heterogeneous, it tends to be inherently under-segmented, and this produces negative effects on the classifier’s performance.
In the experiment, a small portion of the selected objects were used as the training set (T), and the rest of them were treated as the unlabeled set (U). When a query function (Q) computed the appropriateness measure for a sample in U, the label information of U was made unknown to the AL algorithm. Only when the AL selected some samples were the labels of them loaded into the AL, mimicking the human–computer interaction conducted by the supervisor S.
5. Discussion
There were two objectives in this work. First, a new object-based active learning (AL) algorithm based on a binary random forest model was developed. To deal with the multiclass classification problem, the one-against-one strategy was adopted to better measure the classification uncertainty for the unlabeled samples in various situations. This aimed at more accurately estimating classification uncertainty so that more effective samples could be chosen during the AL process. Second, four different categories of object-based features were tested in the experiment to investigate whether AL performance could be affected by the change of feature combination. According to previous literature on AL in remote sensing, this aspect has been rarely analyzed, but we think that it is a meaningful research line, because feature space in OBIA is generally complex and may have a large influence on classification accuracy and consequently, on AL performance.
According to the objectives stated above, it was necessary to discuss the experimental results with a deeper analysis based on the information presented in
Figure 10,
Figure 11 and
Figure 12. This was achieved by comparing the learning rates of different AL methods. Intuitively, the learning rate was considered as the improvement of classification accuracy obtained by using an AL approach. In this work, the learning rate was calculated as the difference between the AL-free average overall accuracy and the highest average overall accuracy derived by using an AL method.
Figure 19 provides the learning rates of the six AL algorithms with eight different feature combinations. It was straightforward to see that for the three datasets, M1 had the best learning rates in all feature-combination cases, except for the GSTC case of “T2,” but it could be seen that the learning rates of the six AL methods were quite similar in this case. These results prove that the proposed AL technique can effectively improve the classification accuracy for the three high-resolution datasets used.
Aside from the average classification performance obtained by using the 20 repetitive runs, we also analyzed the highest overall accuracy that occurred in the 20 runs of different feature combination cases, as revealed in
Figure 20. It is interesting to find that among the eight feature combinations, the best learning rate did not correspond to the highest overall accuracy for all of the three datasets. This demonstrates that to achieve the best classification performance by using the proposed AL method, it is still necessary to test different initial training sets, feature combinations, and numbers of samples. Such a trial-and-error strategy may be time- and labor-consuming, but the results shown in
Figure 19 and
Figure 20 indicate that the proposed technique has a better chance of obtaining good classification accuracy, as compared to other AL approaches.
The patterns revealed by
Figure 19 and
Figure 20 imply that the optimal object-based feature combination varies for different scenes since the best feature combinations of the three datasets are not the same. This is easy to understand because the discriminative power of a feature set tends to change according to the geo-contents of the image. Some recent studies on object-based AL have had different optimal feature combinations. Ma et al. [
45] adopted three object-based feature types (shape, spectra, and texture) in their AL experiment. Their datasets included an agricultural district and two urban areas. Gray-level co-occurrence matrix (GLCM) features were incorporated into their AL strategy and were found to positively affect the classification performance. Xu et al. [
44] also compared three types of features (shape, spectra, and texture) for the problem of earthquake damage mapping. They found that geometric features were very effective, and when this feature type was combined with spectral and textual information, the highest classification accuracy was obtained. The inconsistencies presented in the aforementioned studies indicate that feature selection and engineering is necessary for object-based AL methods. Additionally, it may not be optimal when all of the object-based feature types are used. To achieve the best AL performance for different images, it is suggested to test different feature configurations.
As for the limitation of the proposed AL technique, we have summarized two points. For one thing, according to the discussion presented in the last paragraph, feature selection should be performed by the user to achieve the optimal performance for the mapping problem at hand. This is because, for different landscape patterns, the best feature combination is not consistently the same due to the variation of the discriminative power of diverse object-based features. We recommend that users of our method select the features having good separability for their image. For example, when mapping an urban-area image, if obvious geometric differences exist for two types of buildings, shape information should be considered in the formulation of AL.
For another, there is no automatic stopping criterion for the proposed AL approach. In other words, the user has to decide when to stop the AL process. In the experiments of this article, 9 iterations are adopted to plot
Figure 10,
Figure 11 and
Figure 12. Though it sfigureeems that the number of iterations is sufficient for the three image pairs, it may not be adequate for other datasets. For real applications, when ground-truth validation samples are not available, it is hard for users to determine the optimal iteration number. Developing an automatic stopping method for the proposed AL algorithm is a meaningful research direction and will be our future work. However, for the current version of our AL method, the user has to set the number of iterations as the stopping criterion. We recommend employing a high value as this parameter since, in many cases, good accuracy occurs when the number of samples is high, as pointed out by
Figure 16,
Figure 17 and
Figure 18.
6. Conclusions
An object-based active learning algorithm has been proposed in this article. The objective of this method is to select the most useful segment-samples, which are to be labeled by a supervisor and then added to the training set to improve classification accuracy. In doing so, a series of binary RF classifiers and object-level features are used to quantify the appropriateness of a segment-sample. The sample with a high appropriateness value is selected with a large priority. Given that the object-based feature space is complex, it is difficult to accurately estimate sample appropriateness, but our experimental results indicate that the proposed approach can choose effective samples, mostly thanks to the binary RF classifiers because they allow for a detailed description of the sample appropriateness by using various types of object-based features.
To validate the proposed approach, three pairs of high-resolution multi-spectral images were used. For each image pair, the first one is used for AL execution, resulting in an enlarged training set adopted for the classification task of the second image. The experimental results indicate that our AL method was the most effective in terms of improving classification accuracy, as compared to five other AL strategies. Considering that the proposed AL algorithm relies on the information of feature variables for sample selection and there are various types of object-based features, it is necessary to investigate the influences of feature combinations on the performance of an object-based AL. Thus, in our experiment, the AL-resulted classification improvements were compared for eight feature combinations, and the best combination was determined for the three datasets. It was interesting to find that the optimal feature combination varied for different datasets. This was because the discriminative power of the four feature types that were tested in this study was not the same for different landscape patterns. Accordingly, we suggest the users of our AL method test the effects of different feature combinations to achieve the best accuracy.