Neighbor-Based Label Distribution Learning to Model Label Ambiguity for Aerial Scene Classiﬁcation

: Many aerial images with similar appearances have different but correlated scene labels, which causes the label ambiguity. Label distribution learning (LDL) can express label ambiguity by giving each sample a label distribution. Thus, a sample contributes to the learning of its ground-truth label as well as correlated labels, which improve data utilization. LDL has gained success in many ﬁelds, such as age estimation, in which label ambiguity can be easily modeled on the basis of the prior knowledge about local sample similarity and global label correlations. However, LDL has never been applied to scene classiﬁcation, because there is no knowledge about the local similarity and label correlations and thus it is hard to model label ambiguity. In this paper, we uncover the sample neighbors that cause label ambiguity by jointly capturing the local similarity and label correlations and propose neighbor-based LDL (N-LDL) for aerial scene classiﬁcation. We deﬁne a subspace learning problem, which formulates the neighboring relations as a coefﬁcient matrix that is regularized by a sparse constraint and label correlations. The sparse constraint provides a few nearest neighbors, which captures local similarity. The label correlations are predeﬁned according to the confusion matrices on validation sets. During subspace learning, the neighboring relations are encouraged to agree with the label correlations, which ensures that the uncovered neighbors have correlated labels. Finally, the label propagation among the neighbors forms the label distributions, which leads to label smoothing in terms of label ambiguity. The label distributions are used to train convolutional neural networks (CNNs). Experiments on the aerial image dataset (AID) and NWPU_RESISC45 (NR) datasets demonstrate that using the label distributions clearly improves the classiﬁcation performance by assisting feature learning and mitigating over-ﬁtting problems, and our method achieves state-of-the-art performance.


Introduction
Aerial scene classification aims at classifying each aerial image into a scene label, which is typically cast as a single label learning (SLL) problem. Convolutional neural networks (CNNs) have been acknowledged as the most powerful approach for aerial scene classification [1,2]. The fact that some aerial scenes share similar appearance or objects causes the label ambiguity of aerial image. Some References [3][4][5] handle the label ambiguity through multi-label learning (MLL). Both SLL and MLL aim to answer the question 'which label can describe the sample?'. Different from SLL or MLL, label distribution learning (LDL) [6,7] handles the more ambiguous question 'how much does each label describe the sample?'. For a sample, its label distribution represents the degree to which each label describes the sample. In this way, samples are associated with multiple labels, and a sample can contribute to not only the learning of the ground truth, but also the learning of correlated labels. Thus, each label is supplied with more training data. Using 1.
The widely used label distribution is Gaussian smoothing from the sample ground truth to close labels [7][8][9][10][11], as shown in Figure 1. For the age images, local sample similarity is known, i.e., similar images have close ages; also, global label correlations are available, i.e., the age difference reflects label correlations. The Gaussian distribution properly models the label ambiguity because both the local similarity and label correlations are captured.

2.
For scene classification or generic SLL problems, neither local similarity nor label correlations are known, and thus it is hard to model label ambiguity. As mentioned in Gao et al. [7], modeling label ambiguity is challenging due to the diversity of label space. label distributions to express label ambiguity can boost CNN learning by improving data utilization [7][8][9][10][11].

Difficulties in Modeling Label Ambiguity
However, the label distribution of each sample is unavailable in original training sets and needs to be constructed by modeling label ambiguity. It is generally known that label ambiguity is caused by those similar samples that have different but correlated labels [7]. Existing LDL methods are invalid for scene classification due to the following reasons: 1. The widely used label distribution is Gaussian smoothing from the sample ground truth to close labels [7][8][9][10][11], as shown in Figure 1. For the age images, local sample similarity is known, i.e., similar images have close ages; also, global label correlations are available, i.e., the age difference reflects label correlations. The Gaussian distribution properly models the label ambiguity because both the local similarity and label correlations are captured.
2. For scene classification or generic SLL problems, neither local similarity nor label correlations are known, and thus it is hard to model label ambiguity. As mentioned in Gao et al. [7], modeling label ambiguity is challenging due to the diversity of label space.
Ground truth: 23 years old

Motivation and Fundamental Ideal
Our goal is to construct label distributions for CNN training, which needs to model the label ambiguity regarding aerial images. Our motivation is as follows: 1. We assume that the label ambiguity is caused by sample neighbors having different but correlated labels. As shown in Figure 2, the images of scenes Center, Square, and Stadium share similar visual features; hence, labels Square and Stadium cause the label ambiguity of the Center sample annotated by the red box. As label ambiguity originates from label correlations, we desire to uncover the local sample neighbors that satisfy global label correlations. 2. Subspace learning [12][13][14] has the potential to uncover sample neighbors that satisfy a certain property. To find neighbors having discrimination ability, subspace learning is adopted for semi-supervised learning (SSL) or clustering problems [15][16][17]. In subspace learning, the neighboring relations are formulated as the representation coefficient matrix that takes samples as the dictionary and reconstructs each sample as a linear combination of others, i.e., the self-expressiveness assumption [12]. The sample affinity graph is determined by the coefficient matrix. Discriminative neighbors desire the matrix to be block-diagonal, i.e., neighboring relations occur only between samples of the same label. Many constraints are proposed to purchase the block-diagonal matrix, such as the block-wise [18], low-rank [19], and group-sparse [20] constraints. 3. The subspace learning methods developed for SSL or clustering problems are suboptimal for modeling label ambiguity. The desired discriminative neighbors are used for capturing reliable unlabeled samples in SSL or producing accurate clustering

Motivation and Fundamental Ideal
Our goal is to construct label distributions for CNN training, which needs to model the label ambiguity regarding aerial images. Our motivation is as follows: 1.
We assume that the label ambiguity is caused by sample neighbors having different but correlated labels. As shown in Figure 2, the images of scenes Center, Square, and Stadium share similar visual features; hence, labels Square and Stadium cause the label ambiguity of the Center sample annotated by the red box. As label ambiguity originates from label correlations, we desire to uncover the local sample neighbors that satisfy global label correlations.

2.
Subspace learning [12][13][14] has the potential to uncover sample neighbors that satisfy a certain property. To find neighbors having discrimination ability, subspace learning is adopted for semi-supervised learning (SSL) or clustering problems [15][16][17]. In subspace learning, the neighboring relations are formulated as the representation coefficient matrix that takes samples as the dictionary and reconstructs each sample as a linear combination of others, i.e., the self-expressiveness assumption [12]. The sample affinity graph is determined by the coefficient matrix. Discriminative neighbors desire the matrix to be block-diagonal, i.e., neighboring relations occur only between samples of the same label. Many constraints are proposed to purchase the block-diagonal matrix, such as the block-wise [18], low-rank [19], and group-sparse [20] constraints. 3.
The subspace learning methods developed for SSL or clustering problems are suboptimal for modeling label ambiguity. The desired discriminative neighbors are used for capturing reliable unlabeled samples in SSL or producing accurate clustering membership. The block-diagonal matrix requires sample neighbors have the same label, which is inappropriate for uncovering the neighbors with different labels.

Contributions
In this paper, we uncover the sample neighbors that cause the label ambiguity of aerial images and propose neighbor-based LDL for aerial scene classification, as shown in Figure 3. To be specific, a subspace learning problem is defined to uncover neighboring relations among samples, which includes a 1 norm to capture local sample similarity and a constraint based on global label correlations. During subspace learning, sample neighbors are enforced to share correlated labels. Our method differs substantially from existing methods in two aspects: 1. Different from most subspace learning methods, our method is developed for modeling label ambiguity. Most subspace learning methods emphasize the discrimination ability of neighboring relations, and the learned affinity graph is encouraged to be block-diagonal. Conversely, we aim to uncover neighboring relations among different but correlated labels. Figure 3 shows that our affinity graph is consistent with the global label correlations and is not block-diagonal. 2. Most LDL methods are invalid for generic SLL problems, and we model label ambiguity by jointly capturing local sample similarity and global label correlations. Although the data-dependent LDL (D2LDL) proposed by He et al. [21] has the potential to handle generic SLL problems, it only uses a 1 norm to uncover sample neighbors but overlooks label correlations. In contrast, we introduce label correlations to uncover sample neighbors, which can reduce label noise, as explained in Figure 2. The main contributions of this paper are two-fold: 1. We define a subspace learning problem, which jointly captures local sample similarity and global label correlations. The neighbor-based label distribution can robustly express label ambiguity. In this paper, we define a subspace learning problem to model label ambiguity, and the fundamental idea of this paper is shown in Figure 2. First, we predefine the global label correlations, which reflect that Center is similar to Stadium and Square but differs substantially from Rail S. In subspace learning, the objective function penalizes the coefficient matrix using the 1 norm and the label correlations. The 1 norm provides sparse neighbors, such as the Square image and Stadium image, which captures local similarity. According to the label correlations, the neighboring relation between the training sample and the Rail S image is severely penalized, even though the two images are similar. Label propagation among the neighbors leads to the label distribution that can express the label ambiguity of the training sample. The label noise of Rail S is eliminated by using the label correlations. Similarly, the label ambiguity of the M Res sample is caused by the D Res images and S Res images, but not by the Church image, even though the Church image is visually similar to the M Res sample.

Contributions
In this paper, we uncover the sample neighbors that cause the label ambiguity of aerial images and propose neighbor-based LDL for aerial scene classification, as shown in Figure 3. To be specific, a subspace learning problem is defined to uncover neighboring relations among samples, which includes a 1 norm to capture local sample similarity and a constraint based on global label correlations. During subspace learning, sample neighbors are enforced to share correlated labels. Our method differs substantially from existing methods in two aspects:

1.
Different from most subspace learning methods, our method is developed for modeling label ambiguity. Most subspace learning methods emphasize the discrimination ability of neighboring relations, and the learned affinity graph is encouraged to be block-diagonal. Conversely, we aim to uncover neighboring relations among different but correlated labels. Figure 3 shows that our affinity graph is consistent with the global label correlations and is not block-diagonal.

2.
Most LDL methods are invalid for generic SLL problems, and we model label ambiguity by jointly capturing local sample similarity and global label correlations. Although the data-dependent LDL (D2LDL) proposed by He et al. [21] has the potential to handle generic SLL problems, it only uses a 1 norm to uncover sample neighbors but overlooks label correlations. In contrast, we introduce label correlations to uncover sample neighbors, which can reduce label noise, as explained in Figure 2. Figure 3. Flowchart of our method.

Deep Learning
Currently, deep learning has been acknowledged as the most successful and widely used approach for aerial scene classification. CNNs are able to produce task-specific deep features that are automatically learned from training sets [1,2], which requires little feature engineering by hand. Thus, the deep features enjoy better representation ability than the low-level features (e.g., local binary patterns) or mid-level features (e.g., bag of features). How to further improve the deep features remains a hot research topic [22][23][24][25]. Recently, the attention mechanism has been introduced into network structures to learn more discriminative features, such as the spatial attention [24] for capturing class-specific regions, or the channel attention [25] for selection of important features. As the deep features are usually redundant, some methods adopt meta-heuristic algorithms [26] to select the most effective features among the high-dimensional features [27,28], which leads to compact and robust features.
Overall, many studies related to deep learning focus on network structures or feature selection; however, there are limited works concerning label representations. In this paper, we build a label representation based on label distributions, which is used to guide the feature learning during network training.

Label Distribution Learning (LDL)
As an extension of MLL [3][4][5], LDL [6,7] is a paradigm for handling label ambiguity. Studies on LDL include two aspects: the former is to learn accurate classifiers [29][30][31], and the latter is to build label distributions that can express label ambiguity. The goal of this paper is to construct label distributions for aerial images. The strategies for constructing the distributions can be summarized as three types: The first type is the Gaussian-based label distribution [7], as shown in Figure 1. The Gaussian distribution has established its effectiveness in the fields of age estimation [8], pose estimation [10], and crowd counting [11]. The problem of age estimation can be cast The main contributions of this paper are two-fold:

1.
We define a subspace learning problem, which jointly captures local sample similarity and global label correlations. The neighbor-based label distribution can robustly express label ambiguity.

2.
To our knowledge, this is the first LDL work that can manage generic SLL problems. Experiment results demonstrate that using the label distributions can prevent CNNs from over-fitting and assist feature learning.
The remainder of this paper is organized as follows: Section 2 reviews related works, Section 3 formulates the proposed method in detail, Section 4 reports the experimental results, Section 5 presents the discussion, and Section 6 concludes this paper.

Deep Learning
Currently, deep learning has been acknowledged as the most successful and widely used approach for aerial scene classification. CNNs are able to produce task-specific deep features that are automatically learned from training sets [1,2], which requires little feature engineering by hand. Thus, the deep features enjoy better representation ability than the low-level features (e.g., local binary patterns) or mid-level features (e.g., bag of features). How to further improve the deep features remains a hot research topic [22][23][24][25]. Recently, the attention mechanism has been introduced into network structures to learn more discriminative features, such as the spatial attention [24] for capturing class-specific regions, or the channel attention [25] for selection of important features. As the deep features are usually redundant, some methods adopt meta-heuristic algorithms [26] to select the most effective features among the high-dimensional features [27,28], which leads to compact and robust features.
Overall, many studies related to deep learning focus on network structures or feature selection; however, there are limited works concerning label representations. In this paper, we build a label representation based on label distributions, which is used to guide the feature learning during network training.

Label Distribution Learning (LDL)
As an extension of MLL [3][4][5], LDL [6,7] is a paradigm for handling label ambiguity. Studies on LDL include two aspects: the former is to learn accurate classifiers [29][30][31], and the latter is to build label distributions that can express label ambiguity. The goal of this paper is to construct label distributions for aerial images. The strategies for constructing the distributions can be summarized as three types: The first type is the Gaussian-based label distribution [7], as shown in Figure 1. The Gaussian distribution has established its effectiveness in the fields of age estimation [8], pose estimation [10], and crowd counting [11]. The problem of age estimation can be cast as a classification problem by viewing each age as a label. The classification performance can be improved by using the Gaussian distribution for CNN training. Similarly, using the Gaussian distribution yields satisfactory performance for pose estimation and crowd counting.
The second type is incomplete label distribution learning (IncomLDL) [32], which aims to recover missing labels from partially labeled samples provided by humans. In-comLDL methods [32][33][34][35] usually regularize the label distribution matrix of all samples by a manifold constraint and a low-rank constraint. The former captures local similarity [33]. The latter models the label correlations that exist in the partially labeled samples (e.g., label co-occurrence), which force the completed matrix to agree with the existed label correlations [32]. However, IncomLDL methods are invalid for generic SLL problems due to the lack of label correlations that can guide the construction of label distributions.
The third type is the adaptive label distribution [21,[36][37][38] which aims to enhance the adaption to data variations. Among the methods for adaptive label distributions, the datadependent LDL (D2LDL) [21] has the potential to be transformed into scene classification problems, which models label ambiguity regarding human age though sample neighbors and the label distributions are computed as label propagation. D2LDL adopts subspace learning to compute the sample affinity graph and the coefficient matrix is constrained by the 1 norm to obtain sparse neighbors. Compared to the Gaussian distribution, D2LDL is more flexible for data variations. However, D2LDL only captures local similarity through the 1 norm but overlooks label correlations, which may cause neighbors conflicting with label correlations and thus produce label noise, as explained in Figure 2.

Subspace Learning
Subspace learning [12,13] has gained success in semi-supervised learning and clustering problems [15][16][17][18][19]. The affinity graph produced by subspace learning can provide the graph Laplacian matrix for semi-supervised learning [17], and also can be used for spectral clustering [15]. The self-expressiveness assumption [12] states that each sample can be represented as a linear combination of other samples. Thus, samples are embedded into many local subspaces expressed by the representation coefficient matrix. Samples that lie in the same subspace are neighboring. Formally, denote the sample features and coefficient matrix as X and Z respectively, where X ∈ R D×N , Z ∈ R N×N , and D, N represent the feature dimension and sample number, respectively. Sparse subspace clustering (SSC) [12] is a typical subspace learning approach, which is formulated as: where the 1 norm · 1 guarantees sparsity [39], and λ e is a trade-off parameter. The representation error E is incorporated to resist outliers, where E ∈ R D×N . The operator diag(·) indicates constructing diagonal matrices. The constraint diag(Z) = 0 prevents Z from being an identity matrix. A non-zero coefficient Z(m, n) indicates that samples m and n lie in the same subspace and the two samples are neighboring. To ensure that the neighboring relations are non-negative and symmetric, the affinity graph A is defined as: SSC can uncover adaptive and sparse neighbors and is robust to outliers. Thus, SSC and its variants [40][41][42][43] have realized impressive clustering performance. In SSL or clustering problems, the desirable neighbor assignment is that the affinity graph has exact C connected components (i.e., block-diagonal structure), where C is the class number. Various Table 1. Symbol descriptions.

N
The number of training samples C The number of labels D The dimension of the sample features y n The ground truth of the nth sample Y ∈ R C×N The binary matrix of sample truth labels X ∈ R D×N The set of sample features The matrix of global label correlations Z ∈ R N×N The matrix of representation coefficients A ∈ R N×N The sample affinity graph P ∈ R C×N The set of label distributions P r ∈ R C×N The set of rectified label distributions

Modeling Label Ambiguity
Label ambiguity originates from the visually similar samples that have different but correlated labels. As shown in Figure 2, the label ambiguity of the training sample is caused by the similar images of Square and Stadium. In Figure 2, although the Rail S image is similar to the training sample, it is unreasonable to assume label ambiguity between the training sample and the Rail S image because scenes Center and Rail S are substantially different in terms of the global label correlations. According to the definition of LDL [6,7], label ambiguity should agree with label correlations. As shown in Figure 1, the Gaussian distribution only includes the labels that are near the sample ground truth. In the methods of IncomLDL [32], the label distribution matrix is regularized by a low-rank constraint to capture label correlations, which force the completed matrix to accord with the label correlations.
Our goal is to uncover the sample neighbors that cause label ambiguity. As explained by SSC [12], a non-zero coefficient Z(m, n) indicates that samples m and n are neighboring. To model label ambiguity, ideal neighboring relations should meet the following properties: 1.
The agreement with local sample similarity. If samples m and n have similar features, samples m and n are neighboring, i.e., Z(m, n) = 0.

2.
The consistence with global label correlations. If samples m and n have substantially different labels, samples m and n are not neighboring, i.e., Z(m, n) = 0.

Uncovering Sample Neighbors
We define a subspace learning problem to uncover the ideal neighboring relations. Firstly, a global label correlation matrix G is predefined to formulate label differences, where is specified as a large value if labels i and j are importantly different, and G(i, j) is small if labels i and j are correlated. By introducing G into objective Function (1), the subspace learning problem is formulated as follows: where α z is a trade-off parameter and we set it as 1.0. On the one hand, if sample features X(·, m) and X(·, n) are similar, Z(m, n) is large because each sample tends to select similar samples to reconstruct itself. On the other hand, if samples m and n have substantially different labels, the large element G(y m , y n ) enforces Z(m, n) = 0 and thus the neighboring relation between samples m and n is prohibited. Hence, the objective Function (3) can uncover ideal neighboring relations that jointly capture local similarity and label correlations. Note that D2LDL [21] adopts standard SSC (i.e., the objective Function (1)) to compute neighboring relations, which overlooks label correlations.

Predefining Label Correlations
The global label correlations G should be properly predefined. A simple yet effective approach for evaluating label correlations is to utilize additional knowledge from confusion matrices [44,45]. Following this approach, we use the Library for support vector machines (LIBSVM) [46] to train classifiers on training sets, and compute confusion matrices on validation sets. We use pretrained CNNs as feature extractors, in which the activations of the penultimate FC layer serve as image features.
Denote the confusion matrix as H ∈ R C×C , where element H(i, j) denotes the rate at which samples of label i are classified as label j. According to Wang et al. [44], the symmetric similarity matrix S ∈ R C×C is computed as S = 1/2(H + H T ). A large S(i, j) implies a similar label pair (i, j). In this paper, label pairs (i, j) that are in the top 20% in terms of similarity are regarded as correlated labels. Specifically, we sort all the non-diagonal elements {S(i, j)} i =j of S as a descending sequence and delete the repeated values from the sequence. This leads to a sequence (s 1 , s 2 , . . . , s N 0 ), in which s 1 implies the most similar label pair. Labels i and j are thought of as correlated if S(i, j) ≥ s 0.2N 0 , where the function · denotes the rounding down operator. Denote the maximal and minimal values in the classification accuracies {S(i, i)} i=1∼C as s max and s min , respectively. Since the role of G is to express label differences, G(i, j) should be inversely proportional to S(i, j). Soft mappings are defined to compute G, as illustrated in Figure 4. Elements of G are computed as follows: where the hyper-parameters a min , a max , b min , and b max are set to 0.1, 0.3, 0.01, and 0.09, respectively. On the one hand, label differences for correlated label pairs range from a min to a max , and a high S(i, j) leads to a small G(i, j). On the other hand, a low classification accuracy S(i, i) causes a relatively large G(i, j), as the samples of label i tend to be scattered. pairs range from min a to max a , and a high ( , ) ij S leads to a small ( , ) ij G . On the other hand, a low classification accuracy ( , ) ii S causes a relatively large ( , ) ij G , as the samples of label i tend to be scattered.

Soft mapping for Nondiagonal elements
Soft mapping for diagonal elements Figure 4. Soft mappings from label similarities S to label differences G.

Optimization
We optimize the objective function (3) by resorting to the alternating direction method of multipliers (ADMM) [47]. First, we replace G with another correlation matrix is large when samples m and n have substantially different labels. In addition, an auxiliary matrix NN R   J is introduced. Considering Θ and J, objective (3) equals: where operator is the elementwise product. Using ADMM, the augmented Lagrange function of objective (5) is  We update Z by solving the following problem: The elements of Z can be obtained by applying the soft-thresholding operator [47]:

Optimization
We optimize the objective Function (3) by resorting to the alternating direction method of multipliers (ADMM) [47]. First, we replace G with another correlation matrix Θ ∈ R N×N , the elements of which are Θ(m, n) = G(y m , y n ). Θ(m, n) is large when samples m and n have substantially different labels. In addition, an auxiliary matrix J ∈ R N×N is introduced. Considering Θ and J, objective (3) equals: where operator is the elementwise product. Using ADMM, the augmented Lagrange function of objective (5) is where B 1 and B 2 are Lagrange multipliers and µ is a penalty parameter. Since L(·) is separable, we can alternatively update Z, J, E, B 1 , and B 2 , while fixing others.

Update of Z:
We update Z by solving the following problem: where t denotes the tth iteration and U t The elements ofZ can be obtained by applying the soft-thresholding operator [47]: Solution (8) shows that a large Θ(m, n) encouragesZ t+1 (m, n) = 0, which eliminates the neighboring relations between two substantially different labels y m and y n . Thus, the uncovered neighboring relations are consistent with label correlations.
Update of J: Let the derivative with respect to J be zero, then the solution is expressed as Update of E: While other variables are fixed, we update E as follows: can be obtained by applying the soft-thresholding operator [47]: Update of B 1 and B 2 : The Lagrange multipliers can be updated by using the gradient ascent procedure: For clarity, the ADMM algorithm for solving objective (3) is outlined in Algorithm 1.

Algorithm 1: Solving the objective Function (3) through ADMM
Input: X, Θ, α z , λ e , µ 0 Initialize: Compute the constant term (X T X + I) −1 in Equation (9) While the convergence conditions are not satisfied: 1: update Z t+1 according to Equations (7) and (8) 2: update J t+1 according to Equation (9) 3: update E t+1 according to Equation (10) 4: update B 1,t+1 and B 2,t+1 according to Equation (11) 5: update µ t by µ t+1 = min(µ max , ρµ t ) 6: check the convergence conditions: Label ambiguity can be modeled through the affinity graph A and we construct label distributions through label propagation. Denote the matrix of sample ground truth as Y, where Y ∈ R C×N with Y(i, n) = 1 if i = y n , and Y(i, n) = 0 otherwise. Denote the label distributions of all samples as P, where P ∈ R C×N . Y(·, n) and P(·, n) represent the ground truth and label distribution of the nth sample, respectively. According to Equation (2), the affinity graph can be determined by the coefficient matrix, i.e., A = 1/2( Z + Z T ) . The label distribution P(·, n) can be computed as the following label propagation: where d n is the degree [15] of sample n and the term A(·, n)/d n represents the transition probabilities [48].d n acts as the normalization term to ensure that Equation (12) shows that if sample m is the neighbor of sample n, a positive A(m, n) incorporates the label of sample m into the label distribution of sample n; if two samples are not neighboring, A(m, n) = 0, and the label of sample m has no influence on P(·, n). Thus, sample n is associated with its neighbor labels and thus P(·, n) describes the label ambiguity regarding sample n.

Rectifying Label Distributions
We use the label distribution for CNN training and thus the sample ground truth should account for the highest intensity in the distribution. Since the images of the same class usually share similar features, the majority of sample neighbors have the same label to the sample ground truth. In most label distributions, the ground truth has the highest intensity. However, the adaptively uncovered neighbors cannot ensure that the ground truth always occupies the highest intensity in all label distributions. Referring to D2LDL that use the ground truth to rectify label distributions [21], we compute the rectified distributions P r as follows: where term 0.5Y ensures that y n has the highest intensity in P r (·, n). We use P r for CNN learning.

CNN Learning Framework
The label distributions are used as label-level regularization in network learning. The network jointly learns the task of scene classification and the task of learning label distributions P r . The multitask loss is defined as where I represents the set of training images and λ l is a trade-off parameter. We set λ l to 0.5. Term L cls (·) adopts the common softmax loss and term L ldl (·) uses Kullback-Leibler (KL) loss, respectively. The KL divergence is defined as: whereP(·, n) is the network output generated by a softmax function. Hence, L ldl (·) is computed as follows: The learning framework is supervised not only by sample ground truth but also the side information about the label ambiguity of samples. Compared to the ground truth, the label distributions are more informative since it incorporates correlated labels. As mentioned in previous studies [7,9,10,49], informative label representations enable CNNs to learn robust features. Additionally, the label distribution is label smoothing in terms of label ambiguity. As explained in References [49][50][51][52], label smoothing can prevent networks from over-fitting.

Experiments and Results
To examine the role of modeling label ambiguity, the proposed method is applied to two CNN backbones, namely VGGNet (VGG) and ResNet, and we conduct experiments on two aerial scene datasets: the aerial image dataset (AID) [1] and NWPU_RESISC45 (NR) [2]. As shown in Figure 5, the two datasets are challenging due to the large intraclass variations and small interclass distinctions.
The learning framework is supervised not only by sample ground truth but also the side information about the label ambiguity of samples. Compared to the ground truth, the label distributions are more informative since it incorporates correlated labels. As mentioned in previous studies [7,9,10,49], informative label representations enable CNNs to learn robust features. Additionally, the label distribution is label smoothing in terms of label ambiguity. As explained in References [49][50][51][52], label smoothing can prevent networks from over-fitting.

Experiments and Results
To examine the role of modeling label ambiguity, the proposed method is applied to two CNN backbones, namely VGGNet (VGG) and ResNet, and we conduct experiments on two aerial scene datasets: the aerial image dataset (AID) [1] and NWPU_RESISC45 (NR) [2]. As shown in Figure 5, the two datasets are challenging due to the large intraclass variations and small interclass distinctions.
Following the common evaluation protocol of aerial scene classification [1,2], we train networks under the designated training ratios, and report the overall accuracy (OA) on testing sets. To obtain stable results, we report the average accuracies over 5 trials. For preprocessing, we resize all images to 224×224, subtract the mean from the resized images, and divide them by the standard deviation for each color channel.
Following the common evaluation protocol of aerial scene classification [1,2], we train networks under the designated training ratios, and report the overall accuracy (OA) on testing sets. To obtain stable results, we report the average accuracies over 5 trials. For preprocessing, we resize all images to 224 × 224, subtract the mean from the resized images, and divide them by the standard deviation for each color channel.

Network Backbones
The popular VGGNet and ResNet have demonstrated promising performance for aerial scene classification. We separately utilize VGG-16bn [60] and ResNet50 [61] as the backbones of our networks. The networks are initialized as the weights that were pretrained on ImageNet, and the weights are optimized using stochastic gradient descent. The learning rate, momentum, and weight decay are set to 0.001, 0.9, and 0.0005, respectively. Each network is trained for 50 epochs using minibatches of 16. During training, the learning rate is reduced to one tenth when the loss values stop decreasing.

Parameters for Subspace Learning
In the objective Function (3), the feature matrix X is extracted from the penultimate FC layer of pretrained networks (VGG or ResNet) and we reduce the feature dimension, D, to 100 through principal component analysis (PCA). As recommended by Elhamifar et al. [12], λ e is calculated as β/min n max m =n X(·, m) 1 , where β is set to 20, as in Reference [15]. The parameter µ 0 in the ADMM algorithm is set to β. For a fair comparison,λ e and the ADMM parameters are kept the same for both D2LDL and our method.
All the algorithms are developed in Python under the PyTorch framework. The experiments are implemented on a workstation with an I7-8700K CPU and a Titan XP GPU.

Analysis of Label Distributions
To explore the uncovered sample neighbors and the constructed label distributions, we implement experiments on the AID-0.2 dataset using VGG pretrained features. To intuitively illustrate the results of label propagation, we display the label distributions P but not the rectified distributions P r .

Agreement with Local Similarity
The images of the same scene may exhibit different label distribution patterns due to the large intraclass variations. Within the same class, similar images should share close distribution patterns. To observe the distribution patterns, we group the label distributions of the same class into 3 clusters by K-means, as shown in Figure 6.
tively. Each network is trained for 50 epochs using minibatches of 16. During training, the learning rate is reduced to one tenth when the loss values stop decreasing.

Parameters for Subspace Learning
In the objective function (3), the feature matrix X is extracted from the penultimate FC layer of pretrained networks (VGG or ResNet) and we reduce the feature dimension, D, to 100 through principal component analysis (PCA). As recommended by Elhamifar et al. [12], e  is calculated as 1 / min max || ( , ) || n mn m   X , where  is set to 20, as in Reference [15]. The parameter 0  in the ADMM algorithm is set to  . For a fair comparison, e  and the ADMM parameters are kept the same for both D2LDL and our method. All the algorithms are developed in Python under the PyTorch framework. The experiments are implemented on a workstation with an I7-8700K CPU and a Titan XP GPU.

Analysis of Label Distributions
To explore the uncovered sample neighbors and the constructed label distributions, we implement experiments on the AID-0.2 dataset using VGG pretrained features. To intuitively illustrate the results of label propagation, we display the label distributions P but not the rectified distributions r P .

Agreement with Local Similarity
The images of the same scene may exhibit different label distribution patterns due to the large intraclass variations. Within the same class, similar images should share close distribution patterns. To observe the distribution patterns, we group the label distributions of the same class into 3 clusters by K-means, as shown in Figure 6. The local similarity among the cluster 3 images reflects that these images share similar building appearance with the Indus images. Accordingly, the label distributions in cluster 3 specify relatively high intensities on the Indus label, which agrees with the local similarity. The explanation is that the sample tends to select similar neighbors to reconstruct itself during subspace learning, and thus the images in cluster 3 have many Indus neighbors. Analogously, the local similarity among the cluster 1 shows that the images contain obvious railway station buildings, and thus the distributions in cluster 1 are dominated by the Rail S label. The local similarity among the cluster 3 images reflects that these images share similar building appearance with the Indus images. Accordingly, the label distributions in cluster 3 specify relatively high intensities on the Indus label, which agrees with the local similarity. The explanation is that the sample tends to select similar neighbors to reconstruct itself during subspace learning, and thus the images in cluster 3 have many Indus neighbors. Analogously, the local similarity among the cluster 1 shows that the images contain obvious railway station buildings, and thus the distributions in cluster 1 are dominated by the Rail S label. Therefore, the label distributions agree with local similarity, and can adapt to the intraclass variations. Additionally, our distributions also support the conclusions of recent studies [21,[36][37][38]62,63]: label distributions should vary with data variations.

Consistence with Label Correlations
Neighboring relations that cause label ambiguity should be consistent with label correlations. Figure 7 shows the global label correlations G evaluated on the confusion matrices, and the affinity graph A determined by the coefficient matrix Z. We have the following 3 observations. studies [21,[36][37][38]62,63]: label distributions should vary with data variations.

Consistence with Label Correlations
Neighboring relations that cause label ambiguity should be consistent with label correlations. Figure 7 shows the global label correlations G evaluated on the confusion matrices, and the affinity graph A determined by the coefficient matrix Z. We have the following 3 observations. 1. The dark elements in G indicate the correlated label pairs. For example, the green boxes show that scenes Rail S and Indus are correlated. The explanation is that lots of Rail S images and Indus images are confused with each other in validation sets. 2. The light elements in A are globally consistent with the dark elements in G. For example, the red boxes suggest that many Rail S samples have Indus neighbors. The explanation is that Z is penalized by G in objective function (3) and thus the neighboring relations are consistent with label correlations. 3. A is not block-diagonal, in which there are neighboring relations among samples of different labels. In contrast, the affinity graph for SSL or clustering problems [15][16][17][18][19][40][41][42][43] are encouraged to be block-diagonal so as to yield discriminative neighbors.

Comparisons with D2LDL
D2LDL [21] exploits the 1 norm to uncover sample neighbors but overlooks label correlations, which may cause label noise. The label noise reflects unreasonable neighbors, such as the Rail S image in Figure 2. Serious noise may cause the distorted distributions, in which the sample ground truth fails to account for the highest intensity. We refer to the distribution ( , ) n P as a distorted distribution if ( , ) nn y n i y n  PP . Figure 8 shows the label distributions produced by D2LDL and our method, and d N is the number of distorted distributions which can be thought of as the measurement of noise level. We have the following 3 observations:

1.
The dark elements in G indicate the correlated label pairs. For example, the green boxes show that scenes Rail S and Indus are correlated. The explanation is that lots of Rail S images and Indus images are confused with each other in validation sets.

2.
The light elements in A are globally consistent with the dark elements in G. For example, the red boxes suggest that many Rail S samples have Indus neighbors. The explanation is that Z is penalized by G in objective Function (3) and thus the neighboring relations are consistent with label correlations.

Comparisons with D2LDL
D2LDL [21] exploits the 1 norm to uncover sample neighbors but overlooks label correlations, which may cause label noise. The label noise reflects unreasonable neighbors, such as the Rail S image in Figure 2. Serious noise may cause the distorted distributions, in which the sample ground truth fails to account for the highest intensity. We refer to the distribution P(·, n) as a distorted distribution if P(y n , n) ≤ P(i = y n , n). Figure 8 shows the label distributions produced by D2LDL and our method, and N d is the number of distorted distributions which can be thought of as the measurement of noise level. We have the following 3 observations: 1.
D2LDL can construct adaptive label distributions but results in many distorted distributions, which imply gross label noise. For example, the yellow rectangle shows that many Rail S samples are associated with label Indus, which reflects the correlation between Rail S and Indus.

2.
Our method produces much less distorted distributions compared to D2LDL (138 vs. 411), which suggests the reduction of noise. Compared to the yellow rectangular region, the red rectangular region is 'cleaner', which indicates that fewer samples are contaminated by noisy labels. Therefore, label noise can be reduced by introducing label correlations to regularize the discovery of neighbors.

3.
The distorted distributions of our method originate from severely confused image contents. As shown in Figure 9, the image semantics are confused with the ambiguous labels even by humans. 1. D2LDL can construct adaptive label distributions but results in many distorted distributions, which imply gross label noise. For example, the yellow rectangle shows that many Rail S samples are associated with label Indus, which reflects the correlation between Rail S and Indus. 2. Our method produces much less distorted distributions compared to D2LDL (138 vs. 411), which suggests the reduction of noise. Compared to the yellow rectangular region, the red rectangular region is 'cleaner', which indicates that fewer samples are contaminated by noisy labels. Therefore, label noise can be reduced by introducing label correlations to regularize the discovery of neighbors. 3. The distorted distributions of our method originate from severely confused image contents. As shown in Figure 9, the image semantics are confused with the ambiguous labels even by humans.

Analysis of the Network Performance
In this section, we explore the proposed method from the perspective of classification performance and learned features.

Classification Accuracy
The label distributions are used as label representations for CNN training and we adopt five methods to construct label representations, as described in Table 2. 1. D2LDL can construct adaptive label distributions but results in many distorted distributions, which imply gross label noise. For example, the yellow rectangle shows that many Rail S samples are associated with label Indus, which reflects the correlation between Rail S and Indus. 2. Our method produces much less distorted distributions compared to D2LDL (138 vs. 411), which suggests the reduction of noise. Compared to the yellow rectangular region, the red rectangular region is 'cleaner', which indicates that fewer samples are contaminated by noisy labels. Therefore, label noise can be reduced by introducing label correlations to regularize the discovery of neighbors. 3. The distorted distributions of our method originate from severely confused image contents. As shown in Figure 9, the image semantics are confused with the ambiguous labels even by humans.

Analysis of the Network Performance
In this section, we explore the proposed method from the perspective of classification performance and learned features.

Classification Accuracy
The label distributions are used as label representations for CNN training and we adopt five methods to construct label representations, as described in Table 2.

Analysis of the Network Performance
In this section, we explore the proposed method from the perspective of classification performance and learned features.

Classification Accuracy
The label distributions are used as label representations for CNN training and we adopt five methods to construct label representations, as described in Table 2. Table 2. Methods for constructing label representations of training samples.

Method Description
Original CNNs Sample ground truth is used to train original VGG and ResNet. Label smoothing regularization (LSR) [50,52] LSR is the standard label smoothing method, which handles label ambiguity by uniform label smoothing. The smoothing parameter to 0.1.
Confusion weighted loss (CWL) [44,45] CWL [44,45] encodes label ambiguity on the basis of the confusion matrices on validation sets. Label pairs with high confusion proportions are posited to be correlated, and the label intensity is smoothed from sample ground truth to the correlated labels. The label representations are class-specific. We set the confusion thresholds [45] for AID and NR as 0.02 and 0.03 respectively, due to the lower accuracy achieved on NR.

D2LDL
D2LDL constructs neighbor-based label distributions but overlooks label correlations. The label distributions are rectified by Equation (13).
N-LDL N-LDL jointly captures local similarity and label correlations. The label distributions are rectified by Equation (13).
Across experiments, our label distributions are replaced by the label representations that are produced by the competing approaches, and other parts of the CNN learning framework remain unchanged. The classification accuracies of the testing sets on AID-0.2 and NR-0.1 are listed in Table 3. The comparisons demonstrate the advantages of our method over the competing approaches, regardless of the dataset or network backbone. We observe the following: 1.
LSR, CWL, and D2LDL all realize higher accuracies than using sample ground truth; hence, the efforts to address label ambiguity are beneficial for aerial scene classification.

2.
CWL methods yield competitive performance over D2LDL, which demonstrates the role of incorporating label correlations into label representations.

3.
Our methods clearly outperform CWL. The reason is that the label representations of CWL are class-specific and thus are inflexible to adapt to the large intraclass variation of scene images. In contrast, our neighbor-based label distributions capture local sample similarity and can express different patterns of label distributions.

4.
Our methods substantially outperform D2LDL, which highlights the effectiveness of considering label correlations. Figure  Accordingly, these scenes are susceptible to label noise. Facilitated by label correlations, our methods reduce the label noise and can robustly encode label ambiguity. Therefore, our label distribution is effective in representing complex scenes.

Feature Robustness
The label distributions are more informative compared to sample ground truth and can enhance feature robustness. We select the outputs of the fc7 layer in the VGG backbone as image features, and the corresponding two-dimensional representations generated by the t-Distributed Stochastic Neighbor Embedding (t-SNE) [64] are plotted in Figure 11. We observe that features produced by N-LDL-v are more compact than those produced by original VGG, such as the features of scenes Airport (Airpo), Park, and Port. This compactness suggests that feature robustness is improved by using the informative label distributions.
confusion matrices. There are large accuracy gaps in the scenes of Center (0.81 vs. 0.78), Rail S (0.91 vs. 0.84), and School (0.74 vs. 0.69). These scenes are usually comprised of complicated visual contents and are prone to the involvement of diverse scene labels. Accordingly, these scenes are susceptible to label noise. Facilitated by label correlations, our methods reduce the label noise and can robustly encode label ambiguity. Therefore, our label distribution is effective in representing complex scenes.

Feature Robustness
The label distributions are more informative compared to sample ground truth and can enhance feature robustness. We select the outputs of the fc7 layer in the VGG backbone as image features, and the corresponding two-dimensional representations generated by the t-Distributed Stochastic Neighbor Embedding (t-SNE) [64] are plotted in Figure 11. We observe that features produced by N-LDL-v are more compact than those produced by original VGG, such as the features of scenes Airport (Airpo), Park, and Port. This compactness suggests that feature robustness is improved by using the informative label distributions.

Comparisons with State-of-the-Art Methods
Our method is compared with previous methods for aerial scene classification, which are listed in Table 4 and have been validated on AID-0.2 and NR-0.1 in the original literatures. We have following observations: 1. Although we only use common network structures (i.e., ResNet), our method achieves comparable performance compared to recent deep learning methods which devise complicated network structures, such as Attention-GAN [22] and CapsNet [56]. The explanation is that the label distributions substantially improve data utilization by associating samples with multiple labels. Thus, our method is a simple yet effective approach for aerial scene classification. 2. SF-CNN [57] slightly surpasses our method on NR-0.1 (89.89 vs. 89.80). The SF-CNN, namely scale-free CNN, enlarges sample number by 4 times through resizing each

Comparisons with State-of-the-Art Methods
Our method is compared with previous methods for aerial scene classification, which are listed in Table 4 and have been validated on AID-0.2 and NR-0.1 in the original literatures. We have following observations:

1.
Although we only use common network structures (i.e., ResNet), our method achieves comparable performance compared to recent deep learning methods which devise complicated network structures, such as Attention-GAN [22] and CapsNet [56]. The explanation is that the label distributions substantially improve data utilization by associating samples with multiple labels. Thus, our method is a simple yet effective approach for aerial scene classification.

Experiments Using Different Sizes of Datasets
Our method achieves satisfying performance on small datasets (i.e., AID-0.2 and NR-0.1), and we further study our method on relatively large datasets by using data augmentation and relatively large training ratios. For data augmentation, training images are randomly cropped at 50% of the original image coverage. The cropped and original images are flipped horizontally or vertically. Thus, the sizes of training sets are enlarged by 4 times. Label distributions of the augmented images are also constructed through subspace learning. The augmented training sets for AID-0.2 and NR-0.1 are denoted as Aug AID-0.2 and Aug NR-0.1, respectively. On the other hand, we increase the training ratios for datasets AID and NR to 0.5 and 0.2, forming the data divisions of AID-0.5 and NR-0.2, respectively. The validation ratios remain unchanged, and the rest of the images serve as testing sets. Both the AID-0.5 and NR-0.2 are also commonly used data divisions [53][54][55][56][57]. Original VGG is selected as the baseline. Using different sizes of training sets, Figure 12 plots the learning curves and Figure 13 presents the classification results. We have the following observations:

Experiments Using Different Sizes of Datasets
Our method achieves satisfying performance on small datasets (i.e., AID-0.2 and NR-0.1), and we further study our method on relatively large datasets by using data augmentation and relatively large training ratios. For data augmentation, training images are randomly cropped at 50% of the original image coverage. The cropped and original images are flipped horizontally or vertically. Thus, the sizes of training sets are enlarged by 4 times. Label distributions of the augmented images are also constructed through subspace learning. The augmented training sets for AID-0.2 and NR-0.1 are denoted as Aug AID-0.2 and Aug NR-0.1, respectively. On the other hand, we increase the training ratios for datasets AID and NR to 0.5 and 0.2, forming the data divisions of AID-0.5 and NR-0.2, respectively. The validation ratios remain unchanged, and the rest of the images serve as testing sets. Both the AID-0.5 and NR-0.2 are also commonly used data divisions [53][54][55][56][57]. Original VGG is selected as the baseline. Using different sizes of training sets, Figure 12 plots the learning curves and Figure 13 presents the classification results. We have the following observations: (1) Our method can mitigate over-fitting problems.
Using different sizes of training sets, Figure 12 shows that the original VGG always yields saturated training accuracy, nearly 100%, which indicates over-fitting problems. However, N-LDL produces lower training accuracy but higher validation accuracy, and the gap between training accuracy and validation accuracy becomes smaller, which suggests less over-fitting and better generalization.
As mentioned by Szegedyet al. [50] and Pereyra et al. [51], despite using large datasets, networks are still prone to over-fitting due to networks learning to assign full probability to sample ground truth and thus outputting too-confident predictions. Some studies [50,51,62] demonstrate that label smoothing can alleviate over-fitting by maintaining reasonable ratios between the logits of the correct and incorrect classes. The label distributions are label smoothing in terms of label ambiguity. N-LDL enables networks to assign probability to correlated labels, which helps to improve generalization. (2) Our method is effective especially for small datasets. Figure 13 demonstrates that the accuracy improvements brought about by our method on the small datasets (e.g., AID-0.2) are more significant than that on the relatively large datasets (e.g., Aug AID-0.2 and AID-0.5). When the original VGG is trained with the small datasets, over-fitting problems seriously degrade the generation ability of learned features. By alleviating over-fitting, N-LDL-v significantly improves classification perfor- adaptive and reduce the damage caused by over-fitting. As a result, the improvements brought by N-LDL on the large datasets are degraded.
Our label distributions are used for regularizing output distributions of networks, which can improve generalization. As explained by some studies [65,66] that also work to regularize the output distributions, the attempts to improve generalization are effective especially in the case that the amount of training data is limited. Therefore, the benefit of our method is more significant for small datasets. (a) (b) Figure 13. Classification results using different sizes of training sets on datasets of (a) AID and (b) NR.

Influence of Parameters and Time Efficiency
In this subsection, we discuss the influence of parameters and the time efficiency of the proposed method. The experiments are conducted using VGG pretrained features. (1) Our method can mitigate over-fitting problems.
Using different sizes of training sets, Figure 12 shows that the original VGG always yields saturated training accuracy, nearly 100%, which indicates over-fitting problems. However, N-LDL produces lower training accuracy but higher validation accuracy, and the gap between training accuracy and validation accuracy becomes smaller, which suggests less over-fitting and better generalization.
As mentioned by Szegedyet al. [50] and Pereyra et al. [51], despite using large datasets, networks are still prone to over-fitting due to networks learning to assign full probability to sample ground truth and thus outputting too-confident predictions. Some studies [50,51,62] demonstrate that label smoothing can alleviate over-fitting by maintaining reasonable ratios between the logits of the correct and incorrect classes. The label distributions are label smoothing in terms of label ambiguity. N-LDL enables networks to assign probability to correlated labels, which helps to improve generalization.
(2) Our method is effective especially for small datasets. Figure 13 demonstrates that the accuracy improvements brought about by our method on the small datasets (e.g., AID-0.2) are more significant than that on the relatively large datasets (e.g., Aug AID-0.2 and AID-0.5). When the original VGG is trained with the small datasets, over-fitting problems seriously degrade the generation ability of learned features. By alleviating over-fitting, N-LDL-v significantly improves classification performance on the small datasets. Although there are over-fitting symptoms on the large datasets, networks learn to fit more training samples, which enable learned features to be adaptive and reduce the damage caused by over-fitting. As a result, the improvements brought by N-LDL on the large datasets are degraded.
Our label distributions are used for regularizing output distributions of networks, which can improve generalization. As explained by some studies [65,66] that also work to regularize the output distributions, the attempts to improve generalization are effective especially in the case that the amount of training data is limited. Therefore, the benefit of our method is more significant for small datasets.

Influence of Parameters and Time Efficiency
In this subsection, we discuss the influence of parameters and the time efficiency of the proposed method. The experiments are conducted using VGG pretrained features. 4.6.1. Influence of α z α z controls the importance of the global label correlations in the objective Function (3). Figure 14 presents the resulting label distributions of different α z on the AID-0.2 dataset. Similar to Figure 8, we display 10 label distributions for each class on AID-0.2 since the training samples are too numerous to be fully presented. A too small α z (e.g., 0.1) causes substantial noise as the effect of label correlations is weak. Conversely, setting α z too large (e.g., 10) severely reduces the intensities of the correlated labels, which degrades the ability to express label ambiguity. Therefore, we fixed α z as 1.0 in our method.
dataset. Similar to Figure 8, we display 10 label distributions for each class on AID-0.2 since the training samples are too numerous to be fully presented. A too small z  (e.g., 0.1) causes substantial noise as the effect of label correlations is weak. Conversely, setting z  too large (e.g., 10) severely reduces the intensities of the correlated labels, which degrades the ability to express label ambiguity. Therefore, we fixed z  as 1.0 in our method.

Influence of D
We use pretrained CNN features for subspace learning. To accelerate subspace learning, we reduce feature dimensions, D, via PCA. Figure 15 plots the influence of different feature dimensions, D. On the one hand, the low feature dimensions substantially reduce the time consumption for optimizing Equation (3). On the other hand, when D ≥ 50, the decrease of D has little influence on classification accuracies. The explanation is that original CNN features are high-dimensional and redundant, and reducing the feature dimensions via PCA still preserves the most image information. Thus, features transformed by PCA can also deliver major image contents and are valid for uncovering sample neighbors. However, using too small D (e.g., 10) is insufficient to fully represent image semantics, which is harmful for constructing proper label distributions and thus degrades the classification performance. To balance the time consumption and representation ability, we set D to 100.

Influence of D
We use pretrained CNN features for subspace learning. To accelerate subspace learning, we reduce feature dimensions, D, via PCA. Figure 15 plots the influence of different feature dimensions, D. On the one hand, the low feature dimensions substantially reduce the time consumption for optimizing Equation (3). On the other hand, when 50 D  , the decrease of D has little influence on classification accuracies. The explanation is that original CNN features are high-dimensional and redundant, and reducing the feature dimensions via PCA still preserves the most image information. Thus, features transformed by PCA can also deliver major image contents and are valid for uncovering sample neighbors. However, using too small D (e.g., 10) is insufficient to fully represent image semantics, which is harmful for constructing proper label distributions and thus degrades the classification performance. To balance the time consumption and representation ability, we set D to 100.  Table 5 summarizes the time consumptions of different steps for constructing our label distributions. Predefining label correlations consumes the most time because it includes the process of extracting sample features. It is fast to optimize the objective Function (3), and Figure 16b plots the convergence curves. The total time is relatively short and thus it is feasible to construct our label distributions for network training. Table 5. Time consumption (s) for constructing the label distributions of all the training samples.

Summary of the Experiment Results
According to the experiment results, the proposed N-LDL has the following advantages: 1.
Our label distributions can robustly model label ambiguity for aerial images. Compared to D2LDL, our method substantially reduced label noise by incorporating label correlations, as shown in Figure 8.

2.
Using our label distributions for network training yielded competitive classification performance. Compared to the label representations of LSR, CWL, or D2LDL, our label distributions led to higher accuracies, as presented in Table 3.

3.
Our method can improve the generation ability of networks and is useful, especially for small datasets. Regularizing output distributions by our label distributions helps to mitigate over-fitting problems, as illustrated in Figure 13.

4.
It is convenient and time-efficient to apply our method to common network structures. We only introduced label distributions to regularize network learning, and the time for constructing label distributions was short, as listed in Table 5.

Why Does N-LDL Work Well?
The label representations produced by N-LDL possess low trace values and can capture intrinsic label correlations. Literatures related to IncomLDL [32,33] and MLL [67,68] have proven that label representations considering label correlations contribute to improving classification performance. On the basis of low-rank assumption, a trace norm can be imposed on label representations to capture intrinsic label correlations [67]. Thus, the optimized label representations possess low trace values and can improve classification performance. Loosely speaking, label representations with low trace values enjoy strong ability to express label correlations. Figure 17 and Table 6 present the trace values of different label representations. From Table 5, the sample truth labels (i.e., Y) had the largest trace values due to the ignorance of label correlations. In contrast, N-LDL achieved the lowest trace values and thus can express label correlations sufficiently, which leads to the high OA, as listed in Table 3.

Are There Better Label Distributions?
The limitation of N-LDL is that the label distribution is fixed during network training and we hope that the label distributions can be dynamically updated to further improve training processes. Equation (3) uses fixed global label correlations G to uncover sample neighbors and the constructed label distributions are fixed during CNN training. It is preferable to jointly learn label correlations, label distributions, and feature representations in a unified framework, which may be characterized by a trace norm and graph convolutional networks (GCNs) [69,70]: 1. As label distributions are expected to have low trace values, a trace norm can be imposed on network outputs (i.e., predicted label distributions) in loss functions, which enables the framework to be end-to-end trainable. 2. GCNs learned to map the initial label graph (such as the S or G in this paper) into inter-dependent label embeddings that can implicitly model label correlations. The label embeddings project sample features into network outputs, which enable the predicted label distributions to be consistent with label correlations. In this paper, we fixed label distributions for network training, and it is convenient to apply N-LDL to various network structures. Experiment results demonstrated the effectiveness of the label distributions. In the future, we plan to build the unified framework that can alternatively update label distributions to further improve network training.

Conclusions
In this paper, we proposed neighbor-based label distribution learning (N-LDL) for aerial scene classification, in which subspace learning was adopted to uncover sample neighbors that cause label ambiguity. In subspace learning, the neighboring relations are regularized by a sparse constraint and the predefined label correlations, which jointly captures local similarity and label correlations. As a result, the uncovered neighbors shared correlated labels and the neighbor-based label distribution expressed the label ambiguity of samples. The experiment results demonstrated that using the label distributions for network training can mitigate over-fitting and assist feature learning, and our method yielded competitive classification performance. Additionally, the proposed method has the potential to model label ambiguity for generic single label learning (SLL) problems.

Are There Better Label Distributions?
The limitation of N-LDL is that the label distribution is fixed during network training, and we hope that the label distributions can be dynamically updated to further improve training processes. Equation (3) uses fixed global label correlations G to uncover sample neighbors and the constructed label distributions are fixed during CNN training. It is preferable to jointly learn label correlations, label distributions, and feature representations in a unified framework, which may be characterized by a trace norm and graph convolutional networks (GCNs) [69,70]:

1.
As label distributions are expected to have low trace values, a trace norm can be imposed on network outputs (i.e., predicted label distributions) in loss functions, which enables the framework to be end-to-end trainable.

2.
GCNs learned to map the initial label graph (such as the S or G in this paper) into inter-dependent label embeddings that can implicitly model label correlations. The label embeddings project sample features into network outputs, which enable the predicted label distributions to be consistent with label correlations.
In this paper, we fixed label distributions for network training, and it is convenient to apply N-LDL to various network structures. Experiment results demonstrated the effectiveness of the label distributions. In the future, we plan to build the unified framework that can alternatively update label distributions to further improve network training.

Conclusions
In this paper, we proposed neighbor-based label distribution learning (N-LDL) for aerial scene classification, in which subspace learning was adopted to uncover sample neighbors that cause label ambiguity. In subspace learning, the neighboring relations are regularized by a sparse constraint and the predefined label correlations, which jointly captures local similarity and label correlations. As a result, the uncovered neighbors shared correlated labels and the neighbor-based label distribution expressed the label ambiguity of samples. The experiment results demonstrated that using the label distributions for network training can mitigate over-fitting and assist feature learning, and our method yielded competitive classification performance. Additionally, the proposed method has the potential to model label ambiguity for generic single label learning (SLL) problems.