Attribute Learning for SAR Image Classification

This paper presents a classification approach based on attribute learning for high spatial resolution Synthetic Aperture Radar (SAR) images. To explore the representative and discriminative attributes of SAR images, first, an iterative unsupervised algorithm is designed to cluster in the low-level feature space, where the maximum edge response and the ratio of mean-to-variance are included; a cross-validation step is applied to prevent overfitting. Second, the most discriminative clustering centers are sorted out to construct an attribute dictionary. By resorting to the attribute dictionary, a representation vector describing certain categories in the SAR image can be generated, which in turn is used to perform the classifying task. The experiments conducted on TerraSAR-X images indicate that those learned attributes have strong visual semantics, which are characterized by bright and dark spots, stripes, or their combinations. The classification method based on these learned attributes achieves better results.


Introduction
Synthetic Aperture Radar (SAR) is characterized by day-and-night and all-weather imaging ability in remote sensing; moreover, SAR images contain rich information on the imaged area, i.e., the dielectric and geometrical characteristics of the observed object are relevant to the backscattering [1].Therefore, SAR plays an increasingly important role in various applications, such as urban planning, environmental monitoring, geoscience research, etc.In recent years, with improvements of SAR systems in terms of spatial resolution, high resolution SAR images can provide more detailed and precise information on an observed scene; for this reason, the application of SAR data is highly popular in earth observation [2].Meanwhile, it also creates challenges in SAR image classification because of the more sophisticated shapes, structures and other details of the target.Consequently, it would be highly desirable to interpret SAR images based on a multi-layer model with clear semantic attributes.
Over the last two decades, several methods have been proposed for SAR image interpretation.Roughly speaking, these methods can be categorized into, but not limited to, statistical-based, texture-based and model-based methods.Statistical-based methods mainly relied on the fact that SAR images are characterized by statistical properties.Various statistical models have been proposed, e.g., Lognormal, Rayleigh, Fisher distribution [3], G 0 [4], etc. Texture-based approaches include the Gray-Level Co-occurrence Matrix (GLCM) [5], Gabor filter [6], sparse coding of wavelet polarization textons [7], etc. Model-based methods, e.g., Markov Random Fields (MRFs) [8], Conditional Random Fields [9], Bayesian Network (BN) [10], are widely used for image analysis.In addition, with improvements in spatial resolution, several works have considered spatial features, e.g., morphological profiles, attribute profiles [11][12][13].Recently, the Bag-of-Word (BoW) model has been introduced for SAR image classification [14,15], which is inspired by the texton representation of an image, and this approach is based on the discriminative low-level features.The multi-layer model is a useful framework for SAR image classification tasks in the case of complex scenes [16].However, high-level features extracted by the multi-layer model often lack semantics, and it is an intractable problem to directly establish a correspondence between the physical mechanisms of a SAR image and the semantics of the features at a high level.
Recently, attribute learning has received increasing attention in the computer vision community [17,18], and its effectiveness has been demonstrated in various applications.For example, image retrieval based on weak attributes was proposed in [19], color attributes were applied to object detection [20]; in [21], between-class attributes were utilized to object classification when training and test classes are disjoint.However, the attribute of the SAR image, by contrast, is quite different from the one in optical image because of the coherent imagery mechanism in SAR.Consequently, there are two problems waiting to be solved in the interpretation of the SAR image.The first issue is how to explore the elemental attribute for coherent imagery.On the other hand, the features extracted by the multi-layer model often lack semantic attributes, namely the semantic gap that exists between the extracted features and interpretation; therefore, explainable feature learning by the multi-layer model is another problem for understanding SAR images.
With the spatial resolution improvement in high-resolution SAR, the objects of interest are no longer limited within several pixels, and more complex and rich information is provided, such as the structures, shapes and other details.As shown in Figure 1, different objects in this SAR image, which was acquired by TerraSAR-X, exhibit evidently visual semantics, i.e., bright and dark single spots, dense spots, a linear stripe, or their combination.In this paper, the attributes of the SAR image are referred to as these visual semantics.In this paper, a classification method based attribute learning for SAR images is presented.The main contribution is that we explore the learning attributes of SAR images.In order to learn the discriminative attributes, an unsupervised clustering algorithm [22] is applied to find clustering centers in the space of low-level features, including the ratio of mean-to-variance and the maximum ratio of means in four different directions of an patch, which is sampled at multiple scales from the input image; then, the most discriminative clusters are sorted out to form an attribute dictionary.During the procedure for learning those attributes, cross-validation is applied at each step to prevent overfitting, and the most discriminative clusters are ranked based on the scores from Support Vector Machine (SVM) [23].Specifically, these learned attributes possess semantic properties, which is demonstrated by the experimental results, and the classification method shows promising performance.

Attribute Learning
The input image is first partitioned into several patches with multiscale; then, low-level features, including maximum edge response and the ratio of mean-to-variance, are extracted for each patch; the last step is iterative clustering in the feature space, and the discriminative and representative features are sorted out to construct an attribute dictionary.

Low-Level Feature Extraction
Because of the presence of speckle noise in SAR imaging, such low-level features, which are robust to the impact of speckles, are desirable in SAR image representation.Here, ratio-based features, namely, the maximum edge response and the ratio of mean-to-variance, are selected for low-level image representation.
The edge response [24] is defined as the ratio of mean values in two non-overlap neighborhoods of a patch; to detect all potential edges, the edge response should be computed in all directions in the patch.Here, four elemental directions are considered, which are shown in Figure 2, i.e., 0 • , 45 • , 90 • and 135 • , and the maximum edge response is used to describe the edge of the local patch.The maximum edge response r m is defined as where r i denotes the edge response in ith direction, and r i is given by [25] where u i 1 and u i 2 are the average of pixel values in two neighborhoods.For the ratio of mean-to-variance r mv , which is a statistical feature for a patch, it is defined by where u and σ denote the mean and variance of the sampled patch.
Both r m and r mv are used to represent a patch, and therefore, the extracted feature vector q with two elements, is given by q = (r m , r mv ) T .( 4) where (•) T represents the transpose operation.

Attribute Learning
The goal of this paper is to exploit a set of representative and discriminative mid-level features, which are expected to describe the attributes of an SAR image.In other words, the correspondence between these features and the attributes should be bridged.To explore and discover the underlying relationships between them, unsupervised clustering algorithms are needed for this task.
A large number of clustering methods, e.g., k-means [26], can be used to deal with a standard unsupervised clustering problem.However, as noted in [22], because of the low-level distance metric adopted by these clustering methods, e.g., L1, Euclidean, Cross-correlation, these methods often result in poor performance in clustering tasks with mid-level features.With the discriminative requirement of the desirable attributes, in this paper, an iterative learning step, as illustrated in Figure 3, is employed for this requirement.This procedure is implemented in Algorithm 1, and each step is summarized as follows.Classify on V, top 5 members are sorted out for each new cluster 6: Swap (T , V ) 7: Repeat step 3 to 5 8: if members are not changed in each cluster center 9: end while 10: return Attributes The input full scene is divided into two equal but non-overlaping subsets, i.e., the training subset T and validation subset V. Another required dataset A, called the assistant set, is formed by collecting images from the same SAR system.For all images in the training set T , first, n patches are sampled at multiple scales.For each image, the patch size ranges from the local region, e.g., 7 × 7 pixel, to global size, and an overlap sampling strategy, namely, stride s = 3 pixels, is applied to cover the entire image [15].Then, low-level features are extracted for each patch, including the maximum edge response and the ratio of mean-to-variance; next, standard k-means is employed to cluster in the low-level feature space, where the clustering center is k = n/4 and where n is the number of patches in dataset T .Considering the representative requirement, cluster centers with less than three members are removed, and m clusters remain.
The next step is to train a linear SVM classifier for each cluster generated in the first step, where the positive examples are the members within each cluster, whereas the negative examples come from the assistant dataset A. Then, the trained classifiers are applied to perform classification on the cross-validation subset V , and those labeled samples are re-clustered to form new cluster centers.Note that the members of each cluster center only come from the top p = 5 samples with higher classified scores.Parameters for attribute learning are summarized in Table 1.When the new clusters are constructed, the training set T and validation set V are exchanged; then, the above clustering and classification steps are iteratively repeated, and those clusters that were fired less than 2 times are removed.The procedure described above is iterated until convergence, i.e., the members within each cluster do not change.Note that some low-level features are removed during this procedure since they are not representative and discriminative.

Attribute Dictionary Construction
The attribute dictionary consists of the most distinctive attributes, namely, the cluster centers with maximum separation.To construct such an attribute dictionary an SVM classifier is first employed on the clustering centers, which have been generated during the attribute learning step, and new clusters are generated; Then, the classified scores are assigned to these new clusters, where the classified scores are computed by summing up the scores of the top r (where r > p ) members within the new cluster based on the confidence of SVM classification.Finally, the top K clusters are selected to construct the attribute dictionary D = [d 1 , d 2 , . . ., d K ], where d k (k = 1, . . ., K) is the kth selected cluster center.Once we have built this attribute dictionary, the ith patch, sampled from the input image and represented by feature q i , can be represented by a dictionary-based feature vector For all patches from an input SAR image, such a feature vector can be obtained by Equation ( 5).As a consequence, the input SAR image can be described by a statistical histogram h = [h 1 , h 2 , . . . ,h K ], and h is the sum-pooling of v i | i=1,2,...,n , where n is the number of sampled patches, and the kth element h k in h is the sum of the kthe elements in v i | i=1,2,...,n .

Classification with Attributes
As illustrated in Figure 4, the classifying framework is composed of three steps, namely: the low-level feature extraction, attribute detection, and classification steps.The first step is low-level feature extraction for the input SAR image; several patches are first sampled at multiple scales from the input image; then, the low-level features can be extracted by computing the maximum edge response and the ratio of mean-to-variance; these extracted features are the candidates to be detected for the representation of the input image.
The second step is attribute detection.The attributes for the test image can be sorted out by referring to the attribute dictionary, where the metric is based on the nearest neighbor.Note that these detected attributes correspond to the multi-scale patches sampled from the test image.This is equivalent to detecting visual words for the image.
The last step is the implementation of classification.When the attributes have been sorted out, a statistical histogram describing the attributes frequency can be computed as a final descriptor; then, a simple linear SVM is applied to perform classification.

Data Set
To evaluate the performance of the proposed method, a data set is collected.This full scene was acquired by TerraSAR-X with single polarization (VV channel) from an area in Guangdong Province, China on 24 May 2008.The full scene is composed of 7 categories as shown in Figure 5, including forest, river, hill, farmland, industrial area, urban region and others.Each category contains 160 images with a size of 64 × 64 pixels and a pixel spacing of about 1.25 meters.The dataset T and V for model training are formed by sampling 100 images in each category, and the remaining images are used to test.The assistant dataset A is formed by a collection of images from the same TerraSAR-X system.

Parameter Setting
There are 350 images in each dataset T and V , and 2000 images in the assistant dataset A. The minimum patch size used to extract low-level features is 7 × 7 pixels with a stride of 7 pixels, and the maximum size is as large as the entire image size; those patches are generated by sampling the input image at 19 multiple scales.During the step of initial clustering, the cluster center is k = n/4, where n is the number of patches in dataset T ; the member of cluster thought to be fired is based on the SVM score, which is set above −1.The total number of the attributes sorted out to form the attribute dictionary, is 3396, and the top 1000 discriminative attributes are used for image representation.

Results and Analysis
The most discriminative 5 patches for each category are shown in Figure 6.Here, the detected patches from (a) to (g) correspond to the categories in Figure 5.
As noted, the discriminative patches among categories are quite different; e.g., the patches in (b) are characterized by dense spots, the patches in (e) by stripe, and the patches in (f) are a combination of a single spot and stripe.Moreover, there is remarkable resemblance among the discriminative patches within each category, e.g., the characterization of dense spots in (a) and linear bright stripe in (e).This demonstrates that the learned attributes are representative and discriminative since the interactive algorithm, including initial clustering, classifying and re-clustering step, is used during the attribute learning.The clustering step will sort out the frequently occurring cluster centers, namely, the representative attributes, while the classification step will ensure that these selected attributes are quite discriminative from the rest.
To evaluate the performance of the proposed method, five state-of-the-art methods are chosen for comparison, including GLCM [5], Gabor [6], GMRF [27], Particle Filter Sample Texton (PFST) [28] and BoW-MV [25]; the settings for each method are described as follows.There are 4 extracted statistical features for GLCM, i.e., contrast, correlation, energy and inverse different moment.Gabor filters are implemented on 8 orientations and 3 linear scales, and the mean and variance of each sub band form the Gabor texture feature.The GMRF texture features are generated by estimating the 4-level GMRG model with 12 parameters.For the PFST, the patch size is 5 × 5 pixels; there are 8 key points and 10 texton per class.The cell for local feature extraction in BoW-MV is 7 × 7 pixels.Table 2 shows the classification results; here, AL, the abbreviation of Attribute Learning, denotes the proposed method.Besides the classification accuracy for each class and average accuracy (A.A.), the Kappa coefficients (Kappa) and Kappa confidence intervals (Kappa C.I.) [29] are also used for comparison.The average accuracy of the proposed method is slightly better than other approaches; for example, compared with the state-of-the-art method PFST, the Kappa confidence interval, 0.83 to 0.89, is more confident, and the average accuracy and Kappa coefficients are improved by 7.5% and 0.06%, respectively.Moreover, the classification accuracy of the proposed approach is obviously improved on a region with such rich structure information, such as the industrial and hill area; this is because the edge detector and the ratio of mean-to-variance are extracted as low-level features.

Conclusions
This paper has presented a classification method based on attribute learning for high spatial resolution SAR images.The key contribution is that attributes for SAR images have been extracted, which are characterized by bright and dark spots, stripes, or various combinations of them.In order to learn these attributes, an iterative procedure was designed, including unsupervised clustering step, classification training step, and a cross-validation for preventing over-fitting.The experiments conducted on TerraSAR-X data demonstrate that the extracted attributes are discriminative and representative, and the classification method, based on the learned attributes, achieved better accuracy, especially on regions with rich structure information.However, the proposed method is not rotation-invariant; further research is required to address this issue for performance improvement.

Figure 1 .
Figure 1.Correspondence between different objects in Synthetic Aperture Radar (SAR) image and their ground truth.SAR image is characterized by bright and dark spots, stripe, or various combinations of them; different objects exhibit distinct visual semantics.

Figure 3 .Algorithm 1 1 :
Figure 3. Framework of attribute learning.Attributes are learned by iteratively repeating clustering and classification procedure.

Figure 4 .
Figure 4. Framework of SAR image classification based on attributes.The first step is low-level features extraction; then, the attributes that appeared in the test image are detected, and the frequency of the detected attributes is described by a statistic histogram; finally, a linear SVM classifier is used to perform classification.

Figure 6 .
Figure 6.Five detected patches for each category.The detected patches from (a) to (g) correspond to the categories in Figure 5; those patches among different categories exhibit distinct visual properties, whereas five patches within each category demonstrate remarkable resemblance.

Table 1 .
Parameters for attribute learning.

Table 2 .
Performance comparison in terms of accuracy, average accuracy, Kappa coefficient and associated confidence interval.