Graph Convolutional Networks by Architecture Search for PolSAR Image Classiﬁcation

: Classiﬁcation of polarimetric synthetic aperture radar (PolSAR) images has achieved good results due to the excellent ﬁtting ability of neural networks with a large number of training samples. However, the performance of most convolutional neural networks (CNNs) degrades dramatically when only a few labeled training samples are available. As one well-known class of semi-supervised learning methods, graph convolutional networks (GCNs) have gained much attention recently to address the classiﬁcation problem with only a few labeled samples. As the number of layers grows in the network, the parameters dramatically increase. It is challenging to determine an optimal architecture manually. In this paper, we propose a neural architecture search method based GCN (ASGCN) for the classiﬁcation of PolSAR images. We construct a novel graph whose nodes combines both the physical features and spatial relations between pixels or samples to represent the image. Then we build a new searching space whose components are empirically selected from some graph neural networks for architecture search and develop the differentiable architecture search method to construction our ASGCN. Moreover, to address the training of large-scale images, we present a new weighted mini-batch algorithm to reduce the computing memory consumption and ensure the balance of sample distribution, and also analyze and compare with other similar training strategies. Experiments on several real-world PolSAR datasets show that our method has improved the overall accuracy as much as 3.76% than state-of-the-art methods.


Introduction
Polarimetric synthetic aperture radar (PolSAR) data has wide applications in agriculture, forestry, geology, ocean, etc. [1][2][3][4]. In agriculture, PolSAR data are used for identification crop species, monitoring crop growth and assessment land conditions [5]. In forestry, PolSAR data are adopted to monitor the fire and excessive logging as well as estimate the biomass in forest [6]. In geology, PolSAR data are employed to analyze information such as geological structure, mineral distribution, surface roughness, ground coverage, and soil moisture [7]. Polarimetric SAR data classification is the key for data interpretation and one of the important research for PolSAR data processing.
The current classification methods for PolSAR generally can be categorized as the unsupervised, supervised, and semi-supervised learning. The unsupervised method does not need to use labeled samples for training, while the supervised classification method utilizes a certain number of labeled samples to train a classifier, and then classify the unlabeled samples. More recently, the semi-supervised learning has attracted increasing attention. It uses a few labeled samples and a large number of unlabeled samples for classification. With the development of deep learning, the networks with more complex architectures can be designed for classification, and have improved classification performance in the case of only a few labeled samples.
As one of the semi-supervised learning methods, the graph convolutional network (GCN) [8,9] is graph structured learning network, and it has wide application in modeling social networks, segmentation large point clouds, and predicting biomolecular structure. In SemiGCN [9], a graph using binary weights is constructed and then several graph convolutional layers are stacked following by a Softmax function, in which the labels propagate to the unlabeled samples for semi-supervised classification. However, the above mentioned works mainly construct their networks manually. They need to design delicate networks for the given data. As it is known, the performance of deep learning algorithms heavily depends on the architectures of neural networks, which costs considerable effort for experts to select and determine a suitable one for a specific application. The network with an optimal architecture is still uncovered, which is also challenging. In this paper, inspired by the success of Neural Architecture Search (NAS) [10], we propose a new graph convolutional network based on architecture search, called ASGCN, for PolSAR classification. We first build a fine-grained graph with varying weights, and then propose a weight-based mini-batch strategy to partition the graph into subgraphs. Moreover, we construct a searching space for the architecture search of ASGCN, and utilize the subgraphs to search an optimal architecture for classification.
The main contributions of this paper can be summarized as follows: We propose a novel ASGCN based on architecture search to automatically find the optimal network structure for feature learning and classification. A new search space is constructed for our ASGCN, which provides a variety of possibilities for model selection. Then a new training method is also presented, which can ensure the balance and diversity of sample distribution, and decrease memory and computational costs.
The rest of this paper is organized as follows. Section 2 briefly introduces the background. Our method of constructing ASGCN is presented in Section 3. Section 4 illustrates the experiments and results. The conclusion and future work are discussed in Section 5.

The Classification Methods of PolSAR Data
Most of the classification methods are unsupervised in the early years. Researchers rely on analyzing scattering matrix and covariance matrix for classification. For example, in [11], the authors perform polarimetric decomposition to yield the H/a components and then utilize a complex Wishart classifier. In [12], the pixels are divided into different scattering categories based on Freeman and Durden decomposition [13], and then finegrained classification is achieved by applying the Wishart classifier iteratively. Recently the statistical analyzing and machine learning methods are applied to categorization. For instance, the minimum stochastic distance is compared and analyzed in [14]. The Fuzzy K-means algorithm is employed for classification [15,16]. In [17][18][19][20][21], the authors calculate statistics of the covariance matrix for classification pixels. In [22], a new superpixel generation method named as fuzzy super-pixel (FS) is proposed for PolSAR image classification. The deep learning-based methods have also been presented, for instance, in Wishart Deep Belief Network (W-DBN) [23], the restricted Boltzman machines are stacked to model PolSAR data and for classification.
In semi-supervised learning, the model uses a few labeled samples and a large number of unlabeled samples for classification. In [33], the authors proposed a combined method which utilize both an unsupervised clustering and a multi-layer perceptron for sample labeling. In [34], the co-training based techniques are introduced in classification. A stochastic expectation-maximization algorithm was proposed in [35]. The graph-based methods have also been exploited for their solid mathematical foundation [36][37][38]. In these methods, a graph is defined using both the labeled and unlabeled samples as nodes, then the class labels spread through edges according to a designed optimization function thus complete classification on all the samples. Not only the traditional semi-supervised methods, i.e., co-training, and the graph based methods [39][40][41][42][43][44][45], but also the deep learning techniques are utilized for classification. In [46], a sparse manifold regularization (DSMR) together with a deep neural network is proposed for PolSAR feature extraction. A graph-based deep CNN for semi-supervised label propagation is presented in [47] for PolSAR categorization.

Neural Architecture Search
Neural Architecture Search (NAS) is an important branch of automatic machine learning (AutoML) [48], which aims to find an automatic architecture instead of designing a neural network manually. It is a technique of automatically searing an architecture for an artificial neural network, and it has been used to search for effective architectures that can outperform hand-designed architectures. Search space, search strategy, and a performance evaluation metric are three core elements of an NAS algorithm [49]. Search space is a set of network architectures that can be searched, namely, the solution space. Search strategy is used to find the optimal network architecture in the search space. A performance evaluation metric is designed to evaluate the performance of the searched network architecture. The first work in NAS was proposed in [10], and obtained promising results based on reinforcement learning algorithm. However, its high computational cost has prevented a widespread adoption of this method. In order to solve this issue, differentiable architecture search (DARTS) [50] has been proposed, which makes the search space differentiable and greatly reduces the time consumption of search. This brings great opportunities for the search of network architecture. DARTS can express the structure (search space) and allow efficient architecture search using gradient descent. NAS has been applied in a variety of areas in computer vision. However, the construction of graph convolutional network using NAS for PolSAR classification is rarely in literatures.

Graph Neural Networks
Here we introduce six graph neural networks and common modules below.
(1) SemiGCN [9]: This is a spectral based graph convolution network. It proposes the graph convolutional rule to use the first-order approximation of spectral convolution on graphs. (2) Max-Relative GCN (MRGCN) [51]: It adopts residual/dense connections, and dilated convolution in GCNs to solve vanishing gradient and over smoothing problem. It deepens the network from several layers to dozens of layers. (3) EdgeConv [52]: The EdgeConv is an edge convolution module and is proposed for construction dynamic graph CNN to model the relationship between cloud points. It concatenates the feature of the center point with the feature difference of the two points, and then inputs them into MLP. The EdgeConv ensures that the edge features integrate the local relationship between the points and the global information of the points. (4) Graph Attention Network (GAT) [53]: GAT uses attention coefficient to aggregate the features of neighbor vertices to the central vertex. Its basic idea is to update the node features according to the attention weight of each node on its adjacent nodes. GAT uses masked self attention layer to solve induction problems. (5) Graph Isomorphism Network (GIN) [54]: This network mixes with the original features of the central node after each hop aggregation operation of the adjacent features of the graph nodes. In the process of feature blending, a learnable parameter is introduced to adjust its own features, and the adjusted features are added with the aggregated adjacent features.
(6) TopKPooling [55]: It is a graph pooling method, and is used in graph U-Net. Its main idea is to map the embedding of nodes into one-dimensional space, and select the top K nodes as reserved nodes.

Proposed Method
In this section, we propose the ASGCN to deal with the classification of PolSAR images. We show how to build the fine grained graph, how to divide batch, and how to search the optimal architecture. Their details are described as follows.

Graph Construction
Given a PolSAR dataset which contains N A samples or pixels. Each sample i is represented by its feature vector x i ∈ Z b×1 , i = 1, 2, ..., N A , and b is the number of features. Each pixel contains scattering signals from surrounding area, and may be a combination of backscattered reflections from many surrounding objects. Here, we build an undirected graph, and its nodes are the pixels or samples from the PolSAR image. The edges connecting nodes represent the relations between nodes. One edge connects two nodes, and the weight on the edge denotes the similarity between the two nodes. Considering that the covariance matrix follows a complex Wishart distribution, we utilize the revised Wishart distance [16] to measure the covariance matrix difference between two samples i and j with covariance matrix. It is computed as follows: where θ equals to 3 under symmetry assumption that the returned radar signals is a threedimensional (3-D) complex scattering vector since the combinations of HV and VH are identical. ∑ i is the covariance matrix for sample i, and Tr(·) denotes the trace of a matrix. Furthermore, besides the covariance matrix, the distributions of other features from PolSAR data are still unknown. Here, we use a common Euclidean distance d E to measure the difference between two samples i and j with other features as follows: The revised Wishart distance d W and the Euclidean distance d E may be in a different scale, thus we normalize them to the same scale, that is, Since we use multiple features, e.g., 41 features, from the PolSAR data, both of the above distance measurements should be taken into account as follows: where α ∈ [0, 1] is a coefficient to balance the contribution between the two distances. Moreover, the spatial correlation for PolSAR image is also important for classification. The nearby samples may come from the same category. The spatial distance between samples i and j is defined as: where (h i , u i ), i = 1, 2, ..., N A , represents the coordinate for the sample i in a PolSAR imagery. Then we have the weighted feature distance D: Note that here the logarithm function used for d S is to shrink its value to a smaller scale, which can be comparable to the value of d F . Based on the defined distance D, we construct a K-nearest neighbor (KNN) graph A ∈ R N A ×N A , in which the first K weights for each node on the edges are calculated and the others are represented with 0.

Weight-Based Mini-Batch for Large-Scale Graph
Since most of the datasets acquired by remote sensing systems are back-scattered observation to the broad area, they are in very large scale. When converting them to graph structure, the memory and computational costs may greatly increase. To address this issue, we propose a weight-based mini-batch strategy to transform a large graph into multiple subgraphs, and then select certain numbers of subgraphs according to their weights for learning, as shown in Figure 1. Firstly, the nodes in the whole graph A are partitioned into p(1 ≤ p ≤ N A ) parts: A 1 , A 2 , ..., A p using graph clustering algorithm METIS [56] which is a fast algorithm to partition graphs. Note that the METIS consists of three stages: coarsening, partitioning and uncoarsening. In the coarsening stage, a graph is transformed into a sequence of smaller graphs. Then a 2-way partition is computed to partition the vertices into two parts in the partitioning stage. In the uncoarsening stage, the partition is projected back to the original graph by passing through intermediate partitions. Compared with other graph clustering approaches, METIS can construct proper partitions in the graph such that within-clusters links are much more than between-clusters links to capture the community structure of the graph better and faster. Therefore, we utilized this algorithm. Each part has the same numbers of nodes except the final subgraph, and a weight representing how many times the part has been selected. Let the weight vector be v = [v 1 , v 2 , ..., v p ]. The higher value of the weight, the bigger the probability it being selected. In every epoch, it samples q clusters to form a new batch according to the probability shown in weight vector v.
where TE denotes the total epochs in the training phase. As the number of selection one part increases, it will get lower probability to form a batch. By using this simple method, we can ensure that each part has the opportunity to connect, and avoid the repeated use of the same part of the graph caused by random selection, which results in the instability of the training phase.

The Architecture of Our ASGCN
In order to determine a superior architecture to construct our ASGCN for the input data, we present a NAS method to find a better solution. Firstly, we build a search space O, which contains many operators. Unlike using the common operations, such as convolution and maxpooling as that in CNN, we select nine operators which are effective in other GCN works. They include: the MultiLayer Perceptron (MLP), SemiGCN [9], MRGCN [51], EdgeConv [52], GAT [53], GIN [54], TopKPooling [55], skip-connect, and zero operations. Among them, MLP operation is a full connection layer which makes a map of the input features to the out features without considering the edge between two nodes. Skip-connect is a residual graph connection [57], which reduces the probability of over fitting. Zero denotes the mapping function equals to zero.
We show our search strategy in Figure 2. It is inspired by the method in DARTS [50]. Suppose that our GCN consist of M cells, and each cell consists of N T layers. Each layer l i is a latent representation. Each directed edge between two layers outputs a feature map at every forward propagation of neural network. The output of the layer is obtained by applying a reduction operation (e.g., concatenation) to all the intermediate nodes. The operation of each edge in the search space O is parameterized by architectural parameters op is the architecture parameter of operation op ∈ O from the i-th layer to j-th layer (i < j). The output of i-th layer and one of inputs for j-th layer c (i,j) (l i ) is given as: where o(·, ·) ∈ O is some convolution operation in the search space O, and w op is the weight parameter of the graph convolution op from the i-th layer to j-th layer. Each intermediate layer is based on the addition of all previous layers: Architecture Searching The trainable parameters are architecture parameters γ and the weight of cell or network w, and they are alternately trained. The loss function is denoted by L val which is a cross-entropy on validation set. When w is fixed, γ is updated by minimizing L val (w, γ) on the validation dataset, given as When γ is fixed, w is updated by minimizing L train (w, γ) on the train dataset, given as With the solutions of architecture parameters γ and the weight w, we can determine one cell for our ASGCN. Similar to other NAS implementations, our ASGCN is formed by stacking N c such cells together. Then we can fine-tune the whole network to optimize all the weights with the training set.

Comparison on Methods of Graph Partition
Besides our weight-based mini-batch for large-scale graph, we analyse three other ways of partition a graph. First, we consider dividing the large graph A into N q small non-overlapping graphs. Assuming that the number of nodes in each subgraph is q, the subgraphs are expressed as: (1) Fixed method: A common method is to use these N q subgraphs for training, and this is called fixed method. (2) Shuffle: Another method is shuffle. We can randomly select q nodes to constitute a subgraph for training in each epoch. Obviously, for the Fixed method, it causes great data loss. As shown in the Figure 1, the white area of the adjacency matrix is the unused data. The amount of lost data is (N A × N A − N q × q × q)/N A × N A and the data utilization rate is N q × q × q/N A × N A . Assuming that N q can be divide by N A , then we have N q = N A /q, and the data utilization rate is N A × q/N A × N A = q/N A . The larger q is, the bigger data utilization rate is. q = N A means the whole graph is used for training. For the ClusterGCN method, the selection of number of nodes in a cluster is crucial. If the cluster is too large, it will cause a serious imbalance of sample distribution. As in the clustering process, data of the same category is naturally easy to be gathered together. As a result, the data of each batch are biased towards certain categories. If the cluster is too small, it is no different from the Shuffle method.
Different from ClusterGCN, we propose to select nodes according to a certain weight. We set the number of clusters as p for clustering, and each cluster has its own weight v i . When a cluster is selected too frequently, the weight will be reduced. The probability of being selected is also reduced. In this way, we can avoid serious imbalance of sample distribution and increase the diversity of training samples, which leads to a stable training process.

Experiments and Analysis
We evaluated our proposed method on real-world PolSAR data: Flevoland, and San Francisco datasets. The detail of them are below. The Flevoland dataset is an L-band four-look PolSAR data with a resolution of 1 × 6 m. It has a size of 750 × 1024 pixels and was acquired by NASA/JPL AIRSAR in 1989 from Netherlands. This dataset contains 15 terrains including: stem beans, rapeseed, bare soil, potatoes, beet, wheat, peas, wheat2, lucerne, barley, wheat3, grasses, forest, water, and buildings, which is widely used to evaluate the classification methods. The San Francisco dataset is the bay area with the golden gate bridge and its size is 1300 × 1300 pixels. It is C-band, single-look, and fullpolarimetric SAR data acquired by RADARSAT-2 sensors, and includes five classes: water, vegetation, low-density urban, high-density urban, and the developed.
We conducted all the experiments on a computer with two 1080Ti GPUs (each with 11 GB memory). Our method was compared with five state-of-the-art algorithms consisting of two supervised methods: the SVM [24] and CNN [30], the unsupervised methods with pre-training: FS [22], W-DBN [23], and the semi-supervised methods: DSMR [46] and SemiGCN [9]. The coding was with PyTorch [59] and the GCN operators were implemented using Pytorch Geometric [60]. The initial random seed of the algorithm was fixed for fair comparison. We carried out each experiment for 20 times and reported both the average overall accuracy (OA) and standard deviation. In Flevoland and San Francisco datasets, the feature vectors were both with b = 41. Moreover, the Lee filtering [61] with 5 × 5 window size was applied to all the datasets for pre-processing to reduce the influence from the speckle noise of the PolSAR data. The parameters of algorithms are set as follows: • SVM [24]: It was implemented with LibSVM (https://www.csie.ntu.edu.tw/~cjlin/ libsvm/, accessed on 1 Apirl 2021). The kernel function was Radial Basis Function with gamma = 1 and penalty coefficient = 2 for Flevoland dataset, and gamma = 2 and penalty coefficient = 3 for San Francisco dataset. • FS [22]: The number of super-pixels K was chosen among the interval [500, 3000], and the compactness of the super-pixels mpol is selected in the interval [20,60]. • W-DBN [23]: W-DBN had two hidden layers, and node numbers were set to 50 and 100, respectively. The thresholds τ 0 was chosen in the interval [0.95, 0.99]. The learning rate was set to 0.01. ρ 0 was set to 0, and the window size was set to 3 or 5. • CNN [30]: The network included two convolution layers, two max-pooling layers and one fully connected layer. The sizes of the filters in two convolutional layers were 3 × 3 and 2 × 2, respectively, and the pooling size was 2 × 2. The momentum parameter was 0.9, and the weight decay rate was set to 5 × 10 −4 . • DSMR [46]: The number of nearest neighbors and the regularization parameter λ were among the interval [10,20] and [1 × 10 −3 , 1 × 10 −4 ], respectively. The weight decay rate β is chosen in [1 × 10 −4 , 1 × 10 −3 ]. • SemiGCN [9]: The number of hidden units was set to 32 or 64. The number of layers in the network was 3 or 4. Both normalization and self-connections were used. Learning rate and weight decay were set to 10 −3 and 5 × 10 −3 , respectively. • ASGCN (ours): The coefficient α of distance weighting was in the range [0, 1]. The number of subgraphs p was in the range [2000,3000]. Learning rate and weight decay were 10 −3 and 5 × 10 −3 , respectively. The numbers of cells, and hidden units were discussed in the experiment.

Architecture and Parameter Discussion
In this subsection, we take the Flevoland dataset as an example to analyze the architecture and parameters.
(1) Weight-based Mini-batch Algorithm We verified our proposed strategy: weight-based mini-batch for graph partition and the results are shown in Figure 3. We used the searched architecture for comparison. The number of cells was 3, and the number of layers was 3. The graph was clustered into 2000 subgraphs. Other parameters were set as follows: The value of K in KNN algorithm: 32, batch size (16,50,300), learning rate: 0.001, weight decay: 0.0005, hidden units: 32, gradient clip: 5. We compared the convergence curves of test loss and OA values of four methods: ClusterGCN [58], Shuffle, Fixed, and the Weights (ours).
Experimental results indicated that our algorithm achieved better performance and more comprehensive data utilization than other methods. The convergence curve for the loss of our method was more stable than that of Shuffle and Fixed methods. Our method gained lower loss than that of the three other methods. Moreover, the OA of our weight-based method was the highest among theses three methods, since the weight assignment avoided imbalance of sample distribution and also resulted in a higher accuracy. The Shuffle method adopted random sampling each time, and the result was more unstable than that of other methods. The Fixed method utilized the same subgraphs, therefore its OA did not change in each epoch. (2) Ablation Study on Architectures In order to better understand the effects of the choices of hyper-parameters, we conducted ablation studies on the the number of cells and hidden units for the ASGCN.
Firstly, fixing the parameters: the number of nearest neighbors and batch size were 30 and 400, respectively, we searched and showed the result of the top architecture in Figure 4. The name on each edge represented one of the operations from our search space. The edge without a name was a skip-connection. The inputs of each cell consisted of two previous cell outputs. The input of the first cell consisted of two identical graphs aggregated by MLP from original graph data. It can be seen that this cell included operations: gin, mr_conv, semi_gcn, and edge_conv.
Then we stacked the cells ranging from 2 to 4, and varied the number of units in 16, 32, 64, 128 for each architecture. The overall accuracies were obtained and listed in Table 1. It can be seen when the number of cells was 3 and the number of units is 32, the OA was the highest 99.31%, and the memory consumption was about 0.0568 MB. When the number of cells increased, the depth of the networks and the memory consumption also grew dramatically. However, the OA decreased. This indicated that the ASGCN could not be too deep which may result in over-fitting for this dataset, and three cells each with 32 units was a better choice for this dataset.   (3) Parameter Discussion Moreover, we showed the influences of the number of nearest neighbors and the batch sizes on classification accuracy OA with the above architectures in Figure 5a,b, respectively. The influence caused by different numbers of nearest neighbors on the final results was great. It indicated that higher classification accuracy could be obtained when the number of nearest neighbors was near to 32. When it increased, OA could not be significantly improved, but it was not good if it was too small. For example, 5 and 20 adjacent neighbors may have caused loss of important connection information and led to the degradation of algorithm performance. It is worth noting that the time consumption of different number of nearest neighbors in the search and evaluation process had no significant difference, but the time cost in the phase of graph construction was much larger.
When the graph was divided into 2000 subgraphs, the performances with batch size 16 and 32 were poor. When the batch size exceeded 100, the classification accuracy of the algorithm became higher and started to remain stable. Small batch size resulted in incomplete information, thus the network could not process a large graph, and OA tended to float up and down.

Results on the Flevoland Dataset
For this dataset, the other parameters are as follows. The number of hidden units was set to 32. The number of cells and the number of layers were both 3. The dataset was divided into 2000 partitions, and batch size was 300. The number of nearest neighbors K was 32, and gradient clip was 5.
The classification maps are shown in Figure 6 and the accuracies are listed in Table 2. The mis-classified pixels of ASGCN were least and the OA was the highest at 99.31% among all the methods, and our method achieved high performance improvement in most categories, such as bare soil, wheat, peas, and water. It was 13.81% higher than that of the classical SVM in terms of OA. The performance of SVM was limited by the few labeled samples. Though CNN could take advantage of the labeled samples, it may have over-fitted with only 1% labeled samples for training, and yield an OA of 88.15%. The semi-supervised methods DSMR was inferior to that of the GCN based methods: SemiGCN and ASGCN, which is probably because the graph convolutional network could extract more differentiated features for classification. Moreover, since SemiGCN was constructed manually, its architecture may not have been optimal for this dataset, and its OA was lower than that of our ASGCN. By using the weight-based cluster method and differentiable neural architecture search applied to a reasonable search space, our algorithm effectively avoided over-fitting and unbalanced distribution of samples.It effectively improved the classification performance.

Results on the San Francisco Dataset
For this dataset, the number of hidden units was set to 24. The number of cells and the number of layers were both 3. The dataset was divided into 2000 partitions, and batch size was 300. The number of nearest neighbors K was 32, and gradient clip was 5.
The searched architecture for this dataset is shown in Figure 7. Moreover, Figure 8 shows visual classification results, and Table 3 is the corresponding classification result of each category. The results showed that our ASGCN achieved the best accuracy at 96.80% among the studied algorithms. The semi-supervised methods: DSMR, SemiGCN and ASGCN, had greater improvement in classification performance compared with others as they can employed both the labeled and unlabeled samples to train the network. The classification accuracy of CNN was better than that of the SVM. However, CNN was a little inferior to that of the W-DBN. It is probably because only 1% labeled samples greatly weakened the fitting capability of the network. Compared with SemiGCN, our method attained higher OA. This is likely because its search space covered more possible situations and it could automatically search the most suitable network structure for the this dataset.  [24], (d) FS [22], (e) W-DBN [23], (f) CNN [30], (g) DSMR [46], (h) SemiGCN [9], and (i) ASGCN (ours).

Conclusions
In this work, we propose a new neural network ASGCN based on architecture search for PolSAR image classification. The PolSAR data is represented by a fine grained graph, and a searching space is constructed for the automatical search of an optimal ASGCN. Our method avoids a great deal of work building networks manually. Addressing the memory cost caused by large scale graph, we proposed a weight-based mini-batch strategy, which greatly reduced the memory cost in a single epoch and maintained stable convergence. The experimental results on typical datasets, i.e., Flevoland and San Francisco, from different radar systems indicate that our method outperforms state-of-the-art methods for classification in the majority of the tested cases. The advantages of our ASGCN have been demonstrated by the experiments. That is, (1) Our ASGCN can avoid the conventionally manual design of the architecture which may result in tedious work in tuning the structure and hyper-parameters. (2) The proposed search space enables our model to find appropriate graph convolutional architecture for PolSAR classification. It may provide some inspirations for similar application. (3) The presented weight-based mini-batch strategy can decrease the memory cost and ensure training of large-scale dataset. However, similar to other NAS-based algorithm, our ASGCN costs more time for search of an optimal architecture than the training of other semi-supervised algorithms, such as the DSMR [46], and SemiGCN [9]. Anyway, our ASGCN enhances the classification accuracy compared with some of the state-of-the-art methods. This may provide inspirations for the construction of new GCN and other automatic design of networks for PolSAR classification. In the future, more techniques will be studied to speed up the search process of our ASGCN. Moreover, we will investigate other graph clustering approaches in [62,63] to improve our weighted mini-batch strategy.