PolSAR Image Classiﬁcation with Lightweight 3D Convolutional Networks

: Convolutional neural networks (CNNs) have become the state-of-the-art in optical image processing. Recently, CNNs have been used in polarimetric synthetic aperture radar (PolSAR) image classiﬁcation and obtained promising results. Unlike optical images, the unique phase information of PolSAR data expresses the structure information of objects. This special data representation makes 3D convolution which explicitly modeling the relationship between polarimetric channels perform better in the task of PolSAR image classiﬁcation. However, the development of deep 3D-CNNs will cause a huge number of model parameters and expensive computational costs, which not only leads to the decrease of the interpretation speed during testing, but also greatly increases the risk of over-ﬁtting. To alleviate this problem, a lightweight 3D-CNN framework that compresses 3D-CNNs from two aspects is proposed in this paper. Lightweight convolution operations, i.e., pseudo-3D and 3D-depthwise separable convolutions, are considered as low-latency replacements for vanilla 3D convolution. Further, fully connected layers are replaced by global average pooling to reduce the number of model parameters so as to save the memory. Under the speciﬁc classiﬁcation task, the proposed methods can reduce up to 69.83% of the model parameters in convolution layers of the 3D-CNN as well as almost all the model parameters in fully connected layers, which ensures the fast PolSAR interpretation. Experiments on three PolSAR benchmark datasets, i.e., AIRSAR Flevoland, ESAR Oberpfaffenhofen, EMISAR Foulum, show that the proposed lightweight architectures can not only maintain but also slightly improve the accuracy under various criteria.


Introduction
Polarimetric synthetic aperture radar (PolSAR), as one of the most advanced detectors in the field of remote sensing, can provide rich target information in all-weather and all-time. In recent years, more and more attention has been paid to the development of PolSAR information extractions due to the good properties of PolSAR systems. Especially, PolSAR image classification has been extensively studied as the basis of PolSAR image interpretation.
Deep learning [1] has made remarkable progress in natural language processing and computer vision, and it has the potential to be applied in many other fields. Convolutional neural networks (CNNs), as one of the representative methods of deep learning, have shown strong abilities in the task of image processing [2]. It has been proved that CNNs can obtain more abstract feature representations than traditional hand-engineered filters. The generalization performance of machine learning-based image classification algorithms has been greatly improved with the rise of CNNs. Big data, advanced algorithms, and improvements in computing power are the key factors for the PolSAR classification [24]. Complex-valued fully convolutional networks were proposed in [25] to do PolSAR semantic segmentation. Sun et al. proposed a complex-valued generative adversarial network for semi-supervised PolSAR classification [26]. However, the development of complex-valued architectures is still in its infancy. To avoid complex operations, Liu et al. attempted to learn the feature of phase independently [27]. A two-stream architecture was proposed to extract features from amplitude and phase respectively with the aid of a multi-task feature fusion mechanism [28]. It is worth noting that PolSAR covariance matrix has been used as the input of CNNs in most studies [12,15,16,23,28], and the phase information is hidden between the input channels when each element of the upper triangle of PolSAR covariance matrix is regarded as a channel of the input. Recent works have revealed that, with the aid of 3D operations, channel-wise correlations can be plugged in as an additional dimension of convolution kernels to solve the problem of feature mining on special data (e.g. videos) [29,30]. Such improvements induce considerable advantages when processing PolSAR data by CNNs. Zhang et al. introduced 3D operations for the first time to implement PolSAR classification [31], which effectively improved the performance of ordinary 2D-CNNs. Tan et al. integrated complex-valued and 3D operations, and proposed a complex-valued 3D-CNN for PolSAR classification [32]. However, the performance improvement brought by 3D convolutions is based on greatly increased model parameters [30]. A large number of model parameters limit the speed of classification, which hinders the practical implementations of 3D-CNNs and the development of real-time interpretation systems [33]. Lightweight alternatives of 3D convolutions, e.g. pseudo-3D convolution [34] and depthwise separable convolution [35,36] are good means to solve this dilemma.
Based on the above analysis, the objective of this work is to find 3D-CNNs architectures with low computational cost as well as competitive performance for PolSAR image classification. It can be observed that almost all model parameters of a CNN exist in the convolution and fully connected layers. For these two key components, lightweight strategies are developed in this paper to compress the network architecture so as to reduce the model complexity of 3D-CNNs. Firstly, pseudo-3D convolution-based CNN (P3D-CNN) is introduced which replaces the convolution operations of 3D-CNNs by pseudo-3D convolutions. P3D-CNN uses two successive 2D operations to approximate the features extracted by 3D-CNNs. In addition, 3D-depthwise separable convolution-based CNN (3DDW-CNN) is proposed in parallel. Different from P3D-CNN, 3DDW-CNN decouples the spatial-wise and channel-wise operations that were previously mixed together to find more effective features than 3D-CNNs. The number of model parameters contained in convolution layers can be greatly reduced in the proposed two lightweight architectures. Moreover, fully connected layers of the above two architectures are eliminated and replaced by global average pooling layers [37]. This measure reduces more than 90% of the model parameters in 3D-CNNs and greatly improves the computational efficiency. The dropout mechanism [38] is configured in the proposed architectures to further prevent over-fitting. The proposed architectures can be summarized as a lightweight 3D-CNN framework, which has more efficient convolution and fully connected operations. The proposal has inspirations for the development of many other lightweight architectures. The number of trainable parameters and the computational complexity of the involved models are compared and analyzed, which illustrates the superiority of the lightweight architectures. The classification performance of the proposed methods is tested on three PolSAR benchmark datasets. Experimental results show that considerable accuracy can be maintained by the proposed methods. The main contribution of this paper can be summarized as follows: • Two lightweight 3D-CNN architectures are introduced for the fast PolSAR interpretation speed during testing.

•
A lightweight 3D-CNN framework can be summarized. Compared with ordinary 3D-CNNs, the architectures under the framework have fewer model parameters and lower computational complexity.

•
The performance of the lightweight architectures is verified on three PolSAR benchmark datasets.
The rest of this paper is organized as follows. In Section 2, the background of vanilla convolutions and their variants are introduced. The proposed methods are introduced in Section 3. The experimental results and analysis are presented in Section 4. The conclusion is discussed in Section 5.

Related Works
In this section, 2D convolution, 3D convolution and its lightweight versions, i.e., pseudo-3D convolution and 3D-depthwise separable convolution, are briefly analyzed. Formula expressions are avoided and graphical illustrations are used to facilitate understanding.

Vanilla Convolutions
2D convolution is the choice of most CNNs, which can be used to extract the information from the input maps. The process of vanilla 2D convolution operation is shown in Figure 1, from which one can see that the output of a 2D convolution is always two-dimensional, i.e., one feature map, for any size of inputs. Therefore, 2D convolution can only extract spatial information, and it is not conducive to process the data which has a relationship between channels by 2D convolutions.
When the input is c numbers h × w maps, each kernel is k × k × c. Doing the same operation on each channel as in (a), getting c 2D maps and add them up. The outputs of two sub-graphs are 2D maps with the same size.
Vanilla 3D convolution (C3D) can be seen as an intuitive extension of 2D convolutions and a dimension is added to extract more information [30]. As shown in Figure 1, the process of vanilla 2D convolution can be expressed as where t and h mean the tth sliding window and the hth convolution kernel, k h and k w represent the spatial kernel size, x (t) and z (t) denote the tth input and the tth output, and y (h) and b (h) denote the hth kernel matrix and its bias. Similarly, C3D can be expressed as where k d represents the depth of kernels. The process of C3D can be seen from Figure 2, where the extra depth dimension is added to the 2D convolution kernels. The difference between 2D and 3D convolutions can be seen by comparing Figure 1b with Figure 2a. Similar to 2D convolutions to maintain the spatial size of the inputs, the size of the depth dimension is maintained through 3D convolutions. In other words, the input is only manipulated spatially for 2D convolutions and the output is always maps. However, C3D extract features from spatial and depth dimensions at the same time, and outputs cubes. The latter undoubtedly contains more information as well as more model parameters to be trained.
Same as the operations in (a), c numbers 3D cubes can be obtained and add them up. The outputs of two sub-graphs are 3D cubes with the same size.

Pseudo-3D Convolution
The process of pseudo-3D convolution (P3D) can be seen in Figure 3. Two successive sub-operations work on spatial dimension and depth dimensions respectively are used by P3D to simulate the effect of C3D. It has been proven that P3D can greatly reduce the number of trainable parameters while keeping accuracy [34]. Figure 3. The process of pseudo-3D convolution (P3D). P3D is divided into two steps to achieve low-latency approximation to C3D, and a nonlinear activation exists between the two. (a) Step 1: Operating 2D convolution in the spatial dimension of the h × w × d input, each kernel is k × k × 1 and the corresponding output is Step 2: Operating 1D convolution in the depth dimension, each kernel is 1 × 1 × k. Getting the final output with the size of As shown in Figure 3, P3D decomposes the k × k × k C3D kernel into k × k × 1 and 1 × 1 × k. The number of model parameters of each kernel is reduced from k 3 to k(k + 1). Such divide-and-conquer heuristic modeling ideas are familiar and usually effective [35,39,40]. Intuitively, assigning clear task requirements to convolution operations can increase their productivity. Therefore, compared with C3D, P3D can not only reduce the number of model parameters but also slightly improve accuracy.

3D-Depthwise Separable Convolution
The simultaneous existence of multiple convolution kernels provides a guarantee for the powerful feature extraction capability of CNNs. In fact, the feature maps extracted by multiple convolution kernels can be regarded as many different kinds of features [41]. However, from the comparison of the two sub-graphs in Figure 1, multiple groups of convolution kernels have brought about several times of parameters. Depthwise separable convolution [35] was proposed as an effective way to reduce the increasing of parameters in this case, which realized a very efficient replacement by decoupling the spatial and channel-wise operations of the vanilla 2D convolution. For an h × w × c input map, 2D convolution kernels with the size of k × k × c × c are required to produce the output with the size of h × w × c (performing the operation in Figure 1b c times with zero-padding). However, the convolution kernels with the size of k × k × c × 1 + 1 × 1 × c × c are needed for depthwise separable convolution to achieve the same effect. Due to the good performance of depthwise separable convolution in 2D tasks, it is a natural idea to extend it to 3D tasks. A similar idea has also been considered in [36].
The improved strategy is straightforward, that is, replacing the 2D convolutions in 2D-depthwise separable convolution with 3D operations. The comparison between C3D and 3D-depthwise separable convolution is shown in Figure 4. It can be seen from Figure 4a that c times C3D operations are implemented (different colors represent different groups of filters) to generate c numbers of 3D feature cubes. The process of 3D-depthwise separable convolution is shown in Figure 4b. Similar to 2D operations, 3D-depthwise separable convolution can also be divided into depthwise and pointwise operations. The kernels of 3D depthwise convolution are shown in the second column in Figure 4b, and 3D pointwise convolution kernels are shown as the fourth column. Obviously, the idea of depthwise separable convolution is inherited, and an extra dimension is added to implement 3D feature extraction. k × k × k × c × c numbers of model parameters are needed in Figure 4a, and they can be decomposed into c times 3D depthwise convolution with the parameters of k × k × k and c times 3D pointwise convolution with the parameters of 1 × 1 × 1 × c. Therefore, the model complexity can be greatly reduced, which makes it possible to be utilized with limited resources.  The process of 3D-depthwise separable convolution in the same situation. All 2D operations in depth separable convolution are replaced by 3D operations in (b). Firstly, doing vanilla 3D convolutions on each channel of the input with the kernel size of k × k × k (3D depthwise convolution). Then, doing c times 1 × 1 × 1 convolutions to the intermediates (3D pointwise convolution), and the output with the same size of C3D can be obtained.

Proposed Methods
In this section, the representation of PolSAR images is present firstly. PolSAR coherence matrix T is adopted as the starting point in this work. Then the implementation details of the proposed architectures are introduced.

Representation of PolSAR Images
A polarized scattering matrix can fully characterize the electromagnetic scattering properties of ground targets. The scattering matrix is defined as: where S PQ (P, Q ∈ {H, V}) represents the backscattering coefficient of the polarized electromagnetic wave in emitting Q direction and receiving P direction. H and V represent the horizontal and vertical polarization, respectively. According to the reciprocity theorem, the S matrix satisfies S HV = S V H . To describe the scattering properties of targets more clearly, the S matrix is usually transformed into the polarization coherence matrix or polarization covariance matrix. The polarization vector and coherence matrix based on Pauli decomposition are expressed as (4) and (5) [T] = k k H .
The polarization coherence matrix T is a Hermitian matrix, and all its elements except the diagonal element, are complex numbers. Generally, the upper triangular elements [T 11 , T 12 , T 13 , T 22 , T 23 , T 33 ] are taken and divided into their real and imaginary parts as the input of CNNs. At this point, there are nine real-valued numbers to describe each pixel of PolSAR images.    The original 3D-CNN architecture used for PolSAR image classification [31] is shown in Figure 6a. Compared to their work, a deeper architecture can be seen in Figure 6b, in which the updated network has three additional convolution layers and the network width is reduced to alleviate the adverse effects of the increase of depth. Such a 3D architecture can not only mine the spatial relations but also explore the correlations between different elements of the polarization coherence matrix so as to extract more comprehensive information. Therefore, this architecture is chosen to be the backbone of this paper. It is worth noting that building the network backbone is not the objective of this paper, but to compare the proposed lightweight methods with the ordinary ones in a fair environment. Although 3D-CNNs showed a promising performance [31], it also brought a slower interpretation speed due to more model parameters and higher complexity. The computational difficulty mainly centers on the convolution and fully connected layers for the architecture in Figure 6b. Thus, the lightweight improvements designed for these two parts are implemented.

Lightweight 3D-CNNs for PolSAR Classification
The C3D operations of 3D-CNNs are replaced with the two former introduced lightweight convolution operations to reduce the computational complexity of the convolution layer. It can be seen from Figure 6c that only the way the convolutions is changed, without modifying their depth, width, and kernel size. It can be easily proven that the lightweight convolution layers contain a similar number of model parameters as the 2D layers, and only half or even less of the C3D layers. A more detailed analysis of the changes in the number of model parameters will be given later.
In the architecture shown in Figure 6a, the data is expanded into a 1D vector and enters the fully connected layers when the processing of convolutions is finished. The role of fully connected layers is to reduce the dimension of the outputs of convolution layers. Results of the fully connected layers will be activated by softmax activation to achieve feature classification, which can be defined as where σ so f tmax (x) means the softmax activation of the input x, i denotes the ith category, and j is the number of categories. Thus, a j × 1 vector whose element represents the probability of belonging to the corresponding category can be obtained as the final prediction. Improvements to the fully connected layer have been concerned a lot because it occupies more than 90% model parameter of CNNs. Global average pooling (GAP) has been proved that it can be seen as a plug-and-play replacement for fully connected layers to save the computational resource [37]. As can be seen from Figure 7a, m × m three-channels 2D feature maps are flattened to a vector as the input of fully connected layers. When the number of the category is J and the hidden node of the fully connected layers is H, the total number of parameters is (m × m × 3 × H) + (H × J) for the two fully connected layers. When the input becomes multi-channel 3D features, the large amount of parameters are multiplied by the depth of features. Such a large number of parameters not only brings computational difficulties but also increases the risk of over-fitting. In the proposed architectures, spatial global average pooling is performed as shown in Figure 7b. For each channel of the output feature cube, the process of the used GAP can be defined as where x and y represent the input and output of the GAP layer. d denotes the depth of the feature cube, and h and w represent its height and width. The above operations are performed on each channel for the multi-channel input 3D feature cubes, which can greatly reduce the number of model parameters so as to cut down the computational cost.

Experiments
In this section, to evaluate the performance of the proposed methods, they are tested on three PolSAR benchmark datasets and compared with several alternatives. The experimental environment uses a PC with Intel Core i7-7700 CPU with 16 GB RAM. A deep learning toolbox [42] is utilized to minimize the difficulty of algorithm implementation.

Datasets and Settings
Three widely-used PolSAR benchmark datasets are employed in the experiments: AIRSAR Flevoland, ESAR Oberpfaffenhofen, and EMISAR Foulum. Figures 8-10 show their Pauli maps and ground truth maps, respectively.

AIRSAR Flevoland
As shown in Figure 8, an L-band, full polarimetric image of the agricultural region of the Netherlands is obtained through NASA/Jet Propulsion Laboratory AIRSAR [43]. The size of this image is 750 × 1024 and the spatial resolution is 0.6 m × 1.6 m. The ground truth map is shown in Figure 8b, which is adapted from [44]. There are 15 kinds of ground objects including buildings, rapeseed, beet, stem beans, peas, forest, lucerne, potatoes, bare soil, grass, barley, water, and three kinds of wheat, and a total of 184,592 image slices are contained in this dataset. The details of each category are shown in Table 1.

. ESAR Oberpfaffenhofen
An L-band, full polarimetric image of Oberpfaffenhofen, Germany, 1200 × 1300 scene size, are obtained through the ESAR airborne platform [43]. Its Pauli color-coded image and ground truth map can be seen in Figure 9. The ground truth map is adapted from [45]. According to the ground truth, each pixel in the map is divided into three categories: built-up areas, wood land, and open areas, except for some unknown regions. A total of 1,307,142 image slices are contained in this dataset. The details of each category are shown in Table 2.

EMISAR Foulum
The last full polarimetric image used in this experiment is the L-band image taken by EMISAR in Foulum, Denmark. EMISAR is a full polarized airborne SAR operating in L and C bands with a resolution of 2 m × 2 m and mainly acquired and studied by Danish Center for Remote Sensing (DCRS). Figure 10 shows its Pauli RGB image and ground truth map. The size of this image is 1000 × 1750. The calibration of the terrains in Figure 10b refers to [46,47], and each pixel in the map is divided into seven categories: lake, buildings, forest, peas, winter rape, winter wheat, and beet. There are 431,088 image slices in this dataset. The details of each category are shown in Table 3.

Experiments Starting
To validate the significance of the proposed PolSAR image classification framework, ordinary CNN (CNN), 2D-depthwise separable convolution CNN (DW-CNN) and C3D CNN (3D-CNN) are chosen to be compared. Their architectures and hyperparameters are set as Figure 6b except for the way of convolution. The proposed two classifiers are denoted as P3D-CNN and 3DDW-CNN for convenience. During the training and testing, the size of kernels are 3 × 3 for 2D convolutions and 3 × 3 × 3 for 3D convolutions. The dropout rate is 0.8 for fully connected layers. An improved stochastic gradient descent optimization method [48] is chosen to train the involved architectures with the learning rate of 0.001.
To evaluate the performance of the algorithms mentioned in this paper, the overall accuracy (OA) and kappa coefficient (Kappa) [49] are chosen as criteria, which can be defined as follows: where c is the number of categories. M i and N i denote the number of correctly classified categories and the total number of ith categories, respectively.
where n is the number of testing samples and H denotes the classification confusion matrix. The number of training epoch is important which determines whether the model converges or not. 9000 and 4500 samples of the AIRSAR Flevoland dataset, and 7000 and 3500 samples of the EMISAR Foulum dataset are randomly chosen without overlaps as training and validation sets. Then experiments of 3D-CNN are carried out to find a suitable value of training epoch. The experimental results are shown in Figure 11.
One can see from the experimental results that the training accuracy tends to be stable after 100 iterations and the validation accuracy does not change much after 200 iterations. Combined with these two points, the value of the epoch is set to 250 in the experiments to ensure convergence. When the training epoch reaches the upper limit, the model with the highest OA on the validation set should be selected as the final trained model to ensure the stability of the training process.
The size of the training set also needs to be carefully considered. We do comparative experiments to find an appropriate number of training samples in order to save memory as much as possible under the premise of guaranteeing the training effects. In the experiments, we randomly extract a certain number of samples (the latter size is twice as large as the former for easy analysis) from each category of labeled samples to form the training set. Two basic models including CNN [12] and 3D-CNN [31] are tested on different training sets of three benchmarks. One thing can be found that the number of buildings category is small, so in the experiments on the AIRSAR dataset, the number of training samples is fixed to 600 for buildings when the extracted number is more than 600. The experimental results are listed in Table 4, from which we can see that when the scale of the training set moves from small to large, the accuracy of both the CNN and 3D-CNN shows an upward trend. Results on the AIRSAR and EMISAR datasets show that this upward trend eases when the number of training samples of each category exceeds 1200 and 4000. Although the large size of the training set brings a slight improvement for the ESAR dataset, 1000 samples per category met our needs. After obtaining the training set, half of the training set were extracted from the remaining samples to form the validation set, and 30% of the remaining were taken as the testing set.

Results and Comparisons
Under the experimental environment and settings described earlier, the classification results of different methods are shown in Figures 12-14, and the accuracies are listed in Tables 5-7, respectively. Generally, the proposed methods achieve better performance than the compared ones. The experimental results on the AIRSAR Flevoland dataset can be seen from Table 5 and the classification results of the whole map are listed in Figure 12.
The results in Table 5 prove that the proposed methods slightly improve the classification accuracy on this data set. From the experimental results, it can be seen that 3D networks have a better performance than 2D networks, which confirms the importance of 3D convolutions for the PolSAR classification. Furthermore, it can be seen that the OA and Kappa of lightweight 3D convolution-based methods are higher and ordinary 3D-CNN, especially in the identification of the rapeseed and wheat categories. This shows that there is potential redundancy in C3D operations and the lightweight strategies can improve not only the computational efficiency but also the classification performance.  Whole map classification results can be seen in Figure 12, it can be seen that the proposed methods have more powerful capabilities for distinguishing between forest and grass. In addition, apart from rapeseed and three types of wheat, the proposed methods are also effective for classifying beet and potatoes.
The experimental results on ESAR Oberpfaffenhofen can be seen from Table 6 and the classification results of the whole map are listed in Figure 13. On this data set, the analysis results are generally consistent with the previous ones. The 3D models achieve better results than the 2D models under different criteria. In these experiments, 3DDW-CNN achieves the best performance. It has a 1.37% improvement of OA and 2.04% improvement of kappa compared with the ordinary 3D-CNN. The P3D-based model also achieves a several signs of progress. Similar conclusions under different datasets also confirm the generalization performance of the proposed methods.  The results overlaid with the ground truth map on ESAR Oberpfaffenhofen are shown in Figure 13, where it can be seen that serious confusions exist between built-up areas and woodlands for 2D models. This phenomenon has been weakened in 3D-CNN, and the proposed methods further alleviate this problem. In addition, compared with other methods, the proposed methods have more complete and pure classification results for the open areas.
The experimental results on the EMISAR Foulum can be seen from Table 7 and the classification results of the whole map are listed in Figure 14. Compared with the former two datasets, EMISAR Foulum data which contains quite complex terrain information is not so widely used. Similar conclusions can be drawn from the analysis of the experimental results shown in Table 7, where the proposed P3D-CNN achieves the best classification results. It is worth pointing out that although the results of 3DDW-CNN is slightly lower than 3D-CNN, such a small performance degradation (about 0.03%) is acceptable under the premise of reducing computational complexity.  One can see from Figure 14 that the following groups of objects are easy to be misclassified, including lake-peas, peas-winter wheat, buildings-forest. The proposed methods show competitive performance when generally solving the above problems, although the results of P3D-CNN for the lake is not very good.

Studies of Complexity
In previous experiments, the classification performance of the proposed methods have been verified. In this part, we analyze the number of trainable parameters and the computational complexity of the proposed methods. An intuitive comparison of the number of trainable parameters and overall accuracy of the involved models on the AIRSAR Flevoland dataset can be seen in Table 8. As can be seen from Table 8, P3D-CNN contains half of the parameters of 3D-CNN in convolution layers, which is 1.44 times that of 2D-CNN. 3DDW-CNN is even lighter, which cuts about 70% trainable parameters in convolution layers of 3D-CNN. As the GAP is introduced to replace the fully connected layer, the total parameters contained in the model are greatly reduced. Meanwhile, the proposed two methods not only maintain the accuracy of 3D-CNN but also improve slightly.
Furthermore, the value of the floating point operations (FLOPs) in the convolution layers of each method is calculated, which is a popular evaluation metric to compare the complexity of algorithms. FLOPs in convolution layers of the proposed and comparing methods are calculated. Then the comparison combining accuracy and complexity can be seen from Figure 15, in which the x-axis represents the value of convolution FLOPs, and the y-axis represents the overall accuracy. Four involved methods, i.e., CNN, two proposed ones, and 3D-CNN, are shown in the figure from the left to the right. Each one has three bars, which represent its OA on the three different datasets.
It can be seen that the proposed methods, i.e., the middle two of the four columns, not only have lower FLOPs, but also improve the classification accuracy slightly than 3D-CNN (the rightmost column). This result can verify the theoretical analysis.

Conclusion
Inspired by the recent lightweight improvements for deep neural networks, in this paper, two lightweight 3D-CNN architectures are proposed for PolSAR image classification. Lightweight 3D convolutions, i.e., pseudo-3D and 3D-depthwise separable convolutions, are introduced to perform feature extraction and reduce the redundancy of 3D convolutions. Meanwhile, global average pooling is introduced to replace the fully connected layer considering the huge number of model parameters contained in it. In this way, over 90% model parameters of 3D-CNNs can be compressed so as to support the high-precision interpretation under the resource-constrained system. Moreover, a general lightweight 3D-CNN framework can be summarized, which can help future research. Such a PolSAR tailored classification framework can not only improve the running speed but also boost the performance of convolutions. Experimental results on three PolSAR benchmark datasets show that the proposed architectures have promising classification performance and low computational complexity. In the future, complex-valued CNN architectures, weakly-supervised classification methods and finding the optimal hyperparameters automatically are all issues we are considering.