PolSAR image classification refers to classifying each single-pixel to a certain terrain type. Thus, in this paper, we propose a pixel-based classification approach named SPCNN. The proposed method consists of the following steps: 1. Construct the 3-D tensor block to represent each pixel in PolSAR image, which contains rich channel-spatial information and is suitable for being the CNN input. 2. Design a 6-layer SPCNN to extract the channel-spatial features and conduct per-pixel classification. 3. Train the SPCNN.
2.2. Network Architecture
The architecture of SPCNN is shown in
Figure 3, which contains an input layer, three convolution layers, a fully connected layer and a softmax classifier [
36] connected to the output. The input layer has the size of
, which is equal to the size of input tensor block. The size of the convolution filters in each convolutional layer is
, where
h and
g denote the values of the third dimension and the number of convolution filters, respectively. The reason why we use smaller convolution filter is that the convolution filter with small size (such as
) and deeper architectures can generally obtain better results [
37]. Using multiple smaller convolution filters instead of one large convolution filter results in parameters with the same size of receptive field, for example, the size of the receptive field of each pixel in the feature map obtained by two
convolution operation is the same as a
convolution operation. Suppose that the number of convolution filters per convolution layer is
k, then the number of the parameters of two
convolution operation is
, which is less than that of
convolution operation,
. In our network, the filter size of the first convolutional layer is
. And the filter sizes of the second and third convolutional layer are
and
, respectively. All the convolution stride is fixed to 1 pixel. The fully connected layer contains 128 neurons, which produces a classification feature vector of length 128. The number of neurons in the output layer is
c, which is equal to the number of class. The rectified linear units (ReLU) activation function [
38] is applied to the three convolution layers and the fully connected layer. In term of training time with gradient decent, ReLU tends to be more efficient than other activation functions [
15].
As shown in
Figure 3, the pooling layer is not used in our network. The pooling layer is commonly used to reduce the size of the feature map, thereby reducing the number of parameters in the network. Nonetheless, pooling operation may also lose useful information. Differing from the image-based classification model, SPCNN is utilized for pixel-based classification. The input tensor block of SPCNN consists of pixels in a small spatial neighborhood, so it is unnecessary to carry out the pooling operation.
2.3. Training SPCNN
Suppose are the training dataset, in which denotes the observed sample and represents its ground truth label during training. Training samples are fed into the input layer with the size of . Then the samples are filtered by 64 convolution filters with the size of in the first convolutional layer. After nonlinear mapping, the first convolutional layer produces 64 feature maps with the size of . Taking the 64 feature maps as the input of the second convolutional layer, it produces 32 feature maps with the size of . Similarly, the third convolution layer produces 32 feature maps with the size of . After the fully connected layer, the 128-D classification feature vector is obtained, which is used as the input of the softmax classifier. Finally, SPCNN produces the predicted probability distributions over all the classes of each pixel.
The loss function
of softmax classifier can be formulated as follows:
Here, c represents the number of class. (T is the transpose operator) denotes the weight vector of the neuron of the output layer. is the output vector of the sample. is an indictor function. When , is equal to 1 and 0 otherwise.
Differing from the traditional CNN, SPCNN learns the easier samples of the training dataset first and then gradually involves more samples in the training process. The weight variable
is introduced to represent the learning difficulties of samples, where
denotes the weight of the
sample. The optimization goal of SPCNN is to minimize the training loss under the weight distribution of the samples. Thence, according to the SPL model proposed by Kumar et al. [
27], the objective function of SPCNN can be expressed by the sum of the loss function
shown in Equation (1) and the regularization term
as:
where
W and
b are the trainable parameters of the network, denoting the weight matrix and bias vector, respectively.
is the age parameter to control the learning process, which is initialized before training. Here, the regularization term
shown in Equation (3) is the self-paced regularizer, which determines the values of
. Meng et al. [
39] have proposed several typical self-paced regularization terms. In our method, the binary regularization term is adopted. Under the constraint of the binary regularization term, the weight of each sample is binary (0 or 1). The binary regularization term can be expressed as:
Substitute Equation (3) into Equation (2) and simplify the equation, fix
W and
b, the weight
that denotes the difficulty level of the
sample can be calculated by minimizing Equation (4), where
is the abbreviation for
,
Equation (5) illustrates that when the training loss of the sample is less than the age parameter , this sample is considered as an easier sample and its weight is set to 1, otherwise its weight is set to 0. In the training process of SPCNN, the value of will gradually increase according to the equation . When the age parameter becomes larger, the model tends to incorporate more difficult samples to train, which holds that: , .
There are three variables (W, b and ) in the objective function (2), which are difficult to optimize simultaneously. We obtain the solution according to the following steps.
Step 1: Initialize the parameters of SPCNN
W,
b and
. For the
W initializer, the scale of initialization is determined based on the number of input and output neurons [
40]. For the
b initializer, it was simply initialized as 0. In general, we need the range of training loss values in advance to determine the initial value of
. In our experiment, the initial value of
is set to the first quartile of losses during training, namely the first cut point when dividing all losses in the order from small to large into four equal parts. Then set model optimization parameter including the number of epochs, learning rate
and pace parameter
k.
Step 2: Apply mini-batch gradient descent algorithm and back propagation to train the model.
Step 2.1: Select a mini-batch sample to feed into the network.
Step 2.2: Fix the parameter W and b, obtain the output vector and training loss for each input sample through forward propagation and then calculate the weight variable by Equation (5).
Step 2.3: Fix the weight variable and update the parameter W and b by mini-batch stochastic gradient descent (SGD) with momentum.
Step 2.4: A new mini-batch sample is selected to optimize the parameter until all the samples are included.
Step 3: Update , . Repeat Step 2 until the number of epochs achieves the predefined threshold.