3D Convolutional Neural Networks for Crop Classification with Multi-Temporal Remote Sensing Images

This study describes a novel three-dimensional (3D) convolutional neural networks (CNN) based method that automatically classifies crops from spatio-temporal remote sensing images. First, 3D kernel is designed according to the structure of multi-spectral multi-temporal remote sensing data. Secondly, the 3D CNN framework with fine-tuned parameters is designed for training 3D crop samples and learning spatio-temporal discriminative representations, with the full crop growth cycles being preserved. In addition, we introduce an active learning strategy to the CNN model to improve labelling accuracy up to a required threshold with the most efficiency. Finally, experiments are carried out to test the advantage of the 3D CNN, in comparison to the two-dimensional (2D) CNN and other conventional methods. Our experiments show that the 3D CNN is especially suitable in characterizing the dynamics of crop growth and outperformed the other mainstream methods.


Introduction
Benefiting from the huge number of high-quality spectral-temporal images captured from Earth observation satellites, or unmanned aerial vehicles (UAV), automatic crop classification [1,2] is becoming a fundamental technology for yield estimation, economic assessment, crop transportation, etc. Conventional classification methods, such as support vector machine (SVM) [3], K-nearest neighbor (KNN) [4], maximum likelihood classification (MLC) [5], etc., have been successfully applied in crop classification. However, these methods may be insufficient under the circumstances vegetation, etc. For example, many agricultural sectors of Chinese local governments are doing heavy manual labelling work to assure accurate reports. Therefore, it is still very important to develop new crop classification technologies.
Classification methods in remote sensing mainly consider two aspects, a feature extractor that transforms spatial, spectral, and/or temporal data into discriminative feature vectors, and a classifier that labels each feature vector to certain types. For crop or vegetation classification, the spatial and spectral features are typically extracted in two ways. One is to aggregate spectral bands into vegetation indices that represent the physical characteristics of vegetation, within which the normalized difference vegetation index (NDVI) is mostly used. Wardlow et al. analyzed the time-series MODIS 250 m enhanced vegetation index (EVI) and NDVI datasets for crop classification [1]. Xiao et al. utilized NDVI, this, Kussul et al. introduced two CNNs for crop classification, one 2D CNN for spatial feature learning, and the other one-dimensional (1D) CNN for spectral feature learning [28]. However, this strategy requires additional empirical processing to combine the learned features, and hurdles a full automatic representation learning procedure.
Three-dimensional (3D) convolution naturally suits to spatio-temporal presentations. Recently, some studies have utilized 3D CNN for learning spatio-temporal features from videos [29,30], learning 3D structures from LiDAR point clouds [31], or learning spatio-spectral presentations from hyperspectral images [32]. In general, 3D CNN is not as widely applied as 2D CNN, as the temporal dimension is usually not considered in computer vision and machine learning. Remote sensing images generally provide dynamic or temporal information, from which more information could be extracted. For example, the relations among multi-temporal pictures of a certain crop are explicit. The rice growth cycle includes germination, tillering, spike differentiation, heading and flowering, milking and a mature stage. But these temporal features are often partially ignored or represented by simplistic models. Some studies [5,12,13], simply concatenate temporal images to represent period information; while in [8], only images of early growing season, late growing season and dormant stage were selected. In theory, a proper 3D convolution can extract these spatial and temporal features simultaneously in a more delicate and rigorous manner other than a direct concatenation of reflectance images. In this study, we develop a 3D CNN framework to extract information for multi-temporal images and compare it to a 2D CNN and some conventional empirical methods as SVM, KNN, etc.
Deep learning requires sufficient manual samples that are difficult to be obtained in our situation when crop types, varieties of a certain type of crop, and planting season vary from time to time. To achieve a satisfactory learning result with limited samples and reduced amount of labor work, we introduce a semi-automatic semi-supervised active learning strategy [33]. Active learning is used to pick up the most helpful unlabeled samples (i.e., samples supposed to best improve model performance), according to their scores to each label predicted from the current CNN model, for manual checking, and model retraining, iteratively.
To our knowledge, 3D CNN has not been applied to crop classification using multi-spectral multi-temporal remote sensing images, and compared to 2D CNN and other conventional methods. In Section 2, we introduce a 3D convolutional kernel, 3D CNN structure, and an active learning strategy for crop classification. In Section 3, three tests are carried out to evaluate 3D CNN performance, as compared to 2D CNN and conventional methods. Discussions and conclusions are given in Sections 4 and 5, respectively.

3D Convolution for Multi-Temporal Multi-Spectral Images
Deep CNN utilizes multi-layer linear transformation wx + b, followed by a non-linear activation σ, to learn multi-level representations of input data that are needed for the following classification or detection. In Equation (1), w is the weight template (i.e., convolution kernel), b is the bias vector of current layer l, and x is the neuron input from the previous layer l − 1. w and b are unknown parameters to be trained. The activation functions could be sigmoid or rectified linear unit (ReLU) [34].
The form of original input x 0 depends on special applications. In crop classification, x 0 is typically collected from spectral-temporal images. A simple way is to treat all of the channels independently and clip same-size window patches around sample points or from shape files. A 2D convolution operation, as in Equation (2), is commonly used to aggregate raw information to a more abstract representation of next layer, where w ij,n is the n-th shared M × M weight template for a certain feature, N denotes the number of original channels or the feature maps of previous layer, x cd and y cd represent the input and output activation at location (c, d), respectively. For simplification, we omit the layer notation l in Equation (1).
To prevent a 2D convolution operation collapsing, all of the information from separate channels to a 2D plane (see Figure 1a), a 3D convolution as in Equation (3) can be used to extract and reserve the dynamic features through consecutive periods (see Figure 1b), and can be written as, where w kij is a 3D tensor, k is the temporal indicator and N is its length, n indicates n-th feature map of previous layer, x cde and y cde are the input and output activation at location (c, d, e), respectively. In practice, temporal images also consist of multi-spectral channels, that is, spatial, spectral, and temporal dimensions form a four-dimensional (4D) tensor. To treat spatio-temporal information as main characteristic, the relations among spectral bands must be treated independent in 3D convolution, just as R, G and B bands in a conventional 2D CNN. As in Figure 2, all temporal images of a certain wave band, for example, red, are stacked up to form a 3D tensor, followed by a "green" stack. Indicator n in (3) equals the stack number in the first layer of a 3D CNN. Obviously, to treat spatio-spectral information as main feature, one should stack spectral images of a certain time first.
To prevent a 2D convolution operation collapsing, all of the information from separate channels to a 2D plane (see Figure 1a), a 3D convolution as in Equation (3) can be used to extract and reserve the dynamic features through consecutive periods (see Figure 1b), and can be written as, where wkij is a 3D tensor, k is the temporal indicator and N is its length, n indicates n-th feature map of previous layer, xcde and ycde are the input and output activation at location (c, d, e), respectively. In practice, temporal images also consist of multi-spectral channels, that is, spatial, spectral, and temporal dimensions form a four-dimensional (4D) tensor. To treat spatio-temporal information as main characteristic, the relations among spectral bands must be treated independent in 3D convolution, just as R, G and B bands in a conventional 2D CNN. As in Figure 2, all temporal images of a certain wave band, for example, red, are stacked up to form a 3D tensor, followed by a "green" stack. Indicator n in (3) equals the stack number in the first layer of a 3D CNN. Obviously, to treat spatio-spectral information as main feature, one should stack spectral images of a certain time first.
(a) (b) In (b) 3D convolution, ⊗3 indicates 3D convolution operator with length 3 in temporal direction. The operator is executed three times sequentially (in red, green and blue arrows) through temporal direction. The features pointed by the same-color arrows then contain temporal information, and output map is also a 3D tensor. indicates 3D convolution operator with length 3 in temporal direction. The operator is executed three times sequentially (in red, green and blue arrows) through temporal direction. The features pointed by the same-color arrows then contain temporal information, and output map is also a 3D tensor.
features (in different color) in temporal direction; ⊕ is sum operator where all features are collapsed.
In (b) 3D convolution, ⊗3 indicates 3D convolution operator with length 3 in temporal direction. The operator is executed three times sequentially (in red, green and blue arrows) through temporal direction. The features pointed by the same-color arrows then contain temporal information, and output map is also a 3D tensor. Figure 2. 3D convolution strategy for multi-temporal multi-spectral image input in this study. A 3D feature map is obtained after 3D convolution on spatio-temporal images and accumulating over different spectral bands (denoted by different border colors).

3D CNN Structure for Spatio-Temporal Images
We take the widely used neural network structure developed by Oxford's Visual Geometry Group (VGGnet) [35] as template to train a deep convolutional neural network where all 2D convolution operations are replaced by 3D convolution. Small-size kernel (especially, 3 × 3 in an image) is used to represent multi-level features, while the length of the temporal dimension can be set to 3, as in [30]. However, the kernel-size parameter is also fine-tuned in experiments. Layer number and kernel number of each layer typically depend on the capacity of inputs and complexity of problems. We chose a five-layer 3D convolutional network, with 32 or 64 kernels for all the experiments in this study, as is shown in Figure 3. The input consists of m independent channels (typically four, blue, green, red and near infrared), with size of n × w × h in temporal, width, and height directions, respectively. The first three layers consists of common convolution-pooling concatenation with 32, 32, 64 neurons to be learned sequentially. To best reserve the temporal information, i.e., the complete crop growth cycle, uncompressed till the first full connected layer (short as FC), 1 × 2 × 2 average pooling strategy is applied. The last FC layer, with dimension c × 1 where c the label number, is then fed to loss function, here, Maxout [36], to mark samples a score to each label. We set a fixed learning rate of 0.001 but set different random sampling strategies and batch size according to the sample capacity of different tests.

3D CNN Structure for Spatio-Temporal Images
We take the widely used neural network structure developed by Oxford's Visual Geometry Group (VGGnet) [35] as template to train a deep convolutional neural network where all 2D convolution operations are replaced by 3D convolution. Small-size kernel (especially, 3 × 3 in an image) is used to represent multi-level features, while the length of the temporal dimension can be set to 3, as in [30]. However, the kernel-size parameter is also fine-tuned in experiments. Layer number and kernel number of each layer typically depend on the capacity of inputs and complexity of problems. we chose a five-layer 3D convolutional network, with 32 or 64 kernels for all the experiments in this study, as is shown in Figure 3. The input consists of m independent channels (typically four, blue, green, red and near infrared), with size of n × w × h in temporal, width, and height directions, respectively. The first three layers consists of common convolution-pooling concatenation with 32, 32, 64 neurons to be learned sequentially. To best reserve the temporal information, i.e., the complete crop growth cycle, uncompressed till the first full connected layer (short as FC), 1 × 2 × 2 average pooling strategy is applied. The last FC layer, with dimension c × 1 where c the label number, is then fed to loss function, here, Maxout [36], to mark samples a score to each label. We set a fixed learning rate of 0.001 but set different random sampling strategies and batch size according to the sample capacity of different tests. In addition to the 3D CNN, a conventional 2D CNN is also applied. The structure is shown in Figure 4, which is similar to the 3D CNN and facilitates the comparison between 2D and 3D CNN. In addition to the 3D CNN, a conventional 2D CNN is also applied. The structure is shown in Figure 4, which is similar to the 3D CNN and facilitates the comparison between 2D and 3D CNN.
The network structures in Figures 3 and 4 are used both for a classic training-test process by discrete samples and for pixelwise classification of the whole image. In both cases, the input patch is abstracted into multi-level representations to classify the central point of the patch. In Sections 3.2 and 4 we will discuss the optimal patch size and mixed-pixel problem. In addition to the 3D CNN, a conventional 2D CNN is also applied. The structure is shown in Figure 4, which is similar to the 3D CNN and facilitates the comparison between 2D and 3D CNN. The network structures in Figures 3 and 4 are used both for a classic training-test process by discrete samples and for pixelwise classification of the whole image. In both cases, the input patch is abstracted into multi-level representations to classify the central point of the patch. In Section 3.2 and Section 4 we will discuss the optimal patch size and mixed-pixel problem.

An Active Learning Framework of CNN
Different from the classification applications using abundant internet pictures, the samples of agriculture crop type could be more difficult to collect. A person can clearly discriminate a cat from a dog in pictures, however: one needs specific knowledge or even field validation to recognize rice from corn or wheat in satellite images. How to efficiently improve the number of high-quality samples is important. An active learning framework is introduced into our 3D CNN based methods to improve the classification accuracy to a required level. Active learning consists of three steps, iteratively. First, train a model (here CNN) with the current available sample set. Second, label the unlabeled samples using the model, and detect and select the most salient ones, which are supposed to improve the current model most likely, for manual check. Thirdly, add the manually labelled samples to the current training set and iterate again. Table 1 specifies this flow for CNN model.  (4) is a designed rule for picking out those salient samples.
In Equation (4), A and B are the largest and second largest scores to different crop categories of an arbitrary sample given by the current model. The left inequation expresses that, if the ratio of B and A are bigger than given threshold T 1 (e.g., 0.8), then the classifier cannot clearly distinguish the two candidate labels. The right inequation indicates that all scores are smaller than the given T 2 , i.e., no good candidate label exists. Half of the samples from each criterion are then picked out for check according to their score ranking. Other criterions, for example, the conflict between SVM and CNN, i.e., different categories of a sample predicted by SVM and CNN, also can be used.

Parameter Tuning of 3D CNN
We tune input size, convolutional kernel size, layer number, pooling strategy, and their combination in the 3D CNN. Practical deep learning is basically a stochastic method that utilizes minibatch and stochastic gradient descent optimization. We train each parameter 10 times and

Parameter Tuning of 3D CNN
We tune input size, convolutional kernel size, layer number, pooling strategy, and their combination in the 3D CNN. Practical deep learning is basically a stochastic method that utilizes minibatch and stochastic gradient descent optimization. We train each parameter 10 times and Handcrafted shape files of the GF2 data are used as a reference for pixelwise classification. For discrete sample classification, 400 training samples and 2000 testing samples with four labels, corn, tree, rice, and sorghum, were selected from the 2015 shape file; 50 training samples; and, 1500 testing samples with five labels, corn, rice soybean, tree, and nursery land, were selected from the 2016 shape file.
As to the GF1 data, only discrete point samples are available with nine labels, soybean, wheat, rice, corn, grass, forest, building, water and road, each of which is labelled with different colors in Figure 6. 1055 pixels are selected for training and 180 for testing. It is expected that the multi-temporal information can well discriminate certain crop with special growth trend from static objects as buildings and from other crops.

Parameter Tuning of 3D CNN
We tune input size, convolutional kernel size, layer number, pooling strategy, and their combination in the 3D CNN. Practical deep learning is basically a stochastic method that utilizes minibatch and stochastic gradient descent optimization. We train each parameter 10 times and average the training results ( Table 2). The model whose performance is nearest to the average training accuracy is selected for testing. All of the samples for 2D CNN consist of m × n patches with w × h pixel size, where m and n are spectral and temporal numbers, and w, h are width and height, respectively. Samples for 3D CNN consist of m tensor patches with n × w × h size in temporal, width, and height direction, respectively. Except for m, n is also fixed to the number of all temporal images to reserve the complete growth cycle. Other parameters in CNN are set as follows: epoch and iteration number are both set to 10,000 in GF2 2016, when considering that we can train all the 50 samples at once; minibatch optimization is applied in GF1 and GF2-2015 with batch-size 500 and 400, respectively; the last full-connected layer outputs a c × 1 vector with c the number of classes; the learning rate is fixed to 0.001.
From Table 2, kernel size of 3 × 3 × 3 shows slightly better than other kernel sizes of 1 × 3 × 3 and 3 × 5 × 5. It is compatible with the results in video recognition [29], and indicate 3 × 3 × 3 maybe a robust setting to generic spatio-temporal representation. Patch size 8 × 8 (w × h) is the most suitable to both GF1 and GF2 data for larger patch size may cause mixed-pixel problem. For the number of the convolutional layer (short as "3 L" for example), three convolutional layers perform slightly better than two-layer and four-layer networks. Four or more layers are not necessary here for after three 2× pooling the sample size has been reduced to 1 in image space. For pooling strategy, "A" and "M" denote average pooling and max pooling respectively. The average pooling got 1.3, 1.6, and 2.0 percent better performance than the max pooling in GF2 2015, GF2 2016, and GF1 data, respectively. The training process with the optimal parameters, i.e., the last-column ones, is shown in Figure 7. The training converged fast after several hundred iterations, and the test accuracy is gradually and slightly improved in the following thousands of loops.
pooling the sample size has been reduced to 1 in image space. For pooling strategy, "A" and "M" denote average pooling and max pooling respectively. The average pooling got 1.3, 1.6, and 2.0 percent better performance than the max pooling in GF2 2015, GF2 2016, and GF1 data, respectively. The training process with the optimal parameters, i.e., the last-column ones, is shown in Figure 7. The training converged fast after several hundred iterations, and the test accuracy is gradually and slightly improved in the following thousands of loops. and, (f) 2D CNN for GF1 data.

Comparison to 2D CNN and Empirical Methods
The 2D CNN parameters are tuned the same way as the 3D CNN. The 2D CNN network is set as 8 × 8 patch size, 3 × 3 kernel size, three convolutional layers, and average pooling strategy ( Figure  4). The loss function is the same as 3D CNN, and the only difference between 2D and 3D CNN is the learned multi-level representations, guaranteeing the pure comparison between spatio-temporal and spatial features. Figure 7 is the corresponding test accuracy of 94.0%, 97.2%, and 77.2%, in GF2 2015, GF2 2016, and GF1 data, respectively, when compared to the accuracy of 94.7%, 98.9%, and 79.4%, respectively, in 3D CNN.
Concatenated NDVI [6,8] and original temporal image patches [12,13] are used as inputs for comparison. SVM, KNN, and PCA-KNN [38] are utilized for comparison with both types of inputs. Results of all these methods are obtained after optimal parameter tuning as CNN did and list in Figure 8.
According to Figure 8a, with concatenated temporal-image input, there are three notable points. First, CNN based methods perform better than the conventional methods in the three tests. The latter performs worse both in GF1 data where all methods show overall low performances, and in GF2 data

Comparison to 2D CNN and Empirical Methods
The 2D CNN parameters are tuned the same way as the 3D CNN. The 2D CNN network is set as 8 × 8 patch size, 3 × 3 kernel size, three convolutional layers, and average pooling strategy (Figure 4). The loss function is the same as 3D CNN, and the only difference between 2D and 3D CNN is the learned multi-level representations, guaranteeing the pure comparison between spatio-temporal and spatial features. Figure 7 is the corresponding test accuracy of 94.0%, 97.2%, and 77.2%, in GF2 2015, GF2 2016, and GF1 data, respectively, when compared to the accuracy of 94.7%, 98.9%, and 79.4%, respectively, in 3D CNN. Concatenated NDVI [6,8] and original temporal image patches [12,13] are used as inputs for comparison. SVM, KNN, and PCA-KNN [38] are utilized for comparison with both types of inputs. Results of all these methods are obtained after optimal parameter tuning as CNN did and list in Figure 8.   According to Figure 8a, with concatenated temporal-image input, there are three notable points. First, CNN based methods perform better than the conventional methods in the three tests. The latter performs worse both in GF1 data where all methods show overall low performances, and in GF2 data where all methods perform well. Second, 3D CNN performs better than 2D CNN. Table 3 shows the confusion matrix of categories. In GF2 2015 test, the diagonal elements of corn and tree of 3D CNN are larger than that of 2D CNN, indicating better selectivity; in GF2 2016 test, corn is more easily discriminated from rice by 3D CNN, and the relatively lower performance of 2D CNN will be amplified in corresponding pixelwise classification tests (see Section 3.4). Since the 2D and 3D CNN share almost the same structure except the learned representations, it could be inferred that the temporal features extracted by 3D CNN contribute 1.2% accuracy improvement compared to 2D CNN with GF2 data, and contribute 2.2% improvement with GF1 data. Third, SVM is most effective among the conventional methods, and has shown close to 2D CNN. KNN and PCA are both poorly performed in these tests. According to Figure 8b with NDVI input, the first notable point is the performances of all the CNN and conventional methods are decreased largely when compared to that of the original data input. In GF2 2015 test, the overall accuracy of the CNN methods dropped 12% on average and the conventional methods dropped 4%, which made CNN methods worse than the conventional ones. In GF2 2016 test, the overall accuracy of the CNN methods dropped 5%, while the other methods dropped 4%. This could be explained that NDVI discarded some necessary information for crop classification, and the information is much more important for CNN based methods, which prefer to learn multi-level representations from original data and incline to overfit with the inputs of less information. When considering the poor performance of NDVI input, we only use concatenated original data for both CNN based and conventional methods in the next pixelwise classification experiments.

Pixelwise Classification Results and Analysis
The pixelwise classification performances of all the methods are compared using GF2 2015 and GF2 2016 data (see Figures 9 and 10). The quantitative results are listed in Table 4 where the maximum values in each row (in bold) indicate that in both tests the 3D CNN outperforms the other methods on two most used indexes, overall accuracy (OA), and Kappa, both computed from the confusion matrix. In the 2015 tests, the 3D CNN is slightly better when comparing to the 2D CNN and other methods. In this case, the spatial information could have been sufficient for label discrimination, while the temporal information contributes less to the classification accuracy. GF2 2016 data (see Figures 9 and 10). The quantitative results are listed in Table 4 where the maximum values in each row (in bold) indicate that in both tests the 3D CNN outperforms the other methods on two most used indexes, overall accuracy (OA), and Kappa, both computed from the confusion matrix. In the 2015 tests, the 3D CNN is slightly better when comparing to the 2D CNN and other methods. In this case, the spatial information could have been sufficient for label discrimination, while the temporal information contributes less to the classification accuracy.  However, in the 2016 test, the overall accuracy of 3D CNN exceeded about 6 percent to that of 2D CNN and SVM, and 30 percent to that of KNN and PCA + KNN. On Kappa, 3D CNN exceeded about 10 percent to 2D CNN and SVM, and more than 40 percent to KNN and PCA + KNN. Table 5 reveals the main reason why the 3D CNN outperforms the 2D CNN, along with a prior knowledge that in Northeast China, a certain variety of upland rice and corn are spectrally similar in almost every stage of the growing season. From a satellite view, it is hard to discriminate between the two classes from a single temporal image or simply concatenated temporal images by the 2D CNN and other conventional methods. However, the 3D CNN could extract the subtle differences, i.e., distinctive temporal features, from temporal image series and greatly reduces the uncertainty between rice and corn from 0.12 to 0.02 in the confusion matrix, and increases Kappa from 0.820 to 0.924. Corresponding to the quantitative results of Table 5, the top part of Figure 10 provides more visual evidences where a large area of rice was wrongly labelled to corn by all of the methods expect 3D CNN. It could be limitedly concluded that 3D CNN functions robust and accurate in different situations, whereas other methods rely much more on distinguishable representations of 2D images.  However, in the 2016 test, the overall accuracy of 3D CNN exceeded about 6 percent to that of 2D CNN and SVM, and 30 percent to that of KNN and PCA + KNN. On Kappa, 3D CNN exceeded about 10 percent to 2D CNN and SVM, and more than 40 percent to KNN and PCA + KNN. Table 5 reveals the main reason why the 3D CNN outperforms the 2D CNN, along with a prior knowledge that in Northeast China, a certain variety of upland rice and corn are spectrally similar in almost every stage of the growing season. From a satellite view, it is hard to discriminate between the two classes from a single temporal image or simply concatenated temporal images by the 2D CNN and other conventional methods. However, the 3D CNN could extract the subtle differences, i.e., distinctive temporal features, from temporal image series and greatly reduces the uncertainty between rice and corn from 0.12 to 0.02 in the confusion matrix, and increases Kappa from 0.820 to 0.924. Corresponding to the quantitative results of Table 5, the top part of Figure 10 provides more visual evidences where a large area of rice was wrongly labelled to corn by all of the methods expect 3D CNN. It could be limitedly concluded that 3D CNN functions robust and accurate in different situations, whereas other methods rely much more on distinguishable representations of 2D images.    Figure 9 also reveal some sparse trees that were missed by the handcrafted shape file were rediscovered by automatic methods. This caused a statistical bias slightly. However, the impact is fair for the comparisons of all the methods.

Active Learning Strategy for CNN
To efficiently increase the test accuracy up to the given threshold 90% for GF1 data, on which all of the methods performed poorly and were below 80%, an iterative 3D CNN under the active learning strategy described in Table 1 is utilized. In each iteration, 100 unlabeled samples (about 11 samples per current label) are picked out by the rule (4) with dynamic thresholds T1 and T2 for choosing only top 11 samples per label for manual check, which are then moved to the training dataset. Finally, we get the test accuracy as shown in Table 6, where every 100 more manual samples are added, better (2%~3%) test accuracy is achieved, up to 0.94 with total 700 new samples. We test the efficiency of adding random samples (RA). As can be seen in the third row of Table 6, 1500 new training samples have to be added to reach 90% test accuracy, whereas only 400 samples are needed if utilizing the active learning strategy. We also test another active learning strategy where the prominent samples with different labels predicted by 3D CNN and SVM, respectively, is selected for check. The efficiency of this method (CNN + SVM) is approaching that of our rule (4) but it requires an additional SVM classification process. It could be suggested that if a given accuracy cannot be achieved by the current model, a simple active learning strategy can be easily applied to save a large amount of manual labelling work.

Discussion
Except for CNN parameters, the quality of original samples is also important for remote sensing classification. A widely used preprocessing is segmenting original images to hyper-pixels for obtaining more "pure" samples. We test the performance of hyper-pixel inputs extracted from the GF1 data, when considering the 8 × 8 window patches we used before may contain mixed classes. The whole image is firstly segmented by the method that is developed by Liu et al. [39] to hyper-pixels, and every 8 × 8 patch sample (16 × 16 patch shown in Figure 11 instead for better vision) is then obtained by the superposition of the corresponding hyper-pixel upon 8 × 8 background template. However, the test accuracy decreased near 2 percent when compared to that of our original input. One plausible explanation is that the segmentation operation may introduce confusing information to classification, especially the shapes. The penalty of this exceeds the contribution of pixel purity. We did not test this on 3D CNN because the hyper-pixels containing the same sample points show different shapes in different periods.
of the methods performed poorly and were below 80%, an iterative 3D CNN under the active learning strategy described in Table 1 is utilized. In each iteration, 100 unlabeled samples (about 11 samples per current label) are picked out by the rule (4) with dynamic thresholds T1 and T2 for choosing only top 11 samples per label for manual check, which are then moved to the training dataset. Finally, we get the test accuracy as shown in Table 6, where every 100 more manual samples are added, better (2%~3%) test accuracy is achieved, up to 0.94 with total 700 new samples. We test the efficiency of adding random samples (RA). As can be seen in the third row of Table 6, 1500 new training samples have to be added to reach 90% test accuracy, whereas only 400 samples are needed if utilizing the active learning strategy. We also test another active learning strategy where the prominent samples with different labels predicted by 3D CNN and SVM, respectively, is selected for check. The efficiency of this method (CNN + SVM) is approaching that of our rule (4) but it requires an additional SVM classification process.
It could be suggested that if a given accuracy cannot be achieved by the current model, a simple active learning strategy can be easily applied to save a large amount of manual labelling work. Table 6. Comparison between stochastic and active learning strategies.

Discussion
Except for CNN parameters, the quality of original samples is also important for remote sensing classification. A widely used preprocessing is segmenting original images to hyper-pixels for obtaining more "pure" samples. We test the performance of hyper-pixel inputs extracted from the GF1 data, when considering the 8 × 8 window patches we used before may contain mixed classes. The whole image is firstly segmented by the method that is developed by Liu et al. [39] to hyperpixels, and every 8 × 8 patch sample (16 × 16 patch shown in Figure 11 instead for better vision) is then obtained by the superposition of the corresponding hyper-pixel upon 8 × 8 background template. However, the test accuracy decreased near 2 percent when compared to that of our original input. One plausible explanation is that the segmentation operation may introduce confusing information to classification, especially the shapes. The penalty of this exceeds the contribution of pixel purity. We did not test this on 3D CNN because the hyper-pixels containing the same sample points show different shapes in different periods.  Another interesting question is whether spectral features, extracted from the combination of red, green, blue, and infrared bands, also have a high performance on crop classification. The current 3D convolutional network can be easily modified to check this by stacking the spectral images of a certain time first according to Figure 2. We applied the modified 3D tensor inputs on the GF1 dataset but kept all the CNN parameters the same. From Table 7, the spatio-spectral 3D CNN obtained much worse result when comparing to the spatio-temporal 3D CNN and a slightly worse result when comparing to the 2D CNN. The latter may be caused by the lack of more spectral bands or the inevitable noise in a "standard" spectral curve of a certain material, which is a common case in multi-spectral images. However, the processing of spatio-spectral information is not the focus of this study, and other recent articles could be referred [32]. Third, we discuss more on the performance comparison of 2D CNN, 3D CNN, and SVM in pixelwise classification, which is the most common case in remote sensing applications other than the training-test procedure on discrete samples [8,12]. The overall accuracy of 3D CNN exceeds that of 2D CNN and SVM about 3% in average, and about 6% in the most challenging GF2 2016 data, where all of the other methods failed to discriminate corn from rice. The 3D CNN outperforms the 2D CNN purely with the learned spatio-temporal representations since they share the same parameters and classifier. The performance of SVM is approaching that of 2D CNN, which indicated that the concatenated temporal images are a good representation and match the state-of-the-art multi-level representation learned by a conventional CNN, if we ignore the classifier difference. It also explains from a side that the concatenated temporal images have been widely used in vegetation classification [12,13]. However, representing temporal features only by a concatenation operation is oversimplified and is not as good as spatio-temporal representations extracted by the 3D CNN.
At last, in this study, only temporal images and NDVI are used for feature inputs. We obtained similar result as Guerschman et al. [13], which also shows that advantage of using the whole spectral profile instead of the NDVI. The other method [8] utilized a feature combination of NDVI, humidity and luminance. Luminance is a spatial feature that could be extracted from original images, while humidity could be hard to accurately retrieve from spectral sensors with similar vegetarian covers. Testing all types of empirical features that are used before is unrealistic. We are interested in if additional empirical features could help the selectivity of CNN-based representations and further improve the performance of CNN methods. We added every NDVI layer to the corresponding original temporal image to train the 2D CNN. Table 8 shows they obtained similar performances. It implies that a CNN based method could learn the optimal representations and including new empirical features is unnecessary.

Conclusions
Deep learning has been applied extensively in remote sensing applications. In this study, a novel method based on 3D CNN is introduced to crop classification using multi-temporal remote sensing images. The results show that the 3D CNN with proper structure performed better than either the conventional methods or the 2D CNN. According to the comparison of 2D and 3D CNN, 3D convolution could be a better feature extractor for spatio-temporal remote sensing data, while a common 2D convolution would lose the temporal information due to its own mathematical restriction, and a conventional method may take an oversimplified empirical rule to represent temporal features. In addition, we introduced an active learning strategy to CNN for improve the efficiency of manual labelling. It is very useful to embed the strategy in sample preparation or model refinement stage when the required accuracy is not achieved.
Furthermore, discovering the feasibility and efficiency of spatio-temporal feature representation by a 3D CNN could benefit not only in crop classification, but also other modelling process that is related to changes, trends or dynamics using remote sensing data. For example, embed a high-dimension deep learning framework into studies of glacier melting, climate changes, and seasonal forest dynamics, etc.