Effective training of deep convolutional neural networks for hyperspectral image classification through artificial labeling

Hyperspectral imaging is a rich source of data, allowing for multitude of effective applications. However, such imaging remains challenging because of large data dimension and, typically, small pool of available training examples. While deep learning approaches have been shown to be successful in providing effective classification solutions, especially for high dimensional problems, unfortunately they work best with a lot of labelled examples available. To alleviate the second requirement for a particular dataset the transfer learning approach can be used: first the network is pre-trained on some dataset with large amount of training labels available, then the actual dataset is used to fine-tune the network. This strategy is not straightforward to apply with hyperspectral images, as it is often the case that only one particular image of some type or characteristic is available. In this paper, we propose and investigate a simple and effective strategy of transfer learning that uses unsupervised pre-training step without label information. This approach can be applied to many of the hyperspectral classification problems. Performed experiments show that it is very effective at improving the classification accuracy without being restricted to a particular image type or neural network architecture. The experiments were carried out on several deep neural network architectures and various sizes of labeled training sets. The greatest improvement in overall accuracy on the Indian Pines and Pavia University datasets is over 21 and 13 percentage points, respectively. An additional advantage of the proposed approach is the unsupervised nature of the pre-training step, which can be done immediately after image acquisition, without the need of the potentially costly expert’s time.


Introduction
Classification of hyperspectral images (HSI) has many potential applications, e.g. land cover segmentation [1], mineral identification [2], or anomaly detection [3]. The classification algorithms used include both general models, e.g. the SVM [4], and dedicated approaches, taking into account spectral properties or spatial class distribution [5]. Recently there have been attempts to use Deep Learning Neural Networks (DLNN) for HSI classification. The motivation is that such methods have gained attention after achieving state of the art in natural image 1 processing tasks [6]. Their unique ability to process an image using a hierarchical composition of simple features learned during training makes them a powerful tool in areas where manipulation of high-dimensional data is needed.
While DLNN can achieve very good accuracy scores, they have the drawback of requiring a large amount of training data for estimation of model parameters. Such data is not always available, as it is common to have a single HSI with just a handful of training labels available. To bridge a gap between this realistic scenario and DLNN network requirements, we propose an approach that trains the DLNN in two stages, with the first -pre-training -stage using artificial labels. In the remainder of this section, we discuss the relevant related, and introduce the motivation of our approach and state the hypothesis that is the base of our method.
A number of DLNN architectures have been proposed, inspired by mathematical derivations and/or neuroscience studies. The Convolutional Neural Networks (CNN) [7] are a special case of deep neural networks which were originally developed to process images, but are also used for other types of data like audio. They combine traditional neural networks with biologically inspired structure into a very effective learning algorithm. They scan multidimensional input piece by piece with a convolutional window, which is a set of neurons with common weights. Convolution window processes local dependencies (features) in the input data. The output corresponding to one convolutional window is called a feature map and it can be interpreted as a map of activity of the given feature on the whole input. The CNN remain one of the most popular architectures for DLNN classification in use today.
Other approaches include the generative architectures, e.g. the Restricted Boltzmann Machine (RBM) [8,9], Autoencoders (AE) [10] or Deep Belief Network (DBN) [11,12]. Yet another popular architecture is the Recurrent Neural Network (RNN) which, through directed cycles between units, has the potential of representing the state of processed sequence. They are applicable e.g. for time series prediction or outlier detection. The most popular types of RNN are Long Short-Term Memory (LSTM) networks [13] and Gated Recurrent Units (GRUs) [14]. They improve the original RNN architecture by dealing with exploding and vanishing gradient problem.
For classification of HSI data, the CNN is the most popular architecture chosen. In [15] the simple CNN architecture is adapted to HSI classification; the lack of training labels is mitigated by adding geometric transformations to available training data points. In [16] authors use three kinds of convolutional windows: two of them are 3D convolutions which analyse spatial and spectral dependencies in the input picture, while the third is the 1D kernel. Next the feature maps from these three types of convolutions are stacked one after the other and create joint output of this first part of the network. The following layers consist only of the one dimensional convolutional kernels and residual connections. The authors of [17] introduce a parallel stream of processing with an original approach for spatial enhancement of hyperspectral data. The authors of [18] design a deep network that reduces the effect of Hughes phenomenon (curse of dimensionality) and use additional unlabelled sample pool to improve performance. In [19] authors propose an alternative architecture called RPNet based on prefixed convolutional kernels. It combines shallow and deep features for classification. Another architecture (MugNet) is proposed in [20] with a focus on simplicity of processing for classification of hyperspectral data with few training samples and reduced number of hyperparameters. A yet another architecture approach is used in [21] where a multi-branch fusion network is introduced, which uses merging multiple branches on an ordinary CNN. An additional L2 regularization step is introduced to improve the generalization ability with limited number of training samples. The work [22] proposes a strategy based on multiple convolutional layers fusion. Two distinct networks, composed of similar modules but different organization, are examined.
Other architectures are also used. For example in [23] authors utilize the sequential nature of hyperspectral pixels and use some variations of recurrent neural networks -Gated Recurrent Unit (GRU) and Long-Short Term Memory (LSTM) networks. Moreover, in [24] one dimensional convolutional layers followed by LSTM units were used. Chen et. al. [25] use artificial neural networks for feature extraction. They utilize stacked autoencoders (SAE) for feature extraction from pixels, and PCA for reduction of the spectral dimensionality of the training segments taken from the picture. Next, the logistic regression is performed on this spectral (SAE) and spatial (PCA) extracted information. Another approach [26] uses stacked SAE for an application study -detection of a rice eating insect. RNN architectures are also employed, as they are suitable for processing the spectral vector data. The work [27] applies sequential spectral processing of hyperspectral data, using a RNN supported by a guided filter. In [28] authors use the multiscale hierarchical recurrent neural networks (MHRNNs) to learn the spatial dependency of nonadjacent image patches in the two-dimension (2D) spatial domain. Another idea to analysing HSI is spatial-spectral method in which network takes information not only from spectrum bands but also from spatial dependencies of image [16]. A significant problem in practical hyperspectral classification is the small number of training samples. It is related to the difficulty of obtaining verified labels [1], as often each pixel must be individually evaluated before labelling. Therefore, a reference hyperspectral classification experiment may assume number as low as 1% available samples per class [2]. A number of approaches has been exploited to deal with this difficulty, e.g. including combining spatial and spectral features [29], additional training sample generation [30], extending the classification algorithm with segmentation [31], or employing Active Learning [32].
For the DLNN classification, the lack of high volume of training data is a serious complication, as they typically require a lot of data to achieve high efficiency. Optimal use of DLNN in HSI classification would require learning them with just a few labelled samples. This may be obtained by searching for well-tailored architecture for specific task [15], however such approach requires relatively big validation set to obtain meaningful results. The other approach is to expand the available training set. It may be achieved either by artificially augmenting training set or using different dataset as a source for pre-training [33]. Another approach is to add a regularization step to improve the generalization ability with limited number of training samples [21]. A simplification of the network architecture for classification with few training samples is employed in the MugNet network [20]. Finally, where possible, the transfer learning approach is used, e.g. [34].
The transfer learning [35] uses training samples from two domains, which share common characteristics. A network is first pre-trained on the first domain, which has plentiful supply of training samples but does not solve the problem at hand. Then, the training is updated with the second domain, which adapts the weights to the actual problem.
Transfer learning is simple to apply in the case of convolutional neural networks (CNNs). In [36], authors compared different versions of transfer learning for CNNs in the case of natural images classification. They studied its effects depending on the number of transferred layers and whether they were fine tuned or not as well as depending on the differences between the considered datasets. In [37], authors used transfer learning on CNNs to recognize emotions from the pictures of faces. Other uses include evaluating the level of poverty in a region given its remote sensing images [38] and computer-aided detection using CT scans [39].
There have been applications of transfer learning in the general remote sensing (not-hyperspectral) images. In [40] deep learned features are transferred for effective target detection; negative bootstrapping is used for improving the convergence of the detector. A similar approach is applied in [41] where RNN network trained on multispectral city images is used to derive features for studying urban dynamics across seasonal, spatial and annual variance. The authors of [42] study the performance of transfer learning in two remote sensing scene classifications. The results show that features generalize well to high resolution remote sensing images. As the work [43] shows, transfer learning can be applied in remote sensing using RNN architectures also.
Recently, transfer learning has been also applied to the HSI data. In [34], authors applied it for CNNs originally used for classifying well known remote sensing hyperspectral images to classify images acquired from field-based platforms and regarding a different domain. The authors of [44] use a intermediate step of supervised similarity learning for anomaly detection in unlabelled hyperspectral image. A different approach to transfer learning is proposed in [45] which explores the high level feature correlation of two HSI. A new training principle simultaneously processes both images, to estimate a common feature space for both images. A yet another approach is proposed in [46] where HSI superresolution is achieved using supported high resolution natural image. This natural image is used as a training reference, which is later adapted to HSI domain. In [47], iterative process combines training and updating the currently used training label set. Two specialized architectures (for spatial and spectral processing) are used. The training iteratively extends the current label set, starting from the initial expert's labels.
The above approaches do not apply to the arguably most popular practical scenario, where only a single HSI with a handful of labels is available. Moreover, getting the training labels often requires additional resources (e.g. expert consultation and/or site visit). It is thus desirable to have unsupervised methods for realization of the pre-training step. Authors of [48] use outlier detection and segmentation to provide candidates for training of target detector in HSI. This information is used to construct a subspace for target detection by transfer learning theory. This shows the potential of using an unsupervised approach, however limited to separation of target/anomaly points from the background. In the work of [33], a separate clustering step is used for generation of pseudo-labels, using Dirichlet process mixture model. The network is trained on the pseudo-labels, then the all but last layers are extracted, and the final network is trained on the originally provided training labels. While this scheme is shown to be effective in the presented results, it relies on a complex non-neural preprocessing and tailoring the DLNN configuration to each dataset separately. Also, the effect of size of label areas and effects on different architectures are not investigated. We show that similar gains can be made with a simpler preprocessing, independent of the DLNN architecture chosen. The authors of [49] propose to use a sparse coding to estimate high level features from unlabelled data from different sources. This approach does not require training data, but is tailored to the case where multiple inputs are available, preferably with diverse contents.
To close the gap between data inefficient deep learning models and practical applications of HSI we propose a method which takes advantage of abundant unlabelled data points present on HSI images. Precisely, we state a hypothesis: Spatial similarity of unlabelled data points can be utilized to gain accuracy in hyperspectral classification. To corroborate our hypothesis, we construct a simple clustering method that assigns artificial label to each pixel on the image based on its spatial location. This artificial dataset is used to pre-train deep learning classifier. Next the model is fine-tuned with original dataset. Through series of experiments we show superiority of the proposed approach over the standard learning procedure. Our approach is motivated by two known phenomena: cluster assumption [50] and regularization effect of noise in classes [51,52,53]. We note that many of remote sensing images share common properties, most notably the 'cluster assumption' -pixels that are close to one another or form a distinct cluster or group frequently share the class label. Additionally, due to the simplistic form of our clustering method, we purposefully introduce noise in labels used during pre-training phase, however as shown in [51] this label noise has little to no effect on final accuracy, as long as number of properly labelled examples scales proportionally which is our case.

Materials & Methods
Our method is to be applied in the following case: 1. Classification of pixels from a remote sensing hyperspectral image; 2. Neural networks used as a classifier; 3. Few training labels available.
In such situation, we propose to augment the training with a pre-training step that uses artificial labels, which are independent of the training labels. Inclusion of this pre-training step can be viewed as a modification of a transfer learning approach. Conventional transfer learning in this case would use a related dataset (source domain) with abundance of labels to pre-train, then the current dataset (target domain) to fine-tune. In our case, the source domain consists of every point in the hyperspectral image, while the target domain is composed of only the labelled samples. In the remainder of this Section we discuss: the spatial structure of hyperspectral images and the characteristics of neural network that make this approach feasible, and the details of its application. We also describe the experiments used to test the proposed approach.

Spatial structure of hyperspectral images
It is well-known that remote sensing hyperspectral images contain spatial structure, that can be exploited to improve classification scores when only a few training samples are available [54,31,5,55,30]. A segmentation can be applied to propose candidate pixels for labelling with high confidence [54] or identify connected components for label assignment [31]. Class training samples can be extended through moderated region growing [5] or spatial filtering combined with spatialspectral Label Propagation [55]. Finally, disagreement between spatial and spectral classifiers can be used to propose new samples [30]. A qualitative investigation of this phenomenon shows that hyperspectral pixels close to one another, whether spatially or spectrally, are likely to have the same class label, thus fulfilling the 'cluster assumption' [50]. This effect often leads to a bloblike structure of a hyperspectral dataset, observed in many hyperspectral classification problems (e.g. land cover labelling in remote sensing, paint identification in heritage science, scene analysis in forensics). A single class with samples in different parts of an image can be made of a number of blobs, which differ from each other because of, e.g., non-uniformity in class structure (e.g. the same class can contain differing crop types), spectral variations (e.g. same crop in two areas can have differing properties due to sunlight exposure, soil type) or acquisition conditions (e.g. level of lighting, shadows).

Emergence of data-dependent filters in neural network training
During training, subsequent layers of a deep neural network form a representation of a local input data structure [56]. Given a data source, this representation, especially on lower layers, can be remarkably similar across different dataset. For example, in the problem of natural image classification the learned kernels resemble a Gabor filter bank [6,57], independent of class set. This form of a filter can be shown to arise independently when independent components [58] or an effective sparse code [59] for natural images is estimated. Another case where data-dependent filters emerge is the pretext task approach, e.g. [60,61], where the network first learns to predict the input sequence without class labels, which are introduced at a fine-tune stage for to get the final classification model. Apparently the deep neural networks are able, at least in part, to extract an efficient class-independent data representation. This phenomenon has not been studied for hyperspectral images, however, it can be argued that similar class-independent but data-dependent representation is being learned in training for hyperspectral image classification.

Methods used for proposed artificial labelling approach
Our method for creating artificial labels for the pre-training step is a simple segmentation algorithm which assumes the local homogeneity of samples' spectral characteristics. It works by dividing the considered image into k rectangles, where each of these rectangles has its own label. For an image of height h and width w, we divide its height into m roughly equal parts and its width into n roughly equal parts, so that k = m · n. We then get k rectangles, where each one's height equals approximately h/m, while its width equals approximately w/n. Each of these rectangles defines a different artificial class with a different label. A schematic is presented in Figure 1.
The function of artificial labels is for the network to learn class-independent blob patterns present in the data. This focuses the network training in the fine tuning on the actual training labels, with the network 'oriented' towards the features of the current image. It can also be of advantage in situations when a class is composed of multiple blobs, and not all of them have samples in the training set. In that case sufficiently correct labelling is unlikely to be obtained [62] with just the training samples, but the proposed grid structure forces the network to estimate features for the whole image. An additional advantage of this approach is to shift the potentially time consuming pre-training from the expert labelling moment to the acquisition moment. In other words, network training does not need to be held back until the expert's labels are available, but can commence right after the image is recorded.

Selected network architectures
In our experiments three architectures were tested, based on [16,15,63]. All three share a common approach to exploit local homogeneity of hyperspectral images, however each one has its unique strengths and weaknesses making them an interesting testbed for the universality of the proposed method. The first architecture [16] features relatively high number of convolutional layers which might be helpful in transfer learning application. The second architecture [15], to the best of authors knowledge, is one of the best networks that are trained on limited number of samples per class. However due to its constrained capacity, it may not benefit as much from the pre-training phase. The last of the considered convolutional neural networks [63] is conceptually the simplest of the three, which allows us to test our approach using more conventional convolutional architecture.

Experiments
This subsection describes the experiments evaluating the proposed approach. We investigate the performance of the artificial label pre-training in the following four experiments: 1. Experiment 1 evaluates the accuracy improvement achieved by using the method.
2. Experiment 2 investigates the variability introduced by the size and shape of the patches used to define artificial classes, using one of the datasets from Experiment 1.  4. Experiment 4 is an examination of a claim about the emergence of data-independent representations during neural network training using proposed artificial labelling scheme.
In the following subsections, the detailed descriptions of the conducted experiments are given, while in the Section 3 we present the results of the experiments.

Experiment 1
In this experiment, the proposed approach is evaluated using different hyperspectral images and neural network architectures to prove its robustness. For the experiment, we have used two well-known hyperspectral datasets: Indian Pines and Pavia University.
The Indian Pines dataset was collected by the AVIRIS sensor over the Northwest Indiana area. The image consists of 145 × 145 pixels. Each pixel has 220 spectral bands in the frequency range 0.4-2.5 ×10 −6 m. Channels affected by noise and/or water absorption were removed (i.e. The Pavia University dataset was collected by the ROSIS sensor over the urban area of the University of Pavia in Italy. This image consist of 610 × 340 pixels. It has 115 spectral bands in the frequency range from 0.43 to 0.86 ×10 −6 m. The noisiest 12 bands were removed, and remained 103 were utilized in the experiments. Ground truth includes 9 classes, corresponding mostly to different building materials.
The two datasets were subjected to a feature transformation. For a given dataset, the mean m b of each hyperspectral band b were calculated. In the case of each dataset, and for each given pixel x and band b, the corresponding mean m b was subtracted, For this experiment, all three of the previously introduced neural network architectures were used. As discussed previously, the training was divided into pre-training and fine-tuning stages. In pre-training, the data was labelled through assigning an artificial class to each block within a grid of dimensions 5 × 5. No ground truth data was used at this stage. In the fine-tuning stage, a selected number of ground truth labels was used. The number of training samples from each class was set at n = 5, 15, 50. This allowed to observe the performance both in typical hyperspectral scenarios (small number of classes used) and deep network scenarios (larger number of samples per class available). Because the classification accuracy depends on the training set used in fine-tuning each experiment was repeated n = 15 times for error reporting. The performance is reported in Overall Accuracy (OA) after fine tuning. Additionally, Average Accuracy (AA) and κ coefficient were inspected and improvements verified with statistical tests.

Experiment 2
The second experiment investigates the variability introduced by the size and shape of the patches used in artificial labelling.
For this experiment only the Indian Pines image introduced in Experiment 1 was used, as it is the more challenging of the two introduced datasets. As was the case with the previous experiment, the mean was subtracted. The network investigated is the architecture based on [16], chosen because it has the most potential to be affected by the transfer learning process.
In this experiment, first the grid size was investigated. The dimensions of the patches, varies from 2 × 2, which equals 4 artificial classes, up to 72 × 72, 5184 artificial classes. Furthermore, another way of creating artificial labels is considered. The image is divided into the given number of vertical stripes. Visualisation of different artificial labelings is presented in Figure 2.
The investigated patches were created by dividing horizontal and vertical side of an image into w = 2, 3,5,7,9,15,19,25,29,36,39,48, 72 equal parts. The vertical stripes were created by dividing horizontal side of an image into s = 2, 5,9,16,25,36,49, 81 equal parts (so in the case of s = 2, there are only 2 classes located to the right and left of a single vertical line). The vertical stripes were included to observe whether the pixel distance affects the performance -for patches, all the pixels share similar neighbourhood; for stripes, the top and bottom pixels have a notable spatial separation and, arguably, the distant pixels should not be marked with the same class label without prior knowledge of spatial class distribution. Note that in case of patches made by dividing each side of an image into w = 29, 36, 39, 48, 72 equal parts the size of a square patch is smaller than the size of a processing window 5 × 5 in tested architecture. That means no sample fed to a network during pre-training phase has a coherent class representation (i.e. a single class present in the window). This experiment was performed with 5 training samples per class and 50 experiment runs for each grid density and the number of stripes.

Experiment 3
In this experiment, we test the hypothesis that the more numerous patches' division produces a better pre-training set than the less numerous ones. We investigate this using a specially designed  hyperspectral test image.
In this experiment, we use the image of paints from museum's collection. This dataset [64] was collected by the SPECIM hyperspectral system in the Laboratory of Analysis and Nondestructive Investigation of Heritage Objects (LANBOZ) in National Museum in Kraków. This image consist of 455 × 310 pixels. Each pixel has 256 spectral bands in the frequency range from 1000 to 2500 nm. Ground truth consists of manual annotations of different green pigments used in the mixture of paints for various painting regions. The image of oil paints on paper was used, selected from four available, as it was considered one of the more challenging of the images.
The layout of classes present in this image was especially designed to verify hyperspectral classifiers. The different chemical compositions of the paints used introduce variations of class spectra, yet at the same time all paints are variations of the green pigment with more or less greenish hue. The classification problem is thus difficult, but not exceedingly so. Regular grid layout, with different thickness of paints and fragments where one pigment overpaints another, introduce spatial diversity in the spectra. Since the image is artificially created, ground truth can be precisely marked. The original purpose of the image was to evaluate identification of copper pigments, difficult to differentiate by other (non-hyperspectral) sensors. Here we take advantage of its regularity by complementing the original ground truth (n GT = 5 classes) with a joined set (n GT −2 = 2 classes) and split set (n GT −10 = 10 classes). Those two sets of modified ground truth allow us to compare the proposed grid scheme, as tested in experiments 1 and 2, with a ground truth based pre-training with more and less classes than the original set. We argue that the regular layout of this image is more suited for this experiment than e.g. Indian Pines or Pavia University images; usage of additional dataset allows us to further verify the generalization potential of our approach.
In the case of this dataset, the mean was subtracted as in the case of the previous experiments. Additionally, the standard deviation σ b of each hyperspectral band b was calculated and then all pixels were divided by the corresponding standard deviation value σ b , x(b) := x(b) σ b . In this experiment, as in the previous one, the neural network based on [16] was used. Training size was equal to 5 training samples per class and there were 50 experiment runs for each examined case.
The following cases were investigated: 1. The performance of DLNN with pre-training performed with 2 classes prepared from joining the ground truth classes (GT-2).
2. The performance of DLNN with pre-training performed with 10 classes prepared by splitting the ground truth classes (GT-10).

Experiment 4
In this experiment, we examine the claims from subsection 2.2 about the emergence of datadepenedent representations during neural network training using proposed artificial labelling scheme with noisy labels. To this end, we visualised internal network parameters resulting from network training using t-SNE algorithm [65]. In the experiment, we used neural network architecture based on [16] and the Indian Pines dataset described in subsection 2.5.1. We trained the network on the dataset using the following scenarios: 1. The network was trained using 1600 labelled samples, with 200 samples per class. This scenario represents the neural network trained with abundant information about the data -unrealistic, but convenient from the point of network's requirements.
2. The network was trained using 40 labelled samples, with 5 samples per class. This scenario represents the neural network trained with very limited information about the datarealistic, but difficult learning problem.
3. The network was trained using only the artificial labels created as explained in subsection 2.3. Therefore, the network did not 'see' the true labels and could create the internal represenations only based on the noisy labels provided for training.
4. The network was trained using the complete pretraining-finetuning scheme introduced in this section. That is, first it was pretrained using artificial labels as in point 3, and then all layers except the last was finetune using the training set analogous to the one from point 2. This scenario was introduced to help explain the impact of the finetuning step in our approach.
As a result, we obtain 4 trained neural networks. As a next step, using validation dataset we extract the activations of the next-to-last layers of the considered networks, and use t-SNE algorithm, which is used to visualise high-dimensional data, to learn if the layers right before the classification layers of the networks did learn useful data representations.

Results
This Section presents the results of experiments introduced in subsection 2.5.

Experiment 1
The first experiment's results are presented in Table 1. Each column presents the result for one type of network, each row for a set dataset and the number of training examples. Each table cell presents the results with and without pre-training, in percent of Overall Accuracy, including the standard deviation of the result. The results from Table 1 were computed from a batch of n = 15 independent runs for each case. The specific value of n was chosen to provide robust result, after a set of preliminary runs with different n values. A Mann-Whitney U test was performed on the results to confirm statistical significance of the improvement gained with the proposed method. As Overall Accuracy can be sensitive to class imbalances, Average Accuracy and κ coefficient were computed for additional verification, and were inspected for negative performance.
The presented results show that application of the proposed method leads to definite and consistent improvement in accuracy across different images, number of ground truth labels used and network architectures. In all but one case, the improvement is statistically significant, and in some cases approaches 20 percentage points. The most challenging is scenario with 5 training samples per class. Even average overall accuracy achieved by architecture originally examined on small training set [15] does not exceed 67% on Indian Pines dataset. After the application of the proposed method, performance improves up to 72.8% OA. The most improvement is seen in the architecture [16], namely on IP dataset with only 5 training samples per class in finetuning procedure, it improves from average 52.62 OA to 74.04 OA. This is to be expected as this architecture has the most potential to benefit from additional training samples. Considering these improvements, it can be summarized that the results of the experiment support stated hypothesis and the validity of the proposed approach. The qualitative evaluation of selected realizations (corresponding to the median score) is presented in Figures 4 and 5.

Experiment 2
The results of the experiment are presented in Table 2. For each grid size or the number of stripes, the overall accuracy and the standard deviation are given. These statistics are based on 50 experiment runs for each artificial labelling scheme.
It can be seen that the the score rises sharply until the number of artificial classes reaches approximately the number of original classes (at 5 × 5, note that the original IP ground truth leaves a sizeable portion of background unmarked, which most probably would contribute some additional classes if marked). After that value, there's a declining trend. It can be noted that the scores are higher with smaller patches. It seems viable to form a conclusion that when the original class number is unknown, it is better to overestimate than underestimate their number. Table 1: The result of the first experiment. Each row presents Overall Accuracy (OA), Average Accuracy (AA) and Cohen's kappa (κ) for given scenario. IP denotes the Indian Pines dataset, PU the Pavia University; further differentiation is for number of samples per class in fine-tuning. Accuracies are given as averages with standard deviations with and without pretraining for the three investigated network architectures.
architecture [16] architecture [15] architecture [63]  In the latter case, it is possible that even a chance guess would provide a satisfactory performance. The stripes do not form as good a training set as rectangular grid, which confirms the initial supposition that artificial classes should be confined to local areas. Some improvement however is still seen, which supports our overall proposition, that general artificial labelling can be used for improving the DLNN performance without precise estimation of the artificial class patch size. Table 3 presents the results of the third experiment. The overall accuracy was calculated based on n = 50 runs for each examined scenario.

Experiment 3
Here, the original performance (GT) can be significantly improved by the grid-based artificial labelling (see results for 5 × 5, 20 × 20, 30 × 30). However, in this case the performance gain can be confronted with a label dataset created from ground truth data (GT-2, GT-10). As can be expected, the ground truth data provides a higher performance; however, the artificial labelling provides half of that gain with no prior information needed. The ground truth experiments GT-2 and GT-10 also confirm the observation that classes split is a better option than joining. The latter observation provides an additional support to the conclusion that more small classes (dense grid) is preferable than few large ones (sparse grid).  Table 3: The third experiment results. Evaluation of pretraining on Pigments dataset using the proposed approach and classes created from ground truth. The objective was to collate the performance of artificial labels of different sizes with those created through splitting or joining the ground truth.

Experiment 4
The results of the experiment are presented in Figure 6. As expected, the network trained using 1600 true-labeled samples generated good internal representations, which can be seen by the good separability of the classes. In contrast, neural network trained using only 5 samples per class did not generate representations allowing the separation of samples of different classes. In the case of scenario 3, we can clearly see that the classes were better separated when compared with scenario 2, though of course not as good as in scenario 1. Moreover, the authors did not observe any noticeable differences between scenarios 3 and 4. We argue that the presented results provide some suggestion that during neural network training using proposed artificial labelling scheme there is an emergence of useful data-related representations even before the fine-tuning step.

13
Our results confirm the validity of our proposition: a simple artificial labelling through grouping of the samples based on a local neighbourhood provides an efficient transfer learning scheme. It brings significant improvements of accuracy across datasets and DLNN configurations. The results for different datasets, which have distinctive ground truth layouts suggest that it is not the random alignment with the regularity of a particular ground truth pattern. It is also seen that the local structure is important, as seen in the advantage of grid division over stripe division. The generally better performance of higher over lower number of artificial classes suggests an explanation in that for transfer learning, it is not as important to locate the exact number of classes, but to isolate and learn their components, perhaps for better internal feature representation.
We view the main advantage of the proposed method as enhancing the training of a neural network for hyperspectral remote sensing classification. The proposed pre-training offers a number of benefits: 1. Enhance the training of neural networks in hyperspectral classification scenario. With low number of training samples in typical scenarios (e.g. 5 − 15/class, sometimes even less) the number of network free parameters can be several orders of magnitude higher than the training data, which poses a risk of overtraining.
2. Through splitting the training into two phases, it can be used to shift some of the computational burden of network training to the time before an expert is called in to perform labelling, and make more effective use of his or hers time.
3. Larger number of training samples available can be of use in case different network architectures are compared for the same problem, or during the searching the hyperparameter space.
An open question is whether a clustering algorithm, like [33] or outlier segmentation [48] could be adapted here leading to greater efficiency. It is probable that a more complex artificial labelling algorithm could outperform the proposed solution; however even in that case, a simple, generally applicable heuristic that improves performance can be of value. Our approach has common motivation with self-taught learning [49], where we want the classifier to derive highlevel input representation from the unlabelled data; however we use the same data for both training stages and instead change the label set. It also avoids combining neural and non-neural approaches, and prevents introducing additional assumptions through the manual selection of the latter.
A qualitative examination of the pre-training results shows that some class structure is visible after pre-training (see examples in Figure 7). No identifiable features of this structure have been noticed when investigating pre-training images when associated with better or worse final (after fine-tuning) results. However, the general level of structure visible after pre-training relates to the final performance. The network architecture based on the work [63] is best in learning the artificial classes grid and also the worst at the final classification. The other two networks based on the works [16,15] have more complex pre-training results and correspondingly better final results. This suggests that the training scheme and/or network architecture functions as a form of regularization that prevents overtraining, and that the pre-training classification result can be possibly used to control pre-training and avoid overtraining too. The emergence of partial class structure in the pre-training phase -which does not use ground truth, hence can be viewed as unsupervised processing -also suggests that this approach can be adapted to solve unsupervised tasks, e.g. clustering or anomaly detection.
To provide additional verification, we've analysed per-class classification scores for both datasets, using the data from experiment one, and the same Mann-Whitney U with P < 0.05. As could be expected, performance gains are unequal, as classes differ with their overlap and general difficulty of classification. However, the individual classes showed improvement in most of the cases. Across 198 tests 1 , in 104 cases the improvement was statistically significant; for the remaining cases, in 39 cases the accuracy of 100% was achieved irrespective of pre-training, in 32 cases pre-training improved the mean of the class score. In the remaining cases where pre-training score mean was lower that the reference, the average difference was below two percentage points. The proposed method thus can be viewed as 'not damaging' to individual class scores.
Additionally, a batch of experiments were performed for sensitivity analysis of small variations of hyperparameter setting; the results were very similar to those presented. A separate experiment was conducted analysing time-requirements when training the networks. The results of the experiments are presented in Figure 8. The results show that it is more important to train the network during pre-training stage than during the fine-tuning stage (one can clearly see the results getting better when moving vertically within a grid from Figure 8, as opposed to moving horizontally). As one can see, in the case of the lower number of pre-training iterations (10k-50k), even moderate increase leads to definite improvement in the accuracy of the classification. The results also suggest that it could be possible to reduce the time of training in both of the stages without sacrificing the effectiveness of classification. Moreover, it can be presumed that choosing a different number of iterations of the pre-training and fine-tuning stages could lead to achieving even better results than the ones presented in this work (for example, when training the network for 90k iterations in the pre-training stage, and 20k iterations in the fine-tuning stage, it was possible to achieve the accuracy of 81.34%).
Analysing the results from the Table 1, one can notice that pre-training improves accuracy in some networks more than in others. We suspect that an important factor determining such differences is the capacity of neural networks. We argue this with fact that artificial neural network with greater number of parameters is able to better process information contained in the entire image, which we utilize in the pre-training phase. Therefore, the architecture [15] with the smallest number of parameters achieves a smaller increment of the accuracy in comparison to other two networks.
However, one must to be aware that there are a number of other factors that affect network performance. In particular, architectures [16,15] were designed for the task of HSI classification. With an emphasis on the architecture [15], which has been studied on a small training data sets, and therefore has competitive accuracy even without pre-training. On the other hand, architecture [63] was designed for a slightly different training regimen, which may explain the fact that it achieves worse results than the other two.
Our approach could be used for semi-automatic systems like [66], which use only a part of the annotation, and could be made fully unsupervised. Furthermore, we believe this is one approach for self-taught learning [49], that can be helpful in diverse application of deep learning models. We note, however, that optimization would require further studies to address the issue of which layers benefit most of this scheme, i.e. similar to [36]. Our experiments show that the proposed scheme is largely resistant to the incorrect estimation of the number of classes, hence its parametrization can be considered low-cost. It can be also viewed as a confirmation of traditional software development principle of 'divide and conquer', as of even older proverb, 'divide et impera'.
We have presented and verified a simple method pre-training of DLNN for hyperspectral classification based on the hypothesis that spatial similarity of unlabelled data points can be utilized to gain accuracy in hyperspectral classification. In the first experiment, we showed that for all three neural network architectures tested, and for the all two reference datasets, the proposed procedure leads to an improvement of classification efficiency for small number of training samples. In the second and third experiments, we analysed the properties of proposed method; the obtained results suggest that the number and shape of the pixel blobs have an impact on the effectiveness of the method. Specifically, we conclude from the second experiment that it is safer to underestimate the size of a label cluster rather than overestimate and simultaneously reduce chance of joining separate classes. This conclusion is in line with results of the third experiment, from which we also conclude that it is better to split ground truth classes than join them.
The absence of training labels requirement provides an important advantage: it shifts the need of expert's participation and data labelling from the start of the data analysis process to its late stages. This allows for the use of the potentially long time from the acquisition to the start of data interpretation stage for pre-training the network, and decreases the delay between expert's labelling to getting the classification result. Considering the length of time required to train deep neural networks, this is a significant advantage for their applications. An additional benefit is that multiple unannotated images can be used in the pre-training stage, potentially increasing the robustness of the result.   [16], [15] and [63] respectively. Columns present the three cases of number of true training samples per class in fine-tuning (5s, 15s and 50s). For each result, the Overall Accuracy (OA), Average Accuracy (AA) and κ coefficient are reported. Isolated grey points mark locations of the training samples, and are excluded from the evaluation.       [16]. On the y-axis, the number of epochs for the pre-training stage is written, while on the x-axis the number of epochs for the fine-tuning stage is written. The resuts suggest the relative importance of the pre-training stage in comparison to the fine-tuning stage.