Hyperspectral imaging technology combines the power of spectroscopy with digital optical imagery. It offers the potential to explore the physical and chemical composition of depicted objects that uniquely mark their behaviour when interacting with a light source at different wavelengths of the electromagnetic spectrum. Historically, the technology was introduced in the remote sensing field [
1]. However, it quickly spread to numerous other fields such as food quality and safety assessment, precision agriculture, medical diagnosis, artwork authentication, biotechnology, pharmaceuticals, defence, and home security [
2], among others. Notwithstanding this richness of information, hyperspectral imaging does come with considerable challenges. Those are mostly related to the high dimensionality space of the images generated due to the presence of numerous contiguous spectral bands, the large volume of data incompatible with the limited amount of the training data available, and the high computational cost associated. Such challenges render traditional computer vision algorithms insufficient to process and analyze such images [
3]. The high dimensionality problem of hyperspectral images motivated several works. In general, two types of techniques were developed in this regard, in a supervised [
4] or unsupervised learning manner. These are band-selection techniques [
5,
6], which select the most informative subset of the bands, and feature extraction techniques [
7] which transform the data to a lower dimension. In [
8], a feature extraction method called improving distribution analysis (IDA) is proposed which aims to simplify the data and the computational complexity of the HSI classification model. Ref. [
9] proposes a new semi-supervised feature reduction method to improve the discriminatory performance of the algorithms. Because finding a small number of bands that can represent the hyperspectral remote sensing images is difficult, ref. [
10] proposes a random projection algorithm applicable to large images that maintain class separability. Driven by this progress, hyperspectral image classification has received significant attention in recent decades leading to the development of high-performing methods. In [
11], a synthesis of traditional machine learning and contemporary deep learning techniques are elucidated for the classification of hyperspectral images. However, the need for large and expensive labelled training data has constrained many deep learning methodologies to predominantly concentrate on the spectral dimension. The presence of contiguous bands induces redundancy, making the inclusion of spatial features paramount to achieving distinct separability among various classes. Several studies have leveraged deep learning to exploit both spatial and spectral contexts, enhancing the classification performance of models in hyperspectral image (HSI) classification tasks. In [
12], the authors exploit deep learning techniques for the HSI classification task, proposing a method that utilises both the spatial and spectral context to enhance the performance of the models. Ref. [
13] presents a joint spatial–spectral HSI classification method based on a different-scale two-stream convolutional network and a spatial enhancement strategy. Ref. [
14] proposes an HSI reconstruction model based on a deep CNN to enhance the spatial features. The use of Graph Convolutional Networks (GCN) [
15] and multi-level Graph Learning Networks (MGLN) [
16] further underscores the diversity of approaches in addressing HSI classification. Present research endeavours predominantly concentrate on pixel-level, single-label classification, with a lesser emphasis on patch-level, multi-label classification, thereby not fully harnessing the wealth of information encapsulated in hyperspectral images. Spectral unmixing algorithms [
17] have been instrumental in refining the HSI classification task by emphasising spectral variability. Several works [
18,
19] have focused on isolating and distinguishing multiple spectral mixes, or endmembers within individual pixels. To tackle the constraints imposed by the scarcity of labelled hyperspectral data, ref. [
20] proposes two methods specifically tailored for hyperspectral analysis. These include an unsupervised data augmentation technique that performs data augmentation on the spectrum of the image and a spectral structure extraction method. Both methods aim at optimising classification accuracy and overall performance in situations characterised by few labelled samples. Concurrently, in [
21], challenges arising from limited and segregated datasets are addressed by introducing a paradigm for hyperspectral image classification, which allows a classification model to be trained on one dataset and evaluated on different datasets/scenes. However, due to differences in the scenes, a gap in the categories of the different HSI datasets exists. To narrow this gap, the authors utilise label semantic representations to facilitate label association across a spectrum of diverse datasets, offering a holistic approach to hyperspectral classification amidst data limitations.
In our study, we adopt a deep learning approach in a supervised learning framework, focusing particularly on dissecting the spatial extent of images into patches of smaller dimensions. Our model is designed to preserve and exploit the joint spatial–spectral attributes, enabling the identification of multiple entities within a confined spatial area through learned features. The learning process which enables the acquisition of this knowledge is facilitated through a deep autoencoder followed by a classifier. Moreover, given the likelihood of the co-occurrence of multiple objects/entities within a depicted region, we formulate the prediction task as a multi-label prediction problem, ensuring the robustness of the model to complex scenarios encountered in hyperspectral images. Various classification networks with two components are evident in the existing literature; however, the conceptualisation of the tasks of these networks predominantly concentrate on specific aspects of data and training. Herein, data predominantly pertain to pixel-level, single-label instances. Even when utilising patches from the source images as input data, these are densely sampled and attributed to a single label corresponding to the centre pixel of the patch to maintain the original count of the labelled pixels. The training approach for such two-component networks typically employs the
Cascade training scheme, as used in [
22]. Within the
Cascade scheme, the feature extraction component of the network, specifically the autoencoder, is subject to independent pre-training to produce accurate reconstructions of the input. Subsequently, the decoder is substituted with a classifier which is then trained to predict the pertinent label(s), whilst freezing the weights of the pre-trained encoder.
Additionally, the literature does reflect the presence of a
Joint training scheme as seen in [
23]. However, the central objective of such studies remains primarily the analysis of the reconstruction capability of the autoencoder, with less focus on classification tasks. The
Joint training approach implies a concurrent training process for both components. In each epoch, the autoencoder strives for a precise reconstruction from a compressed, hidden representation, whereas the classifier concurrently optimises to predict the accurate label(s) utilising the hidden representation as its input. In our research, we have performed a meticulous and systematic examination of the diverse procedures, or schemes, potentially applicable for training such a two-component network. In this context, we have delineated three distinct training methodologies for our network; namely, the
Iterative, the
Joint, and the
Cascade training schemes.
The contributions of this paper are twofold. First, this paper elevates the hyperspectral image analysis by emphasising patches annotated with multi-labels, intending to resonate more profoundly with the spatial–spectral characteristics and the richness of information inherent in the images. This approach stands in stark contrast to the predominantly adhered-to practice in the existing literature of focusing on single-pixel, single-label analysis. Our observations highlight a perceptible decline in the performance of a pixel-level, multi-class classifier when subjected to training on patches annotated with single labels, corresponding to the centre pixel. It is concluded that leveraging the spatial–spectral extent available in such images, in tandem with multi-labelled ground truth, enhances the learning capability of the classifier of the latent, valuable features embedded within the hyperspectral images. Second, we systematically explore three distinct training schemes within the paradigm of multi-label prediction. The findings of our study hint at the predominance of the Joint scheme, supported by the attained results. However, the Iterative scheme shows a promising propensity for early sharing of learnable features between the feature extraction component and the classifier. This attribute of the Iterative scheme translates to higher performance, particularly in scenarios where data are characterised by complexity, and a mitigation in overfitting. Furthermore, our results highlight the efficacy of this scheme over the ubiquitously employed Cascade training scheme, especially evident in datasets marked by an increased number of samples with multi-labelled ground truth, leading to enhanced performance metrics. Those results are also observed in experiments we conducted on a patch-level, single-label classification task.