A generic self-supervised learning (SSL) framework for representation learning from spectra-spatial feature of unlabeled remote sensing imagery

Remote sensing data has been widely used for various Earth Observation (EO) missions such as land use and cover classification, weather forecasting, agricultural management, and environmental monitoring. Most existing remote sensing data-based models are based on supervised learning that requires large and representative human-labelled data for model training, which is costly and time-consuming. Recently, self-supervised learning (SSL) enables the models to learn a representation from orders of magnitude more unlabelled data. This representation has been proven to boost the performance of downstream tasks and has potential for remote sensing applications. The success of SSL is heavily dependent on a pre-designed pretext task, which introduces an inductive bias into the model from a large amount of unlabelled data. Since remote sensing imagery has rich spectral information beyond the standard RGB colour space, the pretext tasks established in computer vision based on RGB images may not be straightforward to be extended to the multi/hyperspectral domain. To address this challenge, this work has designed a novel SSL framework that is capable of learning representation from both spectra-spatial information of unlabelled data. The framework contains two novel pretext tasks for object-based and pixel-based remote sensing data analysis methods, respectively. Through two typical downstream tasks evaluation (a multi-label land cover classification task on Sentienl-2 multispectral datasets and a ground soil parameter retrieval task on hyperspectral datasets), the results demonstrate that the representation obtained through the proposed SSL achieved a significant improvement in model performance.

A generic self-supervised learning (SSL) framework for representation learning from spectra-spatial feature of unlabeled remote sensing imagery Xin Zhang, Liangxiu Han* Abstract-Remote sensing data has been widely used for various Earth Observation (EO) missions such as land use and cover classification, weather forecasting, agricultural management, and environmental monitoring.Most existing remote sensing databased models are based on supervised learning that requires large and representative human-labelled data for model training, which is costly and time-consuming.Recently, self-supervised learning (SSL) enables the models to learn a representation from orders of magnitude more unlabelled data.This representation has been proven to boost the performance of downstream tasks and has potential for remote sensing applications.The success of SSL is heavily dependent on a pre-designed pretext task, which introduces an inductive bias into the model from a large amount of unlabelled data.Since remote sensing imagery has rich spectral information beyond the standard RGB colour space, the pretext tasks established in computer vision based on RGB images may not be straightforward to be extended to the multi/hyperspectral domain.To address this challenge, this work has designed a novel SSL framework that is capable of learning representation from both spectra-spatial information of unlabelled data.The framework contains two novel pretext tasks for object-based and pixel-based remote sensing data analysis methods, respectively.Through two typical downstream tasks evaluation (a multi-label land cover classification task on Sentienl-2 multispectral datasets and a ground soil parameter retrieval task on hyperspectral datasets), the results demonstrate that the representation obtained through the proposed SSL achieved a significant improvement in model performance.By comparing with the currently available SSL methods, the proposed method that emphasizes both spectral and spatial features outperform these existing SSL methods on multi and hyperspectral remote sensing datasets.
Index Terms-Remote sensing, Self-supervised learning, Spectral and Spatial features, Object based method, Pixel Based method I. INTRODUCTION E ARTH observation through remote sensing data provides an unbiased, uninterrupted, and borderless view of human activities and natural processes.Through exploiting data collected from various aircraft and satellite systems equipped with multi/hyper spectral sensors ranging from medium and very high spatial resolution, together with advanced data analysis/ machine learning, people can gain digital information and insight to guide multiple applications concerning any corner of the planet [1].In detail, the remote sensing data analysis is essential for many applications such as environmental monitoring, natural resource management, disaster response, urban planning, climate change studies [2], [3].With the rapid development of sensor technology, the complexity of remote sensing data has increased significantly due to the rapid improvement in its spatial and spectral resolution, which poses challenge to the remote sensing data analysis [4].
In general, the existing remote sensing data analysis methods usually contain two fundamental components: data processing and feature extraction.Depending on the spatial and spectral resolution of data, the data processing methods can be broadly divided into pixel-based and object-based methods.Specifically, the most commonly used pixel-based methods take each individual pixel as input and utilize the rich spectral information for subsequent feature extraction tasks [5].It is suitable for low to medium spatial resolution remote sensing data.As the spatial resolution of the data increases, individual pixels are no longer able to cover an object target on the ground.The object-based methods is introduced to segment an image into objects containing spectral and spatial information (e.g.shape/geometry and structure) [6] for subsequent feature extraction and analysis.
Regarding feature extraction, the traditional machine learning methods, such as Classification and Regression Tree (CART) [7]; Support Vector Machine (SVM) [8], and Random Forest (RF) [9] have widely been used for extracting features from remote sensed data for various tasks, such as land cover classification [10], carbon emission, biomass estimation [11], [12].Recent development of deep learning methods [13], [14], such as convolutional neural networks, has shown promising in remote sensing applications and have achieved state-of-theart performance [15] since the convolution operation is able to capture spatial-spectral information [16].
However, most of existing machine/deep learning-based methods are supervised learning, which requires extensive annotated datasets.Acquiring such well-labelled data is labour intensive and time consuming.Recently, self-supervised learning has been proposed to learn patterns from unlabelled data and has been effectively applied in many fields such as computer vision [17], [18], natural language processing [19], [20], and object detection [21], [22].Essentially, the self-supervised learning consists of two steps: firstly, it has an auxiliary or pretext task using pseudo-labels (i.e., auto-generated labels) to help initialize the model parameters, which are then used for boosting downstream tasks such as classification, segmentation and object detection.
Until now, a few SSLs have been applied directly on arXiv:2306.15836v1[cs.CV] 27 Jun 2023 remote sensing applications [23], [24] including land use/cover mapping [25], change detection [26] and Nitrogen prediction [27].Most of the existing self-supervised for remote sensing analysis are straightforward extension of methods in computer vision domain.These pretext tasks are designed to learn the spatial features from RGB data, such as inpainting of the data [28], disrupt the spatial order of the data and random rotate the image [29].However, given remote sensing imagery contains spectral bands beyond the standard RGB colour space (i.e., including both spectral and spatial information), it is insufficient to directly extend the pretext task learning from RGB images to remote sensing data.To the best of our knowledge, there is currently no pretext task designed for spectral-spatial information extraction.Therefore, this work proposes a generic SSL framework for both spatial and spectral feature learning from label-free remote sensing data.This SSL could directly learn a high-level representation from the remote sensing image to promote the performance of both pixel and object-based downstream tasks.The main contributions of this work are as follows: 1) We propose a generic SSL framework for both pixelbased and object-based remote sensing applications.Two novel pretext tasks are proposed.One is used to reconstruct the spectral profile from the masked data, which can be used to extract a representation of pixel information and improve the downstream task performance belonging to pixel-based analysis.The other pretext task is proposed to identify objects from multiple views of the same object.These multiple views including global views, local views, and spectral views, are derived from extensive spatial and spectral transformations of the data to allow the model to learn representations from the spatial-spectral information of the data.These representation can be used to improve the downstream task performance belonging to object-based analysis.2) We demonstrate that the proposed SSL is a novel way to learn representation from unlabelled large-scale remote sensing data.This proposed SSL method is applied to two downstream tasks on large multispectral and hyperspectral remote sensing datasets.One is a multilabel land cover classification on Sentienl-2 multispectral datasets and the other is a ground soil parameter retrieval on hyperspectral datasets.We also compared the proposed methods with the existing SSL frameworks.
The results show the proposed SSL method emphasizes the spectral and spatial features in remote sensing data with higher performance than the other three methods.3) We analyse the impact of spatial-spectral features on the proposed SSL performance and visualize the features learned by SSL, which contribute to a deeper understanding of what would make a self-supervised feature representation useful for remote sensing data analysis.

A. Remote sensing analysis methods
In recent years, the amount of available remote sensing data has increased significantly.The spatial and spectral resolution of the remote sensing data has also increased.It brings challenges to remote sensing analysis methods.Unlike the conventional digital imagery, which captures electromagnetic emissions with only three bands (Red, Green, and Blue) in the visible spectrum, remote sensing imagery has a cube form often with multiple bands [30], covering a wider range of spectra, including the visible spectrum, infrared spectrum, and radio wave range.In most remote sensing imagery analysis methods, the spectral information of each image pixel, made up of hundreds of spectral bands, acts as an important role.Another fundamental feature of the remote sensing data is spatial information, which normally includes such as the texture, shape, and edges of the ground object.In most remote sennsing analysis methods, how to extract valid features from spectral and spatial information is the most vital component.Generally, the feature extraction methods can be broadly divided into two categories: supervised and self-supervised or unsupervised learning [23].
Supervised learning is the most frequently used feature extraction method from labeled data.Numerous traditional machine learning methods such as the SVM [31], [32], RF [33], [10], and boosted DTs [34], [35] have been widely used for the feature extraction of remote sensing data.In recent years, deep learning has shown increasing success in a variety of computational vision tasks and is being used in remote sensing applications [15], [36].In [14], deep learning methods are used in both pixel-based and object-based remote sensing applications and have demonstrated superior performance over traditional machine learning methods.Meanwhile, the authors evalute the performance of a variety of deep learning models in a land cover and object detection tasks.In [37], Google trained a deep learning model for land cover mapping by using Sentinel-2 10m dataset.This model enables real-time land cover prediction on a global scale.
However, supervised learning of remote sensing data requires large labelled data for model training.This poses several challenges.One of the big challenges is that manual annotation of big remote sensing data is expensive, time consuming, labour intensive and subject to individual bias.Another major challenge for supervised learning in remote sensing is the location sensitivity of annotations.The accuracy of supervised learning methods relies on the location and distribution of the selected annotation areas, which makes these methods lack transferability.
The self-supervised learning provides a paradigm to address those challenges that train models with unlabeled data [38].In general, the traditional self-supervised method used in remote sensing applications normally refers to cluster pixels in a dataset based on statistics only, without any user-defined training classes [39], [40].The two most frequently used algorithms are ISODATA [41]and K-Means [42].However, these traditional SSL methods are designed for clustering, grouping and dimensional reduction [43], [44], which do not extract feature for futher analysis.In recent years, a new SSL research trend has emerged to learn representation without labels with deep learning models.These representations can be used to boost the performance of downstream applications, which has great potential for remote sensing applications.

B. Self-supervised learning on remote sensing (SSL)
In general, Self-supervised learning (SSL) involves two tasks: a self-supervised pretext task, and real downstream tasks.The pretext task aims to train a network by optimizing this objective in a self supervised manner using pseudo-labels (i.e.auto-generated labels) of unlabeled data to help initialize the model parameters.Through carefully designed pretext tasks, the network gained the ability to capture high-level representations of the input.Afterwards, the network can be further transferred to supervised downstream tasks for realworld applications.
The success of SSL is heavily dependent on how well a pretext task is designed.The pretext task implicitly introduces an inductive bias into the model learning from a large amount of unlabeled data.If not designed properly, the learning model will only be able to find low-level features, which will be difficult to use for real downstream tasks.Several pretext tasks have been proposed for self-supervised representation learning using visual common sense, such as predicting rotation angle [29], relative patch position [45], solving jigsaw puzzle games [46].
There are two common strategies for pretext design to achieve different objectives: 1) Generative-based pretext task that reconstructs the input data (Such as discriminating images created from distortion [47]) f (x) → x, or predicts a label c that is self-generated from context and data augmentation f (x) → c, and 2) Contrastive learning [48] based pretext task that contrasts inputs x 1 and x 2 that have similar meanings (for example, the encoded features of two different views of the same image should match [49], [50]) |f (x 1 ) − f (x 2 )| → 0. Table .I summarises the representative approaches for different types of pretext tasks.
Naturally, these pretext tasks also have been used for remote sensing applications in a self-supervised manner.In [25], and [58], the random rotation pretext task is used to learn the representation from RGB and SAR remote sensing data as a generative-based SSL.These representations are finally used to boost remote sensing classification tasks.In [59] and [24], inpainting and relative position pretext tasks are used for segmentation and classification.In recent years, the contrastive learning based SSL has also been widely used in remote sensing [60], [61], [27], [62].Tile2Vec is the first selfsupervised work using contrastive learning for remote sensing image representation learning [63].A triple loss is proposed to encourage neighboring patches in one image that are closer and moving tiles that are further away in the spatial space.[64] uses SimCLR like contrastive learning to pre-train HSI classification models to reduce the requirement for massive annotations.
It is worth mentioning that the most current pretext tasks are designed for RGB image where the spatial features are the primary features considered.Only a few simple spectral augmentations [65], [66] are used for view generation.Remote sensing imagery contains rich spectral bands beyond the standard RGB image.Therefore, straightforward extensions to the multi/hyperspectral based on methods established in computer vision may not be suitable.To the best of our knowledge, there is currently no self-supervised pretext task designed for spectral-spatial information extraction in the context of remote sensing analysis.In this paper, to address the limitations and deal with the spectral and spatial features of remote sensing data, we propose a novel SSL framework to capture the spectral-spatial pattern from massive unlabelled remote sensing data.

III. THE PROPOSED METHOD
The proposed SSL aims to learn useful high-level representation that keeps both spatial and spectral information from the label-free remote sensing data and demonstrates this representation can be used to boost downstream remotesensing tasks.The proposed SSL can be used for remote sensing analysis at both object-and pixel-levels: 1) One is a object-based SSL (ObjSSL), which is a kind of contrastive learning.This method is suitable for extracting features from high to very high spatial resolution remote sensing data.The ObjSSL proposes a joint spatial-spectral aware multi-view pretext task, which is a classification problem.It uses cross-entropy loss to measure how well the network can classify the representation amongst a set of multi-views of one target.
2) The other is a pixel-based SSL (PixSSL), which is a kind of generative learning suitable for low to medium spatial resolution images.We propose a spectral aware pretext task for reconstructing the original spectral profile.A spectral masked, auto encoder-decoder is designed to learn meaningful latent representations.
The framework of the proposed SSL method is shown in Fig. 1.The first part is the SSL training with unlabelled data, then the trained representations and network is used for downstream tasks through the knowledge transfer.Specific decoder is added after the network, for each specific task.We evaluate the performance of the SSL by using specific downstream tasks (a multi-label land cover classification task on Sentienl-2 multispectral datasets and a ground soil parameter retrieval task on hyperspectral datasets).

Generative based
Denoising AE [51] Reconstruct clear image from noisy input Masked AE (MAE) [28] Reconstruct randomly masked patches GANs [52] Adversarial training with a generator and a discriminator Wasserstein GAN [53] Train the generator to produce samples that are as close as possible to the real data distribution Relative position [45] Predict the relative positions of random patch pairs Rotation [29] Predict the rotation angle of the random rotated image puzzle [46] Predict the correct order of the puzzle

Contrastive learning based
MoCo V1-V3 [54], [55], [56] Store negative samples in a queue and perform momentum updates to the key code.SwAV [49] Contrastive learning for online clustering BYOL [50] Average a teacher network with a predictor on top of a teacher encoder SimSiam [57] Explore the simplest contrasting SSL designs

A. Object-based SSL (ObjSSL)
The ObjSSL is a contrastive learning method to learn the representation of the remote sensing data.The idea of contrastive learning is to learn representations that bring similar data points (Positive pairs) closer while pushing randomly selected points further away or to maximize contrastivebased mutual information lower bound between different views (Negative pairs).The pretext task of the ObjSSL is a classification problem that uses contrastive loss to measure how well the model can classify the representation among a set of unrelated negative and positive samples.In this work, the positive samples are generated by discerning the representation of augmented views of the same data.The Negative pairs assume that different images in a batch during model training represent different categories.The flowchart of the work can be shown in Fig. 2.There are two main parts in the ObjSSL: 1) A novel multi-view pretext task that generates positive pairs for the ObjSSL by generating different views of remote sensing data from both spectral and spatial perspectives.It is a composition of multiple data augmentation operations, including spectral aware augmentation, regular augmentation, and local and global augmentation.2) A self-distillation framework that uses two networks to learn the representation from multi-views of the data: the student network and the teacher network.The student network is trained to match the output of a given teacher network.
1) Multi-view pretext task: In ObjSSL, the positive pairs are generated by applying data augmentation to create noise versions of the original samples.An appropriate data augmentation is essential for learning good, generalizable embedding features.It introduces unnecessary changes to the original images without modifying the semantics meaning, thus encouraging the model to learn the essential features.In [67], the author has demonstrated that the composition of multiple data augmentation operations is crucial in defining the contrastive prediction tasks that yield effective representations.In this work, the joint spatial-spectral aware multi-view pretext task is proposed to generate positive pairs of data for the ObjSSL.It consists of a composition of multiple data augmentation operations, including 1) regular augmentation, 2) local and global spatial augmentation, and 3) spectral aware augmentation.
a) Regular augmentation: Regular augmentation includes common data transformations, such as: random rotation and zooming, gaussian blur, and random noise.
b) Local and global augmentation: Local and global augmentation is used to generate views of different spatial areas.With an input data X of size 120 2 .The output of this augmentation is a set containing global views and several local views of smaller resolutions.We assume that the original data contains the global context.The small crops are called local views that use an image size of 36 2 .It covers less than 50% of the global view.We assume that it contains the local context.Then two views are fed into the Self-distillation framework.All local views are passed through the student while only the global view is passed through the teacher.It encourages the student network to interpolate context from a small crop and the teacher network to interpolate context from a bigger image.c) Spectral aware augmentation: Spectral aware augmentation is data transformation that performs in parallel with local and global augmentation.The traditional colorbased augmentation method is a set of random transformations on random channels, including variations between channels, which inevitably change the spectral order their relative positions.In this work, the spectral-aware augmentation drops the random channels (30%-50%) and replaces them with a value of zero.It guarantees that the relationship and relative position of the different channels does not change.This view is passed through the student encoder.It encourages the student network to learn the full spectral context from the teacher network.
2) Self-distillation framework: The self-distillation framework consists of teacher and student networks (encoder), which has the same structure with different parameters (θ t and θ s ).In this work, the Spectral-Spatial Vision Transformer [27] designed to extract spectral and spatial features from remote sensing data is selected as the encoder.From a given image X, we generate a set of different views (X 1 ,X 2 ,X 3 . . ..) by the data augmentation.The X 1 and X 2 is fed into the teacher and student encoder separately and the outputs are the probability distributions P t and P s .This can be formulated as: Where g is the encoder with parameters θ. τ is a temperature parameter that controls the sharpness of the output distribution.We learn to match these distributions by minimizing the crossentropy loss between P t and P s .
In this work, the teacher is a momentum teacher, which means that the students' weights texttheta S ) are an exponentially moving average.The update rule for the teacher's weights (texttheta t ) is: with λ following a cosine schedule from 0.96 to 1 during training.The algorithm can be summarized in Fig. 3.

B. Pixel-based SSL (PixSSL)
The PixSSL is a generative SSL method, in which the pretext task is to reconstruct the original input while learning meaningful latent representation.Fig. 4 shows the architecture of PixSSL.In this work, from a pixel view, we designed a spectral information reconstruction task to learn latent representation from the rich spectral information of remote sensing data.There are three main innovations in the PixSSL: 1) To ensure the relationships and relative positions of the different spectral channels remain unchanged, a spectral reconstructive pretext task is introduced to recover each pixel's spectral profile from masked data.Based on our experiments, we find that masking 50% of the spectral information yields a meaningful self-supervisory task.2) An encoder-decoder architecture is designed to perform this pretext task.The encoder is used to generate meaningful latent representation and decoder is used to recover the masked spectral profile.3) Pixel-based analysis methods require processing every pixel within an image, which significantly increases the amount of computation.To optimise computational efficiency, our proposed encoder can operate on a subset of the spectral data (masked data) to reduce the data input.Meanwhile, the aim of the SSL is to train an encoder to generate meaningful latent representation for downstream tasks.Therefore, we only added a lightweight decoder that reconstructs the spectral profile to reduce computational consumption.
The algorithm can be summarized in Fig. 5.The self-supervised pretext task in PixSSL aims to recover spectral information from masked data.In this work, we use high masking ratios to randomly mask each data's spectral profile.The high ratios largely eliminate redundancy, resulting in a pretext task that cannot be easily solved by extrapolation from visible neighboring bands.Through the PixSSL performance experiment, we have demonstrated that masking 50% of the spectral information yields a meaningful latent representation.
2) The spectral masked auto encoder-decoder network: In this work, we have proposed a spectral-masked autoencoder that reconstructs the original spectral information given its partial spectral.Our approach has an encoder that maps the pixel's spectral information to a latent representation, and a decoder that reconstructs the spectral profile from the latent representation.Fig. 4 illustrates the flowchart of the PixSSL.We use an asymmetric design that allows the encoder to operate on masked partial spectral, and a decoder that reconstructs the full spectral from the latent representation and mask tokens.The last layer of the decoder is a linear projection whose number of outputs equals the number of the spectral channel of the data.The loss function computes the mean squared error (MSE) between the reconstructed and original data in the pixel space.
a) Encoder: The encoder in this work is a transformer encoder but applied only on unmasked data.Only 50% of the spectral channel is used for the encoder in the SSL training.Our encoder embeds patches by a linear projection with added positional embeddings and then processes the resulting set via a number (N) of Encoder blocks.
There are four main parts in the encoder as shown in Fig. 4: Multi-Head Self Attention layer (MSP), Multi-Layer Perceptrons (MLP), Layer Norm, and Residual connections, which were introduced in CNN evolution [68].
Multi-Head Self Attention layer (MSP) The MSP is the core of the transformer, and it consists of several self-attention blocks (h) that integrate multiple complicated interactions between different elements in the sequence (Fig. 6).The self-attention mechanism can perform the non-local operation, capturing long-range dependencies/global information between selected patches in the sMRI image [69].Here, we denote the input of the model as a sequence of n patches (p 1 ,p 2 . . .p n ) by P ∈ R n×d , where d is the embedding dimension of each patch.The goal of self-attention is to capture the interaction amongst all n patches by encoding each patch in terms of the global contextual information, which is done by defining three learnable weight matrices to transform Queries (W Q ∈ R d×dq ), Keys (W K ∈ R d×d k ) and Values (W V ∈ R d×dv ), where d q = d k .The input P is first projected into Queries (Q), Keys (K) and Values (V) by using 1×1×1convolution filter which can be defined as: The output of the self-attention layer is: The self-attention computes the dot-product of the query with all keys, which is then normalized using the SoftMax operator to get the attention scores.Each patch becomes the weighted sum of all patches in the image, where the attention scores give weights.
Then, each self-attention block has its own learnable weight (W Qi ,W Ki ,W Vi ,i ∈ h).The output of the h self-attention blocks (A i ) in multi-head attention is then concatenated into a single matrix and then subsequently projected to another weight matrix W m h.The operation is shown in Fig. 6 and can be formulated as: Then, for building a deeper model, a residual connection is employed around each module, followed by Layer Normalization [70].Layer Norm is the normalization method in the NLP area instead of Batch norm in vision tasks.It is applied before every block as it does not introduce any new dependencies between the training images.It helps to improve training time and generalization performance.The operation can be written as: S = LayerNorm(M SP (P ) + P ) (9)

Multi-Layer Perceptrons (MLP)
An MLP is a particular case of a feedforward neural network where every layer is a fully connected layer.An MLP is added at the end of each MRI transformer block, containing two fully connected layers (Fc1 and Fc2) with Gaussian Error Linear Unit (GELU).It has been proven to be an essential part of the transformer that stops and drastically slows down rank collapse in model training [71].Residual connections are applied after every block as they allow the gradients to flow through the network directly without passing through nonlinear activations.The output of MLP can be written as: b) Decoder: The input to the decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens.Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted.We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image.The decoder has another series of Transformer blocks.The decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition).

IV. EXPERIMENTS EVALUATION
This section is devoted to illustrating the capabilities of the presented algorithm in two typical application scenarios and data.There are two main experiments: in the first experiment we evaluate the performance of ObjSSL through a downstream multi-label classification task.The most common mediumresolution multispectral Sentinel 2 is selected as the data source.In the second experitment, we measure the performance of PixSSL using hyperspectral data through a soil parametric regression task.

A. ObjSSL performance evaluation
In this work, we evaluate the performance of the proposed ObjSSL through a downstream multi-label classification task.We have conducted three types of experiments: 1) Sensitivity Analysis of the Proposed Approach.In this experiment, we perform the sensitivity analysis of the proposed approach under different settings and strategies.Firstly, we analyze the downstream task performance with and without the spectral aware augmentation to evaluate the impact of the designed pretext task.Then we report the model performance with 5%, 25%, 50%, and 100% of the training data with without the SSL to demonstrate the effect of SSL on the supervised classification task.2) Comparison among the existing SSL methods.A comparative experiment is proposed to report the accuracy of this proposed algorithm with the three latest contrastive learning SSL methods including MoCo-V2 [57], BYOL [50], and DINO [49].3) Comparison among the existing backbones.A comparative experiment is proposed to report the accuracy of this proposed algorithm with the three commonly used deep learning classification networks including VGG 16 [72], ResNet 50 [68], and Vision transformer [73]. 1) Data Collection: A public dataset BigEarthNet [74] is selected for this experiment.125 Sentinel-2 tiles acquired between June 2017 and May 2018 from the 10 countries (Austria, Belgium, Finland, Ireland, Kosovo, Lithuania, Luxembourg, Portugal, Serbia, and Switzerland) of Europe are initially selected.All the tiles are atmospherically corrected by the Sentinel-2 Level 2A product generation and formatting tool (sen2cor).Then, each tile is divided into non-overlapping image patches with the size of 120*120.Each image patch was annotated by the multiple land-cover classes (i.e., multi-labels) that are provided by the CORINE Land Cover database of the year 2018 (CLC 2018).The CLC Level-3 nomenclature is interpreted and arranged in a new nomenclature of 19 classes (see Table.II).Ten classes of the original CLC nomenclature are maintained in the new nomenclature, 22 classes are grouped into 9 new classes, and 11 classes are removed.There are a total of 519,284 patches of data.Since the data may have been acquired in the same geographical area at different times.The result may not be reliable due to the possibility of the data acquired in the same place appearing in both the training and prediction sets.To avoid this issue, the train and validation sets do not share the images acquired in the same geographical area.As a result, we use the data list in [75], [76].2) Evaluation Metrics: The performance evaluation of multi-label classification method requires the analysis of several factors, not just the assessment of the number of correct predictions, and therefore requires complex analysis than in the single-label case.In this work, various classification-based metrics and ranking-based metrics with varying characteristics are selected to accurately evaluate the accuracy of the proposed approach.Under the category of classification-based metrics, results of experiments were provided in terms of three performance metrics: 1) Accuracy, 2) Precision, 3) Recall, 4) F1 Score and 5) Hamming loss (HL).These metrics are calculated as follows: Where TP, FP, FN and TN indicate the conditions of true positive, false positive, false negative and true negative, respectively.We use macro average for the overall Precision.Recall and F1-score.A macro-average will compute the metric independently for each class and then take the average, which is preferable when the class data imbalance.
The F1 -Score is the weighted harmonic mean of the correct prediction rates among the considered ground reference and the multi-label predictions.
The Hamming loss (HL) is the average Hamming distance between the ground reference labels and predicted multilabels.Thus, it is defined as follows: Where ŷj is the predicted value for the j th label of a given sample, y j is the corresponding true value, and n labels is the number of classes or labels.Under the category of rankingbased metrics, results of experiments are provided in terms of three performance evaluation metrics: 1) Ranking loss (RL); 2) Coverage (COV); and 3) Label ranking average precision (LRAP).All the ranking-based metrics are defined with respect to the ranking of the j th label in the class probabilities result of an multi-label classification approach for the i th image that is defined as Unlike the classification-based metrics, ranking-based metrics are calculated only by giving equal importance to each sample of the test set.
Accordingly, ranking loss(RL) is the rate of wrongly ordered label pairs (i.e., the probability of a label, which is irrelevant to the image, is higher than a ground reference label), and thus expressed as follows: The coverage (COV) calculates the average number of labels required to be included in the prediction list of a multilabel classifier such that all ground reference labels will be predicted.Accordingly, it is defined as follows: For each ground reference label, the label ranking average precision (LRAP) calculates the rate of higher-ranked ground reference labels.This is expressed as follows: ) It is worth noting that, smaller values of the Hamming loss, ranking loss and coverage indicate better performance of an approach, whereas higher values of the accuracy, precision, recall, F1 -Score and the LRAP are associated to better performance.
3) Experimental Setup: The model training in this work has two main steps.One is the SSL training without labels.Then the second step is supervised training.We can choose either the SSL-generated weights or the default weights as the initial weights.
The SSL training uses the AdamW optimizer [77] and a batch size of 64, distributed over 3 GPUs (GeForce RTX 2080 Ti).The learning rate is linearly ramped up during the first ten epochs as 1e-3.After this warmup, we decay the learning rate with a cosine schedule.The weight decay also follows a cosine scheduled from 0.04 to 0.4.We execute training for 100 epochs.
For the supervised training.We first transfer the weights learned from SSL training to initialize the model.AdamW optimizer is used for 100 epochs using a cosine decay learning rate scheduler and 20 epochs of linear warm-up.A batch size of 64, a lower initial learning rate of 1e − 4, and a weight decay of 0.05 are used for model training.

B. PixSSL performance evaluation
In this work, we evaluate the performance of the proposed PixSSL through a downstream parameter regression task.The objective of the task is to estimate the soil parameters, specifically, potassium (K), phosphorus pentoxide (P 2 O 5 ), magnesium (Mg), and pH from the hyperspectral images captured over agricultural areas in Poland.
1) Data Collection: In this work, the data from AI4EO hyperspectral challenge is selected [78].The dataset comprises 2886 patches in total (2 m GSD), of which 1732 patches are for training and 1154 patches for evaluation.The patch size varies (depending on agricultural parcels) and is on average around 60x60 pixels.Each patch contains 150 contiguous hyperspectral bands (462-942 nm, with a spectral resolution of 3.2 nm).Fig. 7 shows the data representation of band 60 and the spectral profile of one patch.2) Experimental Setup: Two experiments are proposed to evaluate the performance of the PixSSL.The first one is to determine the best masking ratio.The second is a comparative experiment.Three existing methods are selected for comparison.The baseline method is the machine learning pixel-based method.We assume that each patch is treated as a pixel and average all the values of each waveband in this patch.The spectral profile of each patch is used as the input.The catboost [79], as one of the state-of-the-art machine learning regression models, is selected for regression.The root mean squared error (RMSE) and the R-squared (R2) are used to evaluate the model's performance.In the baseline model, the 1732 spectral profiles extracted from the patch are used for model training.Then we perform PixSSL on all datasets.We extract around 400 spectral profiles (from 3 x 3 area) from one patch (60x60 pixels).So 1732x400=692,800 are used for the pre-training without labels.At last, we do the downstream regression task to evaluate the representations with two methods: linear probing and fine-tuning (Fig. 8).
In linear probing (SSL LP), a decoder (linear layer) is stacked on top of the encoder and only the decoder is trained by accessing the labels.Since the encoders have already been trained in the first stage, we freeze all the parameters of the encoder in the downstream task training.
In fine-tuning (SSL FT), a similar procedure is followed.In the first stage, encoders are trained without accessing the labels and all the parameters are used as initialization in the second stage.In the second stage a decoder is stacked on top of the backbone, and the whole model is trained by accessing the labels.Notice that we use a smaller learning rate on the encoder to avoid large shifts in weight space.For the SSL pretraining, we use the AdamW optimizer and a batch size of 512, distributed over 3 GPUs (GeForce RTX 2080 Ti).The learning rate is based on a scaling rule [80]: After this warmup, we decay the learning rate with a cosine schedule.The weight decay also follows a cosine schedule from 0.04 to 0.4.For the supervised training on downstream regression task.We first transfer the weights learned from SSL to initialize the model.AdamW optimizer is used for 100 epochs using a cosine decay learning rate scheduler and 20 epochs of linear warm-up.This learning rate scheduler is only works on decoder and a lower learning rate (1e-6) is used on encoder.

V. RESULT
A. ObjSSL performance 1) Sensitivity Analysis of the Proposed Approach: In this section, we first evaluate the impact of spectral aware data augmentation on SSL.We report the class-based performance of the proposed model with and without the spectral aware data augmentation (Table.III).By analyzing the table, one can see that the model using the spectral aware SSL achieves the highest score for each class compared to the SSL model without spectral aware operation.The average Precision, Recall and F1 scores of the proposed SSL method are 78.66%,66.52% and 71.10%, which are 5.64%, 4.40% and 5.05% higher than the model without spectral aware operation.This result demonstrates that spectral information in remote sensing data plays a key role in ground object classification.One of the motivations of SSL is to learn useful representations of data from unlabeled data and then fine-tuning the representations with few labels for the supervised downstream task.In this task, we evaluate the effect of the SSL processing in the downstream task, especially when the amount of training data is limited.Fig. 9 shows the model performance with 5%, 25%, 50% and 100% of the training data.The results show that the accuracy of supervised classification on the validation dataset drops significantly when less 50% of the data is used for training.The F1 score and LRAP is only 22.4%, 32.4% when using 5% of the training data.This indicates that the model is overfitting.When using SSL weights for fine tuning, the model can achieve the best accuracy using only 5% of the training data.
2) Comparison among the existing SSL frameworks: In the second experiment, we compare the classification results of different SLL frameworks with our proposed method.We pretrain the model with four SSL frameworks (MoCo-V2 [57], BYOL [50], DINO [49] and proposed SSL) and then fine tune the representations on 50% of the training data.Table .IV shows the model performance, the model performance without SSL is added as a reference.The results show that the use of SSL can be beneficial in allowing the model to converge on limited data.Compared to the other three SSL frameworks, the proposed SSL method emphasizes the spectral and spatial features in remote sensing data with higher performance than the other three methods.
3) Comparison among the existing networks: Table .V shows the classification-based and rank-based metrics obtained by the proposed method and the three most popular deep learning networks: VGG16, ResNet50 and ViT.Since the BigEearthNet collects 393,000 training data, it is sufficient for most visual tasks.All the deep learning models achieve satisfactory accuracy on supervised tasks.Duo to the residual connection introduced by ResNet [68], ResNet obtains a superior accuracy than VGG16, which is demonstrated in most computer vision tasks [81].ViT [73], as a new computer vision architecture, which utilizes a transformer instead CNN to extract features of the data, achieves accuracy performance close to VGG16.Our method introduces a channel information learning module in VIT, the results show higher performance than ResNet and VIT.A minor improvement in accuracy is also obtained with the addition of SSL in both classifications based and rank-based metrics.

B. PixSSL performance
Fig. 11 shows the influence of the masking ratio.The optimal ratios are high.The ratio of 50% is good for the selfsupervised representation learning.
In this section, we evaluate the performance of PixSSL on a downstream regression task.Figure 10 shows the R2 and RMSE accuracy on the baseline method and proposed SSL method.With the traditional machine learning pixel-based method, the R 2 of nitrogen parameters prediction is around 0.85.With the SSL representation, the R 2 of the SSL LP increases from 0.93 to 0.95 on P, K, and Mg regression.There is no significant improvement in pH regression since the values of PH on the ground are close.When we finetune the final layer of the encoder, the R 2 of the proposed   model is improved with over 0.95.The result indicates that the representations learned from SSL provide a better prediction of nitrogen properties than using original spectral information only.

VI. DISCUSSION
In this work, we propose an SSL framework for feature extraction of remote sensing data at both pixel-based and object-based scales.By validating downstream tasks, our results demonstrate that, the new representation of the data learned by the SSL can achieve better performance on downstream tasks than using original data only.In general, the representations learned by the SSL are abstract and cannot be interpreted directly.In the following section, we visualize the representations and discuss their potential value.

A. The representation of ObjSSL
In the ObjSSL, a novel multi-view pretext task is proposed to generate representation from unlabeled dataset.In our experiments, we demonstrate that our proposed unsupervised learning method exhibits the three main advantages: 1) With the joint spatial-spectral aware pretext task, the deep learning model obtains both spectral and spatial features from the remote sensing data.The classification performance of some spectrum-sensitive categories, such as Mixed Forest, Coniferous Forest, Natural grassland and sparsely vegetated areas, Wetlands and Arable land has been significantly improved.2) The representations generated from self-supervised learning improve the performance of downstream tasks.3) After pre-training with self-supervised learning, the deep learning model can be converged faster and better in supervised training with limited dataset.This shows that self-supervised learning generalizes well to the spectral-spatial features in the data.
In Fig. 12, we visualize the attention map for the different heads of the last layer of the encoder after ObjSSL.The a) column is the original data displayed by red, green, and blue channels.In the b) column, we adjust the brightness of the image for better display.The c) and d) visualize the different attention map of the last layer in the encoder after ObjSSL.It shows that the attention map can attend to different semantic regions of an image, which demonstrates that the representations obtained by SSL reflect the semantic information of the data.We believe that this representation has potential to be used in land cover/use tasks.In Fig. 13, we represent each BigEarthNet class by using the average feature vector for its validation data.We run t-SNE for 5000 iterations and present the resulting class embeddings in Figure 12.The result shows that the representation learned by ObjSSL recovers structures between classes, and similar ground objects are grouped: The water-related classes, such as inland (17) and marine waters (18) are at the bottom.Broadleaved forests (8), Coniferous forests (9), and mixed forests (10) are grouped in the middle.Natural grassland and sparsely vegetated areas (11), Moors, heathland, and sclerophyllous vegetation (12), Transitional woodland, shrub (13), beaches, dunes, sand (14), and inland wetlands (15) are grouped in the top right.Arable land (2) and permanent crops (3) are on the left.

B. The representation of PixSSL
In PixSSL, a reconstruction pretext task is proposed to generate representation from unlabeled data.Our experimental results demonstrate that the representations obtained by SSL can significantly improve the accuracy of pixel-based analysis tasks.In Fig. 14, we display how PixSSL reconstructs the masked spectrum.The a) column shows the original spectral

C. Challenges and future directions
In this work, we have demonstrated that the SSL learning can enhance the performance of remote sensing applications with remarkable efficiency, by greatly reducing the dependence of deep models on large amounts of annotated data.Nevertheless, as an emerging field within computer vision, it still faces the following hurdles.1) Computing efficiency.The SSL usually requires significant computational resources due to the large amount of pre-trained data, complex and varied data enhancements, large batch size of training and more training epochs than other existing supervised learning, etc.Meanwhile, with the growth in popularity of hyper NPL model, such as BERT [82], ChatGPT [83], LaMDA [84], etc. SSL is also widely used to train mega models, which poses a serious challenge to computational resources.At present, little work has been done to reduce computational costs on SSL, but this is an important factor in practical use.The effective data loading, model design, parallel computing and hardware acceleration are therefore to be explored.2) Prompt Engineering, also known as contextual prompting, refers to methods of how to communicate with large deep learning models to guide their behavior towards desired outcomes without updating their weights [85], [86], [87].Self-supervised learning often requires tremendous computational resources to train large models, which poses challenges for non-enterprise researchers.The prompt engineering is an empirical science and does not require large computational resources.The effectiveness of hint engineering methods can vary considerably between models, and therefore requires a lot of experience and experimentation.We believe this will also be a major area of research for SSL in the future.

VII. CONCLUSION
In this work, we have proposed a generic self-supervised learning framework based on remote sensing data at both object and pixel levels.This proposed SSL method learns a target representation that covers both spatial and spectral information from massive unlabelled data.This representation has shown to achieve superior performance in downstream remote sensing tasks than using original data as input.More importantly, this approach can alleviate the problem of expensive label of remote sensing data on traditional supervised learning.In this paper, we have designed two experiments with real data.One is land cover classification task based on Sentienl-2 multispectral datasets, we have selected an object-based analysis approach and the results demonstrate that our proposed ObjSSL outperform other traditional SSL methods that are not designed for both spectral-spatial features extraction.The other one is ground soil parameter retrieval tasks on hyperspectral datasets.We have selected a pixel-based analysis method to utilise the rich spectral information.The results demonstrate that the proposed PixSSL can learn improved spectral representations by recovering the spectral information from the masked data.Simultaneously, we visualize the learned representation of the proposed SSL, and the results show that our SSL can learn representation from both spectral-spatial information of unlabelled datasets.We believe that this approach has the potential to be effective in a wider range of remote sensing applications and we will explore its utility in more remote sensing applications in the future.

Fig. 11 .
Fig.11.Linear and fine-tune regression result on predicting four nitrogen parameters on the ground target.We report R2 and RMSE accuracy for the evaluations on the validation for the proposed self-supervised method and machine learning-based method.

Fig. 12 .
Fig. 12.The visualization of the attention map from the last layer in the encoder.a) and b) shows the RGB vision of the data with their enhancements.C) and d) visualize the different attention masp of the last layer in the encoder after ObjSSL.

Fig. 13 .
Fig. 13.t-SNE visualization of BigEarthNet classes as represented using ObjSSL.For each class, we obtain the embedding by taking the average feature for all images of that class in the validation set.

Fig. 14 .
Fig.14.Reconstructions of spectral information using PixSSL.The predictions are reasonably different from the original spectral information but are essentially close, which indicates that the method can be generalized.

TABLE I A
REPRESENTATIVE COLLECTION OF PRETEXT TASKS IN THE EXISTING SSL METHODS.
Table.II shows the number of images of each class associated with training and validation sets.

TABLE II NUMBER
OF IMAGES OF EACH CLASS.

TABLE IV RESULTS
OBTAINED BY THE MOCO-V2, BYOL, DINO AND THE PROPOSED SSL PRETRAINING AND FINE-TUNE ON 50% OF THE TRAINING

TABLE V RESULTS
OBTAINED BY THE VGG16, RESNET50, VIT AND THE PROPOSED METHOD WITH AND WITHOUT SSL