Automatic Laboratory Martian Rock and Mineral Classiﬁcation Using Highly-Discriminative Representation Derived from Spectral Signatures

: The optical properties of rocks and minerals provide a reliable way to measure their chemical and mineralogical composition due to the specific reflection behaviors, which is also the key insight behind most automatic identification and classification approaches. However, the inter-category spectral similarity poses a great challenge to the automatic identification and classification tasks because of the diversity of rocks and minerals. Therefore, this paper develops a recognition and classification approach of rocks and minerals using the highly discriminative representation derived from their raw spectral signatures. More specifically, a transformer-based classification approach integrated with category-aware contrastive learning is constructed and trained in an end-to-end manner, which would force instances of the same category to remain close-by while pushing instances of a dissimilar category far apart in the high-dimensional feature space, in order to produce the highly discriminative feature representation of the rocks and minerals. From both qualitative and quantitative views, experiments are conducted on the laboratory sample dataset with 30 types of rocks and minerals shared from the National Mineral Rock and Fossil Specimens Resource Center, and the spectral information of the laboratory rocks and minerals is captured using a multi-spectral sensor, with a duplicated payload of the counterpart onboard the Zhurong rover. Quantitative results demonstrate that the developed approach can effectively distinguish 30 types of rocks and minerals, with a high overall accuracy of 96.92%. Furthermore, the developed approach is remarkably superior to other existing methods, with average differences of 4.75% in the overall accuracy. Furthermore, we also visualized the derived highly discriminative features of different types of rocks and minerals by projecting them onto a two-dimensional map, where the same categories tend to be modeled by nearby locations and the dissimilar categories by distant locations with high probability. It can be observed that, compared with those in the raw spectral feature space, the clusters are formed better in the derived highly discriminative feature space, which further conﬁrms the promising representation capability.


Introduction
Rocks and minerals are one of the major planetary surface features. Since minerals are stable under the known ranges of temperature and pressure, a rock, made of specific representation learning method to map mineral signatures of interest across the CRISM image database. To deal with the problem of limited samples, Li et al. [18] adopted the idea of transfer learning to fine-tune the super-parameters of the pre-trained convolutional neural network.
To date, some institutes and organizations have been also devoted to establishing the laboratory spectral database of rocks and minerals (e.g., the Berlin Emissivity Database, United States Geological Survey (USGS) spectral library version 7) to study their spectral behaviors [5,[34][35][36][37][38]. In this way, the reflectance spectra of terrestrial rocks and minerals, covering at the range of visible to near-infrared wavelengths, are explored and analyzed in laboratory conditions, which is beneficial to further supporting the research of planetary origin and geological evolution [4,27,39,40]. Similarly, this work captures the spectral information of the laboratory rocks and minerals using the multi-spectral sensor, which duplicates the payload of its counterpart onboard the Zhurong rover [41], and focuses on investigating the highly discriminative feature representations from the raw spectral information for the classification task.
Although there have been numerous presented studies related to detection and classification [2,25,29,[42][43][44][45], automatic rock and mineral classification remains greatly challenging due to the following aspects. First, according to the literature statistics, the majority of existing classification approaches tended to only use a few types of rocks and minerals, where the number of the category in these work is generally less than 10, to verify their own robustness and effectiveness [25,29,43]. As a matter of fact, the increasing number of categories might result in the severe inter-category similarity, which will pose a challenge in the task of accurate classification. In this case, a dataset containing as many types of rocks and minerals as possible is very necessary for verifying the generalization of the classification approaches. Second, although lots of feature extraction and selection strategies have been developed for describing the spectral and texture information of rocks and minerals [25,29,43], the inter-category similarity might still affect the descriptive capability and reduce the classification quality [46] as the number of categories increases. Obviously, a highly discriminative feature representation plays a significant role in the classification task to enhance the intra-category similarity and enlarge the inter-category variants.
To address these challenges mentioned above, we develop a rock and mineral classification approach using the highly discriminative representation derived from their original spectral signatures. Moreover, experiments are carried out on the multi-spectral data captured from 30 types of rocks and minerals for both qualitatively and quantitatively evaluating the robustness and reliability of the developed approach. Our main contributions in this work are as follows.
(1) To efficiently achieve the classification task, we design a transformer-based classification approach for generating the highly discriminative feature representation of both rocks and minerals, where the inter-category representation variant is enlarged and the intra-category representation similarity is aggregated; (2) A category-aware contrastive learning is integrated within the developed transformerbased classification approach. In this case, the super-parameters of the whole network are learned and trained in an end-to-end multi-task manner. Consequently, the remarkable distinctions among different types of rock and minerals occur in their high-dimensional feature space; (3) We demonstrate the reliability and robustness of the developed approach on a dataset containing rocks and minerals with complicated categories. It is of significance for the investigation of the developed approach's generalization ability.
The rest of this paper is organized as follows. Section 2 describes the experimental data and the developed classification approach in detail. Section 3 presents the experimental results and analysis for evaluating the developed classification approach both quantitatively and qualitatively. Section 4 discusses the sensitivity of parameters and analyzes the Remote Sens. 2022, 14, 5070 4 of 20 strengths and limitations of the developed classification approach. This paper concludes with a summary of future research considerations in Section 5.

Materials and Methods
In this section, we will introduce the proposed method in detail. This section is organized as follows. First, the details of data acquisition are described. Second, we introduce the overview of the proposed method. Subsequently, we present other detailed implementation of the proposed method.

Data Acquisition
When the number of categories is large, the probability of classification error becomes relatively large. This is because each category is surrounded by a large number of neighboring categories [47], which results in severe inter-category spectral similarity. Consequently, a sample dataset of rocks and minerals with as many categories as possible is acquired from the National Mineral Rock and Fossil Specimens Resource Center. The National Mineral Rock and Fossil Specimens Resource Center is devoted to the digitization and sharing of China's rock mineral specimens, offering more than 170,000 state-owned specimens with scientific value, including minerals, rocks, and fossils. As a data provider, its goal is to promote the understanding of geo-resources and to provide resources for academia, education, and scientific popularization in the field of geoscience through collecting, organizing, and sharing rock mineral specimen resources.
As for the implemented experiments in this work, we reviewed the relevant literature about the distribution of rocks and minerals on the Martian surface and selected 30 types of rocks and minerals, e.g., hydrous minerals. All the used rocks and minerals are from the National Mineral Rock and Fossil Specimens Resource Center, including muscovite (Mu), hematite (He), calcite (Ca), galena (Gal), kaolinite (Ka), talcum Ta), pyrite (Pyr), chalcopyrite (Cha), gray tiemannite (GT), stibnite (St), gabbro (Gab), quartz (Qu), chlorite (Chl), serpentine (Se), smaltite (Sm), tennantite (Te), gypsum (Gy), graphite (Gr), crystal (Cr), suanite (Sua), aragonite (Ar), basalt (Bas), fluorite (Fl), psilomelane (Ps), saponite (Sa), goethite (Go), barite (Bar), sulfur (Sul), biotite vermiculite (BV), and pyroxenite (Pr). Moreover, we also enrich the number of categories of rocks and minerals as much as possible in the sample dataset, although a few rocks and minerals on the list are very uncommon and even unlikely to be found on Mars. By establishing a complicated sample dataset, the generalization of the proposed method can be better evaluated.
Generally, most rocks and minerals can be characterized and classified by their unique physical properties (e.g., hardness, luster, color, cleavage, fracture). For instance, barite can be reliably recognized based on its three directions of right-angle cleavage and the sugary appearance. Gypsum is characterized by its softness and its three directions of unequal cleavage. Biotite vermiculite is almost always much darker in color than muscovite. Talcum has a greasy feel. Psilomelane occurs as botryoidal and stalactitic masses with a smooth shining surface and submetallic luster. Quartz is characterized by its glassy luster, conchoidal fracture, and crystal form. The most obvious physical properties of serpentine are its green color, patterned appearance, and slippery feel. Stibnite typically forms coarse, irregular masses or radiating sprays of needlelike crystals, and its distinguishing characterization is easy fusibility, a bladed habit, perfect cleavage in one direction, lead-gray color, and soft black streaks. Figure 1 illustrates examples of different types of both rocks and minerals. It is obvious that similar mineralogy or appearance characteristics among some types of rocks and minerals make them very challenging to differentiate via direct visual interpretation. For example, both aragonite and calcite with the same formula are tabular, prismatic, or needlelike, often with steep pyramidal or chisel-shaped ends, and can form columnar or spreading aggregates. Basalt consists mainly of plagioclase and pyroxene minerals similar to gabbro, but the former is fine-grained, and the latter is coarse-grained. The yellow color and metallic luster are the most obvious physical properties of chalcopyrite, which shows a similar appearance to pyrite. Chlorite is usually green in color, has a foliated appearance, inelastic cleavage, and an oily to soapy feel, but its variable chemical composition makes it a difficult specimen to. Galena can be easily recognized by its metallic cleavage planes, lead gray and silvery color, and black streaks, which are often associated with sphalerite, calcite, and fluorite. Hematite has an extremely variable appearance, from earthy to submetallic to metallic luster, but it always produces a reddish streak. Smaltite crystallizes in the cubic system with the same hemihedral symmetry as pyrite. and can form columnar or spreading aggregates. Basalt consists mainly of plagioclase and pyroxene minerals similar to gabbro, but the former is fine-grained, and the latter is coarse-grained. The yellow color and metallic luster are the most obvious physical properties of chalcopyrite, which shows a similar appearance to pyrite. Chlorite is usually green in color, has a foliated appearance, inelastic cleavage, and an oily to soapy feel, but its variable chemical composition makes it a difficult specimen to. Galena can be easily recognized by its metallic cleavage planes, lead gray and silvery color, and black streaks, which are often associated with sphalerite, calcite, and fluorite. Hematite has an extremely variable appearance, from earthy to submetallic to metallic luster, but it always produces a reddish streak. Smaltite crystallizes in the cubic system with the same hemihedral symmetry as pyrite.  The reflection behavior of rocks and minerals primarily depends on their own mineralogical compositions and physical properties [4]. To capture the reflectance spectra of rocks and minerals in this work, the eight-band multi-spectral camera, which is the alternative payload of its counterpart onboard the Zhurong rover, is used as the multi-spectral sensor [9], and it is able to provide spectral data at the following wavelengths: 480 nm, 525 nm, 650 nm, 700 nm, 800 nm, 900 nm, 950 nm, and 1000 nm. The specifications of the used multispectral camera are listed in Table 1. In our implementation, we only use the multi-spectral camera to capture the eight-band multispectral images of all rock/mineral samples at approximately 12:12 pm, during a period of bright sunshine. Solar elevation angle and the shooting angle of camera is approximately 60° and 37°, respectively. All rock/mineral samples are placed on the ground, and the vertical height of camera from the ground is 1.8 m. As a result, the size of the captured image corresponding to each rock or mineral is × × 8, where denotes the image height, denotes the image width. The reflection behavior of rocks and minerals primarily depends on their own mineralogical compositions and physical properties [4]. To capture the reflectance spectra of rocks and minerals in this work, the eight-band multi-spectral camera, which is the alternative payload of its counterpart onboard the Zhurong rover, is used as the multi-spectral sensor [9], and it is able to provide spectral data at the following wavelengths: 480 nm, 525 nm, 650 nm, 700 nm, 800 nm, 900 nm, 950 nm, and 1000 nm. The specifications of the used multispectral camera are listed in Table 1. In our implementation, we only use the multi-spectral camera to capture the eight-band multispectral images of all rock/mineral samples at approximately 12:12 pm, during a period of bright sunshine. Solar elevation angle and the shooting angle of camera is approximately 60 • and 37 • , respectively. All rock/mineral samples are placed on the ground, and the vertical height of camera from the ground is 1.8 m. As a result, the size of the captured image corresponding to each rock or mineral is H × W × 8, where H denotes the image height, W denotes the image width. Because Spectralon's optical properties make it ideal as a reference surface in remote sensing and spectroscopy, we used Spectralon as a near-Lambertian reference standard. The images of both the standard reference Spectralon and rocks/minerals are captured concurrently, which guarantees the consistency of observation conditions. After capturing the multi-spectral image of all samples, we compute the reflectance using the following Equation (1) due to the very simple and intuitive computation method [48]: where G m denotes the average gray values of the query sample, G ref denotes the average gray values of the standard reference spectralon, R ref denotes the reflectance coefficient of the standard reference Spectralon that is generally known and measured in the laboratory, and R denotes the computed reflectance of the query sample. Figure 2 demonstrates some typical examples of the spectral curves corresponding to rocks and minerals, and shows the changes in the band position and shape of compositions because of the unique spectral signatures due to their own physical characteristics. In the real world, small variations in the composition of rocks and minerals might occur, which often causes shifts in the position and shape of absorption bands in the spectrum. We put the spectral curves of the used (partial) samples together into one subfigure due to the similar spectral characteristics. It can be observed that these shifts in the position and shape of absorption bands might result in the similarity and even the overlap of spectral curves, i.e., severe inter-category similarity, such as in the case of chalcopyrite and fluorite, or with pyrite and saponite.
Pixel size 5.5 μm Focal length 50 mm Imaging distance [1.5 m, ∞) Because Spectralon's optical properties make it ideal as a reference surface in remote sensing and spectroscopy, we used Spectralon as a near-Lambertian reference standard. The images of both the standard reference Spectralon and rocks/minerals are captured concurrently, which guarantees the consistency of observation conditions. After capturing the multi-spectral image of all samples, we compute the reflectance using the following Equation (1) due to the very simple and intuitive computation method [48]: where denotes the average gray values of the query sample, denotes the average gray values of the standard reference spectralon, denotes the reflectance coefficient of the standard reference Spectralon that is generally known and measured in the laboratory, and denotes the computed reflectance of the query sample. Figure 2 demonstrates some typical examples of the spectral curves corresponding to rocks and minerals, and shows the changes in the band position and shape of compositions because of the unique spectral signatures due to their own physical characteristics. In the real world, small variations in the composition of rocks and minerals might occur, which often causes shifts in the position and shape of absorption bands in the spectrum. We put the spectral curves of the used (partial) samples together into one subfigure due to the similar spectral characteristics. It can be observed that these shifts in the position and shape of absorption bands might result in the similarity and even the overlap of spectral curves, i.e., severe inter-category similarity, such as in the case of chalcopyrite and fluorite, or with pyrite and saponite.

The Developed Classification Approach of Rocks and Minerals
It is common knowledge that the reflection behaviors of rocks and minerals depend on their own physical characteristics, which is the key insight behind most existing rock and mineral classification approaches. Despite their unique spectral signatures, the intercategory spectral similarity among different types of rocks and minerals might still reduce the classification quality as the number of categories increases. Therefore, using a transformer encoder as the backbone, we integrate it with contrastive learning and develop a classification method of rocks and minerals based on the highly discriminative representation derived from the original spectral signatures. Figure 3 indicates the pipeline of the developed classification method, which consists of a transformer-based feature encoder Remote Sens. 2022, 14, 5070 7 of 20 module and multi-task loss function for optimization. The former is able to remarkably enhance the descriptive capability of the derived feature representation [31] while the latter is to cause the feature representations of similar categories to aggerate with each other while the feature representations of dissimilar categories separate from each other. In our developed approach, an image patch x ∈ R h×w×d (where h is the height, w is the weight, and d is the channels; in this work, d = 8) is first operated via the simple linear flatten method, serving as the input of the transformer-based feature encoder module. and mineral classification approaches. Despite their unique spectral signatures, the intercategory spectral similarity among different types of rocks and minerals might still reduce the classification quality as the number of categories increases. Therefore, using a transformer encoder as the backbone, we integrate it with contrastive learning and develop a classification method of rocks and minerals based on the highly discriminative representation derived from the original spectral signatures. Figure 3 indicates the pipeline of the developed classification method, which consists of a transformer-based feature encoder module and multi-task loss function for optimization. The former is able to remarkably enhance the descriptive capability of the derived feature representation [31] while the latter is to cause the feature representations of similar categories to aggerate with each other while the feature representations of dissimilar categories separate from each other. In our developed approach, an image patch ∈ × × (where ℎ is the height, is the weight, and is the channels; in this work, = 8) is first operated via the simple linear flatten method, serving as the input of the transformer-based feature encoder module.

Transformer-Based Feature Encoder Module
The vision transformer as a deep learning model shows a remarkable capability to capture rich dependency information between variables [49]. We explore and investigate the adaptation of the vision transformer for basic visual feature extraction in order to generate the highly discriminative feature representation. It is found that, unlike the convolutional neural networks (CNNs) that gradually expand the field of view by repeatedly "convoluting" the information around the kernel layer by layer, the transformer-based method uses the stacked multi-head attention module that allows its strong ability to model the long-range dependencies. Recently, the vision transformer has adopted the attention mechanism and achieved promising results for image classification. Generally, an input image is first converted into a sequence of tokens by dividing it with a certain patch size and then linearly projecting each patch into tokens. Then, when a sequence of tokens is passed into a vision transformer model, attention weights are calculated between every

Transformer-Based Feature Encoder Module
The vision transformer as a deep learning model shows a remarkable capability to capture rich dependency information between variables [49]. We explore and investigate the adaptation of the vision transformer for basic visual feature extraction in order to generate the highly discriminative feature representation. It is found that, unlike the convolutional neural networks (CNNs) that gradually expand the field of view by repeatedly "convoluting" the information around the kernel layer by layer, the transformer-based method uses the stacked multi-head attention module that allows its strong ability to model the long-range dependencies. Recently, the vision transformer has adopted the attention mechanism and achieved promising results for image classification. Generally, an input image is first converted into a sequence of tokens by dividing it with a certain patch size and then linearly projecting each patch into tokens. Then, when a sequence of tokens is passed into a vision transformer model, attention weights are calculated between every token simultaneously. That is to say, the attention weight α ij of token z j with respect to token z i is learned, which suggests the relevant information to each token. After calculating the α ij value for all i and j pairs, we update each token z i to z i using a weighted sum of all tokens followed by a nonlinear ReLU layer. This is defined in the following Equations (2)-(4): where d denotes the dimension of the key vector, W k denotes the key weight matrix, W q denotes the query weight matrix, W v denotes the value weight matrix, W r and W o are the transformation matrices, and b 1 and b 2 are the bias terms. One set of (W k , W q , W v ) matrices is called an attention head, and each layer in a vision transformer model has multiple attention heads, a module for attention mechanisms which runs through an attention mechanism several times in parallel. While each attention head attends to the tokens that are relevant to each token, the model can do this for different definitions of "relevance" with multiple attention heads.
As shown in the subfigure about the highly discriminative representation learning in Figure 3, a transformer-based feature encoder module is composed of a sequence of blocks where each block contains the multi-head attention module. Following this, three successive feedforward networks are used to produce the highly discriminative feature representation for the final label predictions. Empirically, the size of three successive fully connected layers is set to h × w × d, 360, and the number of classes, respectively. As a result, the output of the whole network, i.e., the last fully-connected layer, is considered as the feature representation of rock and mineral derived from the developed method.

Multi-Task Loss Function for Optimization
The super-parameters within the whole network are learned for mapping a set of the feature representations of rocks and minerals to a set of categories from massive highquality labeled training data. General speaking, the problem of learning is cast as a search or optimization problem, which navigates the space of possible sets of super-parameters within the whole network in order to make good or good enough predictions. In the context of an optimization process, the function used to evaluate a candidate solution (i.e., a set of super-parameters) is referred to as the loss function, and the value calculated by the loss function is referred to simply as "loss". Typically, a deep learning model is learned using the stochastic gradient descent optimization algorithm, and super-parameters are updated using the backpropagation of error algorithm so that the next evaluation reduces the error, which means that we are searching for a candidate solution that has the lowest score. Here, we primarily discuss how to design an effective loss function, which jointly optimizes both cross entropy loss and category-aware contrastive loss. It enables us to guide the developed model to move towards convergence via measuring the difference between the predicted output and the ground truth. As a consequence, a highly discriminative feature representation is generated from the learned model.
Due to its fast convergence, the cross-entropy loss L cls is used to guarantee accurate classification of the categories, as defined in the following Equation (5): where C denotes the total number of categories,ŷ i denotes the ith true category of the training samples, and y i denotes the the ith predicted category from our developed model. In addition to the cross-entropy loss, the class separation in the latent feature space would also be an ideal characteristic to discriminate among different types of rocks and minerals. Therefore, we define the category-aware contrastive loss, as defined in Equation (6). The right subfigure in Figure 3 gives an explicit description about the role of category-aware contrastive loss. Intuitively, the goal of the category-aware contrastive loss is to force instances of the same category to remain close-by while pushing ones in the dissimilar category far apart in their latent feature space.
More specifically, we consider the set of categories as K = {1, 2, . . . , C} ⊂ N + , where N + represents the set of positive integers. For each category i ∈ K, a mean feature representation p i is represented and maintained as the specific formed cluster in the latent feature space, which makes up a set of category-specific mean feature representations, namely P = {p 1 , . . . , p C }, for computing the category-aware contrastive loss. During the training procedure, the mean of feature representation for each category makes up a set of category-specific mean feature representations, namely P = {p 1 , . . . , p C }, for computing the category-aware contrastive loss. Since the super-parameters of the whole network is learned in an end-to-end manner, the mean feature representation corresponding to each category would gradually evolve at the training time. Inspired by the contrastive clustering method [50], a queue q i with a fixed length is maintained for the i th category to store its associated feature representation. As a result, a feature representation store F store = {q 1 , . . . , q C } stores the category-specific feature representations in the corresponding queues. After every I p iterations, a set of new category-specific mean feature representation P new is calculated. Following this, the existing set of category-specific mean feature representations, namely P, is updated by weighting P and P new with a momentum constant η, as defined in Equation (6). In our implementation, η is set to 0.99, while I p is set to 2000.
After obtaining the set of category-specific mean feature representations, let f c denote a feature representation produced by an intermediate layer of the used transformerbased feature encoder module, for an instance of category c. To force instances of the same category to remain close-by while pushing instances of the dissimilar category far apart in the high-dimensional feature space, the category-aware contrastive loss is defined as follows: where D(·) denotes any distance function (e.g., Euclidean, cosine), and ∆ denotes how close a similar and dissimilar instance can be. In our implementation, the value of ∆ is empirically set to 2.0. Finally, we simultaneously learn the shared super-parameters through the backpropagation algorithm [51] for optimization to provide better generalization of the developed classification approach [52]. The total loss used in this work is defined as a weighted sum of both the category-aware contrastive loss and the cross-entropy loss as the following Equation (9): Loss total = λ × L cont + L cls (9) where λ is experimentally set to 0.01, L cont denotes the category-aware contrastive loss, and L cls denotes the cross-entropy loss by measuring the cross entropy between the ground truth and the predicted output of our developed model. By minimizing the total loss, our developed classification approach is progressively trained for optimization until convergence.

Experimentation and Analysis
To evaluate the reliability and robustness of the developed classification method, in this section, we performed both qualitative and quantitative analysis on the multi-spectral images of 30 types of rocks and minerals from the National Mineral Rock and Fossil Specimens Resource Center. First, a brief description of evaluation criteria is given. Then, we offer a detailed experimental setting. Finally, we qualitatively and quantitatively analyze the performance of category classification results, where the experiments are conducted using the optimal parameters obtained and discussed in Section 5.

Evaluation Criteria
To measure the classification quality of rocks, we conduct the evaluation solution using precision (Pr), recall (Re), F 1−score and overall accuracy (OA), as defined in Equations (10)- (13). The recall represents a measure of completeness, the precision denotes a measure of correct-ness, and the F 1−score , also called balanced F score , is a weighted average of the precision and recall; overall accuracy denotes the sum of the true positives plus true negatives divided by the total number of queried individuals. The equations are as follows: Overall accuracy = TP + FN TP + TN + FP + FN (13) where TP denotes the number of positive objects that are correctly determined as positive ones, TN denotes the number of positive objects that are correctly determined as negative ones, FN denotes the number of negative objects that are incorrectly classified as negative ones, and FP denotes the number of negative objects that are incorrectly determined as positive ones.

Implementation Details
Computer configuration in the experiment comprised two NVIDIA GeForce RTX 3080Ti GPUs, 128 Gb of memory, Ubuntu 20.04 operating system, Cuda10.0, and Cudnn7.5. Our developed classification approach to generating the highly discriminative features is constructed using Pytorch library [53] and trained in an end-to-end manner. We use Adam for the optimizer with betas = (0.9, 0.999) and weight decay = 1 × 10 −6 . We train the developed approach with a batch size of 64, an initial learning rate of 10 −4 , and maintain a more optimal learning rate throughout the training procedure using a cosine annealing scheduler. We use a dropout with p = 0.1 for regularization. The number of heads in the multi-head attention modules is 8. We use a L = 3 layer transformer with a residual layer around each embedding update and layer normalization.
With regards to our laboratory sample dataset, we imaged the 30 rocks and minerals in it. To compare the spectra of rocks and minerals as well as to reduce the spectral noises, the image patches with fixed-size coverage are obtained. For example, in our implementation, the image patch is a group of pixels 9 × 9 × 8 if its size is set to 9 × 9, which will serve as the input for the developed transformer-based classification approach in Section 3.3. The effect of the size of image patch on the classification performance will be discussed in Section 4.1. In our implementation, we randomly selected some of the image patches as training, and then used the rest as validation and test to classify the rocks and minerals. The ratio of training, validation, and test split is 60%, 10%, and 30%, respectively. We count out the number of image patches in the training set, validation set, and test set; Table 2 lists the detailed statistics.

Rock Classification Results
The classification performance relies on the descriptive ability of the generated feature representation of rocks and minerals to some extent. In our implementation, we conduct the classification task to distinguish 30 types of rock and mineral samples. Table 3 lists rock classification results generated by the developed method in precision, recall, and F 1−score per class, as well as overall accuracy. From the quantitative perspective, we can observe that the developed method can effectively distinguish 30 types of rock and minerals samples, with a high overall accuracy of 96.92% and an average F 1−score of 97.11%. It is worth noting that some types of rocks and minerals, such as muscovite, talcum, pyrite and so on, even achieve 100% in their F 1−score . Experimental results suggest that the feature representation derived from the developed method in this paper is highly discriminative for carrying out the classification task, as summarized in Table 3. That is to say, the generated feature representation shows the capability to reliably and robustly describe the differences among different types of rocks and minerals from the quantitative perspective.
Inevitably, there are some cases of incorrect classification. From the classification results derived from the trained optimal model, it is observed that the sulfur is easily misclassified into crystal, which results in the low recall. Furthermore, the low precision and low recall which occur between chlorite and saponite is primarily due to the mixed classification. For the visualization comparison and analysis (as shown in Figure 4), there exists largely-overlapping regions between sulfur and crystal, as well as between chlorite and saponite in raw spectral signature signatures, which illustrates the high similarity between them. This is one of most important factors responsible for low precision or recall in the classification results.  Inevitably, there are some cases of incorrect classification. From the classification results derived from the trained optimal model, it is observed that the sulfur is easily misclassified into crystal, which results in the low recall. Furthermore, the low precision and low recall which occur between chlorite and saponite is primarily due to the mixed classification. For the visualization comparison and analysis (as shown in Figure 4), there exists largely-overlapping regions between sulfur and crystal, as well as between chlorite and saponite in raw spectral signature signatures, which illustrates the high similarity between them. This is one of most important factors responsible for low precision or recall in the classification results.

T-SNE Visualization in the Discriminative Feature Space
In this work, the key insight behind the developed method is to make a remarkable distinction among different types of rocks and minerals in their high-dimensional feature space. The T-distributed stochastic neighbor embedding (T-SNE) technique [54] is a dimensionality reduction technique, whose goal is to project high-dimensional data into a two-dimensional map space. In the two-dimensional map space, the same categories tend to be modeled by nearby locations and the dissimilar categories by distant locations with high probability. That is to say, the feature representation is highly discriminative if the clusters belonging to the same categories are formed well. For a more detailed description of the T-SNE technique, please refer to Appendix A. To qualitatively analyze the performance of the highly discriminative feature representation derived from the transformerbased classification method, in this subsection, we visualize the quality of formed clusters in their high-dimensional feature space using T-SNE visualization. Theoretically, the number of formed clusters within the two-dimensional visualization maps should be associated with the number of categories in the rock and mineral dataset. In our implementation, the output of the developed method, i.e., the last fully-connected layer, is consid-

T-SNE Visualization in the Discriminative Feature Space
In this work, the key insight behind the developed method is to make a remarkable distinction among different types of rocks and minerals in their high-dimensional feature space. The T-distributed stochastic neighbor embedding (T-SNE) technique [54] is a dimensionality reduction technique, whose goal is to project high-dimensional data into a two-dimensional map space. In the two-dimensional map space, the same categories tend to be modeled by nearby locations and the dissimilar categories by distant locations with high probability. That is to say, the feature representation is highly discriminative if the clusters belonging to the same categories are formed well. For a more detailed description of the T-SNE technique, please refer to Appendix A. To qualitatively analyze the performance of the highly discriminative feature representation derived from the transformer-based classification method, in this subsection, we visualize the quality of formed clusters in their high-dimensional feature space using T-SNE visualization. Theoretically, the number of formed clusters within the two-dimensional visualization maps should be associated with the number of categories in the rock and mineral dataset. In our implementation, the output of the developed method, i.e., the last fully-connected layer, is considered as the derived feature representation describing both rocks and minerals and serves as the input of the scikit-learn T-SNE package [55] for visualization. For comparison, we project the original spectral signatures and the feature representation derived from the developed method into the two-dimensional map space for visualization. Figure 5 demonstrates a visualization comparison between the original spectral signatures and the feature representation derived from the developed method using the T-SNE technique. As shown in Figure 5a, there occur severe overlapping regions among different types of rocks in the two-dimensional visualization maps. It can be concluded that inter-category spectral similarity among different types of rocks and minerals might appear, which would result in the degradation of the classification performance. Compared with those from the original spectral signatures, as shown in Figure 5b, the clusters are formed well from the derived feature representation for different type of both rocks and minerals, which suggests the highly discriminative representation capability.
Remote Sens. 2022, 14, 5070 13 of 20 demonstrates a visualization comparison between the original spectral signatures and the feature representation derived from the developed method using the T-SNE technique. As shown in Figure 5a, there occur severe overlapping regions among different types of rocks in the two-dimensional visualization maps. It can be concluded that inter-category spectral similarity among different types of rocks and minerals might appear, which would result in the degradation of the classification performance. Compared with those from the original spectral signatures, as shown in Figure 5b, the clusters are formed well from the derived feature representation for different type of both rocks and minerals, which suggests the highly discriminative representation capability.

Discussions
In this section, the parameter sensitivity of the developed classification approach is provided, primarily including the size of image patches, the number of transformer layers, the number of transformer heads, and the integration of category-aware contrastive loss. Additionally, we compare the proposed method with other methods. Afterwards, both the strengths and limitations of the developed rock classification method are briefly analyzed and discussed.

Effect of the Size of Image Patches on Classification Results
To compare the spectra of rocks and minerals for automatic classification, in our implementation, spectra were extracted from the image patches with fixed-size coverage. In this subsection, we set different sizes of image patches, including 5 × 5, 7 × 7, 9 × 9, and 11 × 11, to discuss the effect of the size of image patch on the classification results. Table 4 lists the effect of the size of image patches on classification results, which demonstrate the fluctuation of classification results from 95.96% to 96.92%. As mentioned in Section 2.1, small changes in the compositions of rocks and minerals cause shifts in the position and shape of absorption bands in the spectrum. With the increasing size, the image patches with fixed-size coverage alleviate the effect of composition variants, which enhance the classification performance to some extent, although it brings computing bur-

Discussion
In this section, the parameter sensitivity of the developed classification approach is provided, primarily including the size of image patches, the number of transformer layers, the number of transformer heads, and the integration of category-aware contrastive loss. Additionally, we compare the proposed method with other methods. Afterwards, both the strengths and limitations of the developed rock classification method are briefly analyzed and discussed.

Effect of the Size of Image Patches on Classification Results
To compare the spectra of rocks and minerals for automatic classification, in our implementation, spectra were extracted from the image patches with fixed-size coverage. In this subsection, we set different sizes of image patches, including 5 × 5, 7 × 7, 9 × 9, and 11 × 11, to discuss the effect of the size of image patch on the classification results. Table 4 lists the effect of the size of image patches on classification results, which demonstrate the fluctuation of classification results from 95.96% to 96.92%. As mentioned in Section 2.1, small changes in the compositions of rocks and minerals cause shifts in the position and shape of absorption bands in the spectrum. With the increasing size, the image patches with fixed-size coverage alleviate the effect of composition variants, which enhance the classification performance to some extent, although it brings computing burdens. Taking the accuracy and computational cost into consideration, we set the image patch size to 9 × 9 as the optimal parameters in our implementation.

Effect of the Number of Transformer Layers on Classification Results
In our implementation, we extract the highly discriminative feature representation from image patches encoded by a series of transformer layers. To evaluate the effect of the number of transformer layers, we set it with steps of 1 from 1 to 7. As with deeper convolutional neural networks, the developed method is capable of perfectly fitting training data, and also performed well on test data as the number of transformer layers increased. Figure 6 demonstrates the experimental results. When the number of Transformer layers rises from 1 to 3, the overall accuracy fluctuates by approximately 0.77%. However, the increase in the number of transformer layers also makes the model convergence very difficult due to the gradient vanishing, which might degrade the classification performance when the number of transformer layers increases from 3 to 7.
lutional neural networks, the developed method is capable of perfectly fitting training data, and also performed well on test data as the number of transformer layers increased. Figure 6 demonstrates the experimental results. When the number of Transformer layers rises from 1 to 3, the overall accuracy fluctuates by approximately 0.77%. However, the increase in the number of transformer layers also makes the model convergence very difficult due to the gradient vanishing, which might degrade the classification performance when the number of transformer layers increases from 3 to 7.

Effect of the Number of Transformer Heads on Classification Results
In addition to the number of transformer layers, the developed transformer-based classification method aggregates the inter-channel and intra-channel information through the multi-head self-attention mechanism. Thus, the accuracy might depend on the number of transformer heads for the developed transformer-based classification method. To demonstrate the effect of the number of transformer heads on classification results, we set it to 1, 2, 4, 8, respectively, since the number of multi-spectral bands must be guaranteed to be divisible. In this way, the multi-spectral bands are split across the multiple attention heads so that each can process them independently, while the correlations are extracted and aggregated for each attention head. From the experimental results shown in Figure 7, we can conclude that the accuracy of classification is easily affected by the number of transformer heads, with difference in the overall accuracy of more than 1.01%. This enables the transformer-based feature encoder module to capture richer interpretations of the multi-spectral bands.

Effect of the Number of Transformer Heads on Classification Results
In addition to the number of transformer layers, the developed transformer-based classification method aggregates the inter-channel and intra-channel information through the multi-head self-attention mechanism. Thus, the accuracy might depend on the number of transformer heads for the developed transformer-based classification method. To demonstrate the effect of the number of transformer heads on classification results, we set it to 1, 2, 4, 8, respectively, since the number of multi-spectral bands must be guaranteed to be divisible. In this way, the multi-spectral bands are split across the multiple attention heads so that each can process them independently, while the correlations are extracted and aggregated for each attention head. From the experimental results shown in Figure 7, we can conclude that the accuracy of classification is easily affected by the number of transformer heads, with difference in the overall accuracy of more than 1.01%. This enables the transformer-based feature encoder module to capture richer interpretations of the multi-spectral bands.

Effect of Category-Aware Contrastive Loss on Classification Results
To generate the highly discriminative feature representation, we combine the crossentropy loss with the category-aware contrastive loss to optimize a multi-task loss. In this way, we attempt to force instances of the same category to remain close-by while pushing

Effect of Category-Aware Contrastive Loss on Classification Results
To generate the highly discriminative feature representation, we combine the crossentropy loss with the category-aware contrastive loss to optimize a multi-task loss. In this way, we attempt to force instances of the same category to remain close-by while pushing instances of dissimilar category far apart by introducing the category-aware contrastive loss, as shown in Figure 3. Table 5 compares the results between no contrastive loss and the developed method in F 1−score per class, and the overall accuracy. From the comparative results, an improvement of approximately 0.72% can be observed, which suggests that the generated feature representation becomes more discriminative than that without the category-aware contrastive loss. Hence, we can draw a conclusion that the category-aware contrastive loss is beneficial to enhancing the descriptive capability of the generated feature representation.

Comparisons with Other Methods
To further evaluate the descriptive performance of the developed method, we also compare it with other methods which implement the frequently-used classifiers, such as the decision tree [56], random forest [57], and support vector machine (SVM) [58], based on the raw spectral signatures of rock samples. Furthermore, we compare our results with other neural networks [59], namely ConvNet, in this work. In our implementation, only one simple convolutional neural network is selected and used because the small image patches in this work are the input of the developed method which cannot be conducted on more complex architectures with many pooling operations. Table 6 lists the comparisons of the classification results between different methods. For the experimental results, we can observe that the overall accuracy and average F 1−score derived from the developed method exceeds those of other methods, with average differences of 4.75% and 4.97%, respectively. Moreover, although the developed method achieves a lower F 1−score for crystal than other methods, the developed method remarkably improves the classification performance, especially for identifying hematite, chalcopyrite, chlorite, serpentine, smaltite, saponite, etc., which further confirms the superior discriminative ability of the produced feature representation. Table 6. Comparisons with other methods in F 1−score per class and overall accuracy (%). For abbreviation, muscovite (Mu), hematite (He), calcite (Ca), galena (Gal), kaolinite (Ka), talcum (Ta), pyrite (Pyr), chalcopyrite (Cha), gray tiemannite (GT), stibnite (St), gabbro (Gab), quartz (Qu), chlorite (Chl), serpentine (Se), smaltite (Sm), tennantite (Te), gypsum (Gy), graphite (Gr), crystal (Cr), suanite (Sua), aragonite (Ar), basalt (Bas), fluorite (Fl), psilomelane (Ps), saponite (Sa), goethite (Go), barite (Bar), sulfur (Sul), biotite vermiculite (BV), and pyroxenite (Pr). The best performance is highlighted in BOLD fonts.

Summary and Outlook
To address the challenges due to the inter-category spectral similarity, we present a category recognition approach of both rocks and minerals using the discriminative representation derived from its spectral signatures. The developed method combines a transformer-based recognition approach with category-aware contrastive learning, which is trained in an end-to-end multi-task manner. The advantages of our proposed approach are as follows: (1) Different from convolutional neural network, the transformer shows its strong ability to model long-range dependencies. (2) We defined a category-aware contrastive loss, which would force instances of the same category to remain close-by while pushing instances of the dissimilar category far apart. Therefore, the derived highly discriminative feature representation from the developed approach is beneficial to enhancing the descriptive capability and alleviating the inter-category spectral similarity for the classification task. Furthermore, we establish a rock sample database with 30 types of rocks and then carry out the experimental analysis both quantitatively and qualitatively. Experimental results confirm the robustness and reliability of the developed method. It can be concluded that the developed method can effectively distinguish 30 types of rock samples, with a high overall accuracy of 96.92%. Additionally, the overall accuracy and average F 1−score derived from the developed method exceed those of other common methods, with average differences of 5.78% and 5.93%, respectively. It is well-known that the more rock categories, the more severe the inter-category spectral similarity. The establishment of rock databases with more types of rock samples will enable us to test the generalization of the developed method in our future work. Additionally, for the developed method in this work, the category-aware contrastive loss and cross-entropy loss is aggregated, and the weight is set empirically. In our future research, an adaptive weighting solution is required for the performance stability of the method.