Lithological Classification by Hyperspectral Images Based on a Two-Layer XGBoost Model, Combined with a Greedy Algorithm

Lin, Nan; Fu, Jiawei; Jiang, Ranzhe; Li, Genjun; Yang, Qian

doi:10.3390/rs15153764

Open AccessArticle

Lithological Classification by Hyperspectral Images Based on a Two-Layer XGBoost Model, Combined with a Greedy Algorithm

¹

School of Geomatics and Prospecting Engineering, Jilin Jianzhu University, Changchun 130118, China

²

Jilin Province Natural Resources Remote Sensing Information Technology Innovation Laboratory, Changchun 130118, China

³

Qinghai Geological Survey Institute, Xining 810012, China

⁴

Key Laboratory of Geological Processes and Mineral Resources of the Northern Qinghai-Tibet Plateau, Xining 810012, China

⁵

Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130102, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(15), 3764; https://doi.org/10.3390/rs15153764

Submission received: 13 June 2023 / Revised: 23 July 2023 / Accepted: 27 July 2023 / Published: 28 July 2023

(This article belongs to the Special Issue The Use of Hyperspectral Remote Sensing Data in Mineral Exploration)

Download

Browse Figures

Versions Notes

Abstract

:

Lithology classification is important in mineral resource exploration, engineering geological exploration, and disaster monitoring. Traditional laboratory methods for the qualitative analysis of rocks are limited by sampling conditions and analytical techniques, resulting in high costs, low efficiency, and the inability to quickly obtain large-scale geological information. Hyperspectral remote sensing technology can classify and identify lithology using the spectral characteristics of rock, and is characterized by fast detection, large coverage area, and environmental friendliness, which provide the application potential for lithological mapping at a large regional scale. In this study, ZY1-02D hyperspectral images were used as data sources to construct a new two-layer extreme gradient boosting (XGBoost) lithology classification model based on the XGBoost decision tree and an improved greedy search algorithm. A total of 153 spectral bands of the preprocessed hyperspectral images were input into the first layer of the XGBoost model. Based on the tree traversal structural characteristics of the leaf nodes in the XGBoost model, three built-in XGBoost importance indexes were split and combined. The improved greedy search algorithm was used to extract the spectral band variables, which were imported into the second layer of the XGBoost model, and the bat algorithm was used to optimize the modeling parameters of XGBoost. The extraction model of rock classification information was constructed, and the classification map of regional surface rock types was drawn. Field verification was performed for the two-layer XGBoost rock classification model, and its accuracy and reliability were evaluated based on four indexes, namely, accuracy, precision, recall, and F1 score. The results showed that the two-layer XGBoost model had a good lithological classification effect, robustness, and adaptability to small sample datasets. Compared with the traditional machine learning model, the two-layer XGBoost model shows superior performance. The accuracy, precision, recall, and F1 score of the verification set were 0.8343, 0.8406, 0.8350, and 0.8157, respectively. The variable extraction ability of the constructed two-layer XGBoost model was significantly improved. Compared with traditional feature selection methods, the GREED-GFC method, when applied to the two-layer XGBoost model, contributes to more stable rock classification performance and higher lithology prediction accuracy, and the smallest number of extracted features. The lithological distribution information identified by the model was in good agreement with the lithology information verified in the field.

Keywords:

hyperspectral; lithology classification; XGBoost; feature selection

1. Introduction

Lithological classification information is important and basic for mineral resource exploration and geological disaster monitoring. Understanding the spatial distribution characteristics and variability of surface lithology is of great significance for regional geological mapping and mineral resource potential prediction in areas with high altitudes and poor transportation [1,2,3,4]. Traditional lithological mapping involves aerial photo examination and mapping based on interpretation keys, examination of the rocks in the field, rock sampling, and their examination in the laboratory. However, relying mainly on the subjective judgment of the staff to determine the mapping unit results in a lot of work to be completed in the field survey work, which requires a high cost in human, financial, and material resources [5,6,7]. In this sense, an efficient lithologic classification method is urgently needed to identify large-area lithological distribution information. Hyperspectral remote sensing lithological classification techniques infer the rock properties through the spectral characteristics of rocks, determine the lithology types and spatial distribution of the surface, and combine the classification results into a geological map and output expression. Compared with traditional lithology classification methods, remote sensing lithology classification has technical advantages, such as macro-scale capability, rapid processing, and non-pollution, and thus is often used as the main means of geological exploration for large-area remote sensing geological survey and thematic mapping [8,9,10,11]. It is necessary to develop high-precision geological and mineral exploration by investigating the application potential of hyperspectral remote sensing technologies in large-scale lithology classification and mineral resource exploration, which is conducive to realizing the deep integration of hyperspectral remote sensing technology, geological survey, and green exploration.

Aerospace hyperspectral remote sensing is a new type of earth observation technology, and was gradually formed based on its vigorous development in the early 1980s. It can be used to capture the spatial distribution of surface materials and record reflection spectra through hundreds of closely aligned wavelengths, to realize the effective identification of different lithological information [12,13,14,15,16]. Many scholars have researched lithology classification using this technology, focusing on spectral matching methods and classification model improvement [17,18]. With the enrichment of spectral categories and data of rock and ore in standard spectral libraries, such as the ASTER and USGS libraries, the accuracy of lithology classification based on spectral matching methods has been greatly promoted. Certain research achievements have been made in remote sensing lithology classification by spectral matching methods, such as Spectral Angle Mapping, the spectral goodness-of-fit technique, and the cross-correlation spectral matching method [19,20]. However, the lithological mapping method based on the similarity between spectral data is easily affected by the imaging conditions and the selection of the standard spectra of rocks. Finding the best matching threshold in practical application is often difficult, and the matching results are greatly affected by human factors. With the development and introduction of machine learning theory, many drawbacks of traditional spectral matching methods for lithology mapping have been overcome. The remote sensing lithological classification method based on a machine learning model can establish a mapping relationship between spectral data and lithology through mathematical statistics, relying on training samples with strong computational power and high classification accuracy. The nonlinear relationship and enrichment law among data can be deeply mined [16,21]. Nevertheless, with the gradual increase in the bands of hyperspectral data, traditional machine learning algorithms have shown great limitations for the learning of high-dimensional spectral image data. There can be a large amount of redundant information in the data, resulting in the classification accuracy and the number of features being not directly proportional, and resulting in the Hughes phenomenon: that the accuracy first rises and then decreases. As a result, these machine learning models cannot extract the spectral features of the ground objects well, and cannot learn the more accurate mapping relationship between the spectral features of the ground objects and the categories, which affects the classification accuracy. There can also be problems such as overfitting and increased computational complexity [22,23].

To compensate for the band redundancy in hyperspectral images, scholars have proposed and improved a large number of spectral feature extraction methods; for example, traditional dimensionality reduction methods, such as the LASSO algorithm [24,25], the successive projections algorithm [26,27], the wavelet transform [28,29], principal component analysis [30,31], and artificial intelligence feature extraction algorithms such as the particle swarm optimization algorithm [32,33], ant colony optimization algorithm [34,35], and artificial bee colony algorithm [36,37]. However, the above methods still face challenges such as local optimality, dimensional disaster, small data sample size, and high computational complexity, because they cannot effectively identify and retain features with low individual value but high combined value. Thus, it is difficult to accurately extract lithological spectral features in hyperspectral images containing many high-dimensional spectral channels. Compared with traditional feature selection methods, tree-based methods, such as random forest (RF) [38,39], gradient boosting decision tree (GBDT) [40,41], and Light Gradient Boosting Machine (Light GBM) [42,43], can extract the optimal feature system that suits the needs of the model in a more targeted way, and their decision-making process is more similar to human thinking, making the model easy to understand. Its dimensional reduction speed is fast, and it can process continuous and discrete data and solve the multi-output problem. The method does not need to regularize and normalize the input data in advance, exhibiting significant advantages for feature extraction of high-dimensional data. However, the criteria for feature selection in most tree models are determined through information gain and segmentation sample number. These criteria often have disadvantages, such as low stability and the masking of important enumeration features in the feature selection process due to single constraints. Therefore, based on the principle of the tree model, this study uses a greedy algorithm to comprehensively consider the segmentation sample number, average gain, and average coverage of the regression tree (CART) to improve the traditional tree model. The greedy algorithm is a simple and fast optimization design method. It is often combined with optimization measures according to the current situation to optimize the selection, and the local optimal solution is obtained. Therefore, it is not necessary to repeatedly trace the results, so it saves a lot of time in finding the optimal solution, and the calculation result of the local optimal solution is close to the solution result of the overall optimal solution. Using a greedy algorithm to apply the idea of ensemble learning in feature selection, the ideal feature selection strategy can be generated to address problems faced by traditional tree-based feature selection methods, such as overfitting functions, the masking of enumerated features, and high computational complexity. This enables the efficient extraction of rock spectral features.

Machine learning technology has been widely applied in lithology classification, with its excellent feature mining and data fitting ability. Examples include extreme learning machines [44,45], logistic regression [46,47], back-propagation neural networks [48,49], support vector machines [50,51,52], and multi-layer perceptrons [53,54]. The commonly used lithologic classification models can be divided into three categories, the space vector type, neural network type, and linear type. The space vector types represented by the support vector machine algorithm is sensitive to parameter adjustment and kernel function selection. It takes up more memory and time in calculation, and it is difficult to achieve large-scale lithologic sample training [55]. The linear types represented by the logistic regression algorithm are sensitive to the multicollinearity of independent variables, so they are easy to underfit, resulting in low classification accuracy [56,57,58]. Neural network types represented by neural network algorithms are often prone to fall into local optima, resulting in significant declines in model classification performance [59]. In recent years, ensemble learning has become one of the research hotspots in machine learning. It builds and combines multiple learners of the same or different kinds with different combination methods. Then, a combined learner with a better learning effect can be obtained. The extreme gradient boosting decision tree (XGBoost) is a new high-precision distributed ensemble learning gradient enhancement algorithm proposed by Dr. Tianqi Chen of the University of Washington in 2016. The XGBoost model overcomes many challenges faced by traditional machine learning algorithms, such as the high sensitivity of sample data, high computational complexity, and overfitting of the model. It can extract the optimal features of multiple variables and update and adjust the gradient of the basic learner for classification and identification. As a meta-algorithm framework, XGBoost supports the parallel gradient lifting calculation of feature importance and the base learner, which greatly improves the training speed in the face of large-scale samples. Regularization was introduced to control the complexity of the model, and it can also specify the default direction for missing values, thus significantly improving the generalization ability of the model while preventing the fitting problem [60,61,62].

This study aims to construct a lithological classification method using hyperspectral images based on ensemble learning to solve the problem of machine learning’s difficulty in quickly mining effective feature information in high-dimensional remote sensing data. Under the new model of integrating the feature selection and classification process, the accuracy and efficiency of surface lithological information extraction in high-altitude areas with difficult traffic conditions are improved. The Lenghu region of Haixi Mongol and the Tibetan Autonomous Prefecture of Qinghai Province were chosen as the study area, and an experimental study was carried out. Based on the preprocessing of the ZY1-02D hyperspectral remote sensing images, the feature selection process of the model was improved based on the principle of the XGBoost model and the greedy search algorithm. Firstly, the three indexes built in XGBoost were used for feature coarse screening to quickly reduce the feature dimensions. The greedy algorithm was integrated into the XGBoost algorithm, and the feature fine screening was carried out according to the feature importance ranking. The idea of integration was used to update the single selection condition to multiple conditions for parallel selection, so as to obtain the optimal feature subset for lithology classification and improve the classification effect. The accuracy and reliability of the model were examined by evaluation indexes and field verification. The main contributions of this paper are to propose a two-layer XGBoost algorithm rock information classification model and improve the importance selection method of the XGBoost model by employing a greedy search algorithm. In the case of ensuring the accuracy of lithological classification, it effectively solves the problem that it is difficult to extract variables with small correlation, low dimension, and small redundancy from high-dimensional hyperspectral data. It proves the importance of the feature selection process in machine learning in hyperspectral lithology classification and other fields, which can help geoscientists use machine learning to provide references for the prediction of mine resources, mining plans, and disaster prevention measures.

2. Materials and Methods

2.1. Study Area

The study area (38°31′–38°53′N, 93°33′–93°09′E) is located in the eastern section of the Altun Mountain and the Lenghu region at the junction of Qinghai and Gansu Provinces, belonging to the area south of the Qilian Mountains and the northern margin of the Qaidam Basin. The administrative districts are under the jurisdiction of Dagaidan Town and Lenghu Town of Haixi Mongol and the Tibetan Autonomous Prefecture, Qinghai Province. The rock in this area is well exposed, plants are mostly herbaceous, rock types are complete, and there is almost no vegetation distribution above 4500 m. It is an ideal area for the study of lithological information extraction technology. The Altun Mountain Range in the area is characterized by steep cuttings and overlapping peaks and valleys. The range, belonging to a highly eroded tectonic high-mountain area, runs in the NW direction with a higher terrain in the northwest. In contrast, the southwest area is a low plateau hill with complex lithological characteristics, mostly Gobi and desert. The study area is in the junctional zone of the northern margin of the Qaidam Basin. It has complex geological structural characteristics, since it was affected by multi-cycle and multi-stage orogenic movements and has a long history of tectonic evolution. The regional structural backbone comprises compressional structures, among which the reverse faults are the main ones, followed by the tensional or torsional faults. According to the distribution direction, these faults can be divided into NW trending, nearly EW trending, and NE trending groups. The three groups of faults show the characteristics of multi-phase activity. The exposed strata feature the following rock types: schist and gneiss formation of the Paleoproterozoic Dakendaban Group, which are medium and high-grade metamorphic rock series; volcanic rock and clastic rock groups of the Early Paleozoic Tanjianshan Group, belonging to stratigraphic sequences that comprise shallow metamorphic clastic rocks and metamorphic intermediate-basic volcanic rocks with bioclastic limestone and dolomite marble; nearshore terrigenous clastic–carbonate formation of the Early Carboniferous Huaitoutala Formation, wherein the sedimentary environment belongs to the coastal–shallow marine facies. The intrusive rocks in the area were mostly formed by late Variscan intermediate-acid magmatic activity, and the main lithologies are monzonitic granite and syenite granite, mainly in the form of a batholith. Caledonian basic–neutral magmatic activity was also responsible for the intrusive rocks, and the main lithologies include granodiorite and gabbro. With the intrusion of rock mass in each period, various types of dyke penetration were formed, mainly including granite, quartzite, gabbro, and syenitic dikes (Figure 1).

2.2. Data and Methodology

Combined with the geological and mineral data of the study area, after analysis of the distribution of geological structures and stratigraphic–lithological features in detail, ZY1-02D hyperspectral images of the study area were pretreated with radiometric calibration and atmospheric correction. The accuracy of the preprocessing results was verified by field rock spectral measurements. Using the improved greedy search algorithm combined with three important indicators as the first layer of the two-layer XGBoost model, the optimal spectral band variables in the original bands were selected. The preferred variables were put into the second layer of the model, and the modeling parameters were optimized with the BA algorithm. The extraction model of rock classification information was constructed. Accuracy, precision, recall, F1 score, and prediction results were used to evaluate the classification accuracy and generalization ability. Based on the calculation results of the model and field verification, the regional surface lithological spatial distribution map was drawn. The specific process is shown in Figure 2.

2.2.1. Data Acquisition and Preprocessing

The ZY1-02D satellite provided by the China Centre for Resources Satellite Data and Application was selected as the source of remote sensing data. This satellite is equipped with two cameras capable of capturing visible near-infrared and hyperspectral information (Table 1). In this study, hyperspectral images covering the study area in October 2022 were obtained and preprocessed. Because the fringe phenomenon is obvious in the shortwave infrared (SWIR) band data of the satellite, the “global de-stripe” method was used to repair the fringe, and the bands with serious water vapor interference and overlapping bands were eliminated. Based on the ENVI software platform, the image data were radiometrically calibrated, and the FLAASH module was used to perform atmospheric correction on the image to obtain the true reflectance of the ground object. The processed image is shown in Figure 3.

To verify the accuracy of the spectral correction of the ZY1-02D hyperspectral images, the hyperspectral images were overlaid with the existing geological maps of the study area; 50 spectra of the corresponding end elements of monzonitic granite and marble were randomly selected based on the known lithological distribution information; correlation analysis was performed with the spectral data of the two rocks measured in the field; and the Pearson correlation coefficient was used to evaluate the processing effect of the hyperspectral images (Figure 4). The results showed that the image spectral reflectance curves of the two selected rocks and the field-measured spectral curves were similar in morphological characteristics, the positions of the characteristic absorptions were basically consistent, and the spectral shapes were in good agreement. The Pearson correlation coefficient of most bands is above 0.8. The correlation coefficient curve of marble is lower than that of monzonitic granite on the whole, but the average value is still more than 0.7. ZY1-02D hyperspectral images have high spectral correction accuracy and can meet the accuracy requirements of rock information extraction.

2.2.2. Classification Principle of the XGBoost Model

XGBoost is a new high-precision distributed ensemble learning gradient enhancement algorithm based on the optimization and improvement of GBDT, proposed in 2016. It overcomes many challenges faced by traditional boosting ensemble learning algorithms and provides an efficient and flexible new concept and method for the design of ensemble learning algorithms. As a meta-algorithm framework, it supports parallel gradient lifting of the base learner, which greatly improves the speed of model training. In addition, XGBoost can not only update the base learner based on the first derivative, but can also adjust the gradient of the base learner based on the second derivative [63,64]. The objective function of XGBoost model training can be expressed as:

O_{b j} = L + \sum_{i = 1}^{n} Ω (f_{i})

(1)

where L is the error term;

\sum_{i = 1}^{n} Ω (f_{i})

is a complex function term.

L = {\sum_{i} (y_{i} - {\hat{y}}_{j})}^{2}

(2)

{\hat{y}}_{i} = \sum_{K = 1}^{K} f_{k} (x_{i}), f_{k} \in F

(3)

where

x_{i}

is training data;

y_{i}

is the feature label corresponding to

x_{i}

; F is the feature space; K is the number of trees;

{\hat{y}}_{i}

is the feature estimation of the predicted training data

x_{i}

.

The error term L is the sum of the errors between the predicted value and the corresponding actual value. The result of each iteration is the correction of the residual of the previous iteration. Assuming that the prediction result obtained in the s step of the model is

{\hat{y}}_{i}^{(s)}

, the prediction result of each step can be expressed as:

{\hat{y}}_{i}^{(s)} = \sum_{K = 1}^{S} f_{k} (x_{i}) = {\hat{y}}_{i}^{(S - 1)} + f_{S} (x_{i})

(4)

The prediction result

{\hat{y}}_{i}^{(s)}

obtained in step S is the sum of the predicted value

f_{S} (x_{i})

in the step and the estimated value in step S − 1. The objective function obtained by introducing the results into the error term can be expressed as:

O_{b j}^{(s)} = \sum_{i = 1}^{n} {(y_{i} - ({\hat{y}}_{i}^{(S - 1)} + f_{s} (x_{i})))}^{2} + \sum_{i = 1}^{n} Ω (f_{i})

(5)

For step S training, step S − 1 and previous training results are known and can be considered constants in further derivation. The complexity function term

Ω (f_{s})

is mainly affected by the number T of leaf nodes in the tree and the corresponding weight

W_{j}^{2}

of leaf nodes.

Ω (f_{s}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} W_{j}^{2}

(6)

where

γ

and

λ

are hyperparameters that control the number and weight of leaf nodes, respectively.

If multiple data points are located on the same leaf node, their corresponding weight values are the same, so the sum of all data points in the objective function can be converted to the sum of the leaf nodes. In addition, after the second-order Taylor expansion of the error term and substitution of the complexity function into the objective function, the objective function can be expressed as:

O_{b j}^{(s)} = \sum_{i = 1}^{n} [g_{i} w_{q (x_{i})} + \frac{1}{2} h_{i} w_{q (x_{i})}^{2}] + γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}

(7)

where

w_{q (x_{i})}

is the weight of the sample

q (x_{i})

on the corresponding node after prediction.

2.2.3. XGBoost Feature Selection Principle

The XGBoost algorithm is based on the tree model and uses tree-by-tree learning. Each tree fits the deviation of previous trees, and it uses a feature parallel method to calculate the features to be split, which can carry out feature extraction of high-dimensional data. When the algorithm constructs a new feature set, it first continuously adds trees and performs feature splitting to continuously grow each tree. Each added tree is equivalent to learning a new function to fit the residual of the last prediction. In the process of splitting, the leaf nodes of each tree can be regarded as combinations of the features selected by the tree when searching for the optimal segmentation point. The tree generated by each iteration is based on the combined features learned by the previous tree, in order to learn new feature combinations and to continuously fit the true value. The leaf nodes of the final tree are the multi-dimensional combination of the original features. To minimize the cost of the segmentation tree, XGBoost considers the gain of each feature as the segmentation point. According to the weight of all leaf nodes, the formula is:

G a i n = \sum_{l e f t} w + \sum_{r i g h t} w - \sum_{n o s p l i t} w

(8)

where

w

is the weight of each leaf node.

According to the leaf-splitting characteristics of the CART tree algorithm, XGBoost can choose to use three feature importance metrics, feature segmentation number (FS), feature average coverage (AC), and feature average gain (AG). For feature extraction, the formula is as follows:

F s c o r e = | X |

(9)

A v e r a g e G a i n = \frac{\sum G a i n_{x}}{F S c o r e}

(10)

A v e r a g e C o v e r = \frac{\sum C o v e r_{x}}{F S c o r e}

(11)

where

X

is the set of features classified to leaf nodes,

G a i n

is the node gain value of each leaf node in

X

when it is segmented, and

C o v e r

is the number of samples falling on each node in

X

.

2.2.4. Principle of the Greedy Algorithm

The greedy algorithm has a top-down approach to solving a problem; the algorithm adopts the idea of chunking through the local optimal solution to solve the overall optimal solution, and it is a typical heuristic algorithm. It makes successive greedy choices in an iterative way. Each greedy choice simplifies the problem to a smaller subproblem. In the process of feature selection, the greedy search strategy ensures the stability and optimality of the results. When solving the problem, the overall goal of the problem is analyzed first, and then the overall goal is processed in layers to further refine and decompose it into branch goals. Finally, the greedy strategy of achieving the goal is set, and the problem is solved iteratively around the greedy strategy. The algorithm to solve the problem is easy to implement in the process of computer execution, and the time cost is low. The optimal substructure is shown in Figure 5.

2.2.5. BA Parameter Optimization Algorithm Principle

The bat algorithm (BA) is a heuristic search algorithm, which simulates bats using sonar to detect prey and avoid obstacles; simulates the optimal search process by simulating the process of bats flying to find prey; uses the fitness value of the solution problem to select the location of bats in the computation process; and uses the evolutionary process of superiority and inferiority to simulate the iterative search process of better feasible solutions instead of worse feasible solutions; finally, the search is stopped when the set conditions are met and the optimal solution is output. During the search process, bats automatically adjust their sonic wavelengths according to their distance from the prey. After the global search, the flight speed and spatial position of each bat are updated, and the fitness value of the objective function is calculated. The speed and spatial position update formulas are:

f_{i} = f_{\min} + (f_{\max} - f_{\min}) β

(12)

v_{i}^{t} = v_{i}^{t - 1} + (x_{i}^{t} - x_{*}) f_{i}

(13)

x_{i}^{t} = x_{i}^{t - 1} + v_{i}^{t}

(14)

where

v_{i}^{t}

and

v_{i}^{t - 1}

denote the flight speed of bat individual i at t and t − 1;

x_{i}^{t}

and

x_{i}^{t - 1}

denote the position of bat individual i at t and t − 1;

x_{*}

represents the global optimal location. Frequency

f_{i}

is the pulse frequency of bat individual i during search, β is a random number between 0 and 1,

f_{\min}

and

f_{\max}

are the minimum and maximum values of the pulse frequency range. The sound intensity and frequency will be updated according to the pulse loudness attenuation coefficient and the pulse frequency increase coefficient.

2.2.6. Principle of Improved GREED-GFC Method

In the face of many feature variables, efficient feature selection will significantly improve the accuracy of model prediction. The XGBoost algorithm constructed in this study can directly calculate the importance score of each feature after creating a CART tree. A high importance score indicates the frequent use of the feature in the model’s decision tree construction. The three built-in feature importance indicators in the XGBoost algorithm have different advantages and limitations. Specifically, the FS index gives a high score to numerical features, which means that increasing space will be cut corresponding to a large number of variables when the tree is split, masking important enumeration features; the AC is moderate for enumeration features and will not overfit the objective function or be affected by the dimensionality of the objective function. However, the effect of this index will not reach its best performance without the participation of downstream business parties; the AG metric uses the concept of entropy increase, which can quickly find out the most direct features. Despite that, a large gap between the head and tail values of feature ordering exists, which tends to complicate the subsequent optimization. To overcome the shortcomings of low stability and suboptimal feature subsets for feature selection by a single index, a new greedy search method (GREED-GFC) was constructed for feature subset search based on the greedy search method principle and three XGBoost importance measures. The method was combined with the feature selection process. The three importance ranking methods were split and combined to generate an optimal ranking scheme. The feature selection process incorporated with a greedy algorithm ensured that the optimal solution could be obtained at every step, from local optimization to global optimization. The feature subset with the highest classification accuracy was taken as the feature selection result, thus ensuring the stability and optimality of the result. Firstly, the importance of each feature was ranked by combining the three indexes, namely FS, AC, and AG. Secondly, each position of the ranking of multiple groups of feature variables was substituted into the model successively, and the prediction accuracy was used as the evaluation index to find the best feature. The specific implementation of this study is shown in Figure 6.

2.2.7. Evaluation Accuracy Index

In this study, the sensitivity and specificity of medical diagnostic ability evaluation were used to evaluate the effect of lithology information extraction. Sensitivity and specificity are two basic indexes used to evaluate the diagnostic accuracy of test methods. The TPR refers to the proportion of people who are actually sick as patients, and the FPR refers to the proportion of people who are actually free of disease as non-patients. Prediction accuracy, precision, recall, and F1 score were, respectively, used to evaluate the performance of the model. The number of identified target lithological information units correctly classified as target lithological information units is denoted by TP, the number of identified non-target lithological information units correctly classified as non-target lithological information units by TN, the number of target lithological information units classified as non-target lithological information units by FN, and the number of non-target lithological information units classified as target lithological information units by FP.

Based on the above definition, the four evaluation indexes of lithology prediction accuracy, precision, recall, and F1 score can be formulated as follows:

a c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(15)

p r e c i s i o n = \frac{T P}{T P + F P}

(16)

r e c a l l = \frac{T P}{T P + F N}

(17)

F 1 - S c o r e = \frac{2 T P}{2 T P + F P + F N}

(18)

3. Results

3.1. Sample Data Selection

The selection of training samples is a key step in constructing a hyperspectral lithology classification model. The rationality and reliability of sample selection will directly affect the classification effect of lithology information. In the face of many feature variables, efficient feature selection will significantly improve the accuracy of model prediction. In this study, combined with the existing geological data in the study area, it was determined that marble, monzonitic granite, syenite granite, diorite, gabbro, granite porphyry, and granodiorite are the main rock distribution types in the test area. The measured spectra in the field were resampled according to the spectral resolution of the ZY1-02D image. The seven lithologic spectra measured in the field were taken as the standard spectra. The pixel spectra of the hyperspectral image were extracted by spectral angle matching technology and used as the training samples of the classification model. The spectral angle matching technique is a method to measure spectral similarity by the angle between the pixel spectrum and the reference spectrum vector. The smaller the angle is, the more similar the two spectral curves are. The number of lithologic samples was adjusted by the size of the spectral angle set in the experiment. To ensure the accuracy and reliability of training sample extraction, the steps of this test are as follows: (1) Spectral angles were calculated for seven lithologies, respectively, and cosine calculation was carried out to obtain a cosine gray map. (2) Histogram statistical analysis was carried out on the gray image, and the cosine value of the image was classified into equal intervals according to the method of low angle and high cosine, and multi-category pixel number statistics were carried out at the same time. (3) The least squares regression method was used to obtain the gray anomaly lower limit. (4) The inverse cosine value of the lower limit value was determined to obtain the optimal threshold of the spectral angle. Sample selection is shown in Figure 7. Through the spectral angle matching technology, the extraction results of the number of lithological samples of each type were determined, and are shown in Table 2. A total of 3215 samples were sampled this time, and 20% of the 3215 samples were set as the test set and 80% as the training set.

3.2. Feature Selection Results

Hyperspectral remote sensing to obtain lithology spectral information often involves information redundancy. The feature selection process will reduce repeated features and ensure validness, and thus improve the accuracy of the model and shorten its training time. In this study, before the feature selection process was improved, the spectral data were standardized, and then traditional XGBoost methods were used to build a model. The importance of features was calculated according to three indicators, FS, AC, and AG. The cumulative feature importance threshold K was set at 80% to select the optimal number of importance features. Figure 8 shows the importance score and threshold of each indicator in their corresponding bands. The distribution of important characteristic bands calculated according to AC and FS is irregular, and that obtained by AG is uniform. The values of 153 bands of AC, FS, and AG demonstrate obvious, regular, and gentle fluctuation, respectively. Based on the above three feature selection indicators, an improved GREED-GFC method was used in this study for feature band selection on the raw data (Figure 8d). After feature selection by the GREED-GFC method, the accuracy of the model reached its peak in the 15th iteration of the greedy search. After that, the accuracy started to decline slightly, mainly due to feature redundancy caused by excessive characteristic variables, which affected the prediction performance. The computational load was also increased. Table 3 shows the number and proportion of importance features extracted by the four methods before and after the improvement. According to the extracted results, it was found that after improvement through the greedy strategy combining three indicators, the GREED-GFC method only required 15 features. This indicates that the GREED-GFC method can both reduce the number of feature variables and decrease computational complexity.

3.3. Hyperparameter Optimization with XGBoost Algorithm

When using a machine learning model to deal with regression and classification problems, the reasonable selection of model superparameters plays a very important role in the improvement of model performance. The bat algorithm is an effective strategy for parameter optimization. It has the advantages of simple operation, fewer parameters, potential parallelism, and strong universality, and can thoroughly match the optimal hyperparameter combination of the model. In the classification process, the number of decision trees (n_estimator), the maximum depth of the tree (max_depth), the penalty factor (gamma), and the learning rate (learning_rate) determined the learning performance of the XGBoost model, and their settings had a certain influence on the predicted results of the model. In this study, the bat algorithm was used to optimize the parameters of the XGBoost model constructed using four feature extraction methods. We set the frequency and intensity range of the pulse to be between 0 and 1, and the loudness attenuation and frequency increase coefficients, population size, and iteration times of the pulse to be 0.9, 0.9, 20, and 50, respectively, which were used as default parameters to initialize the bat algorithm. The selection of the adaptation function will have a global influence on the optimization algorithm. In this experiment, the accuracy value of the classification results was selected as the adaptation function value to optimize the parameters of the XGBoost model to get the best classification effect. Figure 9 shows the variation in classification accuracy with the increase in the number of iterations of the algorithm. As the number of iterations increased, the accuracy rate increased gradually. Therefore, the corresponding parameter values with maximum accuracy were selected as the modeling parameters of the XGBoost model. The optimization results of each parameter are shown in Table 4.

3.4. Accuracy Assessment of the GREED-GFC Method

The XGBoost lithology classification model constructed by four feature extraction methods was optimized by the bat algorithm. This study evaluates the performance of the model by four indexes: accuracy, precision, recall, and F1 score. Figure 10 shows the evaluation results of the model constructed by the four feature selection methods on the training set and the test set before and after the improvement. The modified GREED-GFC method performed better than the other three conventional methods; they were ranked as GREED-GFC > AG >> FS > AC. It is worth noting that the test performance of the model constructed by the four feature selection methods was close to the training performance, which verifies the generalization ability of the XGBoost algorithm. Among the four sets of results, the model constructed using the GREED-GFC method produced the best evaluation index. The accuracy, precision, recall, and F1 scores of the test set are 0.8343, 0.8406, 0.8350, and 0.8157, respectively.

3.5. Model Classification Results and Evaluation

To carry out the extraction of lithology information in the study area, this study relied on the Python platform to compile the XGBoost algorithm to construct a lithology classification information extraction model. The processed version of the spectral data was put into the first-layer training model with three preferred indexes (FS, AC, and AG). Fifteen characteristic spectral bands extracted by GREED-GFC were used as input variables of the two-layer XGBoost model; n_estimator, max_depth, gamma, and learning_rate obtained by the bat algorithm were used as the optimal parameter combination. The classification models of seven typical rocks (marble, monzonitic granite, syenite granite, diorite, gabbro, granite porphyry, and granodiorite) were constructed. At the same time, this study selected the more widely used support vector machine (SVM) model, random forest (RF) model, and artificial neural network (ANN) model, totaling three typical machine learning models that were compared with the two-layer XGBoost model for hyperspectral lithology classification analysis. The main parameters of each model were optimized by the bat algorithm (Table 5). Although both the ANN and SVM models were capable of learning nonlinear models, the classification performance was not satisfactory, and the overall accuracy is only 0.8192 and 0.8177 (Table 6). RF and XGBOOST, based on ensemble learning, show stronger predictive ability, and the two-layer XGBoost model generated the best index, with an overall accuracy rate of 0.8343. Compared with the traditional machine learning model, two-layer XGBoost is the most effective lithology classification method.

The lithology information of the study area was extracted based on the two-layer XGBoost model, and the spatial distribution map of the surface lithology was drawn based on the GIS software platform (Figure 11). To verify the classification results, the existing geological data in the study area were introduced, and the main distribution areas of rocks classified by the model were compared and analyzed by means of superposition analysis. According to the results of lithological spatial distribution analysis, the seven lithologies were mainly distributed in the southeastern, northeastern, and northern parts of the study area, with a zonal distribution along the tectonic line, and they were also scattered in the northwestern part of the study area. A distribution area of granite porphyry and granodiorite was identified in a large area. Granite porphyry is concentrated in the southeast of the study area, and granodiorite is distributed in the north, northeast, and southeast of the study area. Marble, monzonitic granite, and gabbro are closely distributed in the study area, and diorite and syenite are sparsely distributed throughout the area. It was found that the lithology classification outcomes of hyperspectral images are in high consistency with the rock categories of established geological maps. In addition to the known rock types found in existing geological maps, new rock lithologies were identified. As shown in Figure 11a, a small amount of granite porphyry was identified based on existing marble, monzonitic granite, granodiorite, and diorite. In Figure 11b, the clastic rock formation of the Dakendaban Group in the central region is subdivided into granodiorite, monzonitic granite, diorite, granite porphyry, and other rock types. Under normal circumstances, there are few of the above rock types in the clastic rock group of the Dakendaban Group, but the possible reasons for the above results are identified by the model as the following. The study area is a high-altitude traffic-suffering area, and the highest accuracy of the existing geological map is 1:100,000, whereas the accuracy of the hyperspectral image with a spatial resolution of 30 m is higher than that of the geological map. The model identified new lithologies in highly accurate images. Figure 11c identifies major lithologies, including granodiorite, monzonitic granite, and granite porphyry. In addition, the model also categorized the clastic rock group of the Dakendaban Group in the central part of the region into syenite granite, monzonitic granite, granite porphyry, and granodiorite. However, a portion of the Permian diorite so labeled in the existing geological map of the northwest region was identified as granodiorite and syenite granite. This result may be related to the similarity of major minerals in several lithologies and the small sample size. It was shown that, compared with the existing geological map, the classification accuracy of lithology information based on hyperspectral remote sensing images is improved, and the reliability is higher. The newly discovered rock types can effectively supplement the existing geological data.

To verify the accuracy of the results, a field exploration was conducted in August 2022 and combined with the spatial distribution location and range of lithology information extracted by the two-layer XGBoost model. The lithology information extracted in the eastern section of the Altun Mountain and the Lenghu region was verified on the spot. Due to the limitations of the field conditions, only the areas with concentrated lithology information were selected as the key investigation area. The spatial distribution of field verification points is shown in Figure 12a, and the field observation photos taken near the verification points are shown in Figure 12b. The 1 in Figure 12a is the northwest part of the study area. A total of nine points were collected in the region; through field investigation, it was determined that the results identified by the model of six points are fully consistent with the marble interlayers in the Early Carboniferous Huaitoutala formation, and the other three points are completely consistent with the marble interlayers in the lower clastic rock formation of the Cambrian–Ordovician Tanjianshan Group. The 2 denotes the highest-altitude part in the study area, where the geological tectonic activity is strong. A total of seven points were collected in the region; the information about the monzogranite identified by the model from the four points in the northwest of the area conforms with light red and middle Permian medium-grained monzonite granite, and that about the monzogranite from the two points in the southeast is consistent with light red and light flesh-red medium-grained porphyritic monzonitic granite of the middle Permian. The rock type of one of the verification points is monzonitic granite, but the model identified it as granodiorite. A total of ten points were collected in the 3 region; verification results show that two points in this area contain a small number of syenite granite, as identified by the model, and four points are middle-Permian gray-greenish dark gray medium-fine diorite dikes. No identified granodiorite was found at the remaining four points, but diorite was present at all of them. Combined with the analysis of rock spectra and prediction results, the spectral curves of diorite and granodiorite are similar, and their absorption positions are close, indicating that the lithology information of diorite and granodiorite extracted by image spectral curves as end-member spectra overlap to a certain extent. The 4 in Figure 12a is the concentrated area of lithology classification information, where a total of seven points are verified. Its corresponding field investigation shows that granite porphyry is the main rock type distributed in this area, followed by gabbro, and six points are consistent with the lithology information extracted from the improved model; only one of them identifies the granite porphyry as monzonitic granite. Statistically, a total of thirty-three points were counted in four regions in this study, of which twenty-seven points had been correctly recognized by lithology type, and the correct lithology recognition rate of the model was 81.8%. This result further verified that the reliability and accuracy of the two-layer XGBoost model were high.

4. Discussion

4.1. Analysis of Different Training Samples of the Two-Layer XGBoost Model

In the study of lithology classification using hyperspectral images, the selection of training samples is a key factor affecting the performance of the model. When the sample data are few, the proportions of the training set and test set will affect the classification effect of the model, and the wrong proportion will lead to the overfitting of the model. In this sense, a reasonable design of the proportion of the training set and test set is important. This study analyzed the influence of training sets of different ratios on the classification performance of the two-layer XGBoost model to further verify the robustness and adaptability of the model to small sample datasets, and to reflect the generalization of the XGBoost classifier. The set-aside method was adopted to obtain the corresponding ratios of the training set and test set from 3215 sample data by stratified sampling, and the ratios were 2:8, 3:7, 4:6, 5:5, 6:4, and 7:3, respectively. The performance of the two-layer XGBoost model in different training set sizes was evaluated by accuracy, precision, recall, and F1 score. Figure 13 shows the evaluation results of the training sets of different ratios. It was shown that for the training set and test set of different proportions, with the increase of the ratio of the training set, each evaluation index also upgraded, and the overall trend was first increased and then stabilized. When the ratio reached 5:5, the increasing trend of the four indexes was weakened, which indicates that the training degree of the model approaches its best result at this ratio. When the ratio of the training set to the test set reached 7:3, the four indexes were 0.8243, 0.8316, 0.8257, and 0.8105, respectively. At the ratio of 2:8, the four evaluation indexes all increased, but the increment was not obvious. It can be seen that the two-layer XGBoost classification model can maintain accuracy in the face of a certain degree of change in the size of the training set and has similar excellent performance on different datasets. The model has great robustness and adaptability and obvious advantages in the training of small sample datasets.

4.2. Feature Selection Analysis of the Two-Layer XGBoost Model

The ensemble learning methodology based on the two-layer XGBoost model provides an ensemble learning strategy characterized by high efficiency and excellent generalization ability in response to the multicollinearity and numerous characteristic variables faced by hyperspectral prediction. The “dimensionality disasters”, such as high feature dimensionality, overwhelming invalid data, and unclear feature relations, significantly affect the prediction performance of the model. The performance of the improved feature selection method based on the XGBoost model presented in this paper and traditional feature selection methods were further explored based on the comparative analysis between three traditional feature selection algorithms, PCA, SPA, and LASSO, and the GREED-GFC method; feature selection methods were then modified by their combination with the greedy algorithm. The average accuracy of classification after the construction of the XGBoost model was taken as the evaluation standard, and 50 test experiments were conducted for each method. The average accuracies of the rock classifications of the model training set and test set obtained are shown in Figure 14.

The average accuracy of the prediction results of the XGBoost model based on the four feature selection methods in the test set is 0.7637–0.8333, and the number of effective features extracted is between 15 and 19 (Figure 14). In the comparison of three traditional feature selection algorithms, the model based on the SPA method shows significantly better prediction accuracy, with a range of 0.65–0.9, and the number of features extracted by the model is 17. It is worth noting that the prediction model constructed based on the GREED-GFC method has higher prediction accuracy (0.7–0.94) and a smaller number of extracted features (15), compared with those of the model using the SPA method, which indicates that the GREED-GFC method of the first layer of the improved two-layer XGBoost model has the best extraction effect in the face of many feature variables. In this study, when dealing with massive and high-dimensional sample data, each newly generated CART tree in the two-layer XGBoost model will search for and approximate the objective function through continuous splitting according to the value of the feature variable, and the leaf nodes generated in the splitting process correspond to the segmentation points of each feature variable. By improving the greedy search algorithm and combing three indicators, namely, FS, AG, and AC, issues including segmenting space, the masking of enumeration features, and importance difference were considered, which enables the determination of the optimal variable combination, thereby achieving efficient feature extraction from massive high-dimensional sample data.

4.3. The Potential and Limitations of the Two-Layer XGBoost Model

The model improvement method based on the XGBoost framework combined with a greedy algorithm provides a high-precision learning strategy for hyperspectral lithological classification. When classifying, each tree of the two-layer XGBoost model is learning the residuals of the previous N − 1 trees, and the final predicted value of the sample is the sum of the values predicted by each tree. The model is optimized not only with first-order derivative information, but also with a second-order Taylor expansion of the cost function, which increases the network generalization ability and prevents overfitting, giving it a good classification effect. On this basis, this study introduced the greedy algorithm into the feature selection process of XGBoost, and the improved two-layer XGBoost model is more suitable for extracting effective feature variables from high-dimensional data, strengthens the degree of contribution of the new special collection to the model and the enlightenment of the computational process, reduces the running time of searching for efficient feature combinations, and improves the performance of the ensemble learning model in hyperspectral surface rock type extraction. However, the two-layer XGBoost model still has some limitations. In this study, the greedy algorithm was used in the feature selection process, which usually selects the local optimal solution to solve the problem. However, the local optimal solution is only used as the basis for solving the global optimal solution in some, but not all, fields; the specific application of the “local optimal solution” varies; and all solutions need to be considered in the context of the actual problem. In this study, the greedy algorithm was used to improve the three single-feature selection indexes, based on the premise that multiple single-selection methods had been used to rank the feature importance of all the bands of the hyperspectral image. Then, the greedy selection was carried out according to the order of feature importance, from the largest to the smallest. In the order of the results of the multiple sets of results, each time under the greedy selection, the current optimal solution was already a globally optimal solution among the multiple single selections, and at the same time, the classification accuracy was used as an adaptive function, which ensured that the classification accuracy improved steadily and, at the same time, avoided the emergence of the problem of the local optimum, to a large extent. Additionally, due to the high complexity of the sampling conditions and the model parameter optimization, more time was needed to process the large-scale data. In this sense, the sampling process and model simplification should be improved in order to realize efficient large-scale data processing and provide more detailed and reliable data for resource exploration and mineral exploration.

5. Conclusions

In this study, we proposed a method to identify geological lithology information using hyperspectral images. ZY1-02D hyperspectral images were employed as data sources. Seven typical rocks, including marble, monzonitic granite, syenite granite, diorite, gabbro, granite porphyry, and granodiorite, in the Lenghu region of Qinghai Province, were classified and studied. A greedy search algorithm was introduced into the feature selection process of the XGBoost model, combined with three traditional single extraction indexes, to select the feature variables. An improved two-layer XGBoost model was proposed to identify lithology information. The number of characteristic variables extracted from the improved model was significantly reduced, which lowered the computational complexity of the model and effectively improved the accuracy, precision, recall, and F1 score of the model predictions. The GREED-GFC method, modified by combining the greedy search algorithm at the first layer of the two-layer XGBoost model, demonstrated advantages over traditional feature selection methods in the testing experiments. In multi-group tests, the prediction accuracy of the model constructed by this method had the lowest fluctuations, the highest accuracy of lithology classification, and the smallest number of extracted features. The improved two-layer XGBoost model–based learner had the best stability. In the comparison of multiple machine learning models, the two-layer XGBoost model performed well in lithology classification, and its evaluation indexes are higher than those of the traditional machine learning model. In the analysis of multiple groups with different proportions of training samples, the four indicators of the two-layer XGBoost model changed slightly, which indicates that the improved model has great robustness and adaptability to small sample datasets. Field verification results indicated that the two-layer XGBoost model was reliable, and the accuracy of the classification results was higher than those of the existing geological data. The identification of new rock types by the model can be an effective supplement to the spatial distribution accuracy of the existing geological data.

Author Contributions

Conceptualization, N.L. and J.F.; methodology, R.J.; validation, Q.Y.; investigation G.L.; writing—original draft preparation, N.L. and J.F.; writing—review and editing, N.L. and J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was sponsored by the Science and Technology Development Project of Jilin Province (20210203016SF), the National Natural Science Foundation of China (41702357 and 52178042), and the Scientific and Technological Transformative Special Project of Qinghai Province (2020-SF-150).

Acknowledgments

The authors would like to thank the China Centre for Resources Satellite Data and Application for providing ZY1-02D data. We are most grateful to the anonymous reviewers and the editors for their critical and constructive reviews of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Galdames, F.J.; Perez, C.A.; Estevez, P.A.; Adams, M. Rock lithological instance classification by hyperspectral images using dimensionality reduction and deep learning. Chemom. Intell. Lab. Syst. 2022, 224, 104538. [Google Scholar] [CrossRef]
Liu, H.; Wu, K.; Xu, H.; Xu, Y. Lithology Classification Using TASI Thermal Infrared Hyperspectral Data with Convolutional Neural Networks. Remote Sens. 2021, 13, 3117. [Google Scholar] [CrossRef]
Lu, J.; Han, L.; Zha, X.; Li, L. Lithology classification in semi-arid areas based on vegetation suppression integrating microwave and optical remote sensing images: Duolun county, Inner Mongolia autonomous region, China. Geocarto Int. 2022, 37, 17044–17067. [Google Scholar] [CrossRef]
Yin, S.; Lin, X.; Huang, Y.; Zhang, Z.; Li, X. Application of improved support vector machine in geochemical lithology identification. Earth Sci. Inform. 2023, 16, 205–220. [Google Scholar] [CrossRef]
Galdames, F.J.; Perez, C.A.; Estevez, P.A.; Adams, M. Rock lithological classification by hyperspectral, range 3D and color images. Chemom. Intell. Lab. Syst. 2019, 189, 138–148. [Google Scholar] [CrossRef]
Hossain, T.M.; Watada, J.; Aziz, I.A.; Hermana, M.; Meraj, S.T.; Sakai, H. Lithology prediction using well logs: A granular computing approach. Int. J. Innov. Comput. Inf. Control 2021, 17, 225–244. [Google Scholar] [CrossRef]
Hossain, T.M.; Watada, J.; Aziz, I.A.; Hermana, M. Machine Learning in Electrofacies Classification and Subsurface Lithology Interpretation: A Rough Set Theory Approach. Appl. Sci. 2020, 10, 5940. [Google Scholar] [CrossRef]
Wang, Z.; Tian, S. Lithological information extraction and classification in hyperspectral remote sensing data using Backpropagation Neural Network. PLoS ONE 2021, 16, e0254542. [Google Scholar] [CrossRef]
Sun, L.; Khan, S.; Shabestari, P. Integrated Hyperspectral and Geochemical Study of Sediment-Hosted Disseminated Gold at the Goldstrike District, Utah. Remote Sens. 2019, 11, 1987. [Google Scholar] [CrossRef] [Green Version]
Abd El-Wahed, M.; Kamh, S.; Abu Anbar, M.; Zoheir, B.; Hamdy, M.; Abdeldayem, A.; Lebda, E.M.; Attia, M. Multisensor Satellite Data and Field Studies for Unravelling the Structural Evolution and Gold Metallogeny of the Gerf Ophiolitic Nappe, Eastern Desert, Egypt. Remote Sens. 2023, 15, 1974. [Google Scholar] [CrossRef]
Liu, L.; Zhou, J.; Jiang, D.; Zhuang, D.; Mansaray, L.R.; Zhang, B. Targeting Mineral Resources with Remote Sensing and Field Data in the Xiemisitai Area, West Junggar, Xinjiang, China. Remote Sens. 2013, 5, 3156–3171. [Google Scholar] [CrossRef] [Green Version]
Dong, W.; Fu, F.; Shi, G.; Cao, X.; Wu, J.; Li, G.; Li, X. Hyperspectral Image Super-Resolution via Non-Negative Structured Sparse Representation. IEEE Trans. Image Process. 2016, 25, 2337–2352. [Google Scholar] [CrossRef]
Jackisch, R.; Madriz, Y.; Zimmermann, R.; Pirttijarvi, M.; Saartenoja, A.; Heincke, B.H.; Salmirinne, H.; Kujasalo, J.-P.; Andreani, L.; Gloaguen, R. Drone-Borne Hyperspectral and Magnetic Data Integration: Otanmaki Fe-Ti-V Deposit in Finland. Remote Sens. 2019, 11, 2084. [Google Scholar] [CrossRef] [Green Version]
Kuras, A.; Heincke, B.H.; Salehi, S.; Mielke, C.; Koellner, N.; Rogass, C.; Altenberger, U.; Burud, I. Integration of Hyperspectral and Magnetic Data for Geological Characterization of the Niaqornarssuit Ultramafic Complex in West-Greenland. Remote Sens. 2022, 14, 4877. [Google Scholar] [CrossRef]
Boubanga-Tombet, S.; Huot, A.; Vitins, I.; Heuberger, S.; Veuve, C.; Eisele, A.; Hewson, R.; Guyot, E.; Marcotte, F.; Chamberland, M. Thermal Infrared Hyperspectral Imaging for Mineralogy Mapping of a Mine Face. Remote Sens. 2018, 10, 1518. [Google Scholar] [CrossRef] [Green Version]
Chen, L.; Zhang, N.; Zhao, T.; Zhang, H.; Chang, J.; Tao, J.; Chi, Y. Lithium-Bearing Pegmatite Identification, Based on Spectral Analysis and Machine Learning: A Case Study of the Dahongliutan Area, NW China. Remote Sens. 2023, 15, 493. [Google Scholar] [CrossRef]
Tripathi, M.K.; Govil, H. Evaluation of AVIRIS-NG hyperspectral images for mineral identification and mapping. Heliyon 2019, 5, e02931. [Google Scholar] [CrossRef]
Chen, X.; Chen, J.; Pan, J. Using geochemical imaging data to map nickel sulfide deposits in Daxinganling, China. SN Appl. Sci. 2021, 3, 324. [Google Scholar] [CrossRef]
Fonseca, G.S.; dos Santos, A.C.G.; de Sa, L.B.; Gomes, J.G.R.C. Linear models for SWIR surface spectra from the ECOSTRESS library. In Proceedings of the Conference on Algorithms, Technologies, and Applications for Multispectral and Hyperspectral Imaging XXVII, Online, 12–16 April 2021. [Google Scholar]
Zhang, D.; Zhang, L.; Sun, X.; Gao, Y.; Lan, Z.; Wang, Y.; Zhai, H.; Li, J.; Wang, W.; Chen, M.; et al. A New Method for Calculating Water Quality Parameters by Integrating Space-Ground Hyperspectral Data and Spectral-In Situ Assay Data. Remote Sens. 2022, 14, 3652. [Google Scholar] [CrossRef]
Harris, J.R.; Grunsky, E.C. Predictive lithological mapping of Canada's North using Random Forest classification applied to geophysical and geochemical data. Comput. Geosci. 2015, 80, 9–25. [Google Scholar] [CrossRef]
Li, H.; Cui, J.; Zhang, X.; Han, Y.; Cao, L. Dimensionality Reduction and Classification of Hyperspectral Remote Sensing Image Feature Extraction. Remote Sens. 2022, 14, 4579. [Google Scholar] [CrossRef]
Shi, G.; Luo, F.; Tang, Y.; Li, Y. Dimensionality Reduction of Hyperspectral Image Based on Local Constrained Manifold Structure Collaborative Preserving Embedding. Remote Sens. 2021, 13, 1363. [Google Scholar] [CrossRef]
Liu, T.; Jin, X.; Gu, Y.; IEEE. Sparse Multiple Kernel Learning for Hyperspectral Image Classification Using Spatial-spectral Features. In Proceedings of the 6th International Conference on Instrumentation and Measurement, Computer, Communication and Control (IMCCC), Harbin, China, 21–23 July 2016; Harbin Institute of Technology: Harbin, China, 2016; pp. 614–618. [Google Scholar]
Huang, W.; Li, W.; Xu, J.; Ma, X.; Li, C.; Liu, C. Hyperspectral Monitoring Driven by Machine Learning Methods for Grassland Above-Ground Biomass. Remote Sens. 2022, 14, 2086. [Google Scholar] [CrossRef]
Lin, N.; Jiang, R.; Li, G.; Yang, Q.; Li, D.; Yang, X. Estimating the heavy metal contents in farmland soil from hyperspectral images based on Stacked AdaBoost ensemble learning. Ecol. Indic. 2022, 143, 109330. [Google Scholar] [CrossRef]
Xu, Y.; Wang, J.; Xia, A.; Zhang, K.; Dong, X.; Wu, K.; Wu, G. Continuous Wavelet Analysis of Leaf Reflectance Improves Classification Accuracy of Mangrove Species. Remote Sens. 2019, 11, 254. [Google Scholar] [CrossRef] [Green Version]
Feng, Y.; Lv, J.; Su, J. Feature Preserving Compression for Hyperspectral Remote Sensing Images. In Proceedings of the 4th IEEE Conference on Industrial Electronics and Applications, Xian, China, 25–27 May 2009; pp. 3834–3847. [Google Scholar]
Banskota, A.; Wynne, R.H.; Thomas, V.A.; Serbin, S.P.; Kayastha, N.; Gastellu-Etchegorry, J.P.; Townsend, P.A. Investigating the Utility of Wavelet Transforms for Inverting a 3-D Radiative Transfer Model Using Hyperspectral Data to Retrieve Forest LAI. Remote Sens. 2013, 5, 2639. [Google Scholar] [CrossRef]
Yu, Y.; Peng, Y.; Jiang, T.; Na, J. An endmember extraction method based on PCA and a new SGA algorithm. In Proceedings of the Applied Optics and Photonics China (AOPC) Conference—Optical Sensing and Imaging Technology, Xiamen, China, 25–27 August 2020. [Google Scholar]
Zhou, L.; Ma, X.; Wang, X.; Hao, S.; Ye, Y.; Zhao, K. Shallow-to-Deep Spatial-Spectral Feature Enhancement for Hyperspectral Image Classification. Remote Sens. 2023, 15, 261. [Google Scholar] [CrossRef]
Su, H.; Du, Q.; Chen, G.; Du, P. Optimized Hyperspectral Band Selection Using Particle Swarm Optimization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2659–2670. [Google Scholar] [CrossRef]
Li, J.; Ding, S.; IEEE. Spectral Feature Selection with Particle Swarm Optimization for Hyperspectral Classification. In Proceedings of the International Conference on Industrial Control and Electronics Engineering (ICICEE), Xian, China, 23–25 August 2012; pp. 414–418. [Google Scholar]
Gao, W. Improved Ant Colony Clustering Algorithm and Its Performance Study. Comput. Intell. Neurosci. 2016, 2016, 4835932. [Google Scholar] [CrossRef] [Green Version]
Yu, Y.; Xie, X.; Tang, Y.; Liu, Y. Feature Selection for Cross-Scene Hyperspectral Image Classification via Improved Ant Colony Optimization Algorithm. IEEE Access 2022, 10, 102992–103012. [Google Scholar] [CrossRef]
Zhang, Y.; He, C.-l.; Song, X.-f.; Sun, X.-y. A multi-strategy integrated multi-objective artificial bee colony for unsupervised band selection of hyperspectral images. Swarm Evol. Comput. 2021, 60, 100806. [Google Scholar] [CrossRef]
Ou, X.; Wu, M.; Tu, B.; Zhang, G.; Li, W. Multi-Objective Unsupervised Band Selection Method for Hyperspectral Images Classification. IEEE Trans. Image Process. 2023, 32, 1952–1965. [Google Scholar] [CrossRef] [PubMed]
Xia, J.; Ghamisi, P.; Yokoya, N.; Iwasaki, A. Random Forest Ensembles and Extended Multiextinction Profiles for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 202–216. [Google Scholar] [CrossRef] [Green Version]
Li, J.; Zhang, H.; Zhao, J.; Guo, X.; Rihan, W.; Deng, G. Embedded Feature Selection and Machine Learning Methods for Flash Flood Susceptibility-Mapping in the Mainstream Songhua River Basin, China. Remote Sens. 2022, 14, 5523. [Google Scholar] [CrossRef]
Xu, S.; Liu, S.; Wang, H.; Chen, W.; Zhang, F.; Xiao, Z. A Hyperspectral Image Classification Approach Based on Feature Fusion and Multi-Layered Gradient Boosting Decision Trees. Entropy 2021, 23, 20. [Google Scholar] [CrossRef]
Peng, S.; Xi, X.; Wang, C.; Dong, P.; Wang, P.; Nie, S. Systematic Comparison of Power Corridor Classification Methods from ALS Point Clouds. Remote Sens. 2019, 11, 1961. [Google Scholar] [CrossRef] [Green Version]
Banga, A.; Ahuja, R.; Sharma, S.C. Performance analysis of regression algorithms and feature selection techniques to predict PM2.5 in smart cities. Int. J. Syst. Assur. Eng. Manag. 2023, 14, 732–745. [Google Scholar] [CrossRef]
Dev, V.A.; Eden, M.R. Gradient boosted decision trees for lithology classification. Comput. Aided Chem. Eng. 2019, 47, 113–118. [Google Scholar]
Lu, S.; Li, M.; Luo, N.; He, W.; He, X.; Gan, C.; Deng, R. Lithology Logging Recognition Technology Based on GWO-SVM Algorithm. Math. Probl. Eng. 2022, 2022, 1640096. [Google Scholar] [CrossRef]
Liu, H.; Wu, Y.; Cao, Y.; Lv, W.; Han, H.; Li, Z.; Chang, J. Well Logging Based Lithology Identification Model Establishment Under Data Drift: A Transfer Learning Method. Sensors 2020, 20, 3643. [Google Scholar] [CrossRef]
Yu, Z.; Wang, Z.; Zeng, F.; Song, P.; Baffour, B.A.; Wang, P.; Wang, W.; Li, L. Volcanic lithology identification based on parameter-optimized GBDT algorithm: A case study in the Jilin Oilfield, Songliao Basin, NE China. J. Appl. Geophys. 2021, 194, 104443. [Google Scholar] [CrossRef]
Li, J.; Wu, L.; Lu, W.; Wang, T.; Kang, Y.; Feng, D.; Zhou, H. Lithology Classification Based on Set-Valued Identification Method. J. Syst. Sci. Complex. 2022, 35, 1637–1652. [Google Scholar] [CrossRef]
Li, J.-c.; Zhao, D.-l.; Ge, B.-F.; Yang, K.-W.; Chen, Y.-W. A link prediction method for heterogeneous networks based on BP neural network. Phys. A-Stat. Mech. Its Appl. 2018, 495, 1–17. [Google Scholar] [CrossRef]
Deng, C.; Pan, H.; Fang, S.; Konate, A.A.; Qin, R. Support vector machine as an alternative method for lithology classification of crystalline rocks. J. Geophys. Eng. 2017, 14, 341–349. [Google Scholar] [CrossRef] [Green Version]
Moser, G.; Serpico, S.B. Combining Support Vector Machines and Markov Random Fields in an Integrated Framework for Contextual Image Classification. Ieee Trans. Geosci. Remote Sens. 2013, 51, 2734–2752. [Google Scholar] [CrossRef]
Rani, N.; Mandla, V.R.; Singh, T. Performance of image classification on hyperspectral imagery for lithological mapping. J. Geol. Soc. India 2016, 88, 440–448. [Google Scholar] [CrossRef]
Mou, D.; Wang, Z.; Tan, X.; Shi, S. A variational inequality approach with SVM optimization algorithm for identifying mineral lithology. J. Appl. Geophys. 2022, 204, 104747. [Google Scholar] [CrossRef]
Bressan, T.S.; de Souza, M.K.; Girelli, T.J.; Chemale Junior, F. Evaluation of machine learning methods for lithology classification using geophysical data. Comput. Geosci. 2020, 139, 104475. [Google Scholar] [CrossRef]
Khorram, F.; Morshedy, A.H.; Memarian, H.; Tokhmechi, B.; Zadeh, H.S. Lithological classification and chemical component estimation based on the visual features of crushed rock samples. Arab. J. Geosci. 2017, 10, 324. [Google Scholar] [CrossRef]
Ethem, A. Back Matter. In Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2014; pp. 615–616. [Google Scholar]
Zhang, G.; Wang, Z.; Li, H.; Sun, Y.; Zhang, Q.; Chen, W. Permeability prediction of isolated channel sands using machine learning. J. Appl. Geophys. 2018, 159, 605–615. [Google Scholar] [CrossRef]
Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep learning for visual understanding: A review. Neurocomputing 2016, 187, 27–48. [Google Scholar] [CrossRef]
Saporetti, C.M.; da Fonseca, L.G.; Pereira, E.; de Oliveira, L.C. Machine learning approaches for petrographic classification of carbonate-siliciclastic rocks using well logs and textural information. J. Appl. Geophys. 2018, 155, 217–225. [Google Scholar] [CrossRef]
Rusk, N. Deep learning. Nat. Methods 2016, 13, 35. [Google Scholar] [CrossRef]
Pan, S.; Zheng, Z.; Guo, Z.; Luo, H. An optimized XGBoost method for predicting reservoir porosity using petrophysical logs. J. Pet. Sci. Eng. 2022, 208, 109520. [Google Scholar] [CrossRef]
Gu, Y.; Zhang, D.; Bao, Z. Lithological classification via an improved extreme gradient boosting: A demonstration of the Chang 4+5 member, Ordos Basin, Northern China. J. Asian Earth Sci. 2021, 215, 104798. [Google Scholar] [CrossRef]
Han, R.; Wang, Z.; Wang, W.; Xu, F.; Qi, X.; Cui, Y. Lithology identification of igneous rocks based on XGboost and conventional logging curves, a case study of the eastern depression of Liaohe Basin. J. Appl. Geophys. 2021, 195, 104480. [Google Scholar] [CrossRef]
Guo, L.; Li, Z.; Tian, Q.; Guo, L.; Wang, Q. Prediction of CSG splitting tensile strength based on XGBoost-RF model. Mater. Today Commun. 2023, 34, 105350. [Google Scholar] [CrossRef]
Chandrahas, N.S.; Choudhary, B.S.; Teja, M.V.; Venkataramayya, M.S.; Prasad, N.S.R.K. XG Boost Algorithm to Simultaneous Prediction of Rock Fragmentation and Induced Ground Vibration Using Unique Blast Data. Appl. Sci. 2022, 12, 5269. [Google Scholar] [CrossRef]

Figure 1. (a) a map of Northwest China; (b) Qinghai Province, China; (c) Geological map of cold lake area in Qinghai Province.

Figure 2. Flowchart of lithology classification method based on hyperspectral image.

Figure 3. Hyperspectral image scene (R: 705.19 nm, G: 507.69 nm, B: 447.09 nm) of Qinghai Cold Lake in October 2021.

Figure 4. Image spectrum, measured spectrum, and correlation coefficients of monzonitic granite (a) and marble (b).

Figure 5. Greedy optimal substructure.

Figure 6. Flowchart of the GREED-GFC method.

Figure 7. Statistical diagram of spectral angle threshold of lithologic samples. (a) the optimum threshold of spectral Angle is determined by the partition cosine value; (b) the optimal threshold is used to determine the number of samples

Figure 8. Feature extraction process before and after the XGBoost improvement. (a) FS method feature selection process; (b) AG method feature selection process; (c) AC method feature selection process; (d) GREED-GFC method feature selection process.

Figure 9. Bat algorithm optimization process.

Figure 10. Evaluation results of four optimal methods. (a) the training set evaluates the results; (b) the test set evaluates the results.

Figure 11. Spatial distribution of lithologic information extracted by two-layer XGBoost model. (a) comparative analysis area one; (b) comparative analysis area two; (c) comparative analysis area three.

Figure 12. Distribution of field verification points (a) and field investigation (b).

Figure 13. Accuracy, precision, recall and F1 score (weighted average) of the two-layer XGBoost algorithm for the training set and test set (a) accuracy; (b) precision; (c) recall; (d) F1 score.

Figure 14. Average accuracy of prediction of four feature selection methods.

Table 1. Main specifications of the ZY1-02D HSI.

Specifications		ZY1-02D HSI
Wavelength (μm)		0.4–2.5
Spectral bands	VNIR: 76		SWIR: 90
Spectral resolution	VNIR: 10 nm		SWIR: 20 nm
Swath width		60 km
Spatial resolution		30 m

Table 2. Selection of rock training samples.

Rock Type	Optimal Threshold	Optimal Partition Cosine	Number of Samples	Overall Percentage (%)
Marble	0.167	0.986	432	13.42
Monzonitic granite	0.195	0.981	504	15.68
Syenite granite	0.173	0.985	447	13.91
Diorite	0.155	0.988	401	12.46
Gabbro	0.184	0.983	476	14.79
Granite porphyry	0.179	0.984	463	14.39
Granodiorite	0.191	0.982	494	15.35
Sum	--	--	3215	100

Table 3. The number and proportion of optimal variables of the four methods before and after the improvement.

Serial Number	Preferred Method	Characteristic Number	Proportion (%)
1	FS	22	14.38
2	AC	25	16.34
3	AG	20	13.07
4	GREED-GFC	15	9.81

Table 4. The value of the hyperparameter.

Method	n_Estimator	Max_Depth	Gamma	Learning_Rate
FS	76	20	2.5	0.6
AC	48	16	2	0.5
AG	54	19	1.5	0.3
GREED-GFC	60	18	3	0.3

Table 5. Optimized hyperparameters of the bat algorithm.

Models	Hyperparameters for Bat Algorithm.
SVM	gamma = 0.7; C = 3; kernel = RBF
RF	n_estimators = 47; max_depth = 20; min_samples_split = 2
ANN	learning_rate = 0.25; lambda = 5; L = 6
Two-layer XGBoost	n_estimators = 60; max_depth = 18; learning_rate = 0.3

Table 6. Four models evaluation results.

Rock Type	SVM	RF	ANN	Two-Layer XGBoost
Marble	0.8188	0.8289	0.8287	0.8277
Monzonitic granite	0.8186	0.8291	0.8178	0.8455
Syenite granite	0.8255	0.8324	0.8192	0.8456
Diorite	0.7946	0.7973	0.7988	0.8082
Gabbro	0.8188	0.8266	0.8292	0.8486
Granite porphyry	0.8267	0.8195	0.8192	0.8196
Granodiorite	0.8179	0.8347	0.8194	0.8398
Overall accuracy	0.8177	0.8247	0.8192	0.8343

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, N.; Fu, J.; Jiang, R.; Li, G.; Yang, Q. Lithological Classification by Hyperspectral Images Based on a Two-Layer XGBoost Model, Combined with a Greedy Algorithm. Remote Sens. 2023, 15, 3764. https://doi.org/10.3390/rs15153764

AMA Style

Lin N, Fu J, Jiang R, Li G, Yang Q. Lithological Classification by Hyperspectral Images Based on a Two-Layer XGBoost Model, Combined with a Greedy Algorithm. Remote Sensing. 2023; 15(15):3764. https://doi.org/10.3390/rs15153764

Chicago/Turabian Style

Lin, Nan, Jiawei Fu, Ranzhe Jiang, Genjun Li, and Qian Yang. 2023. "Lithological Classification by Hyperspectral Images Based on a Two-Layer XGBoost Model, Combined with a Greedy Algorithm" Remote Sensing 15, no. 15: 3764. https://doi.org/10.3390/rs15153764

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lithological Classification by Hyperspectral Images Based on a Two-Layer XGBoost Model, Combined with a Greedy Algorithm

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data and Methodology

2.2.1. Data Acquisition and Preprocessing

2.2.2. Classification Principle of the XGBoost Model

2.2.3. XGBoost Feature Selection Principle

2.2.4. Principle of the Greedy Algorithm

2.2.5. BA Parameter Optimization Algorithm Principle

2.2.6. Principle of Improved GREED-GFC Method

2.2.7. Evaluation Accuracy Index

3. Results

3.1. Sample Data Selection

3.2. Feature Selection Results

3.3. Hyperparameter Optimization with XGBoost Algorithm

3.4. Accuracy Assessment of the GREED-GFC Method

3.5. Model Classification Results and Evaluation

4. Discussion

4.1. Analysis of Different Training Samples of the Two-Layer XGBoost Model

4.2. Feature Selection Analysis of the Two-Layer XGBoost Model

4.3. The Potential and Limitations of the Two-Layer XGBoost Model

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI