1. Introduction
Lithological classification information is important and basic for mineral resource exploration and geological disaster monitoring. Understanding the spatial distribution characteristics and variability of surface lithology is of great significance for regional geological mapping and mineral resource potential prediction in areas with high altitudes and poor transportation [
1,
2,
3,
4]. Traditional lithological mapping involves aerial photo examination and mapping based on interpretation keys, examination of the rocks in the field, rock sampling, and their examination in the laboratory. However, relying mainly on the subjective judgment of the staff to determine the mapping unit results in a lot of work to be completed in the field survey work, which requires a high cost in human, financial, and material resources [
5,
6,
7]. In this sense, an efficient lithologic classification method is urgently needed to identify large-area lithological distribution information. Hyperspectral remote sensing lithological classification techniques infer the rock properties through the spectral characteristics of rocks, determine the lithology types and spatial distribution of the surface, and combine the classification results into a geological map and output expression. Compared with traditional lithology classification methods, remote sensing lithology classification has technical advantages, such as macro-scale capability, rapid processing, and non-pollution, and thus is often used as the main means of geological exploration for large-area remote sensing geological survey and thematic mapping [
8,
9,
10,
11]. It is necessary to develop high-precision geological and mineral exploration by investigating the application potential of hyperspectral remote sensing technologies in large-scale lithology classification and mineral resource exploration, which is conducive to realizing the deep integration of hyperspectral remote sensing technology, geological survey, and green exploration.
Aerospace hyperspectral remote sensing is a new type of earth observation technology, and was gradually formed based on its vigorous development in the early 1980s. It can be used to capture the spatial distribution of surface materials and record reflection spectra through hundreds of closely aligned wavelengths, to realize the effective identification of different lithological information [
12,
13,
14,
15,
16]. Many scholars have researched lithology classification using this technology, focusing on spectral matching methods and classification model improvement [
17,
18]. With the enrichment of spectral categories and data of rock and ore in standard spectral libraries, such as the ASTER and USGS libraries, the accuracy of lithology classification based on spectral matching methods has been greatly promoted. Certain research achievements have been made in remote sensing lithology classification by spectral matching methods, such as Spectral Angle Mapping, the spectral goodness-of-fit technique, and the cross-correlation spectral matching method [
19,
20]. However, the lithological mapping method based on the similarity between spectral data is easily affected by the imaging conditions and the selection of the standard spectra of rocks. Finding the best matching threshold in practical application is often difficult, and the matching results are greatly affected by human factors. With the development and introduction of machine learning theory, many drawbacks of traditional spectral matching methods for lithology mapping have been overcome. The remote sensing lithological classification method based on a machine learning model can establish a mapping relationship between spectral data and lithology through mathematical statistics, relying on training samples with strong computational power and high classification accuracy. The nonlinear relationship and enrichment law among data can be deeply mined [
16,
21]. Nevertheless, with the gradual increase in the bands of hyperspectral data, traditional machine learning algorithms have shown great limitations for the learning of high-dimensional spectral image data. There can be a large amount of redundant information in the data, resulting in the classification accuracy and the number of features being not directly proportional, and resulting in the Hughes phenomenon: that the accuracy first rises and then decreases. As a result, these machine learning models cannot extract the spectral features of the ground objects well, and cannot learn the more accurate mapping relationship between the spectral features of the ground objects and the categories, which affects the classification accuracy. There can also be problems such as overfitting and increased computational complexity [
22,
23].
To compensate for the band redundancy in hyperspectral images, scholars have proposed and improved a large number of spectral feature extraction methods; for example, traditional dimensionality reduction methods, such as the LASSO algorithm [
24,
25], the successive projections algorithm [
26,
27], the wavelet transform [
28,
29], principal component analysis [
30,
31], and artificial intelligence feature extraction algorithms such as the particle swarm optimization algorithm [
32,
33], ant colony optimization algorithm [
34,
35], and artificial bee colony algorithm [
36,
37]. However, the above methods still face challenges such as local optimality, dimensional disaster, small data sample size, and high computational complexity, because they cannot effectively identify and retain features with low individual value but high combined value. Thus, it is difficult to accurately extract lithological spectral features in hyperspectral images containing many high-dimensional spectral channels. Compared with traditional feature selection methods, tree-based methods, such as random forest (RF) [
38,
39], gradient boosting decision tree (GBDT) [
40,
41], and Light Gradient Boosting Machine (Light GBM) [
42,
43], can extract the optimal feature system that suits the needs of the model in a more targeted way, and their decision-making process is more similar to human thinking, making the model easy to understand. Its dimensional reduction speed is fast, and it can process continuous and discrete data and solve the multi-output problem. The method does not need to regularize and normalize the input data in advance, exhibiting significant advantages for feature extraction of high-dimensional data. However, the criteria for feature selection in most tree models are determined through information gain and segmentation sample number. These criteria often have disadvantages, such as low stability and the masking of important enumeration features in the feature selection process due to single constraints. Therefore, based on the principle of the tree model, this study uses a greedy algorithm to comprehensively consider the segmentation sample number, average gain, and average coverage of the regression tree (CART) to improve the traditional tree model. The greedy algorithm is a simple and fast optimization design method. It is often combined with optimization measures according to the current situation to optimize the selection, and the local optimal solution is obtained. Therefore, it is not necessary to repeatedly trace the results, so it saves a lot of time in finding the optimal solution, and the calculation result of the local optimal solution is close to the solution result of the overall optimal solution. Using a greedy algorithm to apply the idea of ensemble learning in feature selection, the ideal feature selection strategy can be generated to address problems faced by traditional tree-based feature selection methods, such as overfitting functions, the masking of enumerated features, and high computational complexity. This enables the efficient extraction of rock spectral features.
Machine learning technology has been widely applied in lithology classification, with its excellent feature mining and data fitting ability. Examples include extreme learning machines [
44,
45], logistic regression [
46,
47], back-propagation neural networks [
48,
49], support vector machines [
50,
51,
52], and multi-layer perceptrons [
53,
54]. The commonly used lithologic classification models can be divided into three categories, the space vector type, neural network type, and linear type. The space vector types represented by the support vector machine algorithm is sensitive to parameter adjustment and kernel function selection. It takes up more memory and time in calculation, and it is difficult to achieve large-scale lithologic sample training [
55]. The linear types represented by the logistic regression algorithm are sensitive to the multicollinearity of independent variables, so they are easy to underfit, resulting in low classification accuracy [
56,
57,
58]. Neural network types represented by neural network algorithms are often prone to fall into local optima, resulting in significant declines in model classification performance [
59]. In recent years, ensemble learning has become one of the research hotspots in machine learning. It builds and combines multiple learners of the same or different kinds with different combination methods. Then, a combined learner with a better learning effect can be obtained. The extreme gradient boosting decision tree (XGBoost) is a new high-precision distributed ensemble learning gradient enhancement algorithm proposed by Dr. Tianqi Chen of the University of Washington in 2016. The XGBoost model overcomes many challenges faced by traditional machine learning algorithms, such as the high sensitivity of sample data, high computational complexity, and overfitting of the model. It can extract the optimal features of multiple variables and update and adjust the gradient of the basic learner for classification and identification. As a meta-algorithm framework, XGBoost supports the parallel gradient lifting calculation of feature importance and the base learner, which greatly improves the training speed in the face of large-scale samples. Regularization was introduced to control the complexity of the model, and it can also specify the default direction for missing values, thus significantly improving the generalization ability of the model while preventing the fitting problem [
60,
61,
62].
This study aims to construct a lithological classification method using hyperspectral images based on ensemble learning to solve the problem of machine learning’s difficulty in quickly mining effective feature information in high-dimensional remote sensing data. Under the new model of integrating the feature selection and classification process, the accuracy and efficiency of surface lithological information extraction in high-altitude areas with difficult traffic conditions are improved. The Lenghu region of Haixi Mongol and the Tibetan Autonomous Prefecture of Qinghai Province were chosen as the study area, and an experimental study was carried out. Based on the preprocessing of the ZY1-02D hyperspectral remote sensing images, the feature selection process of the model was improved based on the principle of the XGBoost model and the greedy search algorithm. Firstly, the three indexes built in XGBoost were used for feature coarse screening to quickly reduce the feature dimensions. The greedy algorithm was integrated into the XGBoost algorithm, and the feature fine screening was carried out according to the feature importance ranking. The idea of integration was used to update the single selection condition to multiple conditions for parallel selection, so as to obtain the optimal feature subset for lithology classification and improve the classification effect. The accuracy and reliability of the model were examined by evaluation indexes and field verification. The main contributions of this paper are to propose a two-layer XGBoost algorithm rock information classification model and improve the importance selection method of the XGBoost model by employing a greedy search algorithm. In the case of ensuring the accuracy of lithological classification, it effectively solves the problem that it is difficult to extract variables with small correlation, low dimension, and small redundancy from high-dimensional hyperspectral data. It proves the importance of the feature selection process in machine learning in hyperspectral lithology classification and other fields, which can help geoscientists use machine learning to provide references for the prediction of mine resources, mining plans, and disaster prevention measures.
5. Conclusions
In this study, we proposed a method to identify geological lithology information using hyperspectral images. ZY1-02D hyperspectral images were employed as data sources. Seven typical rocks, including marble, monzonitic granite, syenite granite, diorite, gabbro, granite porphyry, and granodiorite, in the Lenghu region of Qinghai Province, were classified and studied. A greedy search algorithm was introduced into the feature selection process of the XGBoost model, combined with three traditional single extraction indexes, to select the feature variables. An improved two-layer XGBoost model was proposed to identify lithology information. The number of characteristic variables extracted from the improved model was significantly reduced, which lowered the computational complexity of the model and effectively improved the accuracy, precision, recall, and F1 score of the model predictions. The GREED-GFC method, modified by combining the greedy search algorithm at the first layer of the two-layer XGBoost model, demonstrated advantages over traditional feature selection methods in the testing experiments. In multi-group tests, the prediction accuracy of the model constructed by this method had the lowest fluctuations, the highest accuracy of lithology classification, and the smallest number of extracted features. The improved two-layer XGBoost model–based learner had the best stability. In the comparison of multiple machine learning models, the two-layer XGBoost model performed well in lithology classification, and its evaluation indexes are higher than those of the traditional machine learning model. In the analysis of multiple groups with different proportions of training samples, the four indicators of the two-layer XGBoost model changed slightly, which indicates that the improved model has great robustness and adaptability to small sample datasets. Field verification results indicated that the two-layer XGBoost model was reliable, and the accuracy of the classification results was higher than those of the existing geological data. The identification of new rock types by the model can be an effective supplement to the spatial distribution accuracy of the existing geological data.