Intelligent Identification for Rock-Mineral Microscopic Images Using Ensemble Machine Learning Algorithms

It is significant to identify rock-mineral microscopic images in geological engineering. The task of microscopic mineral image identification, which is often conducted in the lab, is tedious and time-consuming. Deep learning and convolutional neural networks (CNNs) provide a method to analyze mineral microscopic images efficiently and smartly. In this research, the transfer learning model of mineral microscopic images is established based on Inception-v3 architecture. The four mineral image features, including K-feldspar (Kf), perthite (Pe), plagioclase (Pl), and quartz (Qz or Q), are extracted using Inception-v3. Based on the features, the machine learning methods, logistic regression (LR), support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), multilayer perceptron (MLP), and gaussian naive Bayes (GNB), are adopted to establish the identification models. The results are evaluated using 10-fold cross-validation. LR, SVM, and MLP have a significant performance among all the models, with accuracy of about 90.0%. The evaluation result shows LR, SVM, and MLP are the outstanding single models in high-dimensional feature analysis. The three models are also selected as the base models in model stacking. The LR model is also set as the meta classifier in the final prediction. The stacking model can achieve 90.9% accuracy, which is higher than all the single models. The result also shows that model stacking effectively improves model performance.


Introduction
At present, image pattern recognition is widely used in image data analysis, especially in earth sciences. Analysis of microscopic images of rock, and mineral classification and identification are fundamental tasks in geological research. The first step of studying a rock-mineral sample in the lab is also a significant way to determine the rock-mineral type and properties. It is also the basis of the geochemical analysis, such as in major, minor, and isotope element tests. To date, rock-mineral microscopic image identification has been conducted manually by engineers and scholars, which depends on the operators' experience and skill. It is inefficient and time-consuming. Moreover, the result is determined largely by the conductor's knowledge. Incorrect rock-mineral recognition impairs subsequent work, which may lead to wasted resources and economic loss. It is crucial to develop efficient, robust, and objective automatic recognition techniques for rock-mineral microscopic images.
Researchers have combined computer vision and machine learning to analyze the automatic classification and identification of rock-mineral microscopic images. Singh et al. [1] extracted 27 features image features. We integrated the deep learning model and machine learning methods to undertake a comprehensive analysis of the rock-mineral OPS images. The results indicate the Inception-v3 model can extract effective features. The machine learning algorithms based on perceptron are proved to be outstanding. Model stacking is also proposed to improve model performance further.

Methodology
In this research, transfer learning using a deep learning model and several machine learning algorithms is applied to identify rock-mineral microscopic images. The deep learning model is set as the pre-trained model, which is trained using a convolutional neural network (CNN). The features are generated by transfer learning using the Inception-v3 model. Based on the extracted features, logistic regression (LR), support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), multilayer perceptron (MLP), and gaussian naive Bayes (GNB) are applied to establish the models. Through the comparison of the different models' performance, the outstanding models can be selected as the base models in model stacking. The schematic for rock-mineral microscopic image identification is shown in Figure 1. to undertake a comprehensive analysis of the rock-mineral OPS images. The results indicate the Inception-v3 model can extract effective features. The machine learning algorithms based on perceptron are proved to be outstanding. Model stacking is also proposed to improve model performance further.

Methodology
In this research, transfer learning using a deep learning model and several machine learning algorithms is applied to identify rock-mineral microscopic images. The deep learning model is set as the pre-trained model, which is trained using a convolutional neural network (CNN). The features are generated by transfer learning using the Inception-v3 model. Based on the extracted features, logistic regression (LR), support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), multilayer perceptron (MLP), and gaussian naive Bayes (GNB) are applied to establish the models. Through the comparison of the different models' performance, the outstanding models can be selected as the base models in model stacking. The schematic for rock-mineral microscopic image identification is shown in Figure 1.

Convolutional Neural Network
The deep learning model is trained on some data set using convolutional neural network architecture. A convolutional neural network (CNN) is a kind of feed-forward neural network, which is designed for unstructured data identification, such as image, text, and sound. It usually consists of convolutional layers and pooling layers. Compared to the fully connected layer, each neuron in the convolutional layer is connected to certain neurons in the previous layer, which is a small and squared area in the image pixel matrix. The size of the small area in the matrix is called the receptive field, which spans the dimensions of height and width in the image. There are no special parameters for image depth. Color information is also significant in model training. As a consequence, the convolutional layer should be conducted across the whole color space.

Convolutional Neural Network
The deep learning model is trained on some data set using convolutional neural network architecture. A convolutional neural network (CNN) is a kind of feed-forward neural network, which is designed for unstructured data identification, such as image, text, and sound. It usually consists of convolutional layers and pooling layers. Compared to the fully connected layer, each neuron in the convolutional layer is connected to certain neurons in the previous layer, which is a small and squared area in the image pixel matrix. The size of the small area in the matrix is called the receptive field, which spans the dimensions of height and width in the image. There are no special parameters for image depth. Color information is also significant in model training. As a consequence, the convolutional layer should be conducted across the whole color space.
The neurons in one convolutional layer share the same weight to recognize certain patterns from the previous layer. The certain patterns should have translation invariance, which means the features should be independent of their coordinates in the image. As a result, all the neurons in the same kernel should share the same parameters, which is called parameter sharing. Since each kernel can just recognize a certain pattern, there are several kernels in one layer to identify multiple patterns in different places of the image. A pooling layer is also a significant concept in CNN, and can decrease the feature dimension and the computation cost. They are also connected to a special region of the previous layer, like the convolutional layer. Compared to the convolutional layers, pooling layers are determined by their own set rather than the parameters in the model training process. In CNN, max and mean pooling are commonly used. The computation is processed in each neuron in the CNN as follows: where f (x) is the output, act is the activation function, θ is the weight matrix, x ij is the input, and b is the bias.

Inception-v3 and Transfer Learning
Compared to the GoogLeNet (Inception-v1), Inception-v3 [17] has made large progress. It integrates all the updates in Inception-v2. Furthermore, there are some new improvements in Inception-v3. In model design, the optimization-SGD (stochastic gradient descent) is replaced by RMSProp (root mean square prop). In the classifier, the LSR (label-smoothing regularization) is added after the fully connected layer. In the convolutional layer, the 7 × 7 kernel is replaced by a 3 × 3 kernel. Normalization is also used and regularization is added to the loss function to avoid overfitting. Figure 2 shows the compressed view of the Inception-v3 model. At the beginning of the model, 3 convolutional layers and 2 pooling layers are set, then 2 convolutional layers and 1 pooling layer are set, and, finally, it follows 11 mixed layers, the dropout layer, the fully connected layer, and the softmax layer.
The operations of convolution and padding are conducted repeatedly on the image in each layer. Table 1 shows the specific process of convolution and padding. The data transformation is presented. Before the softmax layer, the image is converted to a 2048-dimension vector, as shown in Figure 1. The high-dimension vector includes several geometric and optical features. Some of the features will be presented in Section 4.

Machine Learning Algorithms
Recently, machine learning has been increasingly used in classification and pattern recognition. In our research, the input data are complex non-linear features, which are generated using the In most cases of machine learning application, the model is established from scratch even though there is an existing model based on similar data. The repeated construction of the model is a waste of resources. Considering the similarity of the different models, transfer learning can be applied to establish a new model using the obtained knowledge. Meanwhile, there have been two problems in deep learning model training: the data is often insufficient and the computation is slow. However, a good-performance model can be trained using little data and a pre-trained model. Considering the relationship between the source domain and the target domain, the new model can be established using transfer learning.
In the process of deep learning model retraining, the extracted features are adopted to train the new model [29]. As shown in Figure 2, the images are the input; all the convolutional and pooling layers are reused in the new model training. In other words, Inception-v3 is taken as a feature extractor, while the extracted 2048-dimension features are fed to multi-machine learning algorithms rather than the softmax layer. The new model will be established using model stacking. The whole process is shown in Figure 2. In our research, K-feldspar (Kf), perthite (Pe), plagioclase (Pl), and quartz (Qz or Q) images are employed.

Machine Learning Algorithms
Recently, machine learning has been increasingly used in classification and pattern recognition. In our research, the input data are complex non-linear features, which are generated using the Inception-v3 model. LR, SVM, RF, KNN, MLP, and GNB are selected to establish the identification model. LR, SVM, and MLP are based on perceptron; RF is a tree model; KNN is a non-parameter model; and GNB is a model based on probability. It is beneficial for the optimized model search to high-level features (deep model features) training in different machine learning methods.

Logistic Regression (LR)
Logistic Regression is a generalized linear model. Some variables in LR are obtained by linear models. The nonlinear sigmoid function is used to map predictions to probabilities. The LR model can be expressed as Equation (2): where h θ (x) is the probability. The probability values range from 0 to 1, which is an S-shaped curve and splits the space into two equal parts. θ T x is the linear combination of several related variables.

Support Vector Machine (SVM)
The support vector machine (SVM) was proposed by Cortes and Vapnik [30]. Suppose that the data is (x 1 , y 1 ) (x 2 , y 2 ) . . . (x n , y n ); x is the input vectors; n is the number of the training samples; and y is the label, where y = {−1, +1}. In the classification problem using SVM, the target is to search the hyperplane to maximize the margin between the two support vectors. The objective function and the constraint are shown in Equation (3): where w is the adjustable weight, w is the Euclidean norm of the vector, ξ i is the slack variable, which is used to relax the constraints, and C is the penalty parameter, which makes a trade-off between margin and misclassification.

Random Forest (RF)
Random forest (RF) was proposed by Breiman [31], and combines multiple decision trees. Compared to traditional bagging, the base learning model in RF is the decision tree, while the training process is determined by random attributes selection. The x-dimension vector is fed to the RF model; then the K decision trees T(x) K 1 generate and are independent of each other. The RF model is expressed in Equation (4). Each tree will make a prediction and voting is applied to make a decision. The label predicted by the majority of the decision trees is regarded as the final prediction.
To reduce the correlation of the different decision trees, bootstrap aggregating is adopted. The decision trees in a different training subset are generated, which can improve the generalization and robustness of the model.

K-Nearest Neighbors (KNN)
K-nearest neighbors (KNN) is a non-parametric and simple algorithm. It is a lazy algorithm that does not generalize the training data. The steps in KNN can be described as the following: • Calculate the distance between the training and the test data; • Arrange the distance from smallest to largest; • Select K minimum-distance points; • Calculate the frequency of the K points in each group; • Return the label with the highest frequency of K and the label is the prediction.
If the training dataset has n attributes, the distances of two datasets can be calculated based on these attributes. The Euclidean distance is usually selected. For example, two datasets are given as X = (x 1 , x 2 , . . . , x n ) and Y = (y 1 , y 2 , . . . , y n ). The Euclidean distance between X and Y is shown in Equation (5):

Multilayer Perceptron (MLP)
Multilayer perceptron (MLP) is a computing network that is inspired by biological neural networks. Generally, the structure of MLP consists of three significant layers, which are the input layer, the hidden layer, and the output layer, as shown in Figure 3. The number of neurons in the input, hidden and output layers, network architecture, and the learning rate are the parameters to be selected to develop an MLP model. The MLP model is trained with a set of known input data and output data. The training process continues until the network output matches the desired output. Changing the weights and biases shall reduce the error between the network output and the target output. The training process is terminated automatically when the error falls below a threshold or the maximum epochs are exceeded. Return the label with the highest frequency of K and the label is the prediction.
If the training dataset has n attributes, the distances of two datasets can be calculated based on these attributes. The Euclidean distance is usually selected. For example, two datasets are given as X = (x1, x2, . . ., xn) and Y = (y1, y2, . . ., yn). The Euclidean distance between X and Y is shown in Equation (5):

Multilayer Perceptron (MLP)
Multilayer perceptron (MLP) is a computing network that is inspired by biological neural networks. Generally, the structure of MLP consists of three significant layers, which are the input layer, the hidden layer, and the output layer, as shown in Figure 3. The number of neurons in the input, hidden and output layers, network architecture, and the learning rate are the parameters to be selected to develop an MLP model. The MLP model is trained with a set of known input data and output data. The training process continues until the network output matches the desired output. Changing the weights and biases shall reduce the error between the network output and the target output. The training process is terminated automatically when the error falls below a threshold or the maximum epochs are exceeded.

Input layers
Output layers Hidden layers

Gaussian Naive Bayes (GNB)
Gaussian naive Bayes (GNB) is a supervised learning method, which is based on Bayes' theory. GNB supposes that the features are independent of each other. For the label y and the features x1 to xn, the probability relationship can be expressed as follows: In Bayes' theory, the features are independent of each other. Equation (6) can be expressed as Equation (7): , ,

Gaussian Naive Bayes (GNB)
Gaussian naive Bayes (GNB) is a supervised learning method, which is based on Bayes' theory. GNB supposes that the features are independent of each other. For the label y and the features x 1 to x n , the probability relationship can be expressed as follows: In Bayes' theory, the features are independent of each other. Equation (6) can be expressed as Equation (7): P(x 1 , . . . x n ) is a constant, thus we can analyze Equation (8).
In GNB, the features obey the Gauss distribution, as shown in Equation (9):

K-Fold Cross-Validation
In k-fold cross-validation, the original data is randomly divided into k equal-sized subsamples. In the k subsamples, one subsample is set as the validation data, and the remaining k − 1 subsamples are taken as the training data. The cross-validation process is repeated k times, namely, k folds. Each of the k subsamples is taken as the validation data for just one time. The results from all folds can make a comprehensive evaluation. The advantage of this method is that all data can be both training and validation data, and each one is used for validation exactly once. The k-fold cross-validation is more objective than the simple cross-validation. The process of the k-fold is shown in Figure 4. The blue fold is set as validation. The mean value, E, of the k-fold is taken as the final evaluation.
In GNB, the features obey the Gauss distribution, as shown in Equation (9)

K-Fold Cross-Validation
In k-fold cross-validation, the original data is randomly divided into k equal-sized subsamples. In the k subsamples, one subsample is set as the validation data, and the remaining k-1 subsamples are taken as the training data. The cross-validation process is repeated k times, namely, k folds. Each of the k subsamples is taken as the validation data for just one time. The results from all folds can make a comprehensive evaluation. The advantage of this method is that all data can be both training and validation data, and each one is used for validation exactly once. The k-fold cross-validation is more objective than the simple cross-validation. The process of the k-fold is shown in Figure 4. The blue fold is set as validation. The mean value, E, of the k-fold is taken as the final evaluation.
Training dataset

Model Stacking
Model stacking [32] is one of the model ensemble methods, which is not the same as bagging or boosting. Two-stage training is conducted to establish the model. The process of model stacking is as following: • The base models are trained on the same dataset using k-fold cross-validation (usually k = 5 or 10); • The m base models with significant performance are selected to make a prediction and the k-fold cross-validation is also employed;

Model Stacking
Model stacking [32] is one of the model ensemble methods, which is not the same as bagging or boosting. Two-stage training is conducted to establish the model. The process of model stacking is as following: • The base models are trained on the same dataset using k-fold cross-validation (usually k = 5 or 10); • The m base models with significant performance are selected to make a prediction and the k-fold cross-validation is also employed; It is obvious that there are two stages in model stacking. In the first stage, m base classification models are selected to build new features. The robustness of the new features as training data is guaranteed by the adoption of 5-fold cross-validation. In the second stage, LR is commonly chosen as the meta-model to build the model and make a final prediction. The process of model stacking is shown in Figure 5. guaranteed by the adoption of 5-fold cross-validation. In the second stage, LR is commonly chosen as the meta-model to build the model and make a final prediction. The process of model stacking is shown in Figure 5.

Data Collection and Preprocessing
In the field, a rock commonly consists of an aggregate of two or more different minerals. In the identification of the rock-mineral thin section, the main task is to distinguish each mineral and recognize them under the microscope. In this research, 1-mm thin-section images of K-feldspar (Kf), perthite (Pe), plagioclase (Pl), and quartz (Qz or Q) are applied to establish the model; the microscope is shown in Figure 6. The mineral images can be obtained using the camera on the top of the microscope. Under the microscope carrier, there is a halogen lamp applied as the light source. The focus can be tuned using the knob beside the microscope. The four minerals exist together with other minerals. The target images will be cut from the whole thin section images. The target mineral image should cover most of the region in the cut image. Finally, there are a total of 481 images in all the classes. The specific information is listed in Table 2. In 10-fold cross-validation, the data is divided into training and validation datasets with a 90/10 split in each cycle. The four minerals' thin section images are shown in Figure 7. Nine samples in each group are presented.

Data Collection and Preprocessing
In the field, a rock commonly consists of an aggregate of two or more different minerals. In the identification of the rock-mineral thin section, the main task is to distinguish each mineral and recognize them under the microscope. In this research, 1-mm thin-section images of K-feldspar (Kf), perthite (Pe), plagioclase (Pl), and quartz (Qz or Q) are applied to establish the model; the microscope is shown in Figure 6. The mineral images can be obtained using the camera on the top of the microscope. Under the microscope carrier, there is a halogen lamp applied as the light source. The focus can be tuned using the knob beside the microscope. guaranteed by the adoption of 5-fold cross-validation. In the second stage, LR is commonly chosen as the meta-model to build the model and make a final prediction. The process of model stacking is shown in Figure 5.

Data Collection and Preprocessing
In the field, a rock commonly consists of an aggregate of two or more different minerals. In the identification of the rock-mineral thin section, the main task is to distinguish each mineral and recognize them under the microscope. In this research, 1-mm thin-section images of K-feldspar (Kf), perthite (Pe), plagioclase (Pl), and quartz (Qz or Q) are applied to establish the model; the microscope is shown in Figure 6. The mineral images can be obtained using the camera on the top of the microscope. Under the microscope carrier, there is a halogen lamp applied as the light source. The focus can be tuned using the knob beside the microscope.  The four minerals exist together with other minerals. The target images will be cut from the whole thin section images. The target mineral image should cover most of the region in the cut image. Finally, there are a total of 481 images in all the classes. The specific information is listed in Table 2. In 10-fold cross-validation, the data is divided into training and validation datasets with a 90/10 split in each cycle. The four minerals' thin section images are shown in Figure 7. Nine samples in each group are presented. The four minerals exist together with other minerals. The target images will be cut from the whole thin section images. The target mineral image should cover most of the region in the cut image. Finally, there are a total of 481 images in all the classes. The specific information is listed in Table 2. In 10-fold cross-validation, the data is divided into training and validation datasets with a 90/10 split in each cycle. The four minerals' thin section images are shown in Figure 7. Nine samples in each group are presented.

Model Establishment and Evaluation
In the process of OPS microscopic rock-mineral image feature extraction using Inception-v3, there is no special limitation to the raw data. The size of the images can be processed to be 299 × 299 × 3 automatically before training, where 299 denotes the height and width of the image size, and 3 denotes the three channels of RGB (red, green, and blue). The feature map visualization based on the image in Figure 7(a) is shown in Figure 8. It shows 3 feature maps of each layer in the first 15 layers. The process of feature extraction is presented. In the different layers, the extracted features are different. It is easy to see that some features, such as chromatic aberration and texture, can be extracted using the Inception-v3 model.

Model Establishment and Evaluation
In the process of OPS microscopic rock-mineral image feature extraction using Inception-v3, there is no special limitation to the raw data. The size of the images can be processed to be 299 × 299 × 3 automatically before training, where 299 denotes the height and width of the image size, and 3 denotes the three channels of RGB (red, green, and blue). The feature map visualization based on the image in Figure 7a is shown in Figure 8. It shows 3 feature maps of each layer in the first 15 layers. The process of feature extraction is presented. In the different layers, the extracted features are different. It is easy to see that some features, such as chromatic aberration and texture, can be extracted using the Inception-v3 model.
Based on the extracted features, LR, SVM, RF, KNN, MLP, and GNB are adopted to establish the prediction model using Scikit-learn [33]. The data is split into training data and test data with a 90/10 ratio in every fold of cross-validation. Most of the parameters of the algorithm are default ones. The selected parameters of each machine learning method are listed in Table 3. × 3 automatically before training, where 299 denotes the height and width of the image size, and 3 denotes the three channels of RGB (red, green, and blue). The feature map visualization based on the image in Figure 7(a) is shown in Figure 8. It shows 3 feature maps of each layer in the first 15 layers. The process of feature extraction is presented. In the different layers, the extracted features are different. It is easy to see that some features, such as chromatic aberration and texture, can be extracted using the Inception-v3 model.   Based on the model training parameters set in Table 3, all the models are evaluated using 10-fold cross-validation. Because of the 10-fold cross-validation application, the mean value and standard deviation of the accuracy can be used to present the model performance. The accuracy and the accuracy standard deviation of each model are summarized in Table 4. It can be found that LR, SVM, and MLP have a significant effect on extracted features, with higher accuracy than the other models. The accuracy is about 90.0%. Since LR, SVM, and MLP are outstanding among all the models, the three models are employed as the base models in model stacking. The parameters of the three models in Table 3 are also employed in model stacking. In the first training stage, the three models generate new features, and 5-fold cross-validation is applied to show objective accuracy; in the second training stage, the final model is established using LR. The model selection is significant to the performance of the stacking model. All the base models should have outstanding performance. The base model with low accuracy has a negative influence on the final result. There are no special constraints to the new features after model selection. The evaluation results of the stacking model and the three single models are shown in Table 5.  Table 5 shows that the stacking model has the highest accuracy and the accuracy standard deviation is relatively low. In the process of model stacking, the parameters of all the involved models are fixed. It proves that model stacking can improve the prediction performance without parameter tuning again. Meanwhile, considering the accuracy standard deviation, the stacking model is relatively stable.

Conclusions
In this research, the deep learning model Inception-v3 is adopted to extract high-level features of quartz and feldspar microscopic images. Different machine learning methods and 10-fold cross-validation are adopted. The highest accuracy of the single model, SVM, is 90.6% and the lowest accuracy is that of GNB, of about 78.0%, which indicates that the features extracted by Inception-v3 are effective. The deep learning models and the extracted features can be applied in smart identification of rock-mineral thin sections.
Furthermore, based on the extracted features, the six machine learning methods-LR, SVM, MLP, RF, KNN, and GNB-are applied to make a prediction. The result shows LR, SVM and MLP have a significant performance, which means the methods based on the perceptron are effective on the high-level features. The accuracy of the three models is about 90.0%.
Since LR, SVM, and MLP have outstanding performance, the three models are selected as the base models for model stacking. The 5-fold cross-validation is also applied to evaluate the stacking model. The result shows that the stacking model has a better performance than the single models, with an accuracy of 90.9%. It proves that model stacking is also effective for high-dimensional features.
In the future, more types of mineral samples should be added to train the model. Then, the microscope, the computer, and the model can be integrated to identify rock-mineral thin sections automatically.

Conflicts of Interest:
The authors declare no conflict of interest.