1. Introduction
Mineral recognition is a basic yet important aspect in geological surveys. It can not only enrich the map of mineral resources on the earth, but also be used to estimate the hidden mining volume and potential economics of minerals, providing geological information for subsequent mineral exploration. Traditionally, mineral recognition is professional work, which distinguishes minerals according to shape, optical properties, and mechanical properties, requiring rich knowledge and experience or special equipment. However, this process is susceptible to human experience, inefficient, and costly. Recently, with the rapid development of artificial intelligence, a considerable number of methods have been proposed to solve geological problems in a smarter and more convenient way by using an artificial neural network (ANN) [
1,
2,
3,
4,
5,
6,
7,
8].
According to the difference in input images, the current research for mineral recognition methods with ANN can be organized into three groups: Microscopic Image-Based methods, Raman Spectra Image-Based methods, and Photo Image-Based methods.
(1) Microscopic Images-Based Methods: Baykan and Yılmaz [
9] employed the multilayer perceptron neural network (MLPNN) with one hidden layer for mineral classification, which is based on the RGB data of plane-polarized and cross-polarized microscope images and achieved 94.07% average accuracy for five minerals. An idea to use cluster algorithms and morphological analysis to determinate colors and shapes for computing the composition of rocks from micrographs was proposed without providing the number of mineral types and accuracy [
10]. Izadi et al. [
11] presented a two-level cascade neural network classification approach, which first recognized the minerals based on color parameters and then identified those minerals rejected from the first level based on texture features of plane and cross-polarized light. Overall accuracy of 93.81% for the recognition of 23 test minerals was obtained. Maitre et al. [
12] proposed an approach to automate mineral grain recognition using an optical microscope image, which relied on data processing such as superpixel generation, feature extraction and data cleaning, and machine-learning algorithms that classify vectors of mineral features, identifying eight kinds of mineral particles with an accuracy of approximately 90%. A complex ensemble model was proposed in [
13], which used Inception-v3 [
14] to extract features of the microscopic images, selected logistic regression (LR), support vector machine (SVM), and multilayer perceptron (MLP) as the basic models, and chose the LR model as the final prediction meta classifier. The composed model recognized four minerals with an accuracy of 90.9%. The work of [
15] used five machine-learning algorithms to classify the scanning electron microscope images of 12 minerals, and reached accuracies of 86–92%. Among all the above methods, the acquisition of microscopic images is equipment dependent and inconvenient. Additionally, the types of recognized minerals are few due to the limited samples [
9,
11,
12,
13,
15].
(2) Raman Spectra Image-Based methods: Raman spectroscopy has been widely used as a mature auxiliary tool for mineral recognition. An artificial neural network was trained using Raman spectra of minerals to distinguish six minerals in igneous rocks, achieving 83% accuracy [
16]. The work of [
17] proposed full-spectrum matching algorithms realizing 96.5% average accuracy of six minerals without model training. Due to the lack of a large-scale Raman spectrum image set, it is difficult for learning-based methods to train the network, and it is also tough for testing to obtain the Raman spectrum of the sample in the wild. So, mineral recognition based on Raman spectra has difficulty in extensive applications.
(3) Photo Image-Based methods: Compared to the above two groups, mineral photo images can be obtained conveniently due to the popularity of digital cameras and smartphones. Therefore, mineral recognition based on mineral photo images has attracted increasing attention. Recently, Zeng et al. [
18] employed mineral photos combined with Mohs hardness to achieve a Top1 accuracy rate of 90.6% for 36 common minerals using a deep neural network, which input the Mohs hardness of the corresponding mineral into the model manually to assist the image recognition. Without the Mohs information, the Top1 accuracy of model dropped drastically to 78.3%. Although useful, the use of Mohs hardness reduces the adaptability and universality of the algorithm. Liu et al. [
19] extracted the texture features of images using the Inception-v3 model [
14] and established a color model by the K-means algorithm, and then combined the two models to obtain a comprehensive recognition model, which achieves a Top1 accuracy of 74.2% of 12 minerals. Peng et al. [
20] also used the Inception-v3 model but combined the softmax loss with the center loss to identify 19 minerals. This method obtained a Top1 accuracy of 86%, 5 percentage points higher than that of the softmax loss alone. Although the center loss improves the recognition accuracy by reducing the intra-class distance, it slows the convergence of the model greatly and the model training becomes more difficult.
To solve the above issues for mineral photo image recognition, such as the use of additional geological information [
18], incomplete feature extraction [
19], and loss function improvement [
20], a deep learning model based on feature fusion and online hard sample mining using mineral photos only is proposed in this paper. Here, ResNet-50 is used to extract features of the mineral images, and then, the low-level features are merged with the high-level features to improve the model performance due to the fact that the low-level features, such as color and texture, are important for mineral recognition. Meanwhile, a weighted top-
k loss is also proposed to exploit the available information of hard and easy samples, improving the recognition accuracy further.
The remainder of this paper is constructed as follows.
Section 2 introduces a detailed presentation of the proposed method.
Section 3 provides the mineral dataset and experimental results, followed by the experimental analysis in
Section 4. Finally, we conclude in
Section 5.
2. Method
In this section, a mineral photo-recognition model based on the deep residual network ResNet-50, combining multi-resolution feature fusion and online hard sample mining weighted top-k loss, is designed.
2.1. Backbone Network ResNet-50
In the past years, many excellent backbone networks have been proposed, for example LeNet [
21], ALexNet [
22], VGGNet [
23], Inception [
14,
24], ResNet [
25], EfficientNet [
26], and so on. By introducing a residual structure, ResNet can improve the network performance by increasing the network layers while avoiding gradient explosion/disappearance. It has become one of the most widely used convolutional neural network (CNN) backbones for feature extraction. The structure of ResNet-50 is shown in
Table 1. The input image is resized to 224 × 224, then it goes through a convolution layer (conv1) and a max pooling process with a stride size of 2. Features are subsequently extracted through four residual layers (Layer1, Layer2, Layer3, and Layer4). Next, a global average pooling (GAP) operation is conducted to obtain a 1 × 1 × 2048 feature, which is then flattened and input into a full convolutional (FC) layer and the probabilities of mineral types are the output.
2.2. Feature Fusion
The convolution operation and design of convolution neural networks mean the extracted features in the network are of a hierarchical nature. That is to say, the low layers respond to basic features, such as the color and edges. With the increase in the number of layers, the complexity of features increases, and more class-specific features are extracted. Generally, high-level features are used for the classification of different kinds of objects, such as objects given in the ImageNet dataset. However, due to the differences in chemical composition, crystallization, and chemical properties, minerals present a variety of colors, crystal forms, hardness, and luster, which are shown intuitively in different colors, shapes, transparencies, and textures of mineral photo images. So, for mineral recognition, low-level features such as colors, shapes, and textures are still important for mineral recognition, which are ignored by existing methods.
In this paper, a mineral recognition method fusing low-level features and high-level features is proposed to improve the performance of high-level features and produce an increase in recognition accuracy. In detail, mineral photo images are input into the ResNet-50, which is pretrained by the ImageNet dataset to obtain the features of four layers.Layer3 and Layer4 output the high-level features, and the features of Layer4 are usually used as the most discriminative features to classify the objects since they include the largest receptive field and the richest semantic information. The features from Layer1 and Layer2 are often considered low-level features. Since features from different layers have different feature sizes and numbers of channels, before feature fusion, GAP is performed to resize the features to a uniform height and width (
). Then, the low-level features, denoted as
, and the high-level features, denoted as
, are merged via concatenation to obtain fused features
. The fused features thus contain not only rich high-level semantic information, but also much low-level details information. Finally, the model output is obtained by passing the fusion features through the full connection (FC) layer.
Figure 1 displays the example model when the features from Layer2 and Layer4 are fused.
2.3. Loss Function
The loss function assigns the goal of network learning. Usually, the proportion of simple samples, which display the clear features of the minerals, is much larger than that of hard ones, the features of which are shown in a confusing way due to inappropriate imaging or unapparent characteristics of the mineral itself. So, in the training process, the network can easily learn the obvious features from the simple samples, while further mining of the features from hard samples is ignored because the small number of hard samples brings low weights in the total loss. So, when network learning reaches a certain level, the existing loss functions cannot impose the network to learn the implicit information contained in the hard samples further, restricting the improvement of the network performance.
For mineral recognition, due to some subtle differences, certain minerals visually display large intra-class differences and small inter-class differences resulting in misrecognition. So, paying more attention to these hard samples can help the model achieve higher accuracy. Top-
k loss [
27] is proposed by Zhang to solve online hard sample mining (OHSM) for face detection. The core of the OHEM algorithm is to select hard samples with large loss values as training samples to learn the network parameters. Here, we try to consider top-
k loss, denoted as
, in our study for mineral photo recognition.
denotes the softmax loss value of the
sample in a batch, which can be described as (1), where
represents the
sample’s output from the model, which is a
vector. The softmax loss of a batch is the average of all losses in the batch, shown as (2), where
is the batch size. Denote
as
in descending order like (3). The
is defined as the average of the top
loss values, as shown in Equation (4), where
is a percentage.
In the top-
k loss, only the loss values of the top
samples, considered as hard samples, are selected for learning parameters in each training batch. Then the gradients are computed from these hard samples in backward propagation, which can ensure the network pays more attention to hard samples and effectively excavates more implicit information, improving the performance of the network. Although top-
k loss shows good effectiveness in face detection, a task paying more attention to semantic information, mineral recognition is a task that pays close attention to low-level features, clearly presented in the simple samples. In order to effectively utilize all samples and balance the roles of hard and simple samples, a weighted top-
k loss is proposed in this paper, which ensures the network pays attention to the hard samples while taking simple samples into account, improving the performance of the network. The weighted top-
k loss function is shown in Equation (5), where the top
loss values are hard samples and the latter
are easy samples. In (5),
is the weight coefficient. When
, the weighted top-
k loss is the top-
k loss, and when
, the weighted top-
k loss is the softmax loss.
5. Conclusions
This paper proposes a common mineral recognition method using only mineral photos. In this method, we merge the multi-resolution features extracted from ResNet-50 and propose a weighted top-k loss function to balance the importance of hard and easy samples. Since the low-level features, such as color, shape, and so on, are important for mineral recognition, the fused features can supplement the high-level information that suffers from missing details, and the weighted top-k loss function better balances the roles of hard and easy samples for network learning. They effectively improve the accuracy of mineral photo recognition. Of the 14,986 image datasets of 22 common minerals, the experimental results show that the proposed method achieves a Top1 accuracy of 88.01% and mAP of 87.15%, which surpasses the Top1 accuracy of Inception-v3, EfficientNet-B0, and ResNet-50 with softmax loss by a margin 1.88%, 1.29%, and 0.86%, respectively, achieving a promising mineral recognition performance. Experimental analysis illustrates the excellent feature extraction performance of the method we proposed and we know that aside from improving the performance of the algorithm, collecting more diverse samples with clear and discriminative characteristics is a feasible and effective way to increase recognition accuracy.