Deep-Learning-Based Automatic Mineral Grain Segmentation and Recognition

: A multitude of applications in engineering, ore processing, mineral exploration, and environmental science require grain recognition and the counting of minerals. Typically, this task is performed manually with the drawback of monopolizing both time and resources. Moreover, it requires highly trained personnel with a wealth of knowledge and equipment, such as scanning electron microscopes and optical microscopes. Advances in machine learning and deep learning make it possible to envision the automation of many complex tasks in various ﬁelds of science at an accuracy equal to human performance, thereby, avoiding placing human resources into tedious and repetitive tasks, improving time efﬁciency, and lowering costs. Here, we develop deep-learning algorithms to automate the recognition of minerals directly from the grains captured from optical microscopes. Building upon our previous work and applying state-of-the-art technology, we modify a superpixel segmentation method to prepare data for the deep-learning algorithms. We compare two residual network architectures (ResNet 1 and ResNet 2) for the classiﬁcation and identiﬁcation processes. We achieve a validation accuracy of 90.5% using the ResNet 2 architecture with 47 layers. Our approach produces an effective application of deep learning to automate mineral recognition and counting from grains while also achieving a better recognition rate than reported thus far in the literature for this process and other well-known, deep-learning-based models, including AlexNet, GoogleNet, and LeNet.


Introduction
The advent of machine learning and automated classification has demonstrated the potential of technology in many fields, such as medical/health, legal, transportation, and mining [1][2][3][4][5][6]. For example, in exploration geology and mining, the process of identifying economic minerals has always been done manually, where a specialized and trained individual (a mineralogist) is required to identify minerals grains, such as gold, diamond indicator minerals, or sulfides, to discover new deposits [7][8][9]. This manual process has many limitations, including errors in identification and mineralogist fatigue and its timeconsuming and, hence, costly [10]. Moreover, trained mineralogists are able to count around 60 grains per minute without distractions to provide grain percentage rather than the more useful area percentage [11]. With new advances in technology, mineral grain identification and counting can now be performed using optical microscopy and scanning electron microscopy (SEM). However, even with SEM technology, the process remains expensive and time consuming. A scanning electron microscope costs between USD 0.5 and USD 2 million and requires highly qualified personnel to operate.
Nevertheless, the process of identifying and counting mineral grains in sands or sediments is a crucial step for many mineral exploration and engineering projects, environmental studies, and mining (extractive metallurgy); for example, minerals can be economic (e.g., ore, building materials) or toxic (e.g., acid mine drainage production or release of toxic elements such as lead or arsenic) [12,13]. The use of certain sands in building materials can be a major problem, and identifying such grains is crucial in engineering projects [14,15]. In glacial sediments (tills) and soils, the number and count of certain mineral grains can indicate the proximity of a potential deposit; in diamond exploration, for instance, certain minerals, such as chromium-bearing pyrope or diopside, are used to confirm the presence of proximal diamond deposits [16,17].
Machine learning offers an alternative to manual identification. Recent advances in deep learning for image-based tasks offer the possibility of automating, at least partially, grain identification and counting, saving time and money. Moreover, as opposed to relying on SEM [18], the deep-learning-based approach can potentially be carried out in the field, in remote areas where mineral potential is high. Such a method would allow a more rapid identification of economic minerals or toxic minerals to allow effective environmental surveys [19]. This automated approach could work in real time to sort minerals moving along on a conveyor [20]. A robot with specialized tools and equipment could be used to capture images of grain as it explores the terrains [21]. If images are tagged with a location, real-time processing is not obligatory, thus, simplifying the challenge of embedding a deep-learning model in a remote and potentially smaller computer.
In this paper, we propose an automated machine-learning approach to classify grains from a sample using optical microscopy that builds upon our previous work published in [22]. With this approach, the task of mineral identification requires minimal human intervention. The images of grains are collected using inexpensive photomicrographic systems or through the use of robotic machines or automated microscopes. The images (photomicrographs) can then be processed to isolate the grain images within the complete image and, thus, classify and count these grains. Our approach uses an improved superpixel method that segments the grains quickly and automatically. To solve it using deep learning, the segmentation must be very accurate for the model to automatically learn the features representing each class. Then, the segmented grains can be used as an input into the trained deep-learning model. Although deep learning frequently outperforms classical machine learning, it is only recently that mineral identification has been investigated with deep learning. With the new segmentation and state-of-the-art deep-learning models, we can achieve better results than observed in published classical machine-learning-based approaches.

Literature Review
Currently, there are mainly two distinct methods for grain recognition: traditional engineering devices [23,24] and computational methods [25].

Traditional, Device-Based Methods
Traditional methods for classifying and counting mineral grains rely on the use of SEM or optical microscopes. The use of an optical microscope is the most common method for estimating mineral abundance in sediments or milled rock, although this requires highly trained personnel to sort the mineral grains. Mineral sorting is possible using the specific polarized transmitted and reflected light properties of minerals and the morphological properties of the grains. Advances in the use of optical microscopes have been successfully applied to mineral grain analyses, although the main limitations discussed above remain [26][27][28][29]. Significant improvement of this method will require a technical breakthrough. Automated SEM provides an alternative means of counting minerals [11,30], and the SEM-based approaches include QEMSCAN, TIMA-X, and MLA [31]. SEM uses a focused electron beam to scan the material and generate an image of the grains. The interaction of the electrons with atoms on the grain surface provides additional information captured by the various sensors (e.g., X-ray fluorescence) to determine the chemical composition of the mineral. SEM output includes the chemical composition with grain size, shape, and proportion. Grain counting can be performed using an electron microprobe [32]; however, this method is time consuming [33].
In [34], the authors presented an image processing workflow for characterizing poreand grain-size distributions in porous geological samples using SEM images and X-ray microcomputed tomography (µCT). Their samples included the Buff Berea, Berea, Nugget, Bentheimer, and Castlegate sandstones and the carbonate Indiana Limestone. The produced 2D distribution from the SEM appeared biased toward smaller sizes. In [35], the authors developed a grain count technique using a laser particle counter sensor (Wenglor) to count stainless-steel beads and sand grains of different size classes. They compared the count with that obtained using high-speed cameras. They found that the Wenglor can count grain sizes between 210 ± 3 µm and 495 ± 10 µm and that only grains passing through the center of the beam were counted. In [36], the authors used a less expensive light microscope able to produce images of grain shape profiles sufficient in quality for identification and counting. Their key finding was that roundness, sphericity, circularity, ModRatio, and aspect ratio were the key shape parameters for differentiating grains.

Computer Vision-Based Computational Methods
Computational or machine-learning methods are increasingly applied in a multitude of spheres, including automated driving and navigation, automated image recognition, automated medical diagnosis, and agricultural processes [37][38][39]. The ability to apply machine-learning tools to a vast suite of applications also extends to the environmental and geological sciences.
The integration of machine learning to automate the process of mineral grain recognition was first explored by Maitre et al. [22]. The authors used linear iterative clustering segmentation to generate superpixels, thereby, isolating individual grains. The applied feature extraction method, using a series of classifiers, produced an 89% recognition rate. In [25], cluster analysis through a k-means algorithm for mineral recognition divided the data set into categories according to the similarity, computed by distance, e.g., Euclidean distance. Baklanova and Shvets extracted the colors and textures of grains using a stereoscopic binocular microscope. However, the authors failed to compare clusters found with labeled clusters that actually belonged to a certain species of minerals. In fact, their work was used only to classify rocks and not minerals and, thus, their work is only applicable to petrography. Other methods of mineral classification, although limited to copper minerals, have produced an acceptable, approximate 75% accuracy using laser-induced breakdown spectroscopy (LIBS) analyzers [40]. In [41], the authors classified heavy minerals collected from rivers. Using 3067 grains in 22 classes, they achieved 98.8% accuracy using 26 decision attributes and a random forest algorithm.

Materials and Methods
Our approach consisted of four main stages ( Figure 1). The first stage involved data collection followed by preprocessing the original mosaic and SEM images to remove noise and outlier objects. In the third stage, the grains were segmented by utilizing the contours and superpixel-based techniques. We selected five classes for recognition on the basis of classes with the greatest number of grains. In the final stage, we input the segmented grains into various convolutional neural network (CNN) models.

Data Set Acquisition
We collected 10 kg of till grains from the field, and sediments were sieved to less than 1 mm. The samples were then processed with a fluidized bed to obtain a superconcentrate of heavy minerals (approximately 100 mg) containing approximately 2 million grains smaller than 50 µ m. The superconcentrate was sprinkled onto carbon tape to provide a black backdrop for the images. Images were then obtained using a camera mounted onto a binocular microscope, and we created a photomosaic. To acquire the groundtruthed data, i.e., mineral grain identities, we acquired a backscattered image of the grains using SEM with X-ray fluorescence [42]. The groundtruthed data were the mineral map and referenced with the RGB mosaic. The end result, after using the motorized conventional microscope and 6-megapixel camera, was an approximate 2 GB mosaic image (34,674 × 33,720 pixels) to be used as the data set for the machine-learning algorithm. We acquired 238 fields of view with a 10% overlap between adjacent fields in the images. Figure 2 shows the sample of the grains and the corresponding, annotated SEM image.

Data Set Acquisition
We collected 10 kg of till grains from the field, and sediments were sieved to less than 1 mm. The samples were then processed with a fluidized bed to obtain a superconcentrate of heavy minerals (approximately 100 mg) containing approximately 2 million grains smaller than 50 µm. The superconcentrate was sprinkled onto carbon tape to provide a black backdrop for the images. Images were then obtained using a camera mounted onto a binocular microscope, and we created a photomosaic. To acquire the groundtruthed data, i.e., mineral grain identities, we acquired a backscattered image of the grains using SEM with X-ray fluorescence [42]. The groundtruthed data were the mineral map and referenced with the RGB mosaic. The end result, after using the motorized conventional microscope and 6-megapixel camera, was an approximate 2 GB mosaic image (34,674 × 33,720 pixels) to be used as the data set for the machine-learning algorithm. We acquired 238 fields of view with a 10% overlap between adjacent fields in the images. Figure 2 shows the sample of the grains and the corresponding, annotated SEM image.

Data Set Acquisition
We collected 10 kg of till grains from the field, and sediments were sieved to less than 1 mm. The samples were then processed with a fluidized bed to obtain a superconcentrate of heavy minerals (approximately 100 mg) containing approximately 2 million grains smaller than 50 µ m. The superconcentrate was sprinkled onto carbon tape to provide a black backdrop for the images. Images were then obtained using a camera mounted onto a binocular microscope, and we created a photomosaic. To acquire the groundtruthed data, i.e., mineral grain identities, we acquired a backscattered image of the grains using SEM with X-ray fluorescence [42]. The groundtruthed data were the mineral map and referenced with the RGB mosaic. The end result, after using the motorized conventional microscope and 6-megapixel camera, was an approximate 2 GB mosaic image (34,674 × 33,720 pixels) to be used as the data set for the machine-learning algorithm. We acquired 238 fields of view with a 10% overlap between adjacent fields in the images. Figure 2 shows the sample of the grains and the corresponding, annotated SEM image.

Data Preprocessing
The original image background consisted of outlier grains that are not part of the SEM annotated image; therefore, preprocessing, using various morphological operations, served to remove outlier particles. An outlier grain is a phantom image of a grain lying outside of the field of view. To reduce processing time, we cropped the original image to include only 1/3 of the original image by discarding 12,000 border pixels on all sides that did not contain grains. This new image was further divided into 5608 × 5608 equally sized subimages. We considered only five classes for classification because of unbalanced data and a low number of instances for some of the discarded classes.
The groundtruthed image was converted into a binary image and morphological operations were applied, i.e., dilation, filling holes, and erosion, to remove the outlier grains, the background, and other noise. The largest filled segment of the SEM-based, labeled image was extracted by discarding all outlier grains and other noise. The erosion and dilation work was based on kernel size to reduce the size of the input image. Similarly, dilation increased the size of the input image on the basis of kernel size. We applied a kernel size of 7 × 7. The erosion and dilation for the binary image were calculated using Equations (1) and (2), respectively, where A represents the original binary image, and B represents the kernel. In Equation (1) (2) Figure 3 shows the outcome of different preprocessing steps and mapping of the SEM ground truth image and the Original Image based on the processed SEM binary image.

Data Preprocessing
The original image background consisted of outlier grains that are not part of the SEM annotated image; therefore, preprocessing, using various morphological operations, served to remove outlier particles. An outlier grain is a phantom image of a grain lying outside of the field of view. To reduce processing time, we cropped the original image to include only 1/3 of the original image by discarding 12,000 border pixels on all sides that did not contain grains. This new image was further divided into 5608 × 5608 equally sized subimages. We considered only five classes for classification because of unbalanced data and a low number of instances for some of the discarded classes.
The groundtruthed image was converted into a binary image and morphological operations were applied, i.e., dilation, filling holes, and erosion, to remove the outlier grains, the background, and other noise. The largest filled segment of the SEM-based, labeled image was extracted by discarding all outlier grains and other noise. The erosion and dilation work was based on kernel size to reduce the size of the input image. Similarly, dilation increased the size of the input image on the basis of kernel size. We applied a kernel size of 7 × 7. The erosion and dilation for the binary image were calculated using Equation (1) and Equation (2), respectively, where A represents the original binary image, and B represents the kernel. In Equation (1), (2) Figure 3 shows the outcome of different preprocessing steps and mapping of the SEM ground truth image and the Original Image based on the processed SEM binary image

Grain Segmentation
We used superpixel segmentation to separate mineral grain data (see Algorithm 1). The image was first converted to binary, and morphological operations-erosion and dilation-were applied to the image to separate the grains from each other. To convert the image into binary, the image threshold was calculated using Otsu's method [43]. Using the resulting binary image, we calculated the total number of external, closed contours to represent the possible grains in the image. Contours are closed curves that are calculated using the edges of objects with the same values or pixel intensities. The contour count C then serves as a seed for the superpixel segmentation method rather than using a fixed number K as a seed. We applied Equation (3) to calculate the superpixel center grid interval of approximately equal-sized superpixels of an input image of size N.
The superpixel segmentation method relies on oversegmenting the image while simultaneously decreasing the complexity of the image processing tasks. We applied a simple linear iterative clustering (SLIC) method to produce high-quality segmentation in a timely manner [44]. The method performs local k-mean clustering of the image pixels using color similarity and proximity in the subimages. The method also uses the five-dimensional spaces provided by the labxy image plane, where l, a, and b are the pixel vector colors provided by the CIELAB color space, and the x and y values are the coordinates of the pixels which represent the spatial distances. To merge the color proximity and spatial proximity distances, we normalized the distances using Equations (4) and (5). To use the labxy space to cluster the pixels, we required the distance measure D, which considers approximately equal-sized superpixels.

Grain Segmentation
We used superpixel segmentation to separate mineral grain data (see Algorithm 1). The image was first converted to binary, and morphological operations-erosion and dilation-were applied to the image to separate the grains from each other. To convert the image into binary, the image threshold was calculated using Otsu's method [43]. Using the resulting binary image, we calculated the total number of external, closed contours to represent the possible grains in the image. Contours are closed curves that are calculated using the edges of objects with the same values or pixel intensities. The contour count C then serves as a seed for the superpixel segmentation method rather than using a fixed number K as a seed. We applied Equation (3) to calculate the superpixel center grid interval of approximately equal-sized superpixels of an input image of size N.
The superpixel segmentation method relies on oversegmenting the image while simultaneously decreasing the complexity of the image processing tasks. We applied a simple linear iterative clustering (SLIC) method to produce high-quality segmentation in a timely manner [44]. The method performs local k-mean clustering of the image pixels using color similarity and proximity in the subimages. The method also uses the five-dimensional spaces provided by the labxy image plane, where l, a, and b are the pixel vector colors provided by the CIELAB color space, and the x and y values are the coordinates of the pixels which represent the spatial distances. To merge the color proximity and spatial proximity distances, we normalized the distances using Equations (4) and (5). To use the labxy space to cluster the pixels, we required the distance measure D, which considers approximately equal-sized superpixels. The segmentation provided the xy coordinates of each superpixel. The method was further enhanced by increasing the contrast of the images to allow the discrimination of the grain borders. In Maitre et al. [22], the superpixel method was applied using a fixed-size input seed value for the superpixels. This approach worked well for the color feature-based method with classical machine-learning methods; however, this method did not rely on deep learning. Thus, we proposed to automate the calculation of the seed values in the segmentation method to prepare the data for deep-learning networks. The comparisons of the superpixel boundaries and the outcome for the segmented grains for both methods are presented in Figures 4 and 5 The segmentation provided the xy coordinates of each superpixel. The method was further enhanced by increasing the contrast of the images to allow the discrimination of the grain borders. In Maitre et al. [22], the superpixel method was applied using a fixedsize input seed value for the superpixels. This approach worked well for the color featurebased method with classical machine-learning methods; however, this method did not rely on deep learning. Thus, we proposed to automate the calculation of the seed values in the segmentation method to prepare the data for deep-learning networks. The comparisons of the superpixel boundaries and the outcome for the segmented grains for both methods are presented in Figures 4 and 5, respectively.   The segmentation provided the xy coordinates of each superpixel. The method was further enhanced by increasing the contrast of the images to allow the discrimination of the grain borders. In Maitre et al. [22], the superpixel method was applied using a fixedsize input seed value for the superpixels. This approach worked well for the color featurebased method with classical machine-learning methods; however, this method did not rely on deep learning. Thus, we proposed to automate the calculation of the seed values in the segmentation method to prepare the data for deep-learning networks. The comparisons of the superpixel boundaries and the outcome for the segmented grains for both methods are presented in Figures 4 and 5, respectively.    (15, 15)) B c ← find external contours (B e , chain approx simple) Select n classes with maximum grain count

Grain Class Annotation
We selected five main classes on the basis of the number of segmented grains for each class and the group distribution of visually similar, rock-forming minerals (Table 1). We selected six types of individual grain that were further mapped to five classes, including the background class. These segmented images were labeled by mapping the original subimages to the SEM-based subimages using the superpixel-based method. The boundingbox method was then applied to extract the grains that had a rectangular format. The grains with a height:width ratio greater than 1.75 were discarded. A total of 21,091 images were segmented. The final data set consisted of 21,091 images divided into five classes. Albite grain and quartz grain images were merged into one class because they are visually similar, rock-forming minerals. The sample images of albite grain and quartz grain are shown in Figure 6, which clearly indicate their visual similarity. Augite grain and tschermakite grain images were also merged into one class, as are the samples shown in Figure 7, due to their visual similarity. The background class contained images which were either entirely black or contained very small grains (the total number of nonblack pixels was less than 256) or contained noise in the background. Figure 8 shows the sample images of the background class. For the experiments, these five classes' data set images were divided into 20% for training, and the remaining 80% was divided again into 80%/20% for validation/training sets.

C4
Hypersthene Any class > 256 pixels 988 images Hypersthene None C5 Background -6106 images The final data set consisted of 21,091 images divided into five classes. Albite grain and quartz grain images were merged into one class because they are visually similar, rock-forming minerals. The sample images of albite grain and quartz grain are shown in Figure 6, which clearly indicate their visual similarity. Augite grain and tschermakite grain images were also merged into one class, as are the samples shown in Figure 7, due to their visual similarity. The background class contained images which were either entirely black or contained very small grains (the total number of nonblack pixels was less than 256) or contained noise in the background. Figure 8 shows the sample images of the background class. For the experiments, these five classes' data set images were divided into 20% for training, and the remaining 80% was divided again into 80%/20% for validation/training sets.

ResNet Models for Grain Recognition
With the growing difficulties in the functions of computer vision and artificial intelligence, deep neural network models are becoming increasingly complex. Such strong models demand more data for learning to prevent overfitting. Recent deep-learning methods have been successfully applied to artificial intelligence [45,46]. Interest in convolutional neural systems (CNN) began in 2012 with AlexNet, which was based on LeNet. New CNN-based models have since been developed, including GoogleNet and residual neural networks (ResNet) [47][48][49]. CNN's major advantage is its ability to learn the critical features best representing the data without any human intervention.
ResNet overcomes model complexity and the vanishing gradient problems to produce satisfactory accuracies by training deeper networks [50]. Each ResNet block

C4
Hypersthene Any class > 256 pixels 988 images Hypersthene None C5 Background -6106 images The final data set consisted of 21,091 images divided into five classes. Albite grain and quartz grain images were merged into one class because they are visually similar, rock-forming minerals. The sample images of albite grain and quartz grain are shown in Figure 6, which clearly indicate their visual similarity. Augite grain and tschermakite grain images were also merged into one class, as are the samples shown in Figure 7, due to their visual similarity. The background class contained images which were either entirely black or contained very small grains (the total number of nonblack pixels was less than 256) or contained noise in the background. Figure 8 shows the sample images of the background class. For the experiments, these five classes' data set images were divided into 20% for training, and the remaining 80% was divided again into 80%/20% for validation/training sets.

ResNet Models for Grain Recognition
With the growing difficulties in the functions of computer vision and artificial intelligence, deep neural network models are becoming increasingly complex. Such strong models demand more data for learning to prevent overfitting. Recent deep-learning methods have been successfully applied to artificial intelligence [45,46]. Interest in convolutional neural systems (CNN) began in 2012 with AlexNet, which was based on LeNet. New CNN-based models have since been developed, including GoogleNet and residual neural networks (ResNet) [47][48][49]. CNN's major advantage is its ability to learn the critical features best representing the data without any human intervention.
ResNet overcomes model complexity and the vanishing gradient problems to produce satisfactory accuracies by training deeper networks [50]. Each ResNet block

C4
Hypersthene Any class > 256 pixels 988 images Hypersthene None C5 Background -6106 images The final data set consisted of 21,091 images divided into five classes. Albite grain and quartz grain images were merged into one class because they are visually similar, rock-forming minerals. The sample images of albite grain and quartz grain are shown in Figure 6, which clearly indicate their visual similarity. Augite grain and tschermakite grain images were also merged into one class, as are the samples shown in Figure 7, due to their visual similarity. The background class contained images which were either entirely black or contained very small grains (the total number of nonblack pixels was less than 256) or contained noise in the background. Figure 8 shows the sample images of the background class. For the experiments, these five classes' data set images were divided into 20% for training, and the remaining 80% was divided again into 80%/20% for validation/training sets.

ResNet Models for Grain Recognition
With the growing difficulties in the functions of computer vision and artificial intelligence, deep neural network models are becoming increasingly complex. Such strong models demand more data for learning to prevent overfitting. Recent deep-learning methods have been successfully applied to artificial intelligence [45,46]. Interest in convolutional neural systems (CNN) began in 2012 with AlexNet, which was based on LeNet. New CNN-based models have since been developed, including GoogleNet and residual neural networks (ResNet) [47][48][49]. CNN's major advantage is its ability to learn the critical features best representing the data without any human intervention.
ResNet overcomes model complexity and the vanishing gradient problems to produce satisfactory accuracies by training deeper networks [50]. Each ResNet block

ResNet Models for Grain Recognition
With the growing difficulties in the functions of computer vision and artificial intelligence, deep neural network models are becoming increasingly complex. Such strong models demand more data for learning to prevent overfitting. Recent deep-learning methods have been successfully applied to artificial intelligence [45,46]. Interest in convolutional neural systems (CNN) began in 2012 with AlexNet, which was based on LeNet. New CNN-based models have since been developed, including GoogleNet and residual neural networks (ResNet) [47][48][49]. CNN's major advantage is its ability to learn the critical features best representing the data without any human intervention.
ResNet overcomes model complexity and the vanishing gradient problems to produce satisfactory accuracies by training deeper networks [50]. Each ResNet block comprises four layers. The weight layer is expressed as (Z n+1 = W n+1 X n + Y n+1 ). The ReLU layer, a nonlinear layer, is expressed as (X n+1 = H(Z n+1 )), and a third layer is a weight layer (Z n+2 = W n+2 + Y n+2 ). X n is the input to the three layers combined, and F(X n ) is produced in the output. All these variables are matrices, and the subscripts are used to denote the layer numbers. In ResNet, a skip or shortcut link is used to bypass the three layers to pass X n to an adder. Thus, the fourth layer, ReLU, is applied to F(X n ) = Z n+2 to produce X n+2 = H(Z n+2 + X n ). With this skip, F(X n ) = H(Z n+2 ) is added to X n before passing through the second ReLU layer to generate X n+2 .
Skip, or shortcut, connection is a term used to refer to the X input to the adder. Because X is passed from one layer to another, the shortcut connection then permits the residual network so that F(X) = 0, thus, allowing a simple task to be performed by X. If this shortcut connection is absent, then the network needs to learn that the weights layer is equivalent to the identity matrix multiplied by X, which adds more complexity to the task. In cases where X is not required to pass through layers, the network generates F(X) normally, as is achieved when backpropagation is used. In this case, it is easier to train F(x) to be the residual D(X) − X, which results in the desired output of D(X) when added to X using the shortcut connection. Because the shortcut connection does not require weights, the gradient values remain unchanged, thus, overcoming the vanishing gradient problem.
Building a sequence of ResNet blocks produces a ResNet architecture with deeper networks with low training errors and excellent accuracies. The ResNet blocks might require pooling layers when convolution or weight layers generate different F(X) matrices than the original X matrix. The pooling adds X to F(X), which resizes X to match the size of the F(X) matrix. This can be achieved by adding (W.X) to F(X). W, in this case, is a zero-padded matrix in both the rows and columns missing from the original X.

ResNet Version 1
We used two ResNet architectures, referred to as "ResNet 1" and "ResNet 2". Figure 9 details the design of ResNet 1 architecture at the block level. No overfitting is present in the ResNet architecture because no additional parameters are introduced. This implies that ResNet is an efficient deep-learning network even for hundreds of network layers. In ResNet 1, a convolutional layer splits the feature map into two at the beginning, and the filter size is doubled to map the convolutional layer, batch layer, and ReLU layer to 32 × 32 × 16, 16 × 16 × 32, and 8 × 8 × 64, respectively, on the basis of the i and j values, where i represents how many times the filter size must be doubled, and j represents the number of ResNet block iterations on the basis of N. The deep-network performance is enhanced by adjusting the input layer using the batch normalization block. ResNet 1 has an input image dimension of 48 × 48 × 3, with each layer in the architecture consisting of a convolutional layer, batch normalization layer, and a rectified linear unit (ReLU).
comprises four layers. The weight layer is expressed as ( +1 = +1 + +1 ). The ReLU layer, a nonlinear layer, is expressed as ( +1 = ( +1 )), and a third layer is a weight layer ( +2 = +2 + +2 ). is the input to the three layers combined, and ( ) is produced in the output. All these variables are matrices, and the subscripts are used to denote the layer numbers. In ResNet, a skip or shortcut link is used to bypass the three layers to pass to an adder. Thus, the fourth layer, ReLU, is applied to ( ) = +2 to produce +2 = ( +2 + ). With this skip, ( ) = ( +2 ) is added to before passing through the second ReLU layer to generate +2 . Skip, or shortcut, connection is a term used to refer to the X input to the adder. Because X is passed from one layer to another, the shortcut connection then permits the residual network so that ( ) = 0, thus, allowing a simple task to be performed by X. If this shortcut connection is absent, then the network needs to learn that the weights layer is equivalent to the identity matrix multiplied by X, which adds more complexity to the task. In cases where X is not required to pass through layers, the network generates ( ) normally, as is achieved when backpropagation is used. In this case, it is easier to train ( ) to be the residual ( ) − , which results in the desired output of ( ) when added to X using the shortcut connection. Because the shortcut connection does not require weights, the gradient values remain unchanged, thus, overcoming the vanishing gradient problem.
Building a sequence of ResNet blocks produces a ResNet architecture with deeper networks with low training errors and excellent accuracies. The ResNet blocks might require pooling layers when convolution or weight layers generate different F(X) matrices than the original X matrix. The pooling adds to ( ), which resizes to match the size of the ( ) matrix. This can be achieved by adding ( . ) to ( ). , in this case, is a zero-padded matrix in both the rows and columns missing from the original .

ResNet Version 1
We used two ResNet architectures, referred to as "ResNet 1" and "ResNet 2". Figure  9 details the design of ResNet 1 architecture at the block level. No overfitting is present in the ResNet architecture because no additional parameters are introduced. This implies that ResNet is an efficient deep-learning network even for hundreds of network layers. In ResNet 1, a convolutional layer splits the feature map into two at the beginning, and the filter size is doubled to map the convolutional layer, batch layer, and ReLU layer to 32 × 32 × 16, 16 × 16 × 32, and 8 × 8 × 64, respectively, on the basis of the and values, where represents how many times the filter size must be doubled, and represents the number of ResNet block iterations on the basis of . The deep-network performance is enhanced by adjusting the input layer using the batch normalization block. ResNet 1 has an input image dimension of 48 × 48 × 3, with each layer in the architecture consisting of a convolutional layer, batch normalization layer, and a rectified linear unit (ReLU).

ResNet Version 2
ResNet 2 architecture at the block level is detailed in Figure 10, and the filter size for each step is calculated using a flowchart in Figure 11. As for ResNet 1, the feature maps are initially split into two, and the filter maps are doubled. A bottleneck connection is introduced in ResNet 2 with the filter size calculated as shown in Figure 11. In addition, the block size of the skip connection is tripled. The three layers that exist within a residual function block are the convolutional layers sized [1 × 1], [3 × 3], and [1 × 1], in which the increase and decrease of input dimensions are performed using the 1 × 1 layer, and the 3 × 3 layer is the bottleneck with reduced dimensions. The stages of ResNet 2 include a convolutional layer 32 × 32 × 16 in step 1 which produces an output of size 32 × 32 × 64.
Step 2 produces a 16 × 16 × 128 output, and step 3 produces an 8 × 8 × 256 output size. These ResNet 2 outputs are based on the i and j values, where i represents how many times the filter size must be doubled, and j represents the number of ResNet block iterations based on N.
block is repeated. The main difference between the two architectures is: The sequence that follows the initial weight, batch normalization, and activation block differs between the architectures. For ResNet 1, the following sequence is convolutional block → batch normalization block → activation block, whereas, in ResNet 2, the sequence is batch normalization block → activation block → convolutional block.
Postactivation is supported in ResNet 1. Preactivation is supported in ResNet 2.
In ResNet 1, the second ReLU nonlinearity is added after adding ( ) to . In ResNet 2, the last ReLU nonlinearity is deleted, thus, allowing output of the addition of the residual mapping and identity mapping to be passed with no changes to the consecutive block. In addition, the gradient value at the output layer is passed back during backpropagation, as is the input layer, thus, overcoming the vanishing gradient problem in deep-learning networks that have hundreds or thousands of layers, thereby, improving their performance and limiting/reducing the associated training errors.  For both ResNet models, we used experimentation to fine-tune the hyperparameters. The final hyperparameter settings were an activation function (ReLU) learning rate = 0.001, number of epochs = 50, and batch size = 20. These hyperparameters produced the experimental results discussed in Section 4.

Experimental Results
The experimental setup included the use of a high processing computing machine holding 256 GB memory with a graphical processing unit (GPU) Nvidia Tesla-V100 with 5120 CUDA cores. We applied Python 3.8 for the programming of all phases, including the preprocessing, classification, and identification. The data set was split so that 80% was used for training and the remaining 20% was available for testing. Note, however, that the 80% training portion was actually divided again into an 80% training and 20% validation split. We tested variable epoch sizes, and the ideal epoch size was chosen to ensure that the system avoided over-and underfitting. We tested various parameter settings for Res-Net 1 and ResNet 2 to obtain the optimal results and evaluated their performance against the better-known deep-learning approaches of LeNet, AlexNet, and GoogleNet. Note that, in both ResNet 1 and ResNet 2, after the initial concatenation of the blocks in the sequence weights → batch normalization → ReLU, the concatenated sequenced block is repeated. The main difference between the two architectures is: The sequence that follows the initial weight, batch normalization, and activation block differs between the architectures. For ResNet 1, the following sequence is convolutional block → batch normalization block → activation block, whereas, in ResNet 2, the sequence is batch normalization block → activation block → convolutional block.
Postactivation is supported in ResNet 1. Preactivation is supported in ResNet 2.
In ResNet 1, the second ReLU nonlinearity is added after adding F(X) to X. In ResNet 2, the last ReLU nonlinearity is deleted, thus, allowing output of the addition of the residual mapping and identity mapping to be passed with no changes to the consecutive block. In addition, the gradient value at the output layer is passed back during backpropagation, as is the input layer, thus, overcoming the vanishing gradient problem in deep-learning networks that have hundreds or thousands of layers, thereby, improving their performance and limiting/reducing the associated training errors.
For both ResNet models, we used experimentation to fine-tune the hyperparameters. The final hyperparameter settings were an activation function (ReLU) learning rate = 0.001, number of epochs = 50, and batch size = 20. These hyperparameters produced the experimental results discussed in Section 4.

Experimental Results
The experimental setup included the use of a high processing computing machine holding 256 GB memory with a graphical processing unit (GPU) Nvidia Tesla-V100 with 5120 CUDA cores. We applied Python 3.8 for the programming of all phases, including the preprocessing, classification, and identification. The data set was split so that 80% was used for training and the remaining 20% was available for testing. Note, however, that the 80% training portion was actually divided again into an 80% training and 20% validation split. We tested variable epoch sizes, and the ideal epoch size was chosen to ensure that the system avoided over-and underfitting. We tested various parameter settings for ResNet 1 and ResNet 2 to obtain the optimal results and evaluated their performance against the better-known deep-learning approaches of LeNet, AlexNet, and GoogleNet.
ResNet 1 and ResNet 2 achieved higher validation accuracies than LeNet, AlexNet, and GoogleNet ( Table 2). The validation accuracy of ResNet 2 was slightly higher than for ResNet 1. We obtained these scores by applying the segmentation methods presented in [22]. In the latter paper, they achieved a global accuracy of 89% using a RF classifier; however, their data were not effective when deep-learning algorithms were applied. We used superpixel segmentation combined with the proposed ResNet architectures to produce much higher validation accuracies than those achieved in [22]. LeNet, AlexNet, and GoogleNet produced validation accuracies ranging from 74.4% to 86.3%, with the highest accuracy achieved by AlexNet as shown in Table 3. However, the proposed ResNet 1 and ResNet 2 achieved a higher validation accuracy of 89.8% and 90.6%, respectively. Notice that, compared to the highest achieved validation accuracy in [22], which was 49%, our proposed method increased by 84.69%, which is a significant increase by all measures. The highest achieved validation accuracy of 90.5% produced by the ResNet 2 architecture of 47 layers sets a new threshold for researchers in the field of grain recognition. It is also an improvement of 1.69% when compared to the accuracy achieved in [22] using an RF classifier.
We varied the number of layers for ResNet 1 and ResNet 2 to determine the best parameters for achieving the highest accuracy. The best accuracy for ResNet 1 was achieved using 74 layers (Table 4); however, although there was a slight improvement going from 32 to 74 layers, training times increased markedly for 74 layers. Hence, ResNet 1 with 32 layers was the chosen architecture for this application. For ResNet 2, we found the highest validation accuracy using 47 layers, accompanied by a reasonable training time. Although the training time between 29 layers and 47 almost doubled, the increased validation accuracy justified using the 47 layers for this application. We compared the various ResNet-model-layer combinations in terms of training accuracy (Figure 12), validation accuracies ( Figure 13), training loss (Figure 14), and validation loss (Figure 15). A consistent pattern emerged of ResNet 1 (32 layers) and ResNet 2 (47 layers) being the best models of the series.  The confusion matrices in Figure 16 show the comparison of each class's accuracy for the best proposed model (ResNet version 2 with 47 layers). The left confusion matrix shows the percentage accuracies for each class, and the right confusion matrix shows correctly classified grain images for each class. The results in the confusion matrix indicate that the classes C1 and C5 achieved higher accuracies as they had more grain images for the training.       The confusion matrices in Figure 16 show the comparison of each class's accuracy for the best proposed model (ResNet version 2 with 47 layers). The left confusion matrix shows the percentage accuracies for each class, and the right confusion matrix shows correctly classified grain images for each class. The results in the confusion matrix indicate that the classes C1 and C5 achieved higher accuracies as they had more grain images for The confusion matrices in Figure 16 show the comparison of each class's accuracy for the best proposed model (ResNet version 2 with 47 layers). The left confusion matrix shows the percentage accuracies for each class, and the right confusion matrix shows correctly classified grain images for each class. The results in the confusion matrix indicate that the classes C1 and C5 achieved higher accuracies as they had more grain images for the training. When we compared our ResNet 2 model (47 layers) with techniques published in the recent literature-using the published method on our grain data set-we observed that the superpixel-based grain segmentation and the ResNet 2 (47 layers) clearly outperformed the existing techniques and achieved the highest accuracy values (Table 5).  When we compared our ResNet 2 model (47 layers) with techniques published in the recent literature-using the published method on our grain data set-we observed that the superpixel-based grain segmentation and the ResNet 2 (47 layers) clearly outperformed the existing techniques and achieved the highest accuracy values (Table 5).  [50] Neighborhood component analysis quadratic SVM 39.72%

Discussion and Conclusions
We presented two improved residual network architectures to automate the detection and count of individual mineral grains. These algorithms, ResNet 1 and ResNet 2, are modified versions of ResNet. We adopted the superpixel segmentation method and applied preprocessing techniques to provide the seed for the segmentation method, which made the data more appropriate for deep-learning algorithms. The ResNet 2 architecture with 47 layers produced the highest validation accuracy of 90.5%. To our knowledge, this is the highest reported accuracy achieved using deep-learning networks for this particular application.
Few papers explore the use of machine-learning techniques and deep-learning algorithms for the automatic recognition, classification, and counting of grain minerals; however, the existing approaches offer benchmarks against which we can compare our results. Our ResNet 1 and ResNet 2 outperformed the deep-learning algorithms LeNet, AlexNet, and GoogleNet in automatic grain detection and count application. Despite these very encouraging results, improvements must be made prior to the application of our deep-learning techniques in the field. The data set must be enhanced to eliminate problems of mislabeling, unbalanced data, and fusion. Moreover, the developed approach is limited by: a.
The scarcity of mineral data sets. A key contribution of this work is the development of such a data set, because they are not readily available for grain mineral classification. b.
Unbalanced data for different classes. In the developed data set, there was an unequal number of images available for each class. c.
High-performance GPUs are required for training. We had access to a GPU system; however, the training step required a considerable amount of time to be performed.
Future work will include developing data sets for the purpose of grain mineral recognition and enhancing new and current methods to achieve a higher recognition rate with more mineral classes. These advances will include applying various image fusion and registration techniques to greatly improve the mapping of the original images with the labeled images. We will also explore other techniques for segmentation that may enhance accuracy. These may include the region-growing-based method, fuzzy C-means, and deep-learning segmentation.