1. Introduction
Texture analysis in images is important in a wide range of industries. It is not precisely defined, since image texture is not precisely defined, but intuitively, image texture analysis attempts to quantify qualities such as roughness, smoothness, heterogeneity, regularity, etc., as a function of the spatial variation in pixel intensities.
In materials, image texture analysis can be used to derive quantitative descriptors of the distributions of the orientations and sizes of grains in polycrystalline materials. Almost all engineering materials have texture, which is strongly correlated with their properties, such as mechanical strength, resistance to stress corrosion cracking and radiation damage, etc. In this sense, image textures and textures in materials are closely related.
In the case of metalliferous ores or rocks, image texture provides critical information with regard to the response of the materials during mining and mineral processing [
1,
2]. For example, more energy is generally required to liberate finely disseminated minerals from ores, i.e., ores with fine textures, than is the case with ores with coarse textures.
Characterization of the textural characteristics of materials, whether related to the microstructures of metals, mineral ores or the surface properties of materials, requires steps beyond simple qualitative descriptions of material textures to quantitative descriptions that can be used in models that can predict the behavior of the metals or ores in a given process or system [
2].
A variety of such methods to quantify the textural appearance of materials has been proposed, as discussed comprehensively in a recent review by Ghalati et al. [
3]. This includes a review of traditional approaches, as well as more recent approaches based on the use of learned features, as opposed to engineered features.
Few studies have been reported where some comparative assessment of methods was made. For example, Kerut et al. [
4] reviewed quantitative texture analysis applied to the images generated by echocardiography. These were categorized as statistical methods, fractal methods, frequency domain methods and methods based on the use of wavelets. The latter was still an emerging approach but was considered to be a state-of-the-art method at the time, with 2D Haar dyadic wavelets recommended in particular.
In mineral processing, Kistner et al. [
5] considered the application of grey level co-occurrence methods, local binary patterns, wavelets, steerable pyramids and textons and found that the latter approach was best able to capture patterns associated with flotation froth images, particulate solids and slurry flows.
More recently, transfer learning by use of convolutional neural networks has emerged as a strong contender as a state-of-the-art method in texture analysis [
6]. Fu and Aldrich [
7] found that AlexNet provides better flotation froth descriptors than methods based on the use of grey level co-occurrence matrices, wavelets and local binary patterns. Authors, such as Mormont et al. [
8] and Xiao et al. [
9], have conducted texture analysis with transfer learning methods. Mormont et al. [
8] found that DenseNet and ResNet marginally yielded the best features for classification of histological images. Xiao et al. [
9] concluded that transfer learning methods outperformed traditional methods, with InceptionV3 performing the best overall.
While these studies serve as a guide to the comparative merits of different texture analytical methods, the relative merits of different approaches are still not well documented in the literature. This study reviews developments in quantitative image texture analysis, and assesses the feasibility of state-of-the-art methods, with a focus on textures associated with material processing. This includes an assessment of different variants of transfer learning, based on zero, partial and full retraining of the feature layers of convolutional neural networks.
In the following section, image texture analysis in metal processing is briefly reviewed. This is followed by an explanation of the analytical methodology of the study. In
Section 4,
Section 5 and
Section 6, the methodology is applied to case studies and in
Section 7 and
Section 8, the results are discussed, and the conclusions of the study are summarized.
4. Case Study 1: Voronoi-Simulated Material Microstructures of Different Grain Size
In the first case study, 1000 simulated textures each of A and B were generated by identical bivariate uniform distributions, except that texture dataset A was based on 100 data points and texture dataset B on 105 data points. This meant that the simulated grain sizes in the Class B dataset were on average smaller than those in the Class A dataset. Examples of these simulated textures are shown in
Figure 5.
It should be noted that the analysis done by the different algorithms strictly focuses on the appearance of the images, as determined by the distributions of the pixels in the images. This could be closely related to the simulated material textures in the sense of the random orientation of the simulated grains in polycrystalline materials, but it is not a direct measurement of the material texture as such. The same would apply to Case Study 2.
The convolutional neural networks used in this investigation were built using a PyTorch backend. All the experiments were run on a graphics processing unit (GPU) device on the Google Colab platform. In general, two approaches were adopted. In the first approach, Voronoi images were identified from features extracted by use of traditional algorithms including GLCM, LBP and textons.
In the second approach, Voronoi images were identified from features directly extracted by use of convolutional neural networks pretrained on images from a different domain (hereafter referred to as direct deep feature extraction), as well as partially retrained and fully retrained convolutional neural networks, including the use of AlexNet, VGG19, GoogLeNet, ResNet50 and MobileNetV2.
More specifically, with direct deep feature extraction, all froth images were passed through the abovementioned five CNNs and the features generated in the layer immediately preceding the last fully connected layer were used as predictors for classification of the froth images. By removing this last layer and freezing the weights of the model, the network can be regarded as a feature extractor. The traditional feature sets and the direct deep feature sets were then used as input to a random forest model to evaluate their corresponding classification performance to distinguish texture A from texture B.
In the partially retrained networks and fully retrained networks, the original network architecture remained unchanged, and the only difference from the direct deep feature extraction was whether to unfreeze the later layers or all the layers of the original network architecture for further retraining. The features were extracted from the same layer as that in direct deep feature extraction, that is, the layer immediately preceding the last fully connected layer.
It should be noted that for the partially retrained networks, the unfrozen layers in different CNNs are different. For AlexNet and VGG19, the weights are unfrozen from the second last convolutional blocks. For GoogLeNet, the weights are unfrozen from the second last inception modules. For ResNet50, the weights are unfrozen from the second last residual modules. For MobileNetV2, the weighs were unfrozen from the second last inverted residual blocks.
As mentioned above, random forest classification models were used in combination with the traditional feature sets and the direct deep feature sets to assess the quality of the features quantitatively. The mean value of the out-of-bag (OOB) accuracy over 30 runs was used as indicator of the classification performance. The hyperparameters used in random forest models are summarized in
Table 2. The same hyperparameters were used in Case Studies 2 and 3 as well, noting that
is the number of features extracted by each algorithm.
During partial or full retraining of the CNNs, Voronoi images were randomly split into training and test datasets (split ratio 80:20), with the latter used as an independent test set to validate the generalization of the neural network models. The training set was further randomly shuffled, and 75% of it was used to train the models, while the remainder was allocated to a validation set. The adaptive momentum estimation (ADAM) algorithm [
58] was used as the optimizer in this work. Hyperparameter optimization was done by use of grid search. Different optimal learning rates with or without L2 penalty apply to different CNNs. For most models, the optimal initial learning rate was 0.00001 with a weight decay parameter of 0.0000001 (L2 penalty). Optimal batch sizes and number of epochs varied as well. In order to deal with overfitting, image augmentation was used in the training stage by randomly rotating, shearing, shifting and horizontally flipping the original images. The retrained CNN models were used as end-to-end classifiers to discriminate between the two classes of simulated material textures.
Classification tasks to discriminate class A and B were performed and the corresponding features were extracted with the GLCM, LBP, textons, AlexNet (three methods: direct deep feature extraction, partially retrained, fully retrained), VGG19 (three methods: direct deep feature extraction, partially retrained, fully retrained), GoogLeNet (three methods: direct deep feature extraction, partially retrained, fully retrained), ResNet50 (three methods: direct deep feature extraction, partially retrained, fully retrained) and MobileNetV2 (three methods: direct deep feature extraction, partially retrained, fully retrained) algorithms from each of the images in the texture datasets.
The classification performance of the models is summarized in
Table 3, together with the number of features associated with each model. The three methods used with each CNN algorithm are indicated as by different superscripts, for example, AlexNet, AlexNet* and AlexNet** refer to AlexNet with direct deep feature extraction, AlexNet with partial retraining and AlexNet with full retraining, respectively.
As can been seen from
Table 3, all the traditional feature sets can discriminate between Class A and B reasonably well, with accuracies ranging from 57.69% to 64.82%. Not all the direct deep feature sets performed as well as the traditional feature sets. For example, GoogLeNet underperformed, while the others performed similarly.
As expected, the partially retrained networks improved the accuracy to at least 62.75% (only textons exceeds this value among the three traditional models), and the fully retrained networks further improved the accuracy to at least 65.75%. The best model in Case Study 1 is the fully retrained GoogLeNet, which achieved the accuracy of 74.25%. Generally, retraining of CNNs improved the classification accuracy and the depth of retraining led to further improvement.
By making use of t-SNE plots, it is possible to visualize the extracted features in a two-dimensional score space and qualitatively assess the discriminative power of the different models. The t-distributed stochastic neighbor embedding (t-SNE) [
59,
60] algorithm constructs the points embedded in a low-dimensional space to best approximate the relative similarities of the points in the original high-dimensional space with those in the low-dimensional space. It does so by minimizing the Kullback–Leibler divergence between the two distributions by moving the embedded points.
Figure 6 shows the t-SNE score plots of the features extracted from different methods with the corresponding classification accuracy. Retrained networks can produce better separation between two classes than the direct deep feature extraction methods and traditional feature extractors.
With further improvement in accuracy using features extracted from fully retrained networks, the separation between Class A and B becomes clearer. From these visualization results, one can confirm that the retraining of CNNs improves the discrimination between the two classes of microstructures and the depth of retraining makes a difference.
6. Case Study 3: Real Textures in Ultrahigh Carbon Steel for Material Classification
In the final case study, a subset of the public Ultrahigh Carbon Steel Micrograph DataBase (UHCSDB) [
61] of 961 scanning electron microscopy (SEM) UHCS micrographs was considered to classify the textures in terms of heat treatments or annealing conditions on spheroidite morphology. We limited the classification dataset to four annealing conditions with at least 20 micrographs under the same temperature (970 °C) and cooling method (water quench). The only difference between the four classes A–D is the annealing time, that is, 90 min, 3 h, 8 h and 24 h, respectively. The classification dataset contained 660 images and was constructed by cropping four smaller 224 × 224 pixel equalized images from the center of each original micrograph. Examples from each class are shown in
Figure 9.
As before, the same framework for classification as used in Case Study 1 and 2 was used in Case Study 3. To minimize the variability of model performance, the two fully retrained networks were trained multiple times on several different split of training set (including the validation set) and tested on the remaining different independent test set.
The classification performance of different models in Case Study 3 is summarized in
Table 5, together with the dimension of the corresponding feature set. Classification performance was based on the percentage of images correctly classified, using the feature sets as predictors. As can be seen in
Table 5, one of direct deep feature extraction methods achieved comparable performance with the traditional model textons. Furthermore, the classification accuracy with the two retrained CNNs improved to 87.31~91.04% with partial retraining and near-perfect accuracies 97.01~97.76% with full retraining.
The performance of the fully retrained MobileNetV2 appeared to be slightly better than that of the fully retrained GoogLeNet, despite the trade-off in MobileNetV2 between performance and speed. To further highlight the performance of GoogLeNet and MobileNet, their receiver operating curves (ROC) are given in
Appendix A.
As before, the features can be visualized in t-SNE score plots in
Figure 10. Only one feature set of each fully retrained network (which has been trained multiple times) is shown here for visualization purposes. Textons can separate the four classes reasonably well, but overlap can still be seen between successive classes. Similarly, the CNN features extracted directly without retraining of the features layers of the networks separate the four classes and overlap exists.
The partially retrained CNN features could separate the four classes a bit better, as the four classes are generally located in a sequence of regions, although some overlap still exists for specific classes (e.g., Class B and the other classes). In contrast, the fully retrained CNN features form four distinct clusters in the feature space and thus can separate the four classes nearly perfectly.
To further demonstrate the classification performance of the fully retrained MobileNet, the confusion matrix on the test set is shown in
Table 6. There are 20–48 samples for each class in the test set. The actual observations are presented in rows, while the predicted labels are presented in the columns of the table. More specifically, the numbers on the diagonal represent correct predictions, while the off-diagonal numbers give insight in the failures of the model. As can been seen from
Table 6, Classes C and D can be distinguished from the other classes perfectly, while the only few errors made by the model were associated with distinguishing Class A from Class D, as well as distinguishing Class B from Class A. This is consistent with the visualization results in the corresponding t-SNE plot. This may in part be owing to the small dataset available, and the statistical variation in the images themselves, which could have contributed to the intractability of the problem.
7. Discussion
Owing to their deep architectures and large parameters sets, convolutional neural networks are displacing traditional methods in image recognition as state-of-the-art methods in an increasing number of applications. Training large and complex convolutional neural networks completely from scratch can be prohibitively costly, if sufficient data for training are available in the first place. Therefore, the use of pretrained networks is a potentially attractive approach to automated image recognition.
All three case studies demonstrate the effectiveness and superiority of using pretrained convolutional neural networks as competent feature extractors, as well as end-to-end classifiers. These pretrained networks were originally designed to capture domain-specific features from ImageNet images, which may not be optimal for classifying images from other domains (e.g., material textures). However, the case studies have shown that they are able to achieve at least equivalent (without retraining) or better performance (with partial retraining or full retraining) in texture classification compared with traditional feature extraction methods (specifically GLCM, LBP and textons).
When used directly as deep feature extractors, the classification accuracy using these features was generally similar to that obtained with the traditional algorithms in Case Studies 1 and 2, or slightly better than the latter in Case Study 3. The performance of the original deep learning models was probably inhibited by the strong dissimilarity between the source and target databases, as well as the comparatively small size of the simulated microstructure image dataset and the image dataset of ultrahigh carbon steel. Compensation for the dissimilarity and data scarcity requires retraining of the top layers or all the layers of the network.
When retraining the pretrained networks, fine tuning of the weights is accomplished by backpropagation. The earlier layers of these pretrained networks have learned more generic features that are useful for different tasks, while the later layers have learned more complex features that are more specific to the domain of the source dataset. Therefore, it is sensible to freeze the earlier layers and retrain the later layers, forcing the networks to learn high-level features that are relevant to the new dataset. Furthermore, fully retraining and unfreezing all the layers led to markedly higher accuracy in all the case studies.
In future work, other lightweight and effective convolutional neural networks (e.g., ShuffleNet and EfficientNet) for texture analysis should be explored further. Full retraining of these networks should be considered at the first place, owing to the increasing computational power of devices and improved accessibility to high-performance computing (HPC) facilities. Further advanced training strategies, based on the use of transfer learning and progressive image resizing [
62], could also yield richer, hierarchical textural feature sets.
In addition, the efficacy of convolutional neural networks specifically designed for textural feature extraction would also need to be considered, as the advantages of these neural networks have not been established yet. Examples of these networks include T-CNN [
63,
64] and B-CNN [
65], as well as Deep-TEN [
66].
Finally, future work is likely to focus increasingly on the interpretation of the deep learning models [
67,
68]. This would not only be to increase the acceptance of a model that can explain the reason why a certain texture or image is identified, but also would potentially enhance the analyst’s understanding of physical attributes of materials that affect their performance in different applications.