Framework of Specific Description Generation for Aluminum Alloy Metallographic Image Based on Visual and Language Information Fusion

The automatic generation of language description is an important task in the intelligent analysis of aluminum alloy metallographic images, and is crucial for the high-quality development of the non-ferrous metals manufacturing industry. In this paper, we propose a methodological framework to generate the language description for aluminum alloy metallographic images. The framework consists of two parts: feature extraction and classification. In the process of feature extraction, we used ResNet (residual network) and CNN (convolutional neural network) to extract visual features from metallographic images. Meanwhile, we used LSTM (long short term memory), FastText, and TextCNN to extract language text features from questions. Then, we implemented a fusion strategy to integrate these two features. Finally, we used the fused features as the input of the classification network. This framework turns the description generation problem into a classification task, which greatly simplifies the generation process of language description and provides a new idea for the description of metallographic images. Based on this basic framework, we implemented seven different methods to generate the language description of aluminum alloy metallographic images, and their performance comparisons are given. To verify the effectiveness of this framework, we built the aluminum alloy metallographic image dataset. A large number of experimental results show that this framework can effectively accomplish the given tasks.


Introduction
Aluminum alloy is one of the most widely-used non-ferrous metal materials in industry. Because of its good performance, it is widely used in aviation, aerospace, navigation, railways, highways, and other fields [1][2][3][4][5][6]. The properties of aluminum alloy mainly depend on its microstructures, and metallographic analysis is the main method to evaluate its microstructures [7][8][9][10]. In practice, material science experts evaluate the properties of aluminum alloys by observing and analyzing the given metallographic images. The analysis of complex metallographic images generally requires a lot of time and energy from experts, and suffers from poor repeatability due to the different experience of participants [11].
In order to solve these problems, more and more scholars have begun to pay attention to research into intelligent metallographic image processing and analysis methods [12][13][14][15]. In recent years, many automatic metallographic image-processing and analysis methods have been proposed, which can greatly improve the efficiency of metallographic analysis tasks. According to different functions, these methods can be divided into four categories: microstructural classification, segmentation, quantitative calculation, and grain boundary extraction.
The aim of the microstructural classification method is to classify different microstructures in a given metallographic image. For example, Decost and Holm proposed a computer vision approach for automatic analysis and classification of microstructural image data. This approach was able to classify microstructures into one of seven groups with greater than 80% accuracy [16]. In Gola et al.'s paper [17], a data-mining process is presented based on a support vector machine (SVM), which was able to distinguish between different microstructures of the two-phase steels.
Microstructural segmentation methods aim to segment the different microstructures in a given metallographic image. For example, Jiang et al. applied an improved SLIC (simple linear iterative clustering) algorithm and region-merging technique to automatically segment grain regions [18]. Albuquerque et al. applied multilayer perceptron and self-organizing map neural network topologies to segment microstructures from metallographic images [19]. In Bulgarevich et al.'s paper [20], a fast random forest-based method is proposed for reliable and automatic segmentation of typical steel microstructures. In Albuquerque et al.'s paper [21], a neuronal network-based method is proposed for automatic segmentation of nickel alloy secondary phases from SEM (scanning electron microscope) images. In Papa et al.'s paper [22], the automatic segmentation of graphite particles in metallographic images is achieved by using Otsu, SVM, Bayesian, and optimum-path forest methods. Deep learning methods have dramatically improved conventional machine learning techniques due to their strong ability to learn the hierarchical latent features of high-dimensional data [23,24]. These methods have been successfully applied in metallographic image segmentation. Azimi et al. [25], proposed a fully-convolutional neural network (FCNN) accompanied by max-voting scheme to segment some given microstructures of low carbon steel. In Ma et al.'s paper [26], the DeepLab network was used for Al-La alloy metallographic images segmentation. These deep-learning-based methods achieved satisfactory results. However, they always needed a large number of hand-labeled data to achieve accurate microstructural segmentation. In order to solve this problem, a fast automatic labeling method is proposed to label metallographic images quickly [27].
Microstructural quantitative calculation aims to obtain the quantitative information from the given metallographic image, such as the size, shape, and distribution of the different microstructures. For example, in references [28] and [29], conventional digital image processing methods are used for automatic quantification of microstructural features.
Grain boundary extraction aims to extract the grain boundaries. For example, in Xu et al.'s paper [30], an improved mean shift method is presented for automatically extracting grain boundaries, solving the problem of grain boundary blurring or disconnection. In Journaux et al.'s paper [31], the directional wavelets and mathematical morphology are used for grain boundary extraction.
Differently from the existing metallographic image processing methods, in this paper, we focus our attention on the automatic generation of language description from metallographic images. The purpose is to automatically generate a description of the content of a given metallographic image similar to one obtained from material science experts. It is an important part of the intelligent metallographic image analysis system. The automatic generation of a language description of metallographic images is a very challenging task, because it requires the combination of image processing and natural language processing. Recently, many methods have been proposed for natural scene images [32][33][34]. These methods obtain multiple objects and their spatial relationships, and then generate the language description to fit these constituent parts. These language descriptions are often general.
In contrast to natural scene images, aluminum alloy metallographic images need more specific language description, which is useful for subsequent analysis. To address this requirement, we propose a novel method to automatically generate specific language descriptions from aluminum alloy metallographic images. Inspired by Antol, Wu and Kazemi et al.'s papers [35][36][37], we considered the aluminum alloy metallographic image description task as a classification problem. This method consists of two parts. The first part, feature extraction, extracts and fuses the best visual and language Symmetry 2020, 12, 771 3 of 13 features for use in the generation of the language description of aluminum alloy metallographic images. The second part, classification, predicts classification to generate a natural language description based on the extracted features.
We summarize the contributions of this paper as follows.
(1) We achieved automatic generation of the language description for given aluminum alloy metallographic images. In this framework, the aluminum alloy metallographic image description task can be considered as a classification problem, which greatly simplifies the generation process of language description and provides a new idea for the description of metallographic images.
(2) We used ResNet [38][39][40] and convolutional neural network (CNN) to extract visual features from metallographic images. Meanwhile, we used LSTM [41], FastText [42], and TextCNN [43] to extract language text features from given questions. Moreover, we present the comparative analysis among these seven combination strategies applied to generate natural language description of aluminum alloy metallographic images.
(3) The proposed method can not only obtain the language description, but also obtain the attention map. This attention map can correctly reflect the high attention area of the given aluminum alloy metallographic image. This is helpful for the professionals to analyze the aluminum alloy metallographic images.
This paper is organized as follows: Section 1 introduces prior work and our contributions. In Section 2, we introduce the proposed method, including the feature extraction scheme and classification method. Section 3 presents the performance comparisons, attention map analysis, and convergence analysis. The paper is concluded in Section 4.

The Proposed Methods
The automatic generation of the language description is an important part of the automated analysis system of aluminum alloy metallographic images. The basic framework of our proposed method is shown in Figure 1. It consists of two parts: feature extraction and classification. Differently from the typical automatic generation method of language description, the input of our method includes not only image, but also one text question associated with this image. This question is often the most important issue, as it must generate a specific language description and be helpful for analysis of the metallographic image. In the feature extraction scheme, we extract a metallographic image feature and a text question feature at the same time, and then merge them into one latent feature. The classification method is used to classify the obtained latent feature and get the specific language description of the given aluminum alloy metallographic image. The output includes not only the language description, but also an attention map of the given metal micrograph. This attention map is learned automatically by the proposed deep neural network. From the attention map, we can find the key visual features that affect the generation of language description. This will provide more valuable information for us to analyze the aluminum alloy metallographic images. In the next section, we will introduce the proposed method in detail.
April 2, 2020 submitted to Symmetry 3 of 13 metallographic images. The second part, classification, predicts classification to generate a natural language description based on the extracted features. We summarize the contributions of this paper as follows.
(1) We achieved automatic generation of the language description for given aluminum alloy metallographic images. In this framework, the aluminum alloy metallographic image description task can be considered as a classification problem, which greatly simplifies the generation process of language description and provides a new idea for the description of metallographic images.
(2) We used ResNet [38−40] and convolutional neural network (CNN) to extract visual features from metallographic images. Meanwhile, we used LSTM [41], FastText [42], and TextCNN [43] to extract language text features from given questions. Moreover, we present the comparative analysis among these seven combination strategies applied to generate natural language description of aluminum alloy metallographic images.
(3) The proposed method can not only obtain the language description, but also obtain the attention map. This attention map can correctly reflect the high attention area of the given aluminum alloy metallographic image. This is helpful for the professionals to analyze the aluminum alloy metallographic images.
This paper is organized as follows: Section 1 introduces prior work and our contributions. In Section 2, we introduce the proposed method, including the feature extraction scheme and classification method. Section 3 presents the performance comparisons, attention map analysis, and convergence analysis. The paper is concluded in Section 4.

The Proposed Methods
The automatic generation of the language description is an important part of the automated analysis system of aluminum alloy metallographic images. The basic framework of our proposed method is shown in Figure 1. It consists of two parts: feature extraction and classification. Differently from the typical automatic generation method of language description, the input of our method includes not only image, but also one text question associated with this image. This question is often the most important issue, as it must generate a specific language description and be helpful for analysis of the metallographic image. In the feature extraction scheme, we extract a metallographic image feature and a text question feature at the same time, and then merge them into one latent feature. The classification method is used to classify the obtained latent feature and get the specific language description of the given aluminum alloy metallographic image. The output includes not only the language description, but also an attention map of the given metal micrograph. This attention map is learned automatically by the proposed deep neural network. From the attention map, we can find the key visual features that affect the generation of language description. This will provide more valuable information for us to analyze the aluminum alloy metallographic images. In the next section, we will introduce the proposed method in detail.

Feature Extraction and Fusion Scheme
The aim of the feature extraction scheme is to transform the given aluminum alloy metallographic images and corresponding questions from image and text data space to latent feature

Feature Extraction and Fusion Scheme
The aim of the feature extraction scheme is to transform the given aluminum alloy metallographic images and corresponding questions from image and text data space to latent feature space. For this purpose, the deep neural network is used due to its strong ability to learn the hierarchical latent features of given metallographic images and questions. This scheme consists of three parts: metallographic image feature extraction, question text feature extraction, and features fusion.
For the metallographic image feature extraction, the latent visual feature z I can be computed by where I is the given metallographic image and F I represents a certain convolutional neural network (CNN) method. θ I represents the parameter in the given convolutional neural network F I . In this paper, we used CNN and improved ResNet. The size of input image was 224 × 224 pixels and the size of image features extracted by ResNet was 14 × 14. In order to keep the 14 × 14 size of the feature map, the convolution neural network including four convolutional layers and four pooling layers was used to extract image features. In the training process, the parameter θ I is adjusted to fit the given metallographic image training dataset. Similarly, in the process of question feature extraction, the latent question text feature z Q can be computed by where Q is the given question, F Q represents a certain deep neural network method, and θ Q represents the parameter in the given deep neural network F Q . In this paper, we use improved LSTM, FastText and TextCNN. The latent visual feature z I and text feature z Q have different dimensions. Therefore, we needed to design a fusion method to integrate the two features. Let θ 1 be the 1 × 1 dimensional convolution layer of depth 512, we have and where CONV is the convolution operator, and z * I and z * Q have the same dimension. Therefore, we can compute Let θ 2 be the 1 × 1 dimensional convolution layer of depth 2, we have where so f tmax is the so f tmax function and ReLU is the rectified linear unit activation function. The so f tmax classifier is the most popular classifier and many experiments have shown that the so f tmax classifier can get satisfactory results. Therefore, we used the so f tmax function as the classifier in our framework. Then we could compute where F w is the weighted average operator. The fused feature z U can be obtained by where F C is the concatenate operator. For easy description, we define where F U represents the fusion operator and θ U is the parameter of fusion network.
In the process of feature fusion, our purpose was to fuse the visual feature with the text feature. However, the image feature dimension was different from the text feature dimension. The dimension of the image feature was 14 × 14 × 2048, the dimension of the question feature was 1 × 1024. We expanded the dimension of text so that we could make the image feature and question feature have the same dimension. Finally, we added them up to get the fusion feature.
To summarize, the fusion strategy in a form of a pseudo-code is shown in Algorithm 1, as follows: The fusion strategy of latent visual and text features Inputs: Latent visual feature Z I , text feature Z Q and network parameter θ U . Output: Fused feature z U .
In the training step, the parameters θ I , θ Q , and θ U will be adjusted to fit the given metallographic image training dataset.

Classification Method
The aluminum alloy metallographic image description task aims to generate the specific and accurate language description of aluminum alloy metallographic images. In fact, in the metallographic analysis of aluminum alloys, people often want a limited number of questions. We can list these questions and give the corresponding answers. Therefore, we can consider the aluminum alloy metallographic image description task as a classification problem, which can reasonably simplify the generation process of language description. The input of classifier is the fusion feature, and the output is the language description, as shown in Figure 1.
Let c k represent the k-th description and p(c k |z) be the probability that the feature z generates the k-th description, which is given by the so f tmax transformation of linear functions of the feature variable, and θ A is the parameters of classifier. The best description can be obtained by where K is the number of descriptions and the description set c = {c 1 , c 2 , . . . c K }. The network parameter θ is defined by θ = θ I , θ Q , θ U , θ A , which is very important for generating the accurate language description. In order to deal with the problem of parameter estimation, we use the maximum likelihood estimation (MLE) to compute the network model parameter θ. Suppose that we are given a training dataset D = {I n , Q n , t n ; n = 1 : N}. Here, I n is the n-th metallographic image, Q n is the n-th question, and t n is a binary class label vector t n ∈ {0, 1} K , where k=1:K t (k) n = 1,k = 1, . . . , K. The 1 of K coding scheme is used in label vector t n if the input {I n , Q n } belongs to class c k ,t p(c k |I n , Q n ; θ) t (k) n , Symmetry 2020, 12, 771 6 of 13 Using the maximum likelihood estimation, we can compute the network model parameters θ by solving the following optimization problem which is known as the cross-entropy loss function for the multiclass classification problem. The stochastic gradient descent (SGD) algorithm is used to solve this optimization problem.
To summarize, our metallographic image description method in a form of a pseudo-code is as shown in Algorithm 2, as follows: Step 1: Initialization:

•
Network parameter θ Step 2: Optimize θ by using D: • while not converge do • Compute network parameter θ * by solving Equation (13) using SGD algorithm.
Step 3: Generate description by using {I ,Q and θ: • Compute the latent feature z U by using Algorithm 1.

•
Generate the language description c * = c y

Experimental Dataset
In order to verify the proposed method, we built the experimental dataset, which contained 180 aluminum alloy metallographic images and 180 natural scene images. The natural scene images were obtained by randomly sampling from the COCO (common objects in context) public dataset. The aluminum alloy metallographic images were taken by metallographic microscope, and consisted of 100 5-series metallographic images and 80 6-series metallographic images. These metallographic images included six different types of phases, such as Mg 2 Si, FeMnAl 6 , FeAl 3 , MnAl 6 , Si, and FeMnSiAl 6 . Two typical aluminum alloy metallographic images are shown in Figure 2.
which is known as the cross-entropy loss function for the multiclass classification problem. The stochastic gradient descent (SGD) algorithm is used to solve this optimization problem.
To summarize, our metallographic image description method in a form of a pseudo-code is as shown in Algorithm 2, as follows:

IQ.
Output: The description * c .
 Network parameter  .
Step 2: Optimize  by using D :  while not converge do  Compute network parameter *  by solving Equation (13) using SGD algorithm.

 end while
Step 3: Generate description by using ''

{ , } IQ and  :
 Compute the latent feature U z by using Algorithm 1.
 Generate the language description * y cc  .

Experimental Dataset
In order to verify the proposed method, we built the experimental dataset, which contained 180 aluminum alloy metallographic images and 180 natural scene images. The natural scene images were obtained by randomly sampling from the COCO (common objects in context) public dataset. The aluminum alloy metallographic images were taken by metallographic microscope, and consisted of 100 5-series metallographic images and 80 6-series metallographic images. These metallographic images included six different types of phases, such as Mg2Si, FeMnAl6, FeAl3, MnAl6, Si, and FeMnSiAl6. Two typical aluminum alloy metallographic images are shown in Figure 2.  In addition, in this dataset, we designed four questions and eleven language descriptions for each image according to the practical requirements of aluminum alloy metallographic image analysis, as shown in Table 1. In our experiment, we have eleven classes or descriptions in total, as shown in the second row, and each class corresponded to one combination of given metallographic image and question. In order to clarify the description of the relationship, we marked the questions and descriptions in Table 1. For example, the first question is labeled 1 and the corresponding descriptions is also labeled 1.

Performance Comparison
The aim of this section is to analyze the performance of the proposed description method. Let D v = {I n , Q n , C n } represent the test dataset and f θ represent our proposed method. We used the accuracy (ACC) to evaluate our proposed method f θ , which has been widely used in many literatures. It can be calculated by the following formula where (c * n = C n ) is the indicator function defined as where c * n is the estimated result and C n is the ground truth. In addition, we have implemented seven different methods on the basis of the basic proposed framework. For easy description, we set f i , i = 1 : 7, denotes the i-th method. The networks used in these seven methods is shown in Table 2. We set F I denotes the visual feature extraction network and F Q denotes the text feature extraction network. In our framework, these two networks are critical. As shown in Table 2  Methods The detailed network structure and parameter settings are shown below: In the process of training, we use a fixed epochs of 500 and initial learning rate of 0.001. In ResNet, the number of residual blocks was 50 or 16. In CNN, the number of convolution layers was four and the number of pooling layers was four. In LSTM (long short term memory), the output dimension was 1024 or 256. In TextCNN, the word embedding was 100 dimensions, there were three convolutional layers and the three pooling layers. In FastText, the word embedding was 100 and the linear layers was two. Moreover, we used the SGD optimizer with L2 regularization. The weight decay was 0.02. It could accelerate the training process of the model.
In experiments, we used the cross-validation method to ensure the accuracy of the evaluation results. We divide dataset D into six mutually exclusive subsets with the same size, D = D 1 ∪ D 2 ∪ D 3 ∪ D 4 ∪ D 5 ∪ D 6 and D i ∩ D j = φ(i j). We used five subsets as the training set and the remaining subset as the test set, and then got six experimental results, as shown in the second to seventh columns in Table 3. The last column in the Table 3 shows the average of six experiments. The first row denotes the experiment number. From these experimental results, we can see that all the seven methods had more than 90% accuracy, and the third method f 3 (ResNet34 and LSTM1024) had the best average accuracy. Therefore, we can conclude that the proposed method can accurately generate the language description of the aluminum alloy metallographic image. In addition, the box plots of experiment results obtained by the seven methods are shown in Figure 3. description of the aluminum alloy metallographic image. In addition, the box plots of experiment results obtained by the seven methods are shown in Figure 3. In Figure 3, the red plus sign denotes outlier and red line denotes median. From Figure 3, we can observe: (1) the outliers are caused by the first experiment, so we compute the median without using outliers, (2) all median values are concentrated between 0.96635 and 0.97907, and the length of the interval is less than 0.013, and (3) the experimental results are mainly distributed between 0.94962 and 0.99917, and the length of the interval is less than 0.050. Therefore, we can conclude that the proposed method has good robustness.
In addition, we randomly divided the dataset into six mutually-exclusive subsets with the same size for the experiment, this is 6-fold cross validation. In this way, we could improve the stability of the model. However, the dataset was randomly divided, which may have led to incomplete classes of some datasets. Therefore, this cross validation method leads to some outliers. For example, from Table 3 and Figure 3, we can see that dataset 1 does not contain all classes.
The results of training time of the seven different methods are shown in Table 4. We can see that the training time for the third and fourth methods were both less than 20 minutes.  In Figure 3, the red plus sign denotes outlier and red line denotes median. From Figure 3, we can observe: (1) the outliers are caused by the first experiment, so we compute the median without using outliers, (2) all median values are concentrated between 0.96635 and 0.97907, and the length of the interval is less than 0.013, and (3) the experimental results are mainly distributed between 0.94962 and 0.99917, and the length of the interval is less than 0.050. Therefore, we can conclude that the proposed method has good robustness.
In addition, we randomly divided the dataset into six mutually-exclusive subsets with the same size for the experiment, this is 6-fold cross validation. In this way, we could improve the stability of the model. However, the dataset was randomly divided, which may have led to incomplete classes of some datasets. Therefore, this cross validation method leads to some outliers. For example, from Table 3 and Figure 3, we can see that dataset 1 does not contain all classes.
The results of training time of the seven different methods are shown in Table 4. We can see that the training time for the third and fourth methods were both less than 20 minutes.

Attention Map Analysis
Our method generates not only the language description, but also an attention map of a given metal micrograph. In this section, we will analyze the importance of attention images. Figure 4 shows an example of an attention map experimental result.
The attention map is learnt automatically by the proposed deep neural network and can correctly reflect the high attention area of the given aluminum alloy metallographic image. It is a probability map which is extracted from a convolutional neural network. In our network, we sent the fusion feature to a convolutional neural network and we could get the attention map. The process was as follows: (1) we used a two-layer convolution network to process the fusion feature and get two initial attention maps, (2) we averaged these two attention maps in pixels to get the final attention map and transform the pixel value into probability value, and (3) the attention map was processed by pseudo-color.

Attention Map Analysis
Our method generates not only the language description, but also an attention map of a given metal micrograph. In this section, we will analyze the importance of attention images. Figure 4 shows an example of an attention map experimental result.
The attention map is learnt automatically by the proposed deep neural network and can correctly reflect the high attention area of the given aluminum alloy metallographic image. It is a probability map which is extracted from a convolutional neural network. In our network, we sent the fusion feature to a convolutional neural network and we could get the attention map. The process was as follows: (1) we used a two-layer convolution network to process the fusion feature and get two initial attention maps, (2) we averaged these two attention maps in pixels to get the final attention map and transform the pixel value into probability value, and (3) the attention map was processed by pseudo-color. The left figure shows the system input, which included the given aluminum alloy metallographic image and question of interest. The output includes the attention map and language description, as shown in the middle figure. In the attention map, the red color denotes the regions with high attention. For convenient analysis, we overlie the attention map on the original image, as shown in the right one. From Figure 4, we can observe that the main microstructures are distributed in the regions with high attention. This verifies the effectiveness of attention maps. Therefore, we can conclude that attention maps are helpful for the analysis of aluminum alloy metallographic images.

Convergence Analysis
The aim of this experiment is to analyze the convergence of the proposed framework. The loss and accuracy curves with times for seven different methods are shown in Figure 5. From Figure 5, we can observe that these seven methods can converge, and that methods 1 f , 2 f , 3 f , and 4 f have better convergence speed than the other three methods. This verifies the convergence of the proposed methods. The left figure shows the system input, which included the given aluminum alloy metallographic image and question of interest. The output includes the attention map and language description, as shown in the middle figure. In the attention map, the red color denotes the regions with high attention. For convenient analysis, we overlie the attention map on the original image, as shown in the right one. From Figure 4, we can observe that the main microstructures are distributed in the regions with high attention. This verifies the effectiveness of attention maps. Therefore, we can conclude that attention maps are helpful for the analysis of aluminum alloy metallographic images.

Convergence Analysis
The aim of this experiment is to analyze the convergence of the proposed framework. The loss and accuracy curves with times for seven different methods are shown in Figure 5. From Figure 5, we can observe that these seven methods can converge, and that methods f 1 , f 2 , f 3 , and f 4 have better convergence speed than the other three methods. This verifies the convergence of the proposed methods. April 2, 2020

Conclusions
In this paper, we propose a basic framework to generate the language description for aluminum alloy metallographic image. This framework is considered as a classification problem and includes feature extraction and classification. Using this basic framework, we implemented seven different methods to generate the language description of aluminum alloy metallographic images. A large number of experimental results show that this framework can effectively accomplish the given tasks and has good convergence. In the future, we plan to investigate the use of semantic segmentation of metallographic image for further improvement. In addition, we are also interested in the use of high level semantic priors of microstructures.

Conclusions
In this paper, we propose a basic framework to generate the language description for aluminum alloy metallographic image. This framework is considered as a classification problem and includes feature extraction and classification. Using this basic framework, we implemented seven different methods to generate the language description of aluminum alloy metallographic images. A large number of experimental results show that this framework can effectively accomplish the given tasks and has good convergence. In the future, we plan to investigate the use of semantic segmentation of metallographic image for further improvement. In addition, we are also interested in the use of high level semantic priors of microstructures.