Heated Metal Mark Attribute Recognition Based on Compressed CNNs Model

: This study considered heated metal mark attribute recognition based on compressed convolutional neural networks (CNNs) models. Based on our previous works, the heated metal mark image benchmark dataset was further expanded. State-of-the-art lightweight CNNs models were selected. Technologies of pruning, compressing, weight quantization were introduced and analyzed. Then, a multi-label model training method was devised. Moreover, the proposed models were deployed on Android devices. Finally, comprehensive experiments were evaluated. The results show that, with the ﬁne-tuned compressed CNNs model, the recognition rate of attributes meta type, heating mode, heating temperature, heating duration, cooling mode, placing duration and relative humidity were 0.803, 0.837, 0.825, 0.812, 0.883, 0.817 and 0.894, respectively. The best model obtained an overall performance of 0.823. Comparing with traditional CNNs, the adopted compressed multi-label model greatly improved the training efﬁciency and reduced the space occupation, with a relatively small decrease in recognition accuracy. The running time on Android devices was acceptable. It is shown that the proposed model is applicable for real time application and is convenient to implement on mobile or embedded devices scenarios.


Introduction
The expansion of modern construction industry and material technology means many metal components are being applied in domestic appliances. In the event of a fire, experts hope to find some important clues from the scene. Metal is very important for this situation and thus was investigated. In fire scene, metal components are retained for their inflammability. Meanwhile, special marks are left on the surface of metal component due to physical and chemical changes when being heated. The conditions of fire scene are complicated and marks on metal components are influenced by heating temperature, heating duration, cooling mode, etc. These attributes are very important in fire science since they are useful indications to analyze the location, source and situation of fire. Different attribute conditions result in different oxidation reactions on metal surface. The fire scene is distinctive and is very hard to be restored. Therefore, it is sensible to recognize heated metal attributes based on its mark image.
Traditional methods use knowledge of physics and chemistry to analyze heated metal attribute by human experts. Table 1 demonstrates the inspection for trace and physical evidences from fire scene (a National standard of People's Republic of China) [1]. It is basically a relationship between color of heated metal and its heating temperature. Changes of metallographic organization due to heating temperature was studied by Wu et al. [2,3]. Macro-inspection and micro-analytical were used to record the attribute value on object surface by Xu et al. [4]. Stereo microscope and electron microscope have been used to find changes of chemical composition and organization structure of Zn-Fe when being heated. However, these methods have two main drawbacks: (1) they are based on human expert for qualitative analysis; and (2) they are impractical to implement as they have less automation. To solve these problems, this paper presents a correlation construction between heated metal mark image and attributes based on machine learning technology and a completely data-driven method. Image recognition is a traditional problem and has been studied for decades in computer vision and machine learning fields. The image object is first represented by feature vectors, and then a classifier can be learned in feature space with training data.
Feature extraction and representation are key roles and many works have been published. To deal with problems of image translation, scale variant, rotation, illumination and distortion, many expert-designed features are proposed. SIFT (scale invariant feature transform descriptor) was introduced by Lowe [5,6]. Gradient direction of local image can be expressed and an image patch was encoded with 128-D feature vector. HOG (histograms of oriented gradients) was proposed by Navneet and Bill [7]. It is computed from a group of gradient orientation histograms on image sub-regions. The sizes of block and cell are assigned the dimension of HOG descriptor. To improve the efficiency of SIFT, SURF (speeded-up robust features) was proposed by Bay et al. [8]. Scale-space extrema of the determinant of Hessian matrix is used to compute interest point and SURF feature is determined with Haar wavelets. LBP (Local binary patterns) descriptor was introduced by Wang et al. [9]. It defines an 8-bit length number to record the difference between a pixel and its eight neighbors. The frequencies of all 8-bit numbers are counted to represent the feature vector of an image's local region. Based on these basic feature extraction and representation methods, various improvement works are also proposed. Global feature organization methods are designed. BoVW (bag of visual words) model proposed by Li et al. [10], is one of most widely adopted methods . Each local patch of an image is mapped to a clustered visual word, and the histogram of visual word frequency is used to represent the whole image. SPM (spatial pyramid matching) was proposed by Grauman and Darrell [11]. It improves BoVW by dividing an image into multi-resolutions.
Since 2012, with the advent of large scale labeled dataset and GPU, deep learning, especially convolutional neural networks (CNNs), have achieved great successes. It is essentially a multi-layered neural network with cascade nonlinear processing units for feature extraction and representation. Its excellent performances are derived from: (1) complex model representation with millions of parameters; and (2) completely automatic model optimization and adjustment. LeCun et al. [12,13] designed LeNet, a successful small scale CNNs model, is used in handwritten mail zip code recognition. A medium-scale CNNs, AlexNet, proposed by Krizhevsky et al. [14], won ImageNet 2012 competition by significant promotion over non-deep learning methods. More powerful models, e.g., ZFNet, VGGNet, Inception and ResNet, etc., are designed successively by Zeiler et al. [15][16][17][18]. They improve CNNs by using more layers, small convolutional filters, flexible convolutional filter size, combined width and depth of model, and optimal and robust model training methods. ResNet has excellent performance of Top-5 error (3.57%) and it outperformed humans for the first time in the ImageNet 2015 classification task. Some relevant works are reported. A rail surface defects type recognition method was introduced based on CNNs model by Shahrzad et al. [19]. The model contains three convolutional layers, three max-pooling layers and two fully connected layers. It collects and labels 24,408 object images. A bearing fault diagnosis method is proposed for 10-type fault classification based on CNNs and an improved Dempster-Shafer algorithm by Li et al. [20]. The CNN model used in this work contains only three convolutional layers and one fully connected layer. By model ensemble, the final result is combined with various evidence. A steel defect area characterization method was proposed by Psuj [21], utilizing a magnetic multi-sensor matrix transducer. The basic model has three convolutional layers, three max-pooling layers and one fully connected layer. Three combined models are adopted for classification. In total, 35,000 simulated images are generated in this work. A hot-rolled steel sheet surface defect classification was designed by Zhou et al. [22]. Eight surface defect types are defined and 14,400 sample images are used. A civil infrastructure damage detection method was designed by Cha et al. [23]. The structure of the model used in this method is the same as in Ref. [21]. Small patches are cut with manually annotated crack or intact and there are 40,000 sample images in the dataset. A structural surface damage detection method is proposed based on f aster r-cnn in Ref. [24]. Five types of surface damage atr defined and ZFNet os selected as backbone model. In total, 2366 images are collected as the dataset in this study. These works are similar to ours. However, they only model applications and simple CNN model structures are used. Models containing fewer than eight layers in Ref. [19][20][21][22][23] are used and ZFNet is adopted in Ref. [24]. The CNN model structures are relatively small and state-of-the-art models are not adopted. On the other hand, complete experimental evaluation and analysis are needed, including of training parameters, various CNN structures, etc.
Our previous work performs a case study on heated metal attribute recognition based on CNNs [25]. We analyzed and selected seven heated metal attributes. Raw image set was generated with special capture devices (vacuum resistance furnace, muffle furnace, and gasoline burner; test chamber with constant temperature and humidity; and microscope) and a benchmark dataset was organized (900 image samples, each labeled with seven attributes). The relationship between attributes and mark image were trained with state-of-the-art CNNs models (Inception-v4, Inception-v3, ResNet, VGG16, etc). Experimental evaluations were conducted according to various model structure, batch size, data augmentation, and training algorithms. This work is a continued research. In this study, the benchmark dataset was first further expanded with 900 images. Then, compressed CNNs models were analyzed to increase the model efficiency. Because a heated metal mark image contains seven attributes, a multi-label training based model was devised to accomplish the recognition task in one-time completion. Moreover, the compressed CNNs models were deployed on Android platforms. Finally, experiments were evaluated from various aspects.
The main contributions of this paper are threefold: (1) The benchmark image dataset was further expanded (doubled). (2) Compressed CNNs models were adopted and a new model training method was proposed based on multi-label. (3) Models were deployed and tested on Android platforms.

Problem Statement
According to the definition in National standard of People's Republic of China GB/T42327905.3-2011 (inspection methods for trace and physical evidences from fire scene-Part 3: Ferrous metal work) [1], metal types, heating mode, heating temperature, heating duration, cooling mode, cooling humidity and placing duration were selected as attributes which need to be recognized from heated metal mark image. The explanation of each attribute and its corresponding value range configuration are given in Table 2. For simplicity, attribute i is abbreviated as a i . Most of these are the same as our previous work [25], except that the heating mode in this study was divided into four types: vacuum, muffle furnace, gasoline burner and carbon.
Given a heated metal mark image, its attributes are identified based on a classifier model. The relationship can be formulated as Equation (1).
where x is an image of heated metal mark, y denotes its attributes and f () is the classifier model.
x can be expressed as width × height × channel formation. y can be expressed as a 7-D vector, each corresponding to an attribute.

Benchmark Dataset Expansion
In our previous work, 900 sample images were generated and labeled. We used the same method to create image samples in this work. Galvanized steel and cold rolled steel were selected as basic research objects. The metal plate was cut to equal size (1.0 cm × 1.0 cm × 1.0 mm). According to requirements in Table 2, four devices, vacuum resistance furnace, muffle furnace, gasoline burner and carbon furnace, were used for simulating four heating scenes. After heating with a specific temperature (a 3 ) and duration time (a 4 ), metals were placed in a test chamber with constant temperature and humidity. Thus, attributes a 5 , a 6 and a 7 were employed. Special-purpose microscope was used to screen heated metal mark image samples. All devices used were the same as our previous work. Thus, the figure demonstrations are omitted here. The image sample was captured with a resolution of 2152 × 1616 pixels, each was labeled with seven attribute values as described in Table 2. Thus, 900 new image samples were generated, and Figure 1 gives some demonstrations.

Methodology
In our study, deep learning models were deployed on mobile or embedded products. Its importance lies in the facts that: (1) It is practical for expert to investigate using mobile intelligent equipment in the fire scene instead of using bulky server in the lab. Doing investigation off-site is not a bad choice. Wee want recognize attribute of heated metal mark without destroying the fire scene as far as possible. The mark of heated metal may change if we take it back to the lab. (2) Many applications are usually very sensitive to the response time of the program, even a small delay in service response has a significant impact for users. As more and more applications are provided with core functions by deep learning models, low latency inference becomes increasingly, important whether we deploy models on cloud or on mobile side.
One way to solve this problem is committed to performing model inference on high-performance cloud servers and transferring model inputs and outputs between clients and servers. However, this solution poses many problems, such as high computing costs, massive data migration over mobile networks, user privacy and increased latency. Model compression technology adopts an alternative way for these scenarios, which requires fewer resources to perform inference. This was the focus of our research. Key technologies, top compressed CNNs models and a proposed multi-label classification method are described in this section.

Weight Pruning
Network weight pruning-based methods explore the redundancy in model parameters and try to remove noncritical ones. Weight pruning curtails redundant parameters completely from neural networks so that one can even skip computations for pruned weights.
Srinivas and Babu [26] explored the data-free pruning method. Han et al. [27] proposed a method to reduce the total parameters and operations. In Ref. [28], all convolutional filters are ranked with l 1 -norm regularization at each pruning iteration, and m filters with minimum value are deleted. Anwar et al. [29] adopted N Particle Filters for N convolutional layers. Each convolutional unit is set with a value according to its accuracy on a small validation dataset, and the lower one is removed. Pruning is considered as a combinational optimization problem in Ref. [30]. In Ref. [31], each sparse convolutional layer can be performed with a few convolution kernels followed by a sparse matrix multiplication. Lebedev and Lempitsky [32] imposed group sparsity constraints on convolutional filters to prune entries of the convolution kernels in a group-wise fashion. In Ref. [33], a group-sparse regularizer on neurons is introduced during training stage to learn compact CNNs with reduced filters. The method in Ref. [34] adds a structured sparsity regularizer on each layer to reduce trivial filters, channels, or even layers. In filter-level pruning, all of the aforementioned works use l 2,1 -norm regularizers.

Quantization and Sharing
Network weight quantization compresses the model by reducing the number of bits required to represent each weight. It generally divides continuous variation data into discrete values and assigns each specific datum to a fixed value. For example, if a weight is represented with a 32-bit floating-point number and we want to indicate a weight with 100 quantified values, then 7-bit representation is sufficient.
Generally, K-means clustering is a simple and convenient solution to solve the problem of quantization of CNNs weights [35], which is shown in Equation (2). C = {c 1 ,c 2 ,...,c k } denotes the cluster centers we want to compute, and w means original weight. The objective function is to minimize the squared error between all weight and center it belongs to. As a result, each w is quantized to one cluster center. If the number of cluster centers is set with k, then log 2 (k) bits are used to represent the weight value. Vanhoucke et al. [36,37] proposed 8-bit quantization and 16-bit fixed-point representation. They brought significant speedup, reduce memory usage and decrease loss in accuracy. There were also many methods that directly train CNNs with binary weights, e.g., Binary-Connect [38], BinaryNet [39], and XNORNetworks [40]. The main idea was to learn binary weights or activations during the model training directly. The method in Ref. [41] reduced the precision of weights to ternary values. A HashedNets model was proposed, in which the low cost hash function is used to group weights into hash buckets for sharing [42]. In Ref. [43], a simple regularization method based on soft weight-sharing was proposed.

Matrix Factorization
To reduce the time complexity, tensor factorization is a commonly used method. It is usually based on low rank approximation theory, and a high-dimension tensor can be approximated by multiple one-dimensional tensor products.
Lebedev et al. [44] proposed a canonical polyadic (CP)-decomposition based method that decomposing one network layer into five layers with low complexity. The optimal solution was hard to compute with Stochastic gradient descent (SGD) weight fine-tuning. Denton et al. [45] exploited redundancy of convolutional layer and a tensor decomposition method was devised. It treated two-dimensional tensor decomposition as singular decomposition, and three-dimensional tensor decomposition as two-dimensional decomposition.
Zhang et al. [46] used Singular Value Decomposition (SVD) decomposition for parameter matrix, and proposed a nonlinear optimization method with non-SGD. The cumulative reconstruction error of previous layer is considered in asymmetric reconstruction. Jaderberg et al. [47] used rank 1 convolutional filter to generate M independent basic feature map, and then K × K convolutional filters can be decomposed into 1 × K and K × 1 filters. The output is linearly reconstructed with learned weights. Tai et al. [48] proposed a method for training low rank constraint network. A global optimizer is used for matrix factorization and the redundancy of convolutional filter can be reduced.
Kim et al. [49] proposed a model with one or more tensor trained layer. Tensor is trained for tensor compressing and its filters are generated based on SVD approximation. According to redundancy inside and among channels, sparse decomposition was conducted on channels [31]. Convolutional operation with high cost can be transformed into matrix multiplication. The matrix is then sparsified with regularization term.
Among these model compression methods, matrix factorization based methods are most widely used. Many top compressed CNNs models focus on this, which is illustrated in the next subsection.

Top Compressed CNNs Models
In this subsection, some state-of-the-art compressed CNNs models used in our study are introduced.

MobileNet
MobileNets was first designed for mobile and embedded vision applications. It was built primarily from depthwise separable convolutional operations [50]. It factorizes a standard convolution into a depthwise convolution and a pointwise convolution. MobileNets applies a single filter to each input channel, and then pointwise convolution combines the outputs with linear combination.
A standard convolution operation has the following computational cost: where M and N are number of input and output channels, D K × D K is the size of filters, and D F × D F represents the size of feature map. MobileNets splits this into two separate operations, one for filtering and one for combining. Batch normalization and ReLU nonlinearity are used in each layer. The depthwise convolution operation has the following computational cost: A linear combination of the output of depthwise convolution via 1 × 1 convolution is needed to generate new features. Thus, the total computation cost of depthwise separable convolution is as follows: The ratio of computational cost decrease can be shown as follows: Moreover, to make the model smaller and faster, two hyper-parameters are proposed, width multiplier α and resolution multiplier ρ, which represent the ratio of reduced channels and the size of reduced feature maps, respectively. Finally, The computational cost of depthwise convolution operation with parameters α and ρ can be further expressed as follows: MobileNet V2 was proposed for further improvement. It was constructed with inverted residuals and linear bottlenecks techniques, which can reduce number of parameters and the loss in activation operation. Combined with single shot detector lite (SSDLite) for object detection, MobileNet V2 is reported to be 35% faster than MobileNet V1, and have 20× less computation and 10× fewer parameters than YOLO V2.

SqueezeNet
SqueezeNet was proposed for preserving accuracy with few parameters [51]. A novel building block, Firemodule, is used as the core structure in SqueezeNet.
Three main strategies are adopted to construct the Firemodule. First, 3 × 3 filters are replaced with 1 × 1 filters. This can make the number of parameters 9× smaller than before. Second, the number of input channels to filters is decreased. The number of parameters in one standard layer can be represented as N channel × N f ilter × S f ilter , where N channel is the number of input channels, N f ilter is the number of filters and S f ilter is the size of filter. Squeeze layer is proposed to reduce N f ilter so the total number of parameters can be further decreased. Third, the network is late downsampled. Usually, layers have small activation feature maps if their stride is larger than 1, and larger activation feature maps can lead to higher performance.
To accomplish the above strategies, Fire module was designed, which consists of a squeeze layer and an expand layer. Squeeze layer has only 1 × 1 convolution filters (Strategy 1) and expand layer has 1 × 1 and 3 × 3 convolution filters. Then, three hyperparameters are set: s 1×1 , e 1×1 and e 3×3 , which represent the numbers of 1 × 1 convolution filters in squeeze layer, 1 × 1 convolution filters and 3 × 3 convolution filters in expand layer, respectively. s 1×1 is set to be less than (e 1×1 + e 3×3 ), so the squeeze layer can help to limit the number of input channels to expand layer (Strategy 2). The SqueezeNet model is constructed by stacking many Fire modules. The number of filters per Fire module is increased gradually, and a max-pooling with stride 2 is performed with a certain interval (Strategy 3).
The evaluation demonstrates that the SqueezeNet architecture has 50× fewer parameters than original AlexNet and maintains AlexNet-level accuracy on ImageNet. Based on SqueezeNet, some works implement it on field programmable gate array (FPGA), and the model parameters can be stored entirely within FPGA and there is no need to access off-chip storage.

ShuffleNet
Shu f f leNet was proposed by Zhang et al. [52]. In this method, pointwise group convolutions are first used to reduce the costly dense 1 × 1 convolution computation. Then, a novel channel shuffle operation is designed to overcome the side effects of group convolution, which can help information flow across different feature channels.
Group convolution is an effective way to significantly reduce computation cost. However, the outputs are only derived from certain input channels. This blocks the feature exchange among channel groups and the optimal representation cannot be obtained. We proposed a channel shuffle operation to construct association between input and output channels comprising a convolutional layer with g groups and its output with g × n channels. The dimension of the output is reshaped into (g, n) and it is transposed and flattened as the input of next layer.
A Shu f f leNet unit is formed with a 1 × 1 pointwise group convolution layer and follows channel shuffle operation layer. Shu f f leNet architecture is mainly built by a stack of Shu f f leNet units. This structure has less computational cost in the same condition. Let the input be c × h × w with bottleneck channels m, hw (2 cm + 9 m 2 ) floating-point operations per seconds (FLOPs) and hw (2 cm + 9 m 2 /g) FLOPs is needed for ResNet, while only hw (2 cm/g + 9 m) FLOPs is needed for Shu f f leNet.
It is reported that, compared with the MobileNet architecture, Shu f f leNet model obtains superior performance of absolute 7.8% increase in ImageNet Top-1 error with cost of about 40 millions floating-point operations per seconds (MFLOPs). The speedup on hardware has also been tested. With comparable performance, the Shu f f leNet achieves 13× speedup over AlexNet on an off-the-shelf ARM-based core device.
In the latest version, channelsplit operation is proposed in Shu f f leNet V2. The input of feature channels are first split into two branch channels, respectively. One branch remains the same, and the other branch is computed with 1 × 1 convolution, 3 × 3 depthwise separable convolution and 1 × 1 convolution. Then, the two branch features are concatenated and a channel Shuffle operation is implemented. After the channel shuffle, it is repeated for the next unit.
The report demonstrates that Shu f f leNet V2 is about 40% faster than Shu f f leNet V1 and about 16% faster than MobileNet V2. With 500 MFLOPs, Shu f f leNet V2 is 58% faster than MobileNet V2 and 63% faster than Shu f f leNet V1.

Multi-Label Classification
For an input heated metal mark image, we aimed to recognize its attributes of metal type, heating mode, heating temperature, heating duration, cooling mode, placing duration and relative humidity. Each attribute can be trained with a model, with totally seven separate models. However, this has low efficiency for computation time and storage space even using compressed models. In this study, a multiple label classification method was adopted. Seven attributes were recognized in a single test with one unified CNNs model. Figure 2 gives the basic procedure. For each attribute a i , its type was represented with one-hot encoding mode. Then, two-dimensional feature vector, four-dimensional feature vector, four-dimensional feature vector, four-dimensional feature vector, two-dimensional feature vector, three-dimensional feature vector and two-dimensional feature vector were encoded for attributes a 1 -a 7 , respectively. All attributes shared the same backbone network model. All outputs were formed into a tiled vector, and the ground truth labels were concatenated into the same pattern. Finally, the objective function was formulated as follows.

Backbone CNNs model
Output

Experiment Setup
The performances of heated metal mark attributes recognition with compressed CNNs models were evaluated based on a generated benchmark dataset. In this experiment, Python was used as programming language. Tensorflow was adopted as deep learning framework and Keras was selected as library. All experiments were evaluated on Pentium I5-8 series CPU, 32G RAM, Nvidia GTX TitanXp 12G GPU, Ubuntu OS PC.

Recognition Accuracy Evaluation
Recognition accuracy was used to evaluate recognition performance on different attribute. As shown in Equation (9), R i is the recognition accuracy for attribute a i , N all i means the number of all testing samples containing a i , and N correct i denotes the number of correctly recognized attribute a i . We divided the dataset into six subgroups with attribute values equally distributed. Five randomly chosen subgroups (1500 image samples) were used for training and the remaining subgroup (300 image samples) was used for testing. The results were obtained by averaging the five independent tests.
MobileNet, Shu f f leNet and SqueezeNet were used as backbone compressed CNNs models for evaluation. For model input, sample image size was set as 224 × 224 × 3 pixels. Epoch was set as 50 and batch size was set as 32. Adam was used as preferred optimization method. Learning rate was set with initial value of 0.001 and momentum was set as 0.9. Dropout was set as 0.2.
The results of average recognition accuracy are shown in Table 3. Structure of CNNs models and data augment are listed in the first and second columns, respectively. For data augment, commonly used transformations including random cropping, vertical and horizontal flipping, perturbation of brightness, saturation, hue and contrast were adopted. When the model was trained with data augment, 40% of training image in each batch was augmented, otherwise the probability was 10%. For a 1 , Shu f f leNet with data augment obtained the best performance, with value of 0.803. For a 2 , SqueezeNet with data augment obtained best performance, with value of 0.837. For a 3 , SqueezeNet with data augment obtained best performance, with value of 0.825. For a 4 , Shu f f leNet with data augment obtained best performance, with value of 0.812. For a 5 , MobileNet with data augment obtained best performance, with value of 0.883. For a 6 , MobileNet and Shu f f leNet with data augment obtained best performance, with value of 0.817. For a 7 , Shu f f leNet with data augment obtained best performance, with value of 0.894. For the overall performances, Shu f f leNet model ranked first.
We found that models training with data augment obtained better performance than those without data augment. There was about 2% accuracy improvement. It can be concluded that data augment is an effective way to train better CNN models, especially for large scale CNNs with huge parameters and lacking of sufficient training data. Figure 3 demonstrates the misclassified sample images. Each row corresponds to an attribute. The red texts represent ground truth label, while yellow texts represent the predicted results. Galvanized steel and cold rolled steel normally have different corrosion degrees at the experimental condition. The misclassified sample images of a 1 showed similar corrosion degree. For heating temperature, higher temperatures will lead to more corrosion and rougher texture. The misclassified sample images of a 3 came from adjacent temperature. These situations can be seen as general causes of a 4 , a 6 and a 7 . For a 2 and a 5 , the reason for misclassification is hard to describe even for the field professional. Commonly used large scale datasets are mainly natural scene, animals, etc. These are easy to distinguish by humans, and the differences are easy to explain visually. The heated metal mark image we studied is a special kind of objects, the origin of its mark being caused by complex physical and chemical reactions. Moreover, the benchmark dataset we generated inevitably contains noise, which may influence the model performance. We need further research to explore the internal principle with the help of other professionals.

Batch Size Evaluation
Training on different batch sizes, 8, 16, 32 and 48 was evaluated. Figure 4 demonstrates the model accuracy versus training epoch. Here, the average accuracy over seven attribute was used. Data augment was used and other parameters were set the same as for the experiments presented in Section 5.2.
As shown in the figure, all models converged after about 40 training epochs. Models trained with batchsize 32 obtained better performance, and outperformed other models by about 2%. SqueezeNet was more stable and smooth during training, while MobileNet and Shu f f leNet fluctuated more. It was reasonably found that for bigger batch sizes the gradient descent direction computation was more accurate, and was gentler during model training. Smaller batch sizes led to more randomness, and it was harder to achieve optimal performance.

Single Label Model vs. Multi-Label Model
The plain way to train models is to train an independent CNN model for each attribute. This is called single label model. Comparisons between single label model and multi-label model were evaluated. Single label model was trained separately for each attribute a i . The results are shown in Table 4.
It can be seen from the result that models with single label training obtained better performance than those with multi-label training, with about 1-2% improvement. There were some divergences for different attributes, but the overall trends were consistent.
Using multi-label training, model parameters could be shared. The size of model was greatly reduced by 7×, with some performance loss. Multi-label training is not a trivial task as there are conflicts among training parameters for recognizing different attribute. The loss scale for different attribute may be very large, thus the model training could not be coherent for seven attributes. Therefore, the learning process of shared parameters was unavoidably influenced.

Compressed Model vs. Heavy Model
Different CNN models contain various depth and width of layers, number of filters, size and shape of filters, which lead to different structures, parameters and complexity. Comparisons between compressed models and heavy models were evaluated. VGG16, ResNet50, and Inception models were selected. The results are shown in Table 5. It can be seen that heavy CNNs models obtained better performance for all attribute than compressed models. Inception obtained an average performance of 0.854, which was 2.4% better than Shu f f leNet. The main reason is that heavy models contain more complex structures and more parameters, which have the advantages of feature extraction and representation. However, the performance differences between compressed models and heavy models were not large, at only about 1-2%.

Running Time Evaluation
Running time of different CNN models was evaluated. Training and testing time of MobileNet, SqueezeNet, Shu f f leNet and ResNet50 models with various batch sizes (8, 16, 32 and 48) were evaluated. Table 6 gives the experiment results.
MobileNet cost the longest training time among all three compressed CNNs models, at 0.192 s, 0.368 s, 0.736 s and 1.104 s for batch sizes of 8, 16, 32 and 48, respectively, during each training iteration. SqueezeNet used the shortest training time, with about 80% of MobileNet's. For testing time, SqueezeNet had the minimal cost, 0.0026 s. Comparing with ResNet50 model, the running efficiency was greatly improved with compressed CNNs models. All execution times were evaluated on PC. For model space occupancy, 9.6 M, 3.1 M and 5 M were required for MobileNet, SqueezeNet and Shu f f leNet, while 94.7 M was needed for ResNet50. This also demonstrated the space efficiency of compressed CNN models. SqueezeNet model rand 10× faster than ResNet50 model, and reduced the storage space by 30×.

Android Devices Deployment
The models were trained on a PC Server. They were properly running on Linux with Tensorflow framework. However, this could not be done directly on a mobile devices, and some essential transformation and deployment were needed. The compressed CNNs models were deployed on Android platforms, and the corresponding performances were also tested.
The file format of CNNs model on Linux was *.h5. It was first converted into format of *.pb to deploy on Android devices. The file size of MobileNet, SqueezeNet and Shu f f leNet models were 2.82 MB, 4.02 MB and 9.06 MB, respectively, after format conversion, which were similar to their PC format. Table 7 gives the result of model testing on selected Android platforms. Snapdragon 626, Snapdragon 845 and Kylin 970 were used for testing. As can be seen form the result, mobile devices showed good efficiency, and could execute the operation in tens of milliseconds, thus could support real-time applications. Kylin 970 obtained the best performance, and it cost 0.00076 s to execute the SqueezeNet model. This might derive from the Neural Network Processing Unit it contains.

Conclusions
Heated metal marks are important evidence for fire scene analysis. Automatic heated metal attribute recognition using deep learning method has become popular. To further improve the model efficiency, this study considered heated metal mark image attribute recognition based on compressed CNNs model. We expanded the benchmark dataset. Three well known compressed CNNs models were used as backbone structure and a multi-label training method was adopted. Comprehensive experiment were evaluated and analyzed, including recognition rate, influence of batchsize, compressed model vs. heavy model, single label model vs. multi-label model, etc. Moreover, compressed CNNs models were deployed and tested on Android devices.
Through this study, it can be concluded that using compressed CNNs model, efficiency of both time and space are greatly improved, and recognition accuracy still lies in acceptable range. According to the experiment evaluation, Shu f f leNet has the best over recognition accuracy, and SqueezeNet costs the minimal running time. Therefore, users can adopt any models based on their actual demands.