Impact of Image Compression on the Performance of Steel Surface Defect Classiﬁcation with a CNN

: Machine vision is increasingly replacing manual steel surface inspection. The automatic inspection of steel surface defects makes it possible to ensure the quality of products in the steel industry with high accuracy. However, the optimization of inspection time presents a great challenge for the integration of machine vision in high-speed production lines. In this context, compressing the collected images before transmission is essential to save bandwidth and energy, and improve the latency of vision applications. The aim of this paper was to study the impact of quality degradation resulting from image compression on the classiﬁcation performance of steel surface defects with a CNN. Image compression was applied to the Northeastern University (NEU) surface-defect database with various compression ratios. Three different models were trained and tested with these images to classify surface defects using three different approaches. The obtained results showed that trained and tested models on the same compression qualities maintained approximately the same classiﬁcation performance for all used compression grades. In addition, the ﬁndings clearly indicated that the classiﬁcation efﬁciency was affected when the training and test datasets were compressed using different parameters. This impact was more obvious when there was a large difference between these compression parameters, and for models that achieved very high accuracy. Finally, it was found that compression-based data augmentation signiﬁcantly increased the classiﬁcation precision to perfect scores (98–100%), and thus improved the generalization of models when tested on different compression qualities. The importance of this work lies in exploiting the obtained results to successfully integrate image compression into machine vision systems, and as appropriately as possible.


Introduction
Automation is one of the major challenges of Industry 4.0. It consists of optimizing industrial processes with automated systems and integrating technologies into manufacturing processes to increase productivity and autonomy, improve labor conditions, and simplify certain operations [1]. However, in the context of Industry 4.0, equipment automation must be combined with efficient data exchange to build production systems that enable smart, decentralized, and data-informed decision making while minimizing human interaction with processes [2].
Nowadays, industrial automation has become a concept intrinsically linked to the Internet of Things (IoT). The exploitation of generated data by IoT sensors makes it possible for machines to communicate with each other and determine actions in real time to adapt immediately to the requirements, from manufacturing to maintenance and even to market demands. The data coming from the physical environment are then sent to a web platform and processed, thus facilitating decision making, especially in process changes [3]. Concretely, IoT enables to develop the interconnectivity of the different tools and systems of the manufacturing chain by exploiting the intelligent sensors data using big data and artificial intelligence (AI), which include numerous advantages and benefits for main industrial operations such as manufacturing process automation and monitoring [4], predictive maintenance [5], resource and inventory management [6], and quality control [7].
In addition, machine vision (MV) represents a key technology and a powerful support for IoT automation solutions [8]. MV is an AI technique that allows the analysis of images captured by cameras. It is capable of recognizing an image, understanding it, and processing the resulting information [9]. The technical advances in terms of cameras and lighting systems, as well as computer resources, have considerably expanded the field of application of MV and have opened up this technology to all industrial sectors [10]. MV is an extremely strong complement to IoT automation technologies. Automated inspection systems can work faster and more accurately than manual quality control, and they immediately surface relevant data for decision makers when defects and exceptions are detected. MV provides several advantages and performs various controls that would otherwise require different equipment [11].
The main components of an MV system are lighting, lens, image sensor, vision processing, and communications. Lighting illuminates the part to be inspected, highlighting its features so that they are clearly visible to the camera. The lens then captures the image and presents it to the sensor. Finally, the sensor converts this light into a digital image that is sent for analysis [12].
Vision processing refers to the mechanism of extracting information from the captured image; it can be performed internally in a standalone vision system or externally in a PCbased system. Advances in AI, specifically deep learning (DL) algorithms, have brought a revolution to MV by introducing nontraditional and efficient solutions that have made it possible to create high-performance vision applications. Indeed, the advent of the convolutional neural network (CNN) has opened up MV to the industrial sector, and has made this technology an attractive investment for companies seeking to automate tasks. CNNs can identify features that are not visible to humans, facilitate studies, and automate actions, thus saving a lot of time and energy [13].
MV is increasingly becoming a key component of the steel industry's production lines [14,15]. Due to the limitations of the production conditions, the surface of metals inevitably shows various types of defects; for example, scratches, surface cracks, and rolling scale. These defects not only affect the appearance of the product, but also reduce the properties of the product such as corrosion resistance and fatigue strength, which can result in huge economic waste. Automatic surface defect inspection has become a major necessity in the metal fabrication industry, as it detects manufacturing defects with high accuracy and speed. By using a new DL-based approach, it is possible to inspect all types of difficult metal surfaces with precision and repeatability [16,17].
The images captured by cameras can be stored and processed at the edge of the network itself or on a remote server. However, due to the storage and processing capabilities of IoT objects, which are often very limited due to their size, energy, power, and computational capacity, processed images are generally sent to a remote server for analysis and storage [18]. Furthermore, the communication between IoT devices is primarily wireless, as they are usually installed in geographically dispersed locations. Wireless channels are unreliable and often present high distortion rates. Therefore, the main challenge is to ensure that the appropriate type of image is obtained, with the most-optimized size and at the right quality level. In this context, compression of collected images before transmission and storage is essential to reduce costs in terms of bandwidth and storage capacity [19].
Image compression is the application of data compression to digital images by reducing the redundancy of data in images. The purpose of compression is to reduce the memory capacity required for image storage and to accelerate its transmission [20]. The objective of this research was to study the impact of the quality degradation resulting from image compression on the performance of steel surface defect classification with a CNN. Figure 1 presents an example of a compressed image with three different parameters, resulting in different image quality degradations. Image classification was successful for the first two qualities, but failed for the most degraded image. The image used in this illustration was taken from the Northeastern University (NEU) surface-defect database used in the performed experiments in this research [21]. J. Sens. Actuator Netw. 2021, 10, x FOR PEER REVIEW 3 of 18 images before transmission and storage is essential to reduce costs in terms of bandwidth and storage capacity [19]. Image compression is the application of data compression to digital images by reducing the redundancy of data in images. The purpose of compression is to reduce the memory capacity required for image storage and to accelerate its transmission [20]. The objective of this research was to study the impact of the quality degradation resulting from image compression on the performance of steel surface defect classification with a CNN. Figure 1 presents an example of a compressed image with three different parameters, resulting in different image quality degradations. Image classification was successful for the first two qualities, but failed for the most degraded image. The image used in this illustration was taken from the Northeastern University (NEU) surface-defect database used in the performed experiments in this research [21].

Scratches
Scratches The image used in this illustration was taken from the Northeastern University (NEU) surfacedefect database used in the performed experiments in this research [21].
Initially, we created several degraded image datasets by applying standard image compression algorithms to an image database. Afterwards, we performed three different experiments using these datasets: In the first one, we trained and tested the models with compressed image datasets with the same compression parameters. In the second experiment, we trained the models using a compressed dataset with a certain quality, but tested them using all other compressed datasets with the different qualities. In the third experiment, we evaluated the impact of training models with compression-based data augmentation on the classification performance of CNNs. Each model was trained once with a mixture of the different qualities and then evaluated on all compressed datasets. This paper's main contributions are as follows:  Identifying the parameters that can be used to compress images as much as possible, without losing the accuracy of classification with a CNN;  Evaluating the impact of image compression on the classification performance of a CNN that is trained and tested using compressed image datasets with the same parameters;  Investigating the impact of image compression on the classification performance of a trained and tested CNN using compressed image datasets with different parameters;  Studying the benefit of compression-based data augmentation on the classification performance of a CNN.
This paper is organized as follows. Related work is presented in Section 2. Section 3 outlines the theoretical background. Section 4 provides a detailed description of the methodology. Section 5 presents an overview of the results and discussion. Section 6 concludes the paper. The image used in this illustration was taken from the Northeastern University (NEU) surface-defect database used in the performed experiments in this research [21].
Initially, we created several degraded image datasets by applying standard image compression algorithms to an image database. Afterwards, we performed three different experiments using these datasets: In the first one, we trained and tested the models with compressed image datasets with the same compression parameters. In the second experiment, we trained the models using a compressed dataset with a certain quality, but tested them using all other compressed datasets with the different qualities. In the third experiment, we evaluated the impact of training models with compression-based data augmentation on the classification performance of CNNs. Each model was trained once with a mixture of the different qualities and then evaluated on all compressed datasets. This paper's main contributions are as follows: • Identifying the parameters that can be used to compress images as much as possible, without losing the accuracy of classification with a CNN; • Evaluating the impact of image compression on the classification performance of a CNN that is trained and tested using compressed image datasets with the same parameters; • Investigating the impact of image compression on the classification performance of a trained and tested CNN using compressed image datasets with different parameters; • Studying the benefit of compression-based data augmentation on the classification performance of a CNN.
This paper is organized as follows. Related work is presented in Section 2. Section 3 outlines the theoretical background. Section 4 provides a detailed description of the methodology. Section 5 presents an overview of the results and discussion. Section 6 concludes the paper.

Related Work
The availability of massive amounts of data in images has enabled the application of DL models, and in particular CNNs, which now surpass various machine learning approaches in performance and are widely used for a variety of different tasks. However, storage and transmission of large amounts of images are challenging. Therefore, compressing the collected images before transmission is essential to save bandwidth and energy, and improve the latency of vision applications. Considering the image degradation induced by different lossy compression algorithms affecting the performance of CNN models that are vulnerable to image manipulation, various research activities have been performed in the past few years to study the impact of image compression on the performance of CNNs.
Jo, Y.Y. et al. [22] analyzed the impact of image compression on the performance of DL-based models for classifying mammograms into "malignant" cases that lead to cancer diagnosis and treatment, or "normal" and "benign" nonmalignant cases that do not require immediate medical intervention. This paper showed that training on images using augmentation based on compression improved models when tested on compressed data, and that moderate image compression did not have a substantial impact on the classification performance of the DL-based models.
Bouderbal, I. et al. [23] analyzed some image preprocessing techniques for real-time object detection in the context of autonomous vehicles. They examined the impact of image resolution and compression on the accuracy of road object detection. To this end, several experiments were performed on the state-of-the-art YOLOv3 detector. The experimental results showed that the array was resilient to compression, given that its level was sufficient.
Poyser, M. et al. [24] investigated the impact of common image and video compression techniques on the performance of DL architectures. They focused on JPEG and H.264 (MPEG-4 AVC), which are lossy image and video compression techniques commonly used in network-connected image and video devices and infrastructures. The impact on the performance of five distinct tasks was analyzed: human pose estimation, semantic segmentation, object detection, action recognition, and monocular depth estimation. The results of this study revealed a nonlinear and nonuniform relationship between network performance and the applied level of lossy compression.
Steffens, C.R. et al. [25] evaluated the robustness of several high-level image recognition models and examined their performance in the presence of different image distortions. They proposed a testing framework that emulated bad exposure conditions, low-range image sensors, lossy compression, and commonly observed noise types. The results of this work in terms of accuracy, precision, and F1 score indicated that most CNN models were marginally affected by mild miss-exposure, heavy compression, and Poisson noise. On the other hand, severe exposure defects, impulse noise, or signal-dependent noise resulted in a substantial decrease in accuracy and precision.
Ghazvinian, Z. et al. [26] investigated the impact of JPEG 2000 compression on deep convolutional neural networks for metastatic cancer detection in histopathological images. The authors found that their CNN model was robust against compression ratios up to 24:1 when it was trained on uncompressed high-quality images. They also demonstrated that a model trained on lower quality images-i.e., lossy compressed images-depicted a classification performance that was significantly improved for the corresponding compression ratio.
Manual surface inspection is a time-and effort-consuming process, which makes automation of surface defect classification very important for product quality control in the steel industry. However, the traditional methods cannot be properly applied on production lines due to their low accuracy and slow speed. Accordingly, several methods of automatic surface defect inspection have been proposed in previous research. Gradually, researchers have focused on developing new approaches based on deep neural networks for the analysis of steel surfaces in order to improve the classification accuracy and speed [27][28][29]. Other works have investigated the potential of transfer learning methods for the steel defect classification problem in order to overcome the DL training issues that require large processing capacity, especially when dealing with large amounts of data [30][31][32]. Several researchers have studied some specific types of steel surface defects, such as scratches, scrapes, abrasions, and cracks, in order to improve the detection and classification performance for these types of defects [33,34].
Our review of related works has shown that most research has focused on the study and development of DL models in order to reduce implementation costs and improve the efficiency of surface-defect inspection systems. However, little emphasis has been placed on images, which are a key component for training and building successful DL models. The objective of this paper was to evaluate the impact of quality degradation resulting from image compression on the performance of steel surface defect classification with a CNN. The results of this research can be exploited to integrate image compression into surface-defect inspection systems in order to reduce bandwidth and storage costs, and improve latency.

Theoretical Background
DL is an AI approach that is derived from machine learning, in which the machine is capable of learning on its own, as opposed to programming, where it simply executes predefined rules [35]. DL is based on an artificial neural network that imitates the human brain in processing data and creating models that are used in decision making. This network consists of tens or even hundreds of neural layers, each layer receiving and interpreting information from the previous layer [36]. Incorrect answers are eliminated at each step and sent back to the upstream levels to adjust the mathematical model. Progressively, the program reorganizes the information into more complex blocks. When this model is subsequently applied to other cases, it will normally be able to solve problems that it has never encountered before. Training data is crucial for building DL models. Indeed, the system performs better when it accumulates different experiences. DL is used in many fields: image recognition [37], automatic translation [38], autonomous driving [39], intelligent robots [40], etc.
DL has often been proposed in image recognition for MV applications and has shown promising performances; it uses CNNs to perform classification tasks by identifying features from learning images [41]. CNNs are a particular form of multilayer neural network whose connection architecture is inspired by the visual cortex of mammals. Their conception is based on the discovery of visual mechanisms in living organisms, which allows them to categorize information from the most simple to the most complex. A CNN architecture consists of a succession of processing blocks to extract the features that discriminate the image class from the others. A processing block is composed of one or several: convolution layers that analyze the characteristics of the input image, correction layers, often called "ReLUs" in reference to the activation function (rectified linear units), and pooling layers that reduce the size of the intermediate image. The blocks follow each other until the final layers of the network, which classify the image and calculate the error between the prediction and the target value: the fully connected layer and the loss layer. This is the way in which the convolution, correction, and pooling layers are interconnected in the processing blocks, as well as the processing blocks between them, which determines the particularity of the network architecture. This architecture is defined as a result of applied research work [42,43].
The convolution layer is a stack of convolutions. Indeed, several convolution kernels traverse the image and generate several output feature maps. The specific parameters of each convolution kernel are defined according to the information that is sought in the image [44]. The correction or activation layer is the application of a nonlinear function to the output feature maps of the convolution layer. Making the data nonlinear facilitates the extraction of complex features that cannot be modeled by a linear combination of a regression algorithm. Rectified linear units (ReLUs) is the most widely used activation function [45]. The formula of this function is given in Equation (1): The pooling step is a subsampling process. Generally, a pooling layer is inserted regularly between the correction and convolution layers. By reducing the size of the feature maps, and thus the number of network parameters, the computation time is accelerated, and the risk of overfitting is reduced [46]. The fully connected layer classifies the image using the extracted features from the processing block sequence. It is fully connected because all the inputs of the layer are connected to the output neurons of the layer. Each neuron attributes to the image a probability value of belonging to a given class among the possible classes [47]. The loss layer is the final layer of the network. It calculates the error between the network prediction and the actual value. In a classification task, the random variable is discrete, as it can take only the values 0 or 1, representing the belonging (1) or not (0) to a class. This is why the most common and suitable loss function is the cross-entropy function [48]. The formula of this function is given in Equation (2): A CNN is simply a stack of several layers: convolution, pooling, ReLU correction, and fully connected, as shown in Figure 2. Each image received as input will be filtered, reduced, and corrected several times, to finally form a vector. In the classification problem, this vector contains the class affiliation probabilities.
function [45]. The formula of this function is given in Equation (1): The pooling step is a subsampling process. Generally, a pooling layer is inserted regularly between the correction and convolution layers. By reducing the size of the feature maps, and thus the number of network parameters, the computation time is accelerated, and the risk of overfitting is reduced [46]. The fully connected layer classifies the image using the extracted features from the processing block sequence. It is fully connected because all the inputs of the layer are connected to the output neurons of the layer. Each neuron attributes to the image a probability value of belonging to a given class among the possible classes [47]. The loss layer is the final layer of the network. It calculates the error between the network prediction and the actual value. In a classification task, the random variable is discrete, as it can take only the values 0 or 1, representing the belonging (1) or not (0) to a class. This is why the most common and suitable loss function is the cross-entropy function [48]. The formula of this function is given in Equation (2): A CNN is simply a stack of several layers: convolution, pooling, ReLU correction, and fully connected, as shown in Figure 2. Each image received as input will be filtered, reduced, and corrected several times, to finally form a vector. In the classification problem, this vector contains the class affiliation probabilities. Compression is the process of reducing the number of bits that are needed to represent data. Compressing data optimizes storage capacity and file-transfer speed. It reduces costs in both areas. Compression algorithms are distinguished by three essential parameters: the compression ratio, the compression quality, and the speed of compression and decompression [49]. Compression can be done with or without loss. Lossless compression keeps the original file intact and allows the restoration of its original state without losing any bits during decompression. This method is commonly used to compress executable, text, and worksheet files in which the loss of words or numbers can modify the information. Lossy compression definitively removes redundant or unimportant bits by degrading the quality of the original file to reduce further the storage size [50]. This approach is generally applied to audio or visual data, which may be significantly altered without being perceptible to humans. Compression is the process of reducing the number of bits that are needed to represent data. Compressing data optimizes storage capacity and file-transfer speed. It reduces costs in both areas. Compression algorithms are distinguished by three essential parameters: the compression ratio, the compression quality, and the speed of compression and decompression [49]. Compression can be done with or without loss. Lossless compression keeps the original file intact and allows the restoration of its original state without losing any bits during decompression. This method is commonly used to compress executable, text, and worksheet files in which the loss of words or numbers can modify the information. Lossy compression definitively removes redundant or unimportant bits by degrading the quality of the original file to reduce further the storage size [50]. This approach is generally applied to audio or visual data, which may be significantly altered without being perceptible to humans.
The Joint Photographic Experts Group (JPEG) format is a lossy compression method that achieves a high compression ratio with very correct quality. These two benefits make it one of the most popular image formats, particularly on the web, where storage and transfer problems are important. The algorithm is especially effective on images with smooth color variations, such as photographs. The principle of the JPEG algorithm for image compression is as follows. An image is sequentially decomposed into blocks of 8 × 8 pixels, and the compression then works only on the pixel blocks [51]. Then, a discrete cosine transform (DCT) is applied to each block of pixels. The DCT operation evaluates the amplitude of changes from one pixel to another in order to identify high and low frequencies [52]. Afterwards, the quantization attenuates the high frequencies of the image that have been detected by the DCT [53]. Indeed, quantization reduces the importance of contrast areas (high frequencies) that are not easily recognized by the human eye [54]. The main limitations of the JPEG compression algorithm is the tiling effect that appears at a high compression ratio and the destruction and irreversibility of compression.

Methodology
Compression reduces image size and optimizes costs in terms of bandwidth and storage memory. Therefore, image compression can improve latency in MV applications by reducing image transfer and processing time, which will improve the performance of these applications and ensure their integration into high-speed production lines. Considering the image degradation induced by the different lossy compression algorithms that can decrease the accuracy of image identification, a study of the impact of these techniques on image classification with CNNs is elaborated here in order to determine the best way to integrate image compression to MV systems without losing classification precision.
In this study, we performed three different experiments. In the first experiment, the datasets used to train and test the models were compressed with the same compression parameters. In the second experiment, we trained the models with one dataset that was compressed with a certain quality, but tested them with all other datasets that were compressed with the different qualities. In the third experiment, we studied the impact of training models with a compression-based data augmentation. Each model was trained once with a mixture of the different qualities and then evaluated on all compressed datasets. The compression parameters used in these experiments are presented in Table 1. In addition, we also noted the compression ratio (C R ), which evaluated the compression efficiency of an algorithm for an image [55]. We used the formula given in Equation (3) to calculate the C R : where n 1 is the original image size and n 2 is the compressed image size. Three different models were investigated in this study. The first was a simple CNN model with three convolutional layers followed by two dense layers and an output layer with six classes, as shown in Figure 3. Three different models were investigated in this study. The first was a simple CNN model with three convolutional layers followed by two dense layers and an output layer with six classes, as shown in Figure 3.  The second model was Vgg16, a very deep CNN with a very high number of parameters. Due to its depth and the number of fully connected nodes, it takes too much time to train [56]. Vgg16 has five blocks of convolutional layers, in which we used rectified linear units (ReLUs) as the activation function and MaxPooling for downsampling in between each convolutional block. After the last convolution and MaxPooling layer, we passed the data to the dense layer. For this, we flattened the vector that came out of the convolutions and added two dense layers of 4096 units and a dense Softmax layer of 6 units. The Vgg16 architecture used in this experiment is presented in Figure 4. The second model was Vgg16, a very deep CNN with a very high number of parameters. Due to its depth and the number of fully connected nodes, it takes too much time to train [56]. Vgg16 has five blocks of convolutional layers, in which we used rectified linear units (ReLUs) as the activation function and MaxPooling for downsampling in between each convolutional block. After the last convolution and MaxPooling layer, we passed the data to the dense layer. For this, we flattened the vector that came out of the convolutions and added two dense layers of 4096 units and a dense Softmax layer of 6 units. The Vgg16 architecture used in this experiment is presented in Figure 4. The third model was MobileNet, a vision model for TensorFlow designed to efficiently maximize accuracy while taking into account the limited resources of an embedded or device-based application. MobileNet is small, low-latency, low-power model that is configured to meet the resource constraints of a variety of use cases. MobileNet's architecture is built on depthwise separable convolution layers, except for the first layer, which is a full convolutional layer. Each depthwise separable convolution layer consists of a depthwise convolution and a pointwise convolution. Counting depthwise and pointwise convolutions as separate layers, a MobileNet has 28 layers [57]. MobileNet's architecture used in this experiment is displayed in Figure 5.  The third model was MobileNet, a vision model for TensorFlow designed to efficiently maximize accuracy while taking into account the limited resources of an embedded or device-based application. MobileNet is small, low-latency, low-power model that is configured to meet the resource constraints of a variety of use cases. MobileNet's architecture is built on depthwise separable convolution layers, except for the first layer, which is a full convolutional layer. Each depthwise separable convolution layer consists of a depthwise convolution and a pointwise convolution. Counting depthwise and pointwise convolutions as separate layers, a MobileNet has 28 layers [57]. MobileNet's architecture used in this experiment is displayed in Figure 5. The second model was Vgg16, a very deep CNN with a very high number of parameters. Due to its depth and the number of fully connected nodes, it takes too much time to train [56]. Vgg16 has five blocks of convolutional layers, in which we used rectified linear units (ReLUs) as the activation function and MaxPooling for downsampling in between each convolutional block. After the last convolution and MaxPooling layer, we passed the data to the dense layer. For this, we flattened the vector that came out of the convolutions and added two dense layers of 4096 units and a dense Softmax layer of 6 units. The Vgg16 architecture used in this experiment is presented in Figure 4. The third model was MobileNet, a vision model for TensorFlow designed to efficiently maximize accuracy while taking into account the limited resources of an embedded or device-based application. MobileNet is small, low-latency, low-power model that is configured to meet the resource constraints of a variety of use cases. MobileNet's architecture is built on depthwise separable convolution layers, except for the first layer, which is a full convolutional layer. Each depthwise separable convolution layer consists of a depthwise convolution and a pointwise convolution. Counting depthwise and pointwise convolutions as separate layers, a MobileNet has 28 layers [57]. MobileNet's architecture used in this experiment is displayed in Figure 5.  We used a dense layer of 6 units at the end of all models with a Softmax activation, as we had 6 classes to predict in the output, which were the six surface-defect types contained in the Northeastern University (NEU) surface-defect database. Finally, it is important to mention that the three models with distinct architectures were chosen in order to study the impact of image compression on different types of CNN models with different characteristics.
For data collection, the Northeastern University (NEU) surface-defect database was used to train and test our models [29]. The dataset, which contains 1800 (200 × 200) gray scale images, with 300 samples each of six different kinds of typical surface defects, is published by Northeastern University. Each image identifies a typical kind of surface defect of the hot-rolled steel strip that occurs on steel surfaces during the fabrication process. Our aim was to make the models classify those defects into six categories: crazing (Cr), inclusion (In), patches (Pa), pitted surface (PS), rolled-in scale (RS), and scratches (Sc). The NEU surface-defect database presents two challenges: intraclass defects have large differences in appearance, while interclass defects have similar appearances, and defect images are influenced by lighting and material changes. Sample images of six types of surface defects are shown in Figure 6. used to train and test our models [29]. The dataset, which contains 1800 (200 × 200) gray scale images, with 300 samples each of six different kinds of typical surface defects, is published by Northeastern University. Each image identifies a typical kind of surface defect of the hot-rolled steel strip that occurs on steel surfaces during the fabrication process. Our aim was to make the models classify those defects into six categories: crazing (Cr), inclusion (In), patches (Pa), pitted surface (PS), rolled-in scale (RS), and scratches (Sc). The NEU surface-defect database presents two challenges: intraclass defects have large differences in appearance, while interclass defects have similar appearances, and defect images are influenced by lighting and material changes. Sample images of six types of surface defects are shown in Figure 6. Figure 6. Samples of six kinds of typical surface defects from the NEU surface-defect database. Each row shows one example image from each of 300 samples of a class [21,29], Figure 6 is Adapted from [21], with permission from © 2013 Elsevier.
For validation, the dataset was divided at a ratio of 80:20: 80% of the dataset was used as training data, and the remaining 20% was used to test the models after training. The developed models were trained for 20 epochs with the Adam optimizer [58] using a batch size of 30. The original dataset may not have been sufficient to train a deep CNN. Therefore, we used the ImageDataGenerator class of the Tensorflow API to generate an additional set for training. The surface-defect classification was written in Python. The models were developed using the KERAS package and executed on a Google Colab highperformance GPU. Figure 6. Samples of six kinds of typical surface defects from the NEU surface-defect database. Each row shows one example image from each of 300 samples of a class [21,29], Figure 6 is Adapted from [21], with permission from © 2013 Elsevier.
For validation, the dataset was divided at a ratio of 80:20: 80% of the dataset was used as training data, and the remaining 20% was used to test the models after training. The developed models were trained for 20 epochs with the Adam optimizer [58] using a batch size of 30. The original dataset may not have been sufficient to train a deep CNN. Therefore, we used the ImageDataGenerator class of the Tensorflow API to generate an additional set for training. The surface-defect classification was written in Python. The models were developed using the KERAS package and executed on a Google Colab high-performance GPU.
Evaluating a DL model is as important as creating it. We created models to run on new and unseen data. Therefore, extensive evaluation was necessary to create a robust model. In this experiment, we dealt with a classification problem: we used labeled data to predict to which class an object belonged. Therefore, we used a confusion matrix to evaluate our models. The confusion matrix was a cross-table between the actual values and the predictions that went beyond classification accuracy by showing the correct and incorrect predictions for each class, as shown in Figure 7. This matrix identified four categories of outcomes:  Several performance indicators could be derived from the confusion matrix. Recall measured the capacity of our model to correctly predict positive classes. The focus of recall was the TP classes. It indicated the number of positive classes that the model correctly predicted. We used the formula given in Equation (4) to calculate the recall. Precision indicated the quality of our model when the prediction was positive. Precision examined the positive predictions, and indicated how many positive predictions were true. We used the formula given in Equation (5) to calculate the precision. The F1 score was the weighted average of precision and recall, and included both FP and FN. The F1 score was defined by the formula given in Equation (6): In addition, we were particularly interested in the behavior of two major metrics used to evaluate compression encoders with respect to the obtained performance of models: The peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM).
PSNR is a measure of distortion used in digital images, particularly in image compression. It is the most commonly used measure to quantify the performance of Several performance indicators could be derived from the confusion matrix. Recall measured the capacity of our model to correctly predict positive classes. The focus of recall was the TP classes. It indicated the number of positive classes that the model correctly predicted. We used the formula given in Equation (4) to calculate the recall. Precision indicated the quality of our model when the prediction was positive. Precision examined the positive predictions, and indicated how many positive predictions were true. We used the formula given in Equation (5) to calculate the precision. The F1 score was the weighted average of precision and recall, and included both FP and FN. The F1 score was defined by the formula given in Equation (6): In addition, we were particularly interested in the behavior of two major metrics used to evaluate compression encoders with respect to the obtained performance of models: The peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM).
PSNR is a measure of distortion used in digital images, particularly in image compression. It is the most commonly used measure to quantify the performance of encoders by evaluating the difference between the original and compressed representations at a pixel-by-pixel level [59]. PSNR is defined by the formula given in Equation (7): where MAX I represents the maximum pixel value in the original image, and MSE is the mean-squared error defined by the formula given in Equation (8): where Y i represents the pixels of the original image,Ŷ i represents the pixels of the compressed image, and n is the number of pixels of the image.
PSNR assesses how close the compressed image is to the original in terms of signal. However, it does not reflect the visual quality of reconstruction, and cannot be considered as an objective metric of the visual quality of an image. Therefore, SSIM was developed to evaluate the visual quality of a compressed image compared to the original image [60]. Unlike PSNR, SSIM is based on the visible structures in the image. Its purpose is to measure the similarity between two given images, instead of the pixel-to-pixel difference as is done by PSNR. SSIM measured between a compressed image x and an original image y is defined by the formula given in Equation (9): SSIM(x, y) = (2µ x µ y +c 1 )(2cov xy +c 2 (µ 2 x +µ 2 y +c 1 )(σ 2 x +σ 2 y +c 2 (9) where: • µ x is the average of x and µ y is the average of y; • σ 2 x is the variance of x and σ 2 y is the variance of y; • cov xy is the covariance of x and y; • c 1 = (k 1 L) 2 and c 2 = (k 2 L) 2 are two variables to stabilize the division with a weak denominator; • L is the dynamic range of the pixel values (typically, this is 2 bits per pixel − 1); • k 1 = 0.01 and k 2 = 0.03 by default.

Results and Discussion
Once training was complete in the first experiment, we calculated the performance of our trained models using the compressed test datasets with the same compression parameters of training images. The obtained results are reported in Table 2. All results in this section are presented under the format (precision, recall, F1 score). The results showed that all models maintained approximately the same precision and recall for all used compression qualities. In fact, the CNN compared the images fragment by fragment in order to identify approximate features that were similar. By finding similar features that contained the most common aspects of the images, the CNN successfully identified the appropriate class for each image. This explains why image compression did not impact the classification performance of CNN when the training and test images were compressed with the same parameters, since this compression resulted in images with a high degree of similarity. The classification performance of models that were trained and tested on images with different compression qualities in the second experiment are displayed in Table 3. For each table, the "M-Q" rows show the classification performance for a trained model on images compressed with Q quality, and tested on data compressed with the quality indicated in the column header. The obtained results clearly showed that the classification efficiency was affected when the training and test datasets were compressed using different parameters. This impact was more obvious when there was a large difference between these compression parameters. In addition, when we compared the performance of the three investigated models, we saw that the performance deterioration was much more significant for models that reached a very high accuracy during training. Indeed, very efficient models extracted similar features from images with high precision. This explains why MobileNet and Vgg16 were more affected than our simple three-layer CNN. The precision graphs for the different compression qualities shown in Figure 8 highlight this difference in the degree of impact between the three trained models. The MobileNet graph shows a large difference in the classification precision between the different compression qualities, ranging from 61% to 100%, while the difference in precision is less significant in the two other graphs. In order to analyze the behavior of the metrics used to evaluate the compression encoders in relation to the obtained performance of models in this experiment, we calculated the PSNR and SSIM between the different compression qualities. The reported results in the Table 4, under the format (PSNR/SSIM), show that the classification accuracy degraded as the SSIM values decreased, indicating an obvious difference in visual quality between the training and test images. These results demonstrated that the performed classification by CNN was related to the structure evaluation of images. However, the performed measurements did not present a clear correlation between PSNR values and classification performance. Therefore, SSIM was more appropriate for evaluating the classification performance of CNNs with respect to image degradation induced by lossy compression. This makes a lot of sense, since SSIM was introduced to mimic the subjective assessment of image quality by human vision systems, and CNNs are based on a connection architecture inspired by the visual cortex of mammals. The comparison between the SSIM graph curves, as shown in Figure 9, and precision graph curves of trained and tested models on different compression qualities, as shown in Figure 8, clearly show this correlation. The curves with the same color followed approximately the same behavior.  In order to analyze the behavior of the metrics used to evaluate the compression encoders in relation to the obtained performance of models in this experiment, we calculated the PSNR and SSIM between the different compression qualities. The reported results in the Table 4, under the format (PSNR/SSIM), show that the classification accuracy degraded as the SSIM values decreased, indicating an obvious difference in visual quality between the training and test images. These results demonstrated that the performed classification by CNN was related to the structure evaluation of images. However, the performed measurements did not present a clear correlation between PSNR values and classification performance. Therefore, SSIM was more appropriate for evaluating the classification performance of CNNs with respect to image degradation induced by lossy compression. This makes a lot of sense, since SSIM was introduced to mimic the subject assessment of image quality by human vision systems, and CNNs are based on connection architecture inspired by the visual cortex of mammals. The comparis between the SSIM graph curves, as shown in Figure 9, and precision graph curves trained and tested models on different compression qualities, as shown in Figure 8, clea show this correlation. The curves with the same color followed approximately the sa behavior.  Given that models are trained and evaluated on a wider variety of image types, a data storage guidelines regarding image compression vary across the various use cas we investigated the impact of training classifiers using data that was compressed wit mixture of compression qualities. Instead of using the ImageDataGenerator class of Tensorflow API to generate an additional training set by applying many geome transformations for data augmentation (rescale, rotation, flip, zoom, etc.), we increa Given that models are trained and evaluated on a wider variety of image types, and data storage guidelines regarding image compression vary across the various use cases, we investigated the impact of training classifiers using data that was compressed with a mixture of compression qualities. Instead of using the ImageDataGenerator class of the Tensorflow API to generate an additional training set by applying many geometric transformations for data augmentation (rescale, rotation, flip, zoom, etc.), we increased the training dataset by adding all the compressed images with all the compression qualities used in this work. Consequently, we trained our models on a dataset containing (1440 × 7 = 10,080) images, with 1440 images for each compression quality. Table 5 summarizes the classification performance of trained models with compression-based data augmentation. The obtained results proved that compression-based data augmentation dramatically increased the classification efficiency of our models, even when evaluated with the different compression qualities. This indicated that using compression for data augmentation improved the generalization of models when tested on different compression qualities. This can be explained by the fact that CNNs successfully identify the appropriate class for each image by finding similar features that contain the most common aspects of images. When we increased the training dataset using compression, we actually multiplied the number of training images by 7. This improved the feature extraction by including features from the compressed images with the different parameters used in the experiment. Afterwards, when a compressed image with a certain quality was introduced, the network identified similar features and successfully classified the image. Figure 10 shows the training accuracy, validation accuracy, precision, and recall curves per epoch for the trained models with compression-based data augmentation. the different parameters used in the experiment. Afterwards, when a compressed image with a certain quality was introduced, the network identified similar features and successfully classified the image. Figure 10 shows the training accuracy, validation accuracy, precision, and recall curves per epoch for the trained models with compressionbased data augmentation.

Conclusions
This paper evaluated the impact of quality degradation resulting from image compression on the classification performance of steel surface defects with a CNN. The obtained results showed that models trained and tested on compressed images with the same parameters maintained approximately the same classification performance for all

Conclusions
This paper evaluated the impact of quality degradation resulting from image compression on the classification performance of steel surface defects with a CNN. The obtained results showed that models trained and tested on compressed images with the same parameters maintained approximately the same classification performance for all used compression grades. Furthermore, the outcomes indicated that the classification efficiency was affected when the training and test datasets were compressed using different parameters. This impact was more evident when there was a large difference between these compression parameters, and for models that achieved very high accuracy. In addition, the findings revealed that compression-based data augmentation dramatically increased the classification performance, and thus improved the generalization of models when tested on different compression qualities. The experiments also demonstrated that the classification performance of models was correlated with the image quality, as evaluated by the SSIM metric. The importance of this work lies in the exploitation of the obtained results to successfully integrate image compression into machine vision systems, and as appropriately as possible. Using only one compression method (JPEG) in the experiments was the major limitation of this work. In the future, we would like to extend our research by studying the impact of other compression encoders on the classification performance of CNNs.

Conflicts of Interest:
The authors declare no conflict of interest.