1. Introduction
Hot-rolled steel strips play an important role in the national economy and serve as an essential material in various industries, such as automotive, military, machinery manufacturing, and energy sectors. However, unintended surface defects may occur during the manufacturing process. These defects can alter material properties, including corrosion resistance and fatigue strength, ultimately leading to a significant reduction in the quality of the final product [
1]. Steel surface inspection is performed to predict defect categories, identify causative factors, and provide important information for corrective actions, while traditionally, human experts manually assess the surface quality of steel strips; this method is highly subjective, labor-intensive, and inefficient for real-time inspection tasks [
2,
3]. Thus, the development of a new inspection method is essential to replace traditional approaches.
Recently, deep learning techniques, especially Convolutional Neural Networks (CNNs), have been used for the detection and classification of steel surface defects. Khanam et al. [
4] demonstrated the practical applications of CNN models across industries. Wen et al. [
5] presented a comprehensive study of recognition algorithms, including CNN-based approaches. Yi et al. [
6] proposed a system that includes a preprocessing step; the processed images were fed to a novel CNN. Zhu et al. [
7] designed a lightweight model using ShuffleNet V2, while Mao and Gong [
8] employed a MobileNet-V3 module. Liu et al. [
9] surveyed deep and lightweight CNN architectures for industrial scenarios. Bouguettaya and Zarzour [
10] conducted a comparative study on different pre-trained CNN models and proposed a custom classifier. Wi et al. [
11] developed a cost-efficient segmentation labeling framework that combines label enhancement and deep-learning-based anomaly detection in industrial applications. Ashrafi et al. [
12] proposed various deep learning-based vision approaches, including both semantic segmentation and object detection. Ibrahim and Tapamo [
13] integrated VGG16 components into a new CNN for industrial quality inspection. Jeong et al. [
14] presented Hybrid-DC, a hybrid CNN–Transformer architecture for real-time quality control in high-precision manufacturing.
In real production environments, images acquired by image acquisition equipment often suffer from various types of noise, motion blur, and non-uniform illumination, which significantly degrade the performance of surface defect inspection methods. Therefore, once acquired, an image typically undergoes preprocessing (e.g., enhancement) before being input to a CNN. This approach effectively improves the perceptibility and enhances the image characteristics to meet the requirements of subsequent analysis (i.e., defect detection and defect classification) [
9,
15,
16]. In the literature, various preprocessing techniques have been employed to enhance image quality, such as histogram equalization [
17], Gaussian filtering [
18], Sobel and Prewitt operators [
19], and CNN [
20], among others. However, a single filter may not only smooth the image but also obscure important details. To address this issue, researchers have attempted to combine multiple filters to suppress noise [
16]. However, this approach increases processing time, and image enhancement should be efficient in terms of time and resources, as it serves as a preprocessing step for the defect classification task [
16].
Studies in the literature for steel surface defect inspection have been conducted on both deep and lightweight CNN models. In this work, lightweight CNN models are analyzed. Since lightweight CNN models have fewer parameters and lower computational cost, they can run faster on a system and minimize latency, making them particularly suitable for time-sensitive real-time applications. As noted earlier, images acquired in real-world production settings are often degraded by noise, motion blur, and non-uniform illumination. In the literature, model performance has typically been evaluated either on original images alone or on a mixture of original and degraded images. In this study, only degraded images are used for testing, since real-world inspection must tolerate degraded inputs with low latency; accordingly, robustness to degradation is the primary focus. This is particularly important for lightweight CNN models operating in real-time systems, as the time required for image enhancement preprocessing, which is especially needed for degraded images, is more critical for time-sensitive real-time applications. The better lightweight CNN models perform on degraded images, the less need there will be for preprocessing techniques aimed at image enhancement. Thus, this study aims to assist industries and researchers by analyzing lightweight CNN models for surface defect recognition in terms of time and performance on degraded images that have not undergone any preprocessing.
The publicly available Northeastern University (NEU-CLS) dataset [
1] is used for steel surface defect images in this study. Original images are used for training and validation, whereas degraded images are used for testing without any preprocessing. Six pre-trained lightweight CNN models are analyzed. The base of each CNN is kept frozen and only the classifier is trained. Both accuracy-related metrics and normalized inference time are reported to assess model performance.
In the literature, models tested using original images generally exhibit a low misclassification rate, and these misclassifications tend to be concentrated in a specific class. This situation allows for a clearer analysis of the relationship between misclassified classes. However, models tested with degraded images are expected to have a higher error rate as an anticipated outcome. Thus, it increases the likelihood that misclassified instances are distributed across multiple classes rather than being concentrated in a single class, which makes the analysis of the relationship between misclassified classes more challenging. In this study, a new performance metric is proposed to analyze the distribution of misclassifications: Adjusted False Classification Index (AFCI). AFCI is designed to assess how homogeneously misclassifications are distributed. Unlike other misclassification-analysis techniques, AFCI quantifies class-wise error dispersion via the Gini coefficient (GC) and conditions the analysis on the False Negative Rate (FNR). It therefore takes into account both the error’s proportion and the homogeneity of their distribution across other predicted classes (i.e., clustered vs. homogeneous). By analyzing the distribution of misclassifications through this index, the improvement needs of the model can be identified.
3. Results and Discussion
To evaluate the performance of the aforementioned pre-trained lightweight CNN models, I first employ the accuracy and normalized inference time evaluation metrics for all degraded conditions, as shown in
Table 1 and
Table 2, respectively.
Among all models, based on
Table 1 and
Table 2, EfficientNet-B0 exhibits the lowest accuracy and the highest normalized inference time; therefore, it is excluded from further analysis. In general, MobileNet-V1 achieves the best results. Only under high noise conditions, NasNetMobile outperforms MobileNet-V1. When comparing the normalized inference times of these two models, MobileNet-V1 completes the operation in a significantly shorter time. Although MobileNet-V1 and MobileNet-V2 have approximately the same normalized inference times, MobileNet-V1 achieves significantly better results, particularly under high-intensity non-uniform illumination and high motion blur conditions. Among the models, MobileNet-V3 and ShuffleNet V2 have significantly shorter normalized inference times. As shown in
Table 1, ShuffleNet V2 achieves better results than MobileNet-V3 for all degradations except high noise. Although ShuffleNet V2 has a better normalized inference time than MobileNet-V2, MobileNet-V1, and NasNetMobile, it performs worse than these models across all types of degradations. In this study, the ASD is computed for each model across the six degradation conditions. The resulting ASDs for MobileNet-V3, MobileNet-V2, MobileNet-V1, NasNetMobile, and ShuffleNet V2 are 0.0551, 0.0346, 0.0178, 0.0197, and 0.0801, respectively. Among these models, MobileNet-V1 has the lowest ASD, indicating that it exhibits the most consistent performance across different types of degradation. As a result, if inference time is the only priority, ShuffleNet V2 can be preferred. However, when accuracy, ASD, and normalized inference time are evaluated together, MobileNet-V1 emerges as the best option among the models used in this study.
The classification performances of MobileNet-V1 and ShuffleNet V2 are presented in
Table 3 and
Table 4, respectively. The confusion matrices of MobileNet-V1 and ShuffleNet V2 are shown in
Figure 6 and
Figure 7, respectively.
Consistent with
Figure 4, each confusion matrix in
Figure 6 and
Figure 7 follows the same axis representation, with the horizontal axis for predicted labels and the vertical axis for actual labels. To reduce visual clutter, zero-valued off-diagonal cells are left blank in the confusion matrices in
Figure 6 and
Figure 7; diagonal values are always displayed. That is, empty cells in
Figure 6 and
Figure 7 indicate zero values. As observed in
Table 1, MobileNet-V1 demonstrates a high accuracy rate under low-level degradations. According to
Table 3, when examining the F1-score for MobileNet-V1 under all three types of low-level degradation, only the Inclusion and Scratch classes remain below 100% performance across all low-level degradation levels. For all three types of low-level degradation, the recall value for the Inclusion class remains low, indicating that some Inclusion instances are misclassified in each degradation type. On the contrary, the precision value for the Scratch class is low, indicating that some instances classified as Scratch do not actually belong to this class. Upon examining
Figure 6, it is observed that some instances identified as Scratch under low-level degradation actually belong to the Inclusion class. The reason for this is that, as stated in [
10], there is an inter-class similarity between the Inclusion and Scratch classes. For the original images, MobileNet-V1 exhibits misclassification between these two classes [
10], and this misclassification continues under low-level degradations. When
Table 4 and
Figure 7 are examined for low-level degradations, ShuffleNet V2 exhibits low performance across all classes. In conclusion, although degradations alter the grayscale values of the defect image as stated in [
1], MobileNet-V1 exhibits good performance under low-level degradations. When
Table 1 is examined for high-level degradations, the performance of both MobileNet-V1 and ShuffleNet V2 declines as expected. ShuffleNet V2 exhibits the largest degradation: accuracy drops by 19.1 percentage points (pp) from low to high noise (0.9561→0.7652; −20.0% relative) and by 10.6 pp from low to high motion blur (0.9144→0.8085; −11.6% relative). When
Table 3 and
Table 4, as well as
Figure 6 and
Figure 7, are examined for high-level degradations, no class achieves 100% performance across all degradations. For all high-level degradations, only MobileNet-V1 achieves 100% performance in terms of recall for the Scratch class. This implies that even under high-level degradations, MobileNet-V1 correctly classifies all instances of the Scratch class. Moreover,
Table 1 shows that high noise and high motion blur reduce MobileNet-V1’s accuracy by only 3.5–3.7 pp (low→high noise: 0.9993→0.9646; low→high blur: 0.9972→0.9602), and high-intensity non-uniform illumination by ~2.0 pp (0.9987→0.9789). These drops are far smaller than those for ShuffleNet V2 (noise: 19.1 pp; blur: 10.6 pp). In conclusion, MobileNet-V1 still demonstrates strong performance even under high-level degradations.
Finally, the MobileNet-V1 and ShuffleNet V2 models are analyzed using the proposed AFCI in this study. The AFCI values are computed using
= 0.5, defined in Equation (9), and are presented in
Table 5 (for MobileNet-V1) and
Table 6 (for ShuffleNet V2).
As shown in
Table 5, AFCI is reported only in two cases (Cr–High Motion Blur and In–High Noise). In all other cases, FNR is below the threshold; therefore, misclassification dispersion is not evaluated for MobileNet-V1. Under high noise degradation, misclassifications for the Inclusion class are partially concentrated in a specific class. An examination of
Figure 6 reveals that, under high noise conditions, misclassified instances for the Inclusion class are distributed among three classes; however, they are primarily concentrated in the Pitted Surface class, with approximately 12 misclassified instances. Additionally, as shown in
Table 5, the Crazing class exhibits another imbalanced distribution under the high motion blur degradation. The AFCI value of 0.226 is higher compared to the previously mentioned AFCI value of 0.147 for the Inclusion class, indicating a more homogeneous distribution. Similarly, an examination of
Figure 6 reveals that under high motion blur conditions, misclassified instances for the Crazing class are distributed among three classes; however, they are primarily concentrated in the Pitted Surface class, with approximately 15 misclassified instances. While an imbalance is observed in the Crazing class under high motion blur degradation (AFCI = 0.226), a more pronounced concentration tendency is present in the Inclusion class under high noise degradation (AFCI = 0.147). An examination of
Table 6 reveals that, except for low noise, all other degradations exhibit a high FNR for at least one class. Similarly to MobileNet-V1, ShuffleNet V2 also exhibits the most significant misclassification imbalance under high noise degradation. As shown in
Table 6, under high noise conditions, misclassified instances for the Inclusion, Pitted Surface, and Rolled-in Scale classes are entirely concentrated in a single class.
Figure 7 illustrates that, under high noise conditions, all three of these classes are predicted as Crazing class. To enable the model to learn these classes more effectively and improve its performance for high noise conditions, analyzing the relationships between these classes and the Crazing class can provide insights into the confusion patterns. Similarly, for other classes with low AFCI values (such as the AFCI value for the Patch class under high motion blur conditions), the same analysis can be applied to enhance the model’s performance. Conversely, under low motion blur conditions,
Table 6 shows that the Crazing class has a high AFCI value, indicating that misclassified instances are homogeneously distributed across multiple classes. In such cases, class-specific methods, such as data augmentation and feature enhancement techniques tailored to that particular class, can be applied to improve its classification performance. In conclusion, the ShuffleNet V2 model has defined AFCI values for many classes. In contrast, the MobileNet-V1 model generally exhibits low FNR values under degradations, while only a few classes have a defined AFCI value.
In this study,
is set to 0.5; adjusting
allows coarser or finer granularity. The choice of
is entirely at the discretion of the user. In particular, selecting a smaller
yields AFCI values more frequently (i.e., higher sensitivity). By contrast, a larger
yields fewer distinct (i.e., ‘defined’) AFCI values. To analyze the effect of
, I compute AFCI values for the Crazing class using ShuffleNet V2 across different
values, as shown in
Table 7.
As
Figure 7 shows, the Crazing class exhibits zero misclassifications at low and high noise levels; therefore, the AFCI is ∅ for any
in
Table 7 for these degradations.
Table 7 shows that with a very small
(e.g.,
= 0.001), multiple defined AFCI values are observed across degradation types (0.556, 0.593, 0.826 and 0.388), whereas with a large
(e.g.,
= 0.9) only a single defined AFCI value remains (0.388). As previously noted, this choice is user-dependent: if the analysis is intended to focus only on very high FNR regimes, a larger
should be selected; for a more fine-grained analysis,
should be decreased. If
= 0, the AFCI depends entirely on the GC
N parameter, meaning that the FNR value has no significance.
Finally, in this study, the AFCI is compared with the standard deviation used for homogeneity analysis. Suppose two scenarios. Scenario 1 contains 30 instances of the Crazing class, and Scenario 2 contains 24. For each scenario, three alternative classification outcomes are listed—six rows in total—with the corresponding AFCI and standard deviation reported in the last two columns (see
Table 8).
Comparing Scenarios 1A and 2A in
Table 8, the misclassified instances are homogeneously distributed across the classes in both cases; consequently, AFCI = 1 and the standard deviation = 0. Analyzing Scenarios 1B and 2B, the misclassified instances are concentrated in two classes in both cases. As expected, AFCI decreases from 1 toward 0 as the standard deviation increases (a lower standard deviation indicates a more homogeneous spread across the misclassified classes). However, although AFCI is the same for both cases (0.625), the standard deviation differs (7.348 for 1B vs. 5.878 for 2B). This occurs because the standard deviation is sensitive to the total count of misclassified instances. These patterns are even more pronounced in Scenarios 1C and 2C. AFCI is 0 in both cases because the misclassified instances are concentrated entirely in a single class; however, the standard deviation differs between the two cases (12 for 1C vs. 9.6 for 2C). In conclusion, while standard deviation has a minimum of 0 and a maximum that scales with the number of misclassified instances, AFCI is normalized to [0, 1], better capturing the continuum from homogeneity to concentration in the error distribution.
For comparison with AFCI,
Table 9 lists the standard deviation values for ShuffleNet V2.
Both metrics evaluate the distribution of misclassifications. Examining
Table 6 and
Table 9 for the Crazing and Inclusion classes: first, for Crazing, the least homogeneous (most concentrated) distribution occurs under the high motion blur condition (
Table 6: AFCI = 0.388;
Table 9: standard deviation = 9.046). Second, for Inclusion, the least homogeneous distribution occurs under the high noise condition (
Table 6: AFCI = 0;
Table 9: standard deviation = 12). However, comparing
Table 6 and
Table 9 shows that standard deviation is harder to interpret than AFCI, because AFCI takes into account FNR and the tunable
parameter, which yields clearer, more concise interpretations—especially in complex scenarios.
4. Conclusions
In this study, I analyze pre-trained lightweight CNN models for classifying surface defects in hot-rolled steel strips under three different degradation types—noise, non-uniform illumination, and motion blur—using the NEU-CLS dataset and the degradation protocols defined in this work. In all experiments, the base of each CNN model is kept frozen, and only the classifier is trained. The sources of randomness are controlled by stabilizing them throughout the training process. Additionally, each architecture undergoes ten training runs (frozen base; trainable classifier) and the average results are reported. Within this experimental setting, the findings indicate that if inference time is the sole priority, ShuffleNet V2 is the preferred choice. However, when both inference time and performance are considered, MobileNet-V1 emerges as the most suitable option on NEU-CLS across both low- and high-degradation levels among the models evaluated. Additionally, this study proposes AFCI, a new index that evaluates whether misclassifications are concentrated in a few classes or uniformly distributed across alternatives. To improve the classification performance of the model, when the AFCI value is low, the relationships between relevant classes can be analyzed, whereas when the AFCI value is high, class-specific methods can be considered. On the NEU-CLS dataset and under my degradation protocols, ShuffleNet V2 exhibits defined AFCI values for numerous classes. In contrast, MobileNet-V1 has defined AFCI values for only a few classes. Specifically, for MobileNet-V1, under high noise degradation, misclassifications in the Inclusion class are partially concentrated in a specific class, whereas the Crazing class demonstrates another imbalanced distribution under high motion blur degradation. When comparing these two misclassification patterns, the Crazing class exhibits a more homogeneous distribution of misclassified instances. When considering the AFCI values obtained across all other degraded conditions, the results indicate that MobileNet-V1 demonstrates strong performance in terms of robustness to misclassification. Beyond reporting AFCI, this study provides a quantitative comparison with the standard deviation. Across the NEU-CLS dataset and the applied degradation protocols, both metrics show broadly consistent trends; however, AFCI offers tunable sensitivity via the parameter, which can aid interpretability and AFCI better captures the continuum between homogeneity and concentration in the error distribution.