4. Results
In the first part of the thesis, various methodological approaches were evaluated based on their confusion matrix and accuracy. The confusion matrix shows how prediction and true values differ. In the simplest case of two classes, a distinction is made here between True Positive (TP) and True Negative for correct predictions and False Positive (FP) and False Negative (FN) for incorrect predictions. Large deviations in the confusion matrix are particularly critical if, for example, a standstill mark rated as 2 is rated as 5 by the AI. Deviations of one class are still relatively uncritical for the current question and can also be based on the subjective assessment by the expert.
All images with deviations greater than one class were output after the first training and checked again by the human experts. In the process, images were found that had obviously been inadvertently misclassified. These comprised approx. 3% of all the images (30 of 1032). Four of these 30 images were clearly misclassified by the expert. The classification was not clear for 22 images. These images were shifted by one class. The expert classification was confirmed for four images during the follow-up inspection. No changes were made to the classification here. This new data set was then labelled as ‘re-sorted’ and used for all further investigations. In order to provide the CNN with better decision criteria, an investigation is currently being carried out using LIME (Local Interpretable Model Agnostic Explanations [
19,
20]).
BasicNet was the first CNN to be analyzed.
Figure 5 shows the architecture of this simple CNN as a flowchart. In a first approach, the number of classes was varied. In the first step, only clearly different classes were analyzed. In the second step, classes were analyzed in pairs, and in the last step, all six classes were analyzed.
Table 3 shows the results.
Figure 5.
BasicNet with its convolutional part on the left side and its fully connected part on the right side. The depth value in the convolutional and max pooling layers denotes the number of channels. In each layer, the input and output sizes are listed. The input to the overall CNN is a single image with three channels, a height of 45 pixels and a width of 61 pixels. The output is a tensor with three values as this is a BasicNet example for the classification of three classes.
Figure 5.
BasicNet with its convolutional part on the left side and its fully connected part on the right side. The depth value in the convolutional and max pooling layers denotes the number of channels. In each layer, the input and output sizes are listed. The input to the overall CNN is a single image with three channels, a height of 45 pixels and a width of 61 pixels. The output is a tensor with three values as this is a BasicNet example for the classification of three classes.
Table 3.
Validation and training accuracies for the classification of different class configurations with BasicNet. The used hyperparameters are LR = 1 × 10−4, BS = 8, H = 90 and W = 122.
Table 3.
Validation and training accuracies for the classification of different class configurations with BasicNet. The used hyperparameters are LR = 1 × 10−4, BS = 8, H = 90 and W = 122.
Dataset | Number of Classes | Classes | ACCtrain [%] | ACCval [%] |
---|
Original | 2 | 1, 5 | 99.52 | 96.67 |
Re-Sorted | 2 | 1, 5 | 99.52 | 98.33 |
Original | 3 | 1, 3, 5 | 92.31 | 88.43 |
Re-Sorted | 3 | 1, 3, 5 | 91.04 | 91.80 |
Original | 3 | 2, 4, 6 | 91.55 | 86.05 |
Re-Sorted | 3 | 2, 4, 6 | 95.27 | 87.21 |
Original | 3 | (1, 2), (3, 4), (5, 6) | 83.19 | 84.06 |
Re-Sorted | 3 | (1, 2), (3, 4), (5, 6) | 90.83 | 81.64 |
Original | 6 | 1, 2, 3, 4, 5, 6 | 85.56 | 63.94 |
Re-Sorted | 6 | 1, 2, 3, 4, 5, 6 | 86.67 | 69.08 |
To find suitable hyperparameters, a grid search was conducted with
BasicNet in which all six classes were classified. The grid spans the dimensions learning rate, which is the most important hyperparameter to tune [
16] and batch size. For the batch size, the values 8 and 16 were chosen, as smaller batch sizes can improve the classification performance [
21]. For the learning rate, which usually is small and lies between 0 and 1, the values 0.01, 0.005, 0.001, 0.0005 and 0.0001 were chosen. Generally, it is recommended to set the learning rate values in a logarithmic scale when a grid search is performed [
16]. The learning rate values in the middle of the respective intervals were selected to refine the grid. For every grid combination,
BasicNet was trained for 50 epochs. The best validation accuracy, 69.57%, was achieved with a learning rate of 0.0005 and a batch size of 16.
Because training with images in their original size led to the GPU memory overflowing and took quite a long time, various smaller image sizes were examined to see how they affect the classification accuracy [
25]. The maximum number of pixels that was evaluated was 43,920 px and the lowest was 2745 px. The best validation accuracy, 91.80%, was achieved with a width of 122 pixels and a height of 90 pixels (
Table 4).
Table 4.
Validation and training accuracies for the classification of classes 1, 3 and 5 with BasicNet for different photometric augmentation types. The used hyperparameters are LR = 1 × 10−4, BS = 8, H = 45 and W = 61. The best accuracy is highlighted in bold.
Table 4.
Validation and training accuracies for the classification of classes 1, 3 and 5 with BasicNet for different photometric augmentation types. The used hyperparameters are LR = 1 × 10−4, BS = 8, H = 45 and W = 61. The best accuracy is highlighted in bold.
H [px] | W [px] | Number of Pixels | ACCtrain [%] | ACCval [%] |
---|
180 | 244 | 43,920 | 98.80 | 91.74 |
90 | 122 | 10,980 | 91.04 | 91.80 |
60 | 81 | 4860 | 91.59 | 89.26 |
60 | 60 | 3600 | 91.35 | 90.91 |
45 | 61 | 2745 | 85.10 | 85.95 |
In a next step, different types of augmentation were tested; first, separately and later, in combination (
Table 5).
Table 5.
Validation and training accuracies for the classification of classes 1, 3 and 5 with BasicNet for different photometric augmentation types. The used hyperparameters are LR = 1 × 10−4, BS = 8, H = 45 and W = 61. The best accuracy is highlighted in bold.
Table 5.
Validation and training accuracies for the classification of classes 1, 3 and 5 with BasicNet for different photometric augmentation types. The used hyperparameters are LR = 1 × 10−4, BS = 8, H = 45 and W = 61. The best accuracy is highlighted in bold.
Augmentation Type | ACCtrain [%] | ACCval [%] |
---|
None | 85.10 | 85.95 |
Color | 84.86 | 89.26 |
Brightness | 79.81 | 84.30 |
Contrast | 85.10 | 89.26 |
Overexposure | 84.13 | 81.82 |
Spot Reflections | 88.46 | 86.78 |
Contrast and Color | 87.02 | 87.60 |
Color and Spot Reflection | 86.78 | 87.60 |
Contrast and Spot Reflection | 86.54 | 90.08 |
Contrast and Spot Reflection & Color | 86.06 | 88.43 |
Compared to the basic run, three photometric augmentation types have improved the validation accuracy, i.e., Color, Contrast and Spot Reflections. Two augmentation types, Overexposure and Brightness, have worsened the validation accuracy. In the second run of photometric augmentation testing, the augmentation types with which the validation accuracy could be improved in the first run were combined. The augmentation combinations all increased the validation and training accuracy compared to the basic run.
The second CNN analyzed in detail was
ZhouNet by Zhou et al. [
26].
ZhouNet consists of three two dimensional convolutional layers and two fully connected layers. The first convolutional layer has a kernel size of 5 × 5, a stride of 1 and 20 channels. The subsequent two-dimensional max pooling layer has a kernel size of 2 × 2, a stride of 2 and 20 channels. The second convolutional layer also has a kernel size of 5 × 5 and a stride of 1. It has 50 channels. Likewise, its following max pooling layer, which has a kernel size of 2 × 2 and a stride of 2, has 50 channels. The third convolutional layer has a kernel size of 4 × 4, a stride of 1 and 400 channels. The third and last convolutional layer is connected to a fully connected layer, which has 6400 neurons.
ZhouNet’s output layer is adapted from its original eight neurons to six neurons. A depiction of the original
ZhouNet can be found in [
26].
Three experiment runs with all six SSM classes have been conducted for different combinations of image width, height and number of color channels. The achieved training and validation accuracies are shown in
Table 6. The hyperparameters learning rate with a value of 1 × 10
−3 and the momentum with a value 0.9 were taken from the publication. Only the batch size was lowered to eight since Zhou et al. state that a smaller batch size may increase accuracy [
26].
ZhouNet was tested for the classification of SSMs because it achieved an accuracy of 99.4% in the classification of eight different surface defect classes that could be similar to SSM.
Table 6.
Validation and training accuracies for the classification of all six SSM classes with ZhouNet on the re-sorted dataset.
Table 6.
Validation and training accuracies for the classification of all six SSM classes with ZhouNet on the re-sorted dataset.
Color Channels | H [px] | W [px] | ACCtrain [%] | ACCval [%] |
---|
3 | 40 | 40 | 99.72 | 64.25 |
1 | 40 | 40 | 98.89 | 62.32 |
1 | 80 | 80 | 99.58 | 65.70 |
In the experiments with ZhouNet, validation accuracies between 62.32% and 65.70% were achieved. The training accuracies are much higher; all are above 98.89%.
Finally, transfer learning (TL) was used to test further CNNs. Four different pre-trained CNNs were chosen for TL based on their testing accuracy on the ImageNet-1K dataset. The CNNs are ConvNeXt_Base, ResNeXt101_64X4D, RegNet_Y_16GF and Efficient-Net_V2_S. All four are used in training as fixed feature extractors (FFEs) and for fine-tuning (FT). In both training modes, a grid search is conducted with all six SSM classes for each of these CNNs. The fine-tuned CNNs achieve higher classification accuracies. With validation accuracies of at least 78.26%, all four CNNs achieve better results than BasicNet. EfficientNet_V2_S achieves the best validation accuracy, which is 82.13%. For RegNet_Y_16GF and EfficientNet_V2_S, there seems to be no correlation between learning rate and validation accuracy. RegNet_Y_16GF and EfficientNet_V2_S achieve higher validation accuracies for small batch sizes at large learning rates and for large batch sizes at small learning rates. For ResNeXt101_64X4D, the validation accuracies are higher at larger learning rates. For ConvNeXt_Base, the validation accuracies are higher at smaller learning rates and the validation accuracy does not seem to be strongly dependent on the batch size.
Because the validation accuracies for a fine-tuned ConvNeXt_Base and a ConvNeXt_Base which is used as an FFE are similar, it was investigated if a partially frozen ConvNeXt_Base achieves better classification results. The idea behind partial freezing is that layers close to the input layer are more likely to extract unspecific features and layers close to the output layer are more likely to extract dataset specific features. Therefore, if the front layers are frozen, the back layers can be fine-tuned to the new dataset.
The classification results can be seen in
Figure 6. It is striking that validation accuracies of only less than 50% were achieved in five attempts, four times with a learning rate of 0.01 and once with a learning rate of 0.005, i.e., overall, at high learning rates. In general, a trend can be observed that with many frozen layers, a high learning rate leads to a better validation accuracy and with few frozen layers, a low learning rate leads to a better validation accuracy.
The final classification within this work was performed on the re-sorted dataset for pairwise classes and for all six classes. Augmentations that have proven to increase the classification accuracy were applied in groups. These groups are no augmentations, geometric augmentations, which include Transposing, Striding Crops, Rotation by 180°, Striding Crops and Transposing, Striding Crops and Rotation by 180°, Transposing and Rotation by 180° and Striding Crops and Transposing and Rotation by 180°, photometric augmentations, which include Color, Contrast, Spot Reflections, Color and Spot Reflection, Contrast and Spot Reflection and Contrast and Spot Reflection and Color. Both augmentation groups are also applied at the same time.
The final classification was performed with
EfficientNet_V2_S with learning rate of 0.005 and a batch size of 16,
ConvNeXt_Base with a learning rate of 0.0005, a batch size of 16 and 2 frozen layers and
BasicNet with a learning rate of 0.0005. These hyperparameters were selected because they produced the best results in the previous grid searches. The test accuracy of the runs was calculated in each case using the CNN weights of the epoch with the best validation accuracy. The achieved final accuracies for the classification of six classes can be seen in
Table 7 (subdivided according to accuracy in the validation data set and in the test data set).
Table 7.
Validation and test accuracies for the final classification of all six SSM classes depending on CNN and augmentation type. The highest achieved test accuracy is highlighted in bold.
Table 7.
Validation and test accuracies for the final classification of all six SSM classes depending on CNN and augmentation type. The highest achieved test accuracy is highlighted in bold.
Model | ACC | None [%] | Geom. [%] | Photom. [%] | Both [%] |
---|
EfficientNet_V2_S | ACCval | 82.13 | 80.68 | 75.36 | 82.61 |
| ACCtest | 73.08 | 82.69 | 76.92 | 75.96 |
ConvNeXt_Base | ACCval | 81.16 | 79.71 | 78.26 | 79.23 |
| ACCtest | 79.81 | 80.77 | 81.73 | 83.65 |
BasicNet | ACCval | 69.57 | 70.53 | 67.15 | 73.43 |
| ACCtest | 67.31 | 70.19 | 68.27 | 70.19 |
The best test accuracy for the classification of six classes, 83.65%, was achieved with
ConvNeXt_Base and both augmentation types. A mean AUC of 0.927 shows that all classes are well distinguishable. Class 4 is the least and class 1 the most distinguishable. In the confusion matrix in
Figure 7, it can be seen that of the misclassified images, all except one were only placed one class away.
The best final test accuracy for pairwise classification was achieved with ConvNeXt_Base and photometric augmentations; it is 91.35%. The mean AUC is 0.919, which indicates a good differentiability between the classes. Class (5,6) is the least and class (1,2) is the most distinguishable. All misclassified images were only placed one class away.
The test accuracies depending on the model and augmentation group are shown in
Figure 8. It appears that the six-class classification benefits from augmentations, as the score with no augmentations is the lowest for each model. In pairwise classification, no trend regarding augmentation group is recognizable.
5. Discussion
The classification success of two SSM classes using BasicNet, where a validation accuracy of 98.33% was achieved, has shown that the classification of SSMs is principally possible with CNNs. However, the drop in accuracy when classifying with six classes to 69.08% showed that BasicNet is not suitable for distinguishing between all classes. Since the training accuracy also dropped, research was carried out to determine whether better results can be achieved with other hyperparameters. It was also investigated whether data augmentation can improve the classification results. In addition, it was investigated whether more complex CNNs and TL can improve the classification performance.
In order to find out whether better classification results can be achieved with different hyperparameters, a grid search was carried out with two different batch sizes and five different learning rates. Indeed, with grid search, a hyperparameter combination was found with which the validation accuracy could be improved to 69.57%. As expected, the accuracies for larger learning rates are worse than the accuracies for lower learning rates. With a too high learning rate, the SGD optimizer may update the weights and biases too much, which can lead to the optimal values being skipped over [
27]. It was also found that the optimal learning rate must lie between 0.001 and 0.0001, as the achieved accuracy values are similar in this hyperparameter range. A further grid search with more grid points in this range might have been able to find the optimal hyperparameter combination. However, the grid search method is very computationally expensive. For further work, other methods should be used that are less computationally expensive, such as random search.
The image size, which can also be regarded as a hyperparameter, was examined as well, mainly because the GPU memory was overloaded at full image size [
28].
For the largest examined image size, the difference between the validation accuracy and the training accuracy was found to be high, which indicates overfitting (
Table 4).
BasicNet may be overfitting here because larger images are more detailed and, therefore, contain more specific features.
BasicNet could, thus, memorize the training images based on their details and, therefore, distinguish less accurately between the classes with larger image sizes. The accuracies for the middle three examined image sizes are very close to each other. In addition, the difference between the respective validation and training accuracy is small, which indicates that
BasicNet does not overfit to the training data. In other words, it can generalize well, which means that it distinguishes classes based on their characteristic features. However, the training accuracy is only about 91% in each case. Therefore, it is very unlikely that better validation accuracy can be achieved with these image sizes. The optimal image size probably lies between the largest and the second largest tested image size, because at these two image sizes, the best validation accuracies were achieved. No further studies were conducted in this regard. A sharp drop in accuracy can be seen at the smallest selected image size. Looking at
Figure 9, it is noticeable that details are omitted at this image size. In addition, artifacts appear in the image. They are recognizable by the new pixelated pattern that replaces the previously existing grooves.
Image augmentation was identified as a further possibility to increase the classification accuracy. Five different geometric and photometric augmentation types were investigated. Of the geometric augmentation types, Transposing, Striding Crops and Rotation by 180° improved the validation accuracy. That is why combinations of them were examined in a second experiment run, in which the validation accuracy could be increased in comparison to the non-augmented classification, but not in comparison to the classification with the non-combined augmentation types.
When Transposing and Rotation by 180° are applied, the training accuracy is lower than the validation accuracy, which is usually not the case. This may indicate that the validation dataset is less complex than the training dataset and, therefore, easier to classify. The lower complexity may also be a consequence of the fact that the images of the validation dataset are not transformed by augmentations. Accordingly, all SSMs of the validation dataset have the same orientation. The lower validation accuracy can also be an indication that the validation set is too small and, therefore, too homogeneous. It may be that the validation dataset, in contrast to the training dataset, does not sufficiently represent the diversity of all SSMs. Another indication that the validation data set is too small is the fact that three validation accuracies have exactly the same value. The training accuracy increased slightly and the validation accuracy remained the same, in comparison to the non-augmented classification, when Rotation by Small Angles was applied. This may be related to the shift invariance, which is ensured by the pooling layer. If patterns from one area of the image differ only slightly from each other, which is the case for images transformed with the Rotation by Small Angles augmentation type, the pooling layer produces the same output for them. Therefore, this augmentation type has only minor effects. In the Flips augmentation experiment, the training accuracy improved slightly, but the validation accuracy declined to 50.41%. This is due to the fact that the SSM images are flipped with a probability of 50% in the training run and, therefore, half of the images have a different orientation compared to the validation set images. BasicNet may not be able to find the flipped patterns in the validation set images.
Of the photometric augmentation types, Color, Contrast and Spot Reflections improved the validation accuracy compared to the non-augmented classification. With three photometric types of augmentation, it should, again, be noted that the validation accuracy is higher than the training accuracy. This may, again, be due to the fact that the validation set is too homogeneous compared to the augmented training set. The application of the Overexposure augmentation decreased the training and validation accuracy in comparison to the non-augmented run. This could be due to the fact that with this type of augmentation, an image is overlaid with a brightened image. This overlay might introduce image artifacts. In addition, a change in brightness has also led to a reduction in validation and training accuracy, when the Brightness augmentation type was applied. It is surprising that a simple change in image brightness does not improve the validation accuracy, unlike other simple modifications such as contrast or color changes. In a second run, combinations of classification-improving augmentations were tested again. With all photometric augmentation types, the validation accuracy was improved in comparison to the non-augmented classification. Compared to the runs with non-combined augmentation types, the validation accuracy appears to deteriorate for the combination Contrast and Color. The combination of Contrast and Spot Reflection has clearly improved the validation accuracy. For the remaining two combinations, the validation accuracy lies between the validation accuracies of the respective basic runs. For photometric augmentation combinations, the validation accuracies are, again, higher than the training accuracies, which again indicates that the validation set is too small or homogeneous.
In the final classification experiments, image augmentations were tested in a geometric group, a photometric group and a group with both augmentation types. It has been found that augmentations definitely improve the accuracy for the classification of six classes. In case of pairwise classification, it has been found that in some cases, non-augmented classification achieves higher results than augmented classification.
Further work could investigate whether there are more augmentation types that improve the SSM classification. In addition, a grid search or other hyperparameter tuning methods could be used to search for the optimal augmentation parameters, such as the probability with which they are applied.
As
BasicNet did not perform well in classifying all six classes, a CNN was tested that has already proven itself in classifying surface defect classes. This CNN was developed by Zhou et al. [
26] and is called
ZhouNet in this work. Compared to the achieved accuracy of 99.4% by Zhou et al. in the classification of eight classes, this work only achieved an accuracy of 65.70% in the classification of six classes.
Three different combinations of image width, height and number of color channels were examined with ZhouNet. In its original architecture, in which monochromatic images with a height and width of 40 pixels each are classified, ZhouNet achieved a validation accuracy of only 62.32%. Because colored and larger images contain more details, ZhouNet was adapted to process them by adapting the fully connected layer. Better validation accuracies were achieved by adapting both the number of color channels and the image size. The training accuracy is above 98% for all runs and, therefore, far above the validation accuracy. It should be noted that ZhouNet is heavily overfitting and cannot generalize. This is probably due to the fact that the SSM images are heavily compressed by resizing them to their small size. As a result, they probably lose details that would be important for distinguishing the classes. Zhou et al. resize images of 200 by 200 pixels to a size of 40 by 40 pixels. The SSM images with an original size of 960 by 1280 pixels are, therefore, much more compressed when they are resized. The validation accuracy in the classification of SSMs is far below the test accuracy mentioned in the paper. This may be due to the fact that Zhou et al. classify images of various defect classes, which seem easier to distinguish than classes, which contain images of only a single surface defect with varying severity.
Overall, the maximum validation accuracy of 65.70% that was achieved with ZhouNet is lower than BasicNet’s maximum validation accuracy of 69.57% for the classification of all six SSM classes. ZhouNet was not examined further because the GPU memory was almost overloaded when the images with a width and height of 80 by 80 pixels were pro- cessed.
As with BasicNet, a grid search was carried out with pre-trained CNNs, which were expected to produce better classification results. These CNNs, which are ConvNeXt_Base, ResNeXt101_64X4D, RegNet_Y_16GF and EfficientNet_V2_S, were used for finetuning and as an FFE, respectively.
It has been found that all examined pre-trained CNNs work better when they are fine-tuned. Only ConvNeXt_Base achieves similar performance, both as an FFE and when finetuned. Overall, only ConvNeXt_Base as an FFE exceeds the validation accuracy of BasicNet. The other three CNNs perform worse than BasicNet, which is related to the fact that all layers except the last classification layer are frozen during their use as FFEs and, therefore, their weights and biases cannot be trained on the new image data.
As expected, the CNNs were able to achieve higher validation accuracies after being fine-tuned. All of them at least achieve a validation accuracy of 78.26% and, thus, perform better than BasicNet. With 82.13%, the fine-tuned EfficientNet_V2_S achieves the highest validation accuracy.
Because it achieved similar validation accuracies when fine-tuned and as an FFE, Con-vNeXt_Base was further investigated by examining its classification accuracy for different numbers of frozen layers. With two frozen layers, the classification accuracy could indeed be increased to 81.2%.
Final classifications were performed with EfficientNet_V2_S and with ConvNeXt_Base, because these CNNs were the only two that achieved classification accuracies of over 80%. BasicNet was added for comparison. All three CNNs were trained with hyperparameters that were found to be beneficial for them in the process of this work.
In the end, for the classification of all six classes, an accuracy of 83.65%, and for the classification of pairwise classes, an accuracy of 91.35%, was achieved. Both highest accuracies were achieved with a partially frozen ConvNeXt_Base and image augmentations. BasicNet as a non-pretrained CNN only reached an accuracy of 70.19% for the classification of all six classes and an accuracy of 83.65% for the classification of pairwise classes. Mean AUC values of 0.927 for the classification of six classes and 0.919 for the classification of pairwise classes indicate that the six SSM classes can be distinguished reasonably well from each other.