TIANet: A Defect Classification Structure Based on the Combination of CNN and ViT

Wang, Hongjuan; Zhao, Fangzheng; An, Xinjun; Zhao, Youjun; Li, Kunxi; Guo, Quanbing

doi:10.3390/electronics14081502

Open AccessArticle

TIANet: A Defect Classification Structure Based on the Combination of CNN and ViT

by

Hongjuan Wang

,

Fangzheng Zhao

,

Xinjun An

^*,

Youjun Zhao

,

Kunxi Li

and

Quanbing Guo

School of Intelligent Equipment, Shandong University of Science and Technology, Tai’an 271000,China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(8), 1502; https://doi.org/10.3390/electronics14081502

Submission received: 10 March 2025 / Revised: 29 March 2025 / Accepted: 3 April 2025 / Published: 9 April 2025

Download

Browse Figures

Versions Notes

Abstract

Defect detection plays a crucial role in ensuring product quality. However, accurate and effective defect detection remains a challenge due to the specific features inherent in defect images, including scale and shape variations. We propose a new defect classification structure—TIANet—which includes a local feature extraction module (ISD), a global feature extraction module (DSViT), and an atrous spatial pyramid pooling (ASPP) module. Among them, ISD is composed of inverted residual structure fusing extrusion and incentive attention mechanism and path discarding mechanism to realize the extraction of local features and the learning of complex patterns. DSViT is composed of Vision Transformer and deep separable convolution, which realizes the extraction of global features and fuses them with local features to ensure the accurate feature expression of defects with similar backgrounds. ASPP enhances the multi-scale feature extraction ability and contextual information capture ability of the model and effectively perceives defects of different shapes and scales. Experimental verification on the glass bottle dataset shows that it performs well in the defect classification task of national standard white glass bottle, with an accuracy rate of 95.714%. Compared with the typical network models Vision Transformer and MobileNetV3, TIANet shows significant advantages, which verifies its effectiveness and superiority in the defect classification of glass bottles.

Keywords:

CNN; inverted residuals structure; squeeze-and-excitation attention mechanism; ViT; feature fusion

1. Introduction

1.1. Research Background

Recent advancements in computer vision have greatly contributed to industrial intelligent defect detection, offering robust technical support. Image processing techniques are now widely used to analyze surface images of objects, enabling the automatic detection and classification of defects like pores, fractures, cracks, and wear, making it a key area of research. Conventional image processing methods primarily depend on techniques such as edge detection and morphological operations [1,2,3,4]. While these methods perform well in certain specific scenarios, they have limitations in robustness and generalization, especially when applied in complex and ever-changing real-world production environments, posing significant challenges. As deep learning technology has advanced [5], particularly the widespread use of convolutional neural networks (CNNs), defect detection methods have gradually shifted to deep learning techniques, greatly improving detection performance and adaptability.

1.2. Related Work

Recently, models like Faster R-CNN [6], YOLO [7], and SSD [8] have been proposed for object detection, achieving strong results on public datasets and in real-world applications. However, Faster R-CNN may miss small defects during candidate region generation, leading to reduced performance in detecting these defects. YOLO struggles with multi-scale defects, having difficulty detecting both large and small defects at once. SSD, on the other hand, tends to produce false positives in complex backgrounds, particularly when surface texture or lighting variations interfere, affecting model robustness.

In order to overcome this shortcoming, Transformer [9], which relies entirely on attention mechanism, becomes another scheme. Chen et al. [10] proposed a double-branch converter to improve image classification performance through multi-scale feature representation and cross-attention module. Bazi et al. [11] generated sequences of input multi-head attention layers for representation through blocking, flattening, embedding, and positional embedding, and used the multi-head attention mechanism to derive the long-term context relationships in the images, and then classified them through the Softmax layer. However, the Transformer can effectively model the global context, but it lacks spatial inductive bias in the modeling process, and there are shortcomings in capturing the local features of the industrial defect image.

To address this issue, this paper introduces a surface defect classification approach that combines CNN and ViT [12]—TIANet. TIANet provides three modules that work together to implement defect classification functions. Firstly, a local feature extraction module (ISD) was adopted to fuse the Squeeze-and-Excitation attention mechanism (SE) [13] and path drop mechanism [14] with an Inverted Residuals structure (IR) [15], which greatly improved the local feature extraction ability and channel representation ability of the model. Secondly, a global feature extraction (DSViT) module is used, which is composed of ViT combined with depth-separable convolution [16] to realize the global feature extraction of defects and to fuse them with the local features extracted by ISD. Finally, the Atrous Spatial Pyramid Pooling (ASPP) module [17] is employed to capture multi-scale features, utilizing dilated convolutions with varying expansion rates to capture contextual information at different scales and detect defects of various shapes and sizes.

In Section 2, we present the specific methods we propose. Section 3 includes an introduction to the dataset, experiments, and analysis. Finally, Section 4 and Section 5 provide a discussion and conclusion of this work, along with suggestions for future research.

2. Methodology

This section mainly introduces the model proposed in this article. Section 2.1 introduces the overall structure of the module, Section 2.2 introduces the ISD module, Section 2.3 introduces the DSViT module, and Section 2.4 introduces the ASPP module.

2.1. TIANet Structure Design

The architecture of TIANet, introduced in this paper, is illustrated in Figure 1, which mainly includes ISD blocks, DSViT blocks, and ASPP blocks. Among them, ISD blocks and DSViT blocks are employed to extract local and global features, respectively, and ASPP blocks are used for multi-scale feature extraction to further capture contextual information. ViT treats the input as a one-dimensional sequence and focuses on modeling the global information, but lacks low-level features with detailed position information, which cannot be directly upsampled to effectively restore the original resolution, resulting in poor classification effect. The CNN structure is good at extracting low-level features, focusing on small areas around each pixel, and capturing local textures and edge features, which can make up for the disadvantages of ViT. The specific method is to first use the progressively increasing receptive field of CNN to obtain low-level local defect features from the input defect image. Then, ViT is used to capture global contextual features and model long-range dependencies. Finally, deep convolution fuses these features to refine defect information.

The TIANet architecture begins with the input image, where initial processing is carried out using convolutional layers. Next, enter the feature extraction block to extract image features, which is composed of the ISD block, DSViT block, and ASPP block, which are used to further extract and fuse local features and global features, and perform multi-scale feature extraction through the ASPP module to further capture contextual information. Eventually, the aggregated features are passed through a convolutional layer to adjust the channel count, followed by a global pooling layer and classifier to produce the final predictions. By leveraging the strengths of both CNN and ViT, the model effectively extracts local features while capturing global context, enhancing its performance and accuracy.

2.2. ISD Block

ResNet [18] introduced the residual structure with skip connections to mitigate the vanishing gradient issue in deep networks, enhancing training performance. However, the computational and memory consumption of residual structures is high, especially in deep networks, resulting in slower training and inference speeds. The IR utilizes deep separable convolution, significantly boosting feature extraction while maintaining computational efficiency. However, in the face of complex surface defect classification tasks, its feature extraction ability is slightly insufficient. This paper proposes the ISD block, which integrates the SE mechanism and DropPath mechanism into the inverted residual structure, which can capture the difference features between different channels at specific locations in the image through the SE mechanism while maintaining computational efficiency. It focuses selectively on relevant data, disregards irrelevant parts, and enhances feature expression capability, which is helpful for better coping with complex local feature extraction tasks, and it randomly discards paths through the DropPath mechanism. The model can learn more diverse feature representations, avoiding reliance on feature extraction for some specific paths and reducing overfitting.

Figure 2 illustrates the detailed structure of the ISD block. Initially, the input tensor increases the number of channels via a pointwise convolution (PW Conv2D), which is then processed by batch normalization (BN) and the Gaussian Error Linear Unit (GELU) activation function. This step increases the channel dimension of the input feature map, enhancing its representational capacity. In a high-dimensional space, a deep convolution is performed with the number of channels kept constant, while using batch normalization and GELU activation functions. Subsequently, the attention weights are multiplied by the original feature map element by element through the SE module to complete the feature recalibration between channels. Then, a pointwise convolution reduces the dimensionality of the number of channels to restore it to the original dimension. To mitigate overfitting, the DropPath mechanism randomly drops the computational results of certain channels and calculates the output after path-discarding by multiplying and scaling with the randomly generated reserved mask, so as to enhance the generalization ability of the model. Finally, the ISD block uses residual connections to merge inputs and outputs, enabling the model to capture higher-level features while preserving the original feature information. Throughout the process, the integration of the IR, SE, and DropPath mechanisms enhances local feature extraction and channel characterization, while ensuring computational efficiency and reducing overfitting.

2.2.1. Optimize the Squeeze-and-Excitation Attention

The core concept behind the SE attention mechanism is to enhance the network’s representation by explicitly capturing channel dependencies. SE boosts network performance by reassigning weights to each channel, amplifying crucial features, and diminishing irrelevant ones. However, due to the large changes in the scale and shape of defect features, especially when the feature values are small, the traditional Rectified Linear Unit (ReLU) activation function leads to the inability to effectively transmit some important details, thus affecting the effect of feature extraction. In this paper, the ReLU6 and H-Sigmoid activation functions are introduced to improve numerical stability and computational efficiency, while making them more sensitive and robust when dealing with small eigenvalues. The SE mechanism is mainly composed of two parts: extrusion and excitation. The process of the SE algorithm is shown in Figure 3.

In the extrusion (

F_{sq} (\cdot)

) stage, the model globally averages the input feature maps, narrowing down the feature maps for each channel to

1 \times 1 \times C

to generate a global description of each channel. Each value represents the global information for the corresponding channel, and the formula is shown in Equation (1):

Z_{c} = F_{sq} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(1)

where

Z_{c}

represents the result of global average pooling, which is used to represent the global information of channel

c

;

H

and

W

are the height and width of the input feature map, respectively; and

u_{c}

represents the spatial position of the input feature map and the value of

(i, j)

on channel

c

.

In the excitation (

F_{ex} (\cdot, W)

) stage, the

1 \times 1 \times C

vectors generated in the extrusion stage are fully connected by two layers (implemented as

1 \times 1

convolutions, respectively). The output dimension of

FC 1

of the first fully connected layer is

c / r

of the number of channels, and after applying the ReLU6 activation function, the number of channels is restored back to C by

FC 2

of the second fully connected layer, and the attention weight

S

of each channel is generated by the H-Sigmoid activation function. The calculated weights

S

are applied to reweight the individual channels of the original feature map. The formula is shown in Equation (2):

S = H-Sigmoid ({FC}_{2} (ReLU 6 ({FC}_{1} (F_{sq} (u_{c})))))

(2)

where

S

represents the generated attention weight;

u_{c}

represents the input features; and

{FC}_{1}

and

{FC}_{2}

represent two fully connected layers, which are responsible for mapping and processing the features, respectively.

Finally, in stage

F_{scale} (\cdot, \cdot)

, the attention weight

S

and the original feature map

U

are multiplied element by element to complete the feature recalibration between the channels, and the feature map

X

we want is obtained, which is exactly the same size as the feature map

\tilde{X}

, and the formula is shown in Equation (3):

{\tilde{X}}_{c} = F_{scale} (u_{c}, s_{c}) = s_{c} u_{c}

(3)

where

u_{c}

is the input feature of channel

c

;

s_{c}

is the scale factor generated for each channel; and

F_{scale}

adjusts the scale of the input feature according to the scale factor related to it.

2.2.2. Path Drop Mechanism

The dropout mechanism is a regularization technique used to prevent overfitting in neural network models. The core idea is to reduce co-adaptation between neurons by randomly selecting a subset of neurons during model training and temporarily “discarding” them (i.e., setting their output to zero). However, in the IR or other complex network structures, the path-discarding mechanism performs better because it randomly discards the entire computational path during the training process instead of a single neuron, and the network uses different sub-networks for each forward propagation, which can reduce the network’s dependence on certain paths and increase the generalization ability of the model.

The path dropout mechanism can be expressed as follows: For the input

x i

of layer

i

, by multiplying and scaling with the randomly generated retention mask, the output formula after path loss is shown in Equation (4):

yi = xi \cdot mi

(4)

where

m i

is a random mask vector in the same dimension as the input, and its elements are 0 or 1, representing the drop and retention of the path. This paper sets the probability of random path discard to 0.1, which reduces reliance on specific paths, particularly in the IR, where the path dropout mechanism effectively mitigates overfitting, enhances model robustness to noise and data distribution changes, and better copes with different types of defects.

2.3. DSViT Block

The CNN network is excellent in extracting local features, but due to the high feature instability of image defects in the industrial field, CNN performance is not as expected when dealing with tasks that require global features, and the limitations of convolution kernel hinders CNN’s ability in global feature extraction. The Transformer module can extract features from the global scope through the self-attention mechanism, which compensates for the shortcomings of CNN in global feature extraction. To this end, the DSViT module is used in this paper, which is combined with depth separable convolution by Vision Transformer to achieve global feature extraction of defects. The structure of the DSViT block is shown in Figure 4, and efficient representation learning is achieved through the combination of three submodules: feature preprocessing, global feature encoding, and feature fusion.

(1): Feature preprocessing: This process mainly includes two stages: deep convolution and pointwise convolution, with the purpose of converting the input tensor $X \in R^{H \times W \times C}$ into $X_{D} \in R^{H \times W \times d}$ to adapt to the input requirements of the Transformer module.

Firstly, the deep convolutional layer is used to extract local features from the input feature map X. Deep convolution preserves spatial information while extracting key local features by applying a 3 × 3 convolution kernel on a channel-by-channel basis, as shown in Equation (5):

X_{0} = {DWConv}_{3 \times 3} (X)

(5)

where

X_{0}

represents the output feature map, and

X

is the input feature map.

Then, the point-by-point convolution uses a 1 × 1 convolution kernel, which is mainly used to adjust the depth of the feature map and merge the multi-channel data into deeper representations to adapt to the dimensional requirements of the subsequent Transformer modules. The formula is shown in Equation (6):

X_{D} = PW {Conv}_{1 \times 1} (X_{0})

(6)

where

X_{D}

represents the output feature map, and

X_{0}

is the feature map after deep convolution processing.

(2): Global representation: It mainly includes three stages: unfolding operation, Transformer encoding, and folding operation, and its purpose is to capture global information by dealing with long-range non-local dependencies with spatial inductive bias.

First, the unwrapping operation converts the processed feature map

X_{D}

into multiple non-overlapping flattened patches

X_{u}

. Each patch is continuously cut from the feature map, often including a rearrangement of rows and columns to form a vector. The formula is shown in Equation (7):

X_{u} = UnFold (X_{D}) = reshape (X_{D}, [P, d])

(7)

where

X_{D}

is the feature map of the input,

P = w \times h

is the number of patches, and

d

is the dimension of each patch (including the height, width, and number of channels of patches).

Secondly, for each patch sequence, the Transformer encoder accepts the expanded patch as input and uses the multi-head self-attention (MSA) and feed-forward network (FFN) to encode the interrelationship between the patches. The formulas are shown in Equations (8) and (9):

X_{temp} = MSA (LN (X_{u})) + (X_{u})

(8)

X_{T} = FFN (LN {(X}_{temp})) + X_{temp}

(9)

where

LN

stands for layer normalization,

MSA

stands for multi-head self-attention mechanism, and

FFN

stands for feedforward network. Each patch

X_{u}

learns and updates its representation through a self-attention mechanism to capture dependencies between patches.

Finally, the folding operation converts the output

X_{T} \in R^{P \times N \times d}

encoded by the Transformer back to

X_{F} \in R^{H \times W \times d}

with the same spatial layout as the original feature map

X_{D}

. The formula is shown in Equation (10):

X_{F} = Fold (X_{T}) = reshape (X_{T}, [H, W, d])

(10)

where

H

,

W

, and

d

are the height, width, and number of channels of the feature map, respectively, which match

X_{D}

. This step ensures that the global representation is consistent with the original spatial dimension, which facilitates subsequent processing and feature fusion.

(3): Feature fusion: This process should include three stages: low-dimensional projection, joining operation, and fusion convolution. The aim is to combine the global features processed by Transformer with the local features extracted by the ISD block.

Firstly, a 1 × 1 pointwise convolutional layer is used to project the high-dimensional features output by the global feature coding module into a low-dimensional space to reduce computational complexity.

Secondly, the global feature

X_{1}

after dimensionality reduction is connected with the local feature

X

to achieve feature fusion, enhance the richness of feature expression, and obtain the fused feature

X_{2}

. The formula is shown in Equation (11):

X_{2} = Concat (X_{1}, X)

(11)

where

Concat (\cdot)

represents a vector splicing operation.

Finally, the connected features are processed by another 3 × 3 depth separable convolutional layer, and the information is further extracted and fused to obtain the final output tensor

Y \in R^{H \times W \times C}

.

2.4. ASPP Block

The Atrous Spatial Pyramid Pooling (ASPP) module introduced in this paper captures contextual information at different scales by introducing dilated convolutions with different sampling rates (expansion rates) [19], so that the network can learn and focus on defect features of different shapes and sizes. The formula for the dilated convolution operation is Equation (12):

y [i] = \sum_{k = 1}^{K} x [i + r \cdot k] \cdot w [k]

(12)

where

y [i]

represents the value of the output feature map at position

i

,

x [i]

is the input,

w [k]

is the convolutional filter weight,

r

is the dilation rate, and

K

is the filter size. The void rate

r

controls the spacing between filter values.

The ASPP structure, shown in Figure 5, processes the input feature map through several

3 \times 3

convolutional layers with varying dilation rates, each followed by batch normalization and ReLU activation to capture multi-scale features. Simultaneously, the global average pooling layer compresses the feature map into a global vector, which is processed through a

1 \times 1

convolution, batch normalization, and ReLU activation before being resized to match the input dimensions. The output feature maps from each convolutional layer and the global pooling layer are concatenated along the channel axis to create a new feature map. Finally, channel fusion and dimensionality reduction are carried out via a

1 \times 1

convolution, producing the desired output feature map after batch normalization, ReLU activation, and Dropout processing. The formula corresponding to this process is as follows (13):

y_{ASPP} = {Conv}_{1} ({{Conv}_{1}, Conv}_{3}^{r} (x), {Conv}_{3}^{2 r} (x), {Conv}_{3}^{3 r} (x), {Conv}_{1})

(13)

{Conv}_{3}^{r}

represents a

3 \times 3

convolution with an expansion rate of

r

, and

[\cdot]

represents concatenating feature maps along the channel dimension. After concatenation, apply a

1 \times 1

convolution of

(Conv 1)

to fuse the features. The ASPP module leverages multi-scale feature extraction to enhance the model’s ability to capture both image details and global context, thereby improving classification performance.

Figure 5. Structure of ASPP.

3. Data and Experiments

This section presents a comprehensive analysis of the experiment, covering the dataset, experimental setup, the detailed process and results, as well as the ablation study to further demonstrate the effectiveness of the TIANet architecture.

3.1. Datasets

The dataset used in this paper is provided by Shandong Mingjia Technology Co., a cooperative enterprise of the project, and the data acquisition equipment is the industrial camera DASAL Genie Nano-1GigE, with a resolution of 1280 × 1024 and a frame rate of 35 fps. The backlight is used to illuminate, and multiple sets of mirrors are used to reflect at different angles to obtain the maximum angle of the bottle image. Through two bottle detections before and after, one camera with four viewing angles, a total of eight viewing angles, so that the 360-degree bottle body of each bottle is covered.

The different categories of bottle wall images are shown in Figure 6. Among them, there are 282 qualified images on the front side, 608 unqualified images on the front side, 253 qualified images on the back side, and 429 unqualified images on the back side, with a total of 1572 glass bottle wall images. Among them, the defect category in the non-conforming image is shown in Figure 7 and contains the following: (a) Plaque: foreign stone particles embedded in the glass, which will reduce the quality of the bottle; (b) Crack: refers to the deformation or cracking of the surface of the bottle, which may affect the structural integrity of the bottle; (c) Stolen goods: refers to the stolen goods remaining in the bottle during use or cleaning; (d) Scratches, i.e., scratches on the surface of the bottle, which can affect the strength and appearance of the bottle.

Due to imbalanced sample distribution in the dataset, the accuracy and robustness of the model are reduced. Data augmentation technology is used to ensure that each category has at least 450 images, for a total of 1800 images. 60% (1080 images) of the dataset is used for training, 20% (360 images) for validation, and the remaining 20% (360 images) for testing. The size of the input image is 512 × 512 × 3, and data augmentation techniques such as horizontal flipping and random cropping are applied to improve diversity.

The glass bottle defect dataset used in this study has diverse defect types and real-world image samples, making it suitable for training TIANet models. The dataset contains defect images of different shapes, sizes, and lighting conditions, covering various common types of defects such as cracks, debris, scratches, etc., providing rich training samples for the model and helping it learn the features of various defects.

3.2. Experimental Environment

The TIANet model is adjusted using Python 3.7.1 and PyTorch 1.10.1 frameworks. The training was conducted on a single high-performance NVIDIA GeForce GTX 4070 GPU. Adopting the ADAMW optimizer [20] with cross-entropy loss [21], the training cycle is 500 rounds, the batch size is 4, and the learning rate is set to 0.0002.

3.3. Analysis of Experimental Results

This section assesses the classification of glass bottle defects based on accuracy, precision, recall, and F1-score. To compare the model’s performance in identifying defective glass bottles, a confusion matrix and comparative experiments were performed to evaluate the proposed method’s effectiveness.

3.3.1. Evaluation Metrics

Accuracy measures the proportion of correctly predicted samples relative to the total number of samples, reflecting the overall correctness of the classifier. Precision calculates the proportion of true positive samples among those predicted as positive, indicating the accuracy of positive predictions. Recall evaluates the proportion of actual positive samples correctly identified by the classifier, demonstrating its ability to detect positive instances. The F1-score, the harmonic mean of precision and recall, offers a balanced assessment of both metrics. It is particularly useful in cases of class imbalance, where a higher F1-score signifies a better balance between precision and recall in the model.

3.3.2. Comparative Test

To validate the effectiveness of the proposed model for visual classification of glass bottles using the glass bottle dataset, we compare the results of the proposed method on the glass bottle count dataset with the typical networks EfficientNetV2 [22], Vision Transformer, ShuffleNet [23], MobileNetV3 [24], and MobileViT [25]. The method in this paper and various networks, such as EfficientNetV2, use images with a resolution of 512

\times

512 as input, and all comparison experiments are performed on the same machine and environment and run on the same dataset with default settings. Table 1 presents the results of the comparison.

The comparative experimental results in Table 1 indicate significant differences in the classification performance of the models. The EfficientNetV2 and Vision Transformer models showed average performance in classification accuracy, F1-score, and recall, with classification accuracies of 60.357% and 63.571%, respectively, and F1-scores of 59.682% and 62.104%, respectively. Due to the large amount of data required for training ViT and EfficientNetV2, the dataset used in this article is very small, and the models may not be able to fully learn useful features, resulting in low performance. ShuffleNet and MobileViT models performed well in classification tasks, with accuracies of 93.929% and 92.186%, respectively, and F1-scores of 93.975% and 92.066%, respectively. Among all models, the proposed model TIANet has the best performance, with a classification accuracy of 95.714% and an F1-score of 95.882%. TIANet has a high level of classification accuracy and precision for all categories, thanks to the optimization of the TIANet model’s structure and feature extraction mechanism, which can better capture and distinguish features from different categories, thereby improving classification performance.

3.3.3. Confusion Matrix

To assess the performance of the TIANet model, the best-performing model from training was saved for testing. In addressing the “black box” issue, the model’s interpretability was analyzed through the confusion matrix. The confusion matrix, a key tool for evaluating classification model performance, shows a square matrix where rows represent predicted categories, and columns represent actual categories. Analyzing this matrix helps in understanding how the model classifies each category. To evaluate the classification performance of TIANet, the confusion matrix was plotted for a visual comparison with other models on the glass bottle dataset, as shown in Figure 8.

As can be seen from Figure 8, EfficientNetV2 does not perform well when dealing with bottle sorting tasks, especially in the “Front unqualified” category, and the overall performance of the model is low. In contrast, the Vision Transformer model has improved in all indicators, but its performance in handling complex classification tasks still needs to be improved. The ShuffleNet and MobileViT models perform well in classification tasks, especially when dealing with the “Front unqualified” category, with less misclassification. The MobileNetV3 model also performed well, especially in the “Rear qualified” and “Rear unqualified” categories. The proposed model TIANet has the best classification performance in all categories, especially in the “Front qualified” and “Front unqualified” categories. Compared with other models, TIANet has higher classification accuracy in all four categories, showing its superior feature extraction ability and excellent classification performance.

3.4. Ablation Experiments

3.4.1. Recommended Module Ablation

The method in this paper consists of three parts: the ISD module, the DSViT module, and the ASPP module. The ISD module is composed of the SE mechanism and the DropPath mechanism on the basis of the inverted residual structure (IR). In order to evaluate the improved performance of each part, ablation experiments were carried out on the glass bottle dataset, and the SE mechanism, DropPath mechanism, and ASPP module were removed to form a network with the DSViT module fused with the inverted residual structure (IR) as the baseline. They were added to the baseline network one by one.

The experimental results are shown in Figure 9. From the results, it can be seen that when using only the DSViT fusion inverse residual structure as the baseline network, the accuracy value is relatively low, at 91.786%. When the SE mechanism is integrated, the network will have the ability to capture differential features between different channels at specific locations. It improved the accuracy, reaching 94.107%. After further integration of the ASPP module, the network can effectively perceive defects of different shapes and sizes with an accuracy of 95.005%. Finally, after adding the DropPath mechanism, the accuracy of the network increased to 95.714%. The ablation experiment proves that a method for feature extraction, multi-scale fusion, and classification performance has been proposed.

3.4.2. Probabilistic Analysis of the Path Drop Mechanism

The path drop mechanism prevents overfitting and accelerates training by setting the probability of dropping, randomly dropping the entire path in certain layers, and not updating the parameters in those paths during backpropagating. In this part, various drop probabilities are applied to the network, and experiments are conducted to achieve optimal outcomes. The performance results are illustrated in Figure 10, showing noticeable changes depending on different path drop probabilities. Specifically, when the drop probability is set to 0.1, the model attains its maximum accuracy of 95.714%. However, as the drop probability increases to 0.2 and 0.3, all performance metrics start to decline. In particular, at a drop probability of 0.3, the performance metrics reach their lowest values. This indicates that higher drop probabilities may hinder the network’s learning ability, leading to poorer performance.

3.4.3. ASPP Module Insert Position Analysis

Based on practical experience, this study assumes that the ASPP module will achieve the best results when applied to the middle layer of the baseline network. In order to test this hypothesis, the ASPP module is inserted into different positions after each layer stage, and a series of experiments are conducted. The results are shown in Figure 11. Initially, ASPP modules were inserted separately after each stage to evaluate the classification performance of each network separately. According to Figure 11, the performance of ASPP modules is poor when inserted after Layer 1 and Layer 2 and improved after other phases. The potential degradation of performance after the Layer 1 and Layer 2 phases may be due to the fact that the shallow layers of the network have not yet fully learned the basic features of the image, and the addition of ASPP may introduce noise information, which may affect the overall performance. Further tests on the combination of different stages showed that inserting the ASPP module after the Layer 5 stage worked best, with a recognition accuracy of 95.089%.

4. Discussion

In order to achieve accurate detection of glass bottles with defects, this paper proposes a defect classification structure based on the combination of CNN and VIT-TIANet and verifies its effectiveness on a self-built dataset. Firstly, in order to enhance the model’s ability to extract local features on the surface of glass bottles, we propose the ISD module, which consists of an IR structure fused with the SE mechanism and path dropout mechanism, improving the model’s ability to extract local features and represent channels; Secondly, a DSViT module is proposed, which is composed of ViT combined with depthwise separable convolution to achieve global feature extraction of defects. Finally, the ASPP module is introduced to extract multi-scale features and capture contextual information of different scales by convolving holes with different sampling rates (dilation rates) to perceive defects of different shapes and sizes. On the self-built glass bottle dataset, TIANet demonstrated excellent performance, achieving an accuracy of 95.714%, precision of 95.903%, recall of 95.903%, and F1-score of 95.882%, outperforming other classification models. The experimental results show that TIANet effectively captures both local and global features, enabling it to accurately detect defects of various categories and sizes. It maintains high accuracy even under challenging conditions such as surface texture or lighting variations, highlighting its robustness and practicality in real-world defect detection scenarios.

5. Conclusions

On the whole, TIANet has surpassed current mainstream models in several key performance indicators for glass bottle classification tasks, achieving an accuracy of 95.714, which is 3.928% higher than the benchmark model. Due to its integration of ISD, DSViT, and ASPP, it performs well in small object detection and complex background scenes, greatly improving feature extraction capabilities. However, there are still areas that need improvement. Specifically, the detection of rare defects in industrial datasets and the imbalance of defect types may lead to performance degradation. Future research should focus on data augmentation and synthetic data generation to enhance model learning. In addition, lightweight multi-scale feature fusion techniques are needed to reduce computational costs and meet real-time industrial production needs. In summary, TIANet has demonstrated tremendous potential in glass bottle defect detection, providing practical solutions for improving quality control, reducing defective products, and increasing manufacturing efficiency. It paves the way for efficient, intelligent manufacturing and defect detection, supporting the advancement of high-precision, real-time industrial inspection systems.

Author Contributions

H.W.: Data curation; Formal analysis; Methodology. F.Z.: Data organization; Investigation; Methodology; Writing—initial draft. X.A.: Data curation; Formal analysis; Funding acquisition. K.L.: Writing—original draft. Y.Z.: Formal analysis; Project management. Q.G.: Data preprocessing; Writing—initial draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TIANet	A defect classification structure based on the combination of CNN and ViT
ViT	Vision transformer
ISD	Local feature extraction
DSViT	Global feature extraction
ASPP	Atrous Spatial Pyramid Pooling
SE	Squeeze-and-Excitation attention
BN	Batch normalization

References

Xian, R.; Xiong, X.; Peng, H.; Wang, J.; de Arellano Marrero, A.R.; Yang, Q. Feature fusion method based on spiking neural convolutional network for edge detection. Pattern Recognit. 2024, 147, 110112. [Google Scholar] [CrossRef]
Lu, Y.; Duanmu, L.; Zhai, Z.J.; Wang, Z. Application and improvement of Canny edge-detection algorithm for exterior wall hollowing detection using infrared thermal images. Energy Build. 2022, 274, 112421. [Google Scholar]
Wang, S.; Li, L.; Wen, S.; Liang, R.; Liu, Y.; Zhao, F.; Yang, Y. Metalens for accelerated optoelectronic edge detection under ambient illumination. Nano Lett. 2023, 24, 356–361. [Google Scholar] [CrossRef] [PubMed]
Bhateja, V.; Nigam, M.; Bhadauria, A.S.; Arya, A.; Zhang, E.Y.-D. Human visual system based optimized mathematical morphology approach for enhancement of brain MR images. J. Ambient Intell. Humaniz. Comput. 2024, 15, 799–807. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Qin, H.; Wang, J.; Mao, X.; Zhao, Z.; Gao, X.; Lu, W. An improved faster R-CNN method for landslide detection in remote sensing images. J. Geovis. Spat. Anal. 2024, 8, 2. [Google Scholar]
Yang, D.; Solihin, M.I.; Ardiyanto, I.; Zhao, Y.; Li, W.; Cai, B.; Chen, C. A streamlined approach for intelligent ship object detection using EL-YOLO algorithm. Sci. Rep. 2024, 14, 15254. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; He, L.; Zhang, M.; Cheng, Z.; Liu, W.; Wu, Z. Improving the Performance of the Single Shot Multibox Detector for Steel Surface Defects with Context Fusion and Feature Refinement. Electronics 2023, 12, 2440. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–12. [Google Scholar]
Chen, C.F.R.; Fan, Q.; Panda, R. CrossViT: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 357–366. [Google Scholar]
Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In Proceedings of the Computer Vision—ECCV 2016, 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 646–661. [Google Scholar]
Dong, K.; Zhou, C.; Ruan, Y.; Li, Y. MobileNetV2 model for image classification. In Proceedings of the 2nd International Conference on Information Technology and Computer Application, Guangzhou, China, 18–20 December 2020; pp. 476–480. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Targ, S.; Almeida, D.; Lyman, K. ResNet in ResNet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Louisiana, 21–24 June 2022; pp. 11976–11986. [Google Scholar]
Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23 July 2023; pp. 23803–23828. [Google Scholar]
Tan, M.; Le, Q. EfficientNetV2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]

Figure 1. Structure of TIANet.

Figure 2. Structure of ISD.

Figure 3. SE algorithmic flow.

Figure 4. Structure of DSViT.

Figure 6. Types of glass bottles. (A) Front qualified; (B) Front unqualified; (C) Rear qualified; (D) Rear unqualified (the defects are marked in sub-figures (B,D)).

Figure 7. Defective class of glass bottles. (a) Plaque; (b) Crack; (c) Stolen goods; (d) Scratches.

Figure 8. Confusion matrix of defective glass bottle classification results for various network structures. (a) EfficientNetV2; (b) Vision Transformer; (c) MobileNetV3; (d) ShuffleNet; (e) MobileViT; (f) TIANet.

Figure 9. Module ablation experiment results.

Figure 10. Analysis of dropout probability in the path dropout mechanism.

Figure 11. Analysis of ASPP module insertion position.

Table 1. Comparative experimental results of different models.

No	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
1	EfficientNetV2	60.357	60.985	60.341	59.682
2	Vision Transformer	63.571	66.051	63.553	62.104
3	ShuffleNet	93.929	94.025	93.976	93.975
4	MobileNetV3	91.857	93.163	92.125	92.014
5	MobileViT	92.186	93.067	92.067	92.066
6	Ours (TIANet)	95.714	95.903	95.903	95.882

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Zhao, F.; An, X.; Zhao, Y.; Li, K.; Guo, Q. TIANet: A Defect Classification Structure Based on the Combination of CNN and ViT. Electronics 2025, 14, 1502. https://doi.org/10.3390/electronics14081502

AMA Style

Wang H, Zhao F, An X, Zhao Y, Li K, Guo Q. TIANet: A Defect Classification Structure Based on the Combination of CNN and ViT. Electronics. 2025; 14(8):1502. https://doi.org/10.3390/electronics14081502

Chicago/Turabian Style

Wang, Hongjuan, Fangzheng Zhao, Xinjun An, Youjun Zhao, Kunxi Li, and Quanbing Guo. 2025. "TIANet: A Defect Classification Structure Based on the Combination of CNN and ViT" Electronics 14, no. 8: 1502. https://doi.org/10.3390/electronics14081502

APA Style

Wang, H., Zhao, F., An, X., Zhao, Y., Li, K., & Guo, Q. (2025). TIANet: A Defect Classification Structure Based on the Combination of CNN and ViT. Electronics, 14(8), 1502. https://doi.org/10.3390/electronics14081502

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TIANet: A Defect Classification Structure Based on the Combination of CNN and ViT

Abstract

1. Introduction

1.1. Research Background

1.2. Related Work

2. Methodology

2.1. TIANet Structure Design

2.2. ISD Block

2.2.1. Optimize the Squeeze-and-Excitation Attention

2.2.2. Path Drop Mechanism

2.3. DSViT Block

2.4. ASPP Block

3. Data and Experiments

3.1. Datasets

3.2. Experimental Environment

3.3. Analysis of Experimental Results

3.3.1. Evaluation Metrics

3.3.2. Comparative Test

3.3.3. Confusion Matrix

3.4. Ablation Experiments

3.4.1. Recommended Module Ablation

3.4.2. Probabilistic Analysis of the Path Drop Mechanism

3.4.3. ASPP Module Insert Position Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI