ATNet: A Defect Detection Framework for X-ray Images of DIP Chip Lead Bonding

In order to improve the production quality and qualification rate of chips, X-ray nondestructive imaging technology has been widely used in the detection of chip defects, which represents an important part of the quality inspection of products after packaging. However, the current traditional defect detection algorithm cannot meet the demands of high accuracy, fast speed, and real-time chip defect detection in industrial production. Therefore, this paper proposes a new multi-scale feature fusion module (ATSPPF) based on convolutional neural networks, which can more fully extract semantic information at different scales. In addition, based on this module, we design a deep learning model (ATNet) for detecting lead defects in chips. The experimental results show that at 8.2 giga floating point operations (GFLOPs) and 146 frames per second (FPS), mAP0.5 and mAP0.5–0.95 can achieve an average accuracy of 99.4% and 69.3%, respectively, while the detection speed is faster than the baseline yolov5s by nearly 50%.


Introduction
There are three main stages in the production of chips: design, manufacturing, and packaging and testing [1,2]. Among them, testing is an important part that determines whether the chip can be put into the market or not. The frequent start and stop of electronic devices can aggravate the thermal cycling stress of the chip [3], which can lead to structural deformation, lead breakage, and even solder joint failure [4]. The defects, such as missing and redundant chip solder joints, interconnections, and lead bonding fractures, are the main components of the testing session. Detecting defects in chips is of great importance to improve the quality of chips and reduce production costs.
In order to better ensure the quality of the chip, a number of researchers and scholars have conducted a lot of research on lead bonding defect detection. Pecht et al. [5] used an electromagnetic resonance technique instead of the conventional lead bond tension test to detect the quality of lead bonding. Luo et al. [6] obtained the vibration of the wire by micro force sensors and interpreted the vibration signal in time, frequency, and phase domains to determine the integrity of the lead bonding. Feng et al. [7] evaluated bond quality using a time-frequency analysis of the electrical signal at the bond obtained from the ultrasonic generator. Kannan [8] proposed a technique based on the forced-resonance principle for detecting multilead bonding. However, these physical inspection methods are only 1.
A novel ATSPPF module is introduced, which enables comprehensive feature extraction. This module effectively combines features from various scales and employs adaptive weighting in both the spatial and channel domains to enhance the expression of features; 2.
Based on the ATSPPF module, an accurate and fast chip wire bonding defect detection model framework, ATNet, was specifically designed to achieve the automated, rapid, and highly accurate detection of lead bonding defects;  3. Within the dataset, the average detection accuracy (mAP 0.5 ) achieved was an impressive 99.4%, while the detection speed reached 146 frames per second (FPS), outperforming other state-of-the-art networks, such as yolov5s and yolox.

Data Acquisition and Augmentation
Due to the special nature of chip inspection, CT equipment is used to obtain the internal defects of the chip. As shown in Figure 1, the chip is placed on a rotating platform, and X-rays pass through the chip; the X-ray intensity will be attenuated to different degrees, and the detection panel will receive the attenuated X-ray intensity. Finally, by leveraging the intensity information of the rays, an X-ray image is reconstructed to depict the internal structure of the chip accurately. Pre-processing the obtained images is an important step to improve the correct rate of detecting lead bonding defects. In order to obtain better image quality and reduce the effect of noise on subsequent detection, image denoising is performed on the acquired images. In this paper, we first implement a typical method based on a median filter to remove the pepper noise from the image. With this nonlinear filtering process, the noise caused by various signal transmission errors can be well removed, while the edge information of the lead is well preserved. The wire part of the image is subsequently cropped to obtain the region of interest (ROI), which can largely reduce the influence of the background on the recognition results. 2. Based on the ATSPPF module, an accurate and fast chip wire bonding defect detection model framework, ATNet, was specifically designed to achieve the automated, rapid, and highly accurate detection of lead bonding defects; 3. Within the dataset, the average detection accuracy (mAP0.5) achieved was an impressive 99.4%, while the detection speed reached 146 frames per second (FPS), outperforming other state-of-the-art networks, such as yolov5s and yolox.

Data Acquisition and Augmentation
Due to the special nature of chip inspection, CT equipment is used to obtain the internal defects of the chip. As shown in Figure 1, the chip is placed on a rotating platform, and X-rays pass through the chip; the X-ray intensity will be attenuated to different degrees, and the detection panel will receive the attenuated X-ray intensity. Finally, by leveraging the intensity information of the rays, an X-ray image is reconstructed to depict the internal structure of the chip accurately. Pre-processing the obtained images is an important step to improve the correct rate of detecting lead bonding defects. In order to obtain better image quality and reduce the effect of noise on subsequent detection, image denoising is performed on the acquired images. In this paper, we first implement a typical method based on a median filter to remove the pepper noise from the image. With this nonlinear filtering process, the noise caused by various signal transmission errors can be well removed, while the edge information of the lead is well preserved. The wire part of the image is subsequently cropped to obtain the region of interest (ROI), which can largely reduce the influence of the background on the recognition results. As shown in Table 1, there are five types of lead chip bonding defects in this paper, namely: high loop, low loop, broken wire, sagged wire, and missing wire. Sample images of these defects and the process of extracting ROI regions are shown in Figure 2. The amount of data has a great impact on network training. Data expansion can make the network have better generalization and robustness and can prevent the network from over-fitting. Generally, image data can be expanded by random scaling, stretching, shearing, rotating, changing transparency, brightness, and other operations. In addition, some other scholars have proposed more unique data-enhancement methods, such as Mixup [20], Mosaic [21], and CutMix [22]. Therefore, to enhance the network's generalization and robustness, the random combination of random scaling, cutting, rotating, and changing transparency and brightness is used to expand the data of the obtained image, and a set of lead defect data (Dataset-WBD) is constructed for training the network. As shown in Table 1, there are five types of lead chip bonding defects in this paper, namely: high loop, low loop, broken wire, sagged wire, and missing wire. Sample images of these defects and the process of extracting ROI regions are shown in Figure 2. The amount of data has a great impact on network training. Data expansion can make the network have better generalization and robustness and can prevent the network from over-fitting. Generally, image data can be expanded by random scaling, stretching, shearing, rotating, changing transparency, brightness, and other operations. In addition, some other scholars have proposed more unique data-enhancement methods, such as Mixup [20], Mosaic [21], and CutMix [22]. Therefore, to enhance the network's generalization and robustness, the random combination of random scaling, cutting, rotating, and changing transparency and brightness is used to expand the data of the obtained image, and a set of lead defect data (Dataset-WBD) is constructed for training the network.

Object Detection
Currently, the majority of target detection networks can be categorized into three components: the backbone, neck, and detection head. The backbone part is used to extract the semantic features of the image; the neck part is to operate the features extracted at different stages of the backbone to make better use of the feature map, and the detection head is to detect the predicted object and the specific location of the object.
Backbone: The selection of an appropriate backbone network plays a crucial role in determining the overall performance of the network, so the selection of a backbone should conform to the characteristics of the dataset. At present, the main backbone networks include ResNet [23], MobileNet [24], Darknet 53 [25], Swin Transformer [26], VGG [27], etc. These backbone networks have fully verified their advantages in feature extraction in many experiments. On the basis of these backbone networks, researchers have fine-tuned them as their own backbone networks so as to make them more suitable for their own datasets.
Neck: Early networks, such as VGG, AlexNet [28], and Resnet, only used simple convolution operations to predict the final layer of the feature extraction network. However, the deep network is not sensitive to small target objects, so it often has no good effect on detecting small target tasks. In order to effectively avoid this problem, Lin et al. [29] developed an architecture with a horizontal connection and a top-down channel (FPN structure), which is used to fuse adjacent feature maps, make information flow between adjacent layers, form multi-scale feature maps, and enhance feature extraction. Szegedy et al. [30] pointed out that more network branches would reduce the parallelism of the model. Therefore, the recognition speed of the network will be reduced to a certain extent after the introduction of the FPN structure, but the detection accuracy will be greatly improved. Liu et al. [31] introduced a new bottom-up aggregation path based on the FPN structure, which enhances and shortens the information path between the top and bottom layers. In addition to the several different neck layer structures and methods mentioned above, there are also methods of path aggregation, such as BiFPN [32] and ASFF [33].
Head: This part is to recognize the feature map of the target extracted from the Neck part. At present, there are two main head detectors: one-stage and two-stage. The one-

Object Detection
Currently, the majority of target detection networks can be categorized into three components: the backbone, neck, and detection head. The backbone part is used to extract the semantic features of the image; the neck part is to operate the features extracted at different stages of the backbone to make better use of the feature map, and the detection head is to detect the predicted object and the specific location of the object.
Backbone: The selection of an appropriate backbone network plays a crucial role in determining the overall performance of the network, so the selection of a backbone should conform to the characteristics of the dataset. At present, the main backbone networks include ResNet [23], MobileNet [24], Darknet 53 [25], Swin Transformer [26], VGG [27], etc. These backbone networks have fully verified their advantages in feature extraction in many experiments. On the basis of these backbone networks, researchers have fine-tuned them as their own backbone networks so as to make them more suitable for their own datasets.
Neck: Early networks, such as VGG, AlexNet [28], and Resnet, only used simple convolution operations to predict the final layer of the feature extraction network. However, the deep network is not sensitive to small target objects, so it often has no good effect on detecting small target tasks. In order to effectively avoid this problem, Lin et al. [29] developed an architecture with a horizontal connection and a top-down channel (FPN structure), which is used to fuse adjacent feature maps, make information flow between adjacent layers, form multi-scale feature maps, and enhance feature extraction. Szegedy et al. [30] pointed out that more network branches would reduce the parallelism of the model. Therefore, the recognition speed of the network will be reduced to a certain extent after the introduction of the FPN structure, but the detection accuracy will be greatly improved. Liu et al. [31] introduced a new bottom-up aggregation path based on the FPN structure, which enhances and shortens the information path between the top and bottom layers. In addition to the several different neck layer structures and methods mentioned above, there are also methods of path aggregation, such as BiFPN [32] and ASFF [33].
Head: This part is to recognize the feature map of the target extracted from the Neck part. At present, there are two main head detectors: one-stage and two-stage. The onestage target detector directly predicts the feature map, while the two-stage detection is different from the one-stage detector. It utilizes the region proposal network (RPN) to select candidate regions on the feature map and then identifies and locates the candidate regions. The two-stage detector can generally obtain more accurate results, but the detection process requires the use of the RPN network to select a specific area, which can lead to a slow detection speed. The one-stage detector offers a faster detection speed compared to the two-stage detector, but it may result in relatively lower detection accuracy.

The Position of the Target Object
In yolov5, each grid in the feature map of the prediction head section generates three prediction frames of different scales to predict the exact location of the target, so an object in the original map is covered by multiple prediction frames at the same time. In order to select the most suitable predictor box, the box with the largest intersection-to-parallel ratio (IOU) is chosen as the predicted target object location. However, there is no intersection between the real frame and the prediction frame, which will result in a loss of 0. The gradient cannot be backpropagated, and training is not possible. Moreover, since the real frame will have a certain tilt angle, it will lead to the inability to select the most suitable prediction frame with the same intersection ratio. Rezatofighi [34] proposed the concept of GIOU on the basis of IOU, which is used to calculate the minimum area of the region that encloses the real frame and the predicted frame and then calculate the proportion of the region that does not belong to the two frames in the region, and finally subtract this proportion from IoU to get GIoU. It can not only well reflect the overlap between the real frame and the predicted frame part but can also pay more attention to the non-overlapping part of the region, which can better reflect the overlap between the two. Zheng et al. [35] proposed DIOU, with a faster convergence rate, which can directly minimize the distance between two target frames and provide direction for the movement of the prediction frame. Meanwhile, Zheng introduced a weight parameter to extend DIOU and propose CIOU. This modification aims to address the issue of varying width and height between the predicted frame and the actual frame, thereby optimizing the overall performance. In order to optimize the training results, CIOU is used as the position loss of the prediction frame in this paper. The equation for computing CIOU can be represented as follows.
where ρ 2 (box 1 , box 2 ) denotes the Euclidean distance between the centroid of box 1 and box 2 . c denotes the diagonal distance of the smallest closed region that can contain both box 1 and box 2 . In Formula (3), ω gt , h gt , ω, and h represent the width and height of the real frame and the width and height of the prediction frame, respectively. The formulas for α, υ, and LOSS CIOU are as follows.

Activation Functions
In artificial neural networks, the activation function plays a very important part, similar to the model of neurons in the human brain, where the activation function ultimately decides what to send to the next neuron. Its main role is to add a nonlinear operation to all implicit and output layers, enhancing the intricacy and expressiveness of the neural network's output. The commonly used activation functions are Sigmoid, Tanh, ReLU, LeakyReLU, SELU, and SiLU. In this paper, we mainly use two activation functions, LeakyReLU and SiLU, for which the expressions and corresponding first-order f (x) derivatives are as follows.
The SiLU activation function is expressed as follows, where σ(x) is the sigmoid activation function, and x is the input: The LeakyReLU activation function, where α is a constant with a value of 0.01:

Methodology
The Yolov5 framework utilizes C3 as the backbone layer, which enables it to recognize more intricate features. In this study, we propose the CGC module, building upon the structure of C3. Additionally, we introduce a novel feature fusion method called ATSPPF. This method adaptively assigns weights based on the contributions from spatial and channel dimensions, enhancing the network's sensitivity to valuable spatial and channel information. By incorporating these advancements, the network becomes more proficient at identifying lead bonding defects across multiple scales. Yolov3-tiny, known for its fast detection speed due to its low computational requirements and simple network structure, excels in swiftly detecting simple features in industrial settings. Hence, we combine the network architecture of yolov3-tiny with the CGC module and ATSPPF module to propose ATNet, a network specifically designed for lead bonding defect detection. Figure 3 illustrates the structure of ATNet. Detailed descriptions of the CGC module and ATSPPF module proposed in this study will follow.
LeakyReLU, SELU, and SiLU. In this paper, we mainly use two activation functions, LeakyReLU and SiLU, for which the expressions and corresponding first-order ′ ( ) derivatives are as follows.
The SiLU activation function is expressed as follows, where ( ) is the sigmoid activation function, and x is the input: The LeakyReLU activation function, where is a constant with a value of 0.01:

Methodology
The Yolov5 framework utilizes C3 as the backbone layer, which enables it to recognize more intricate features. In this study, we propose the CGC module, building upon the structure of C3. Additionally, we introduce a novel feature fusion method called ATSPPF. This method adaptively assigns weights based on the contributions from spatial and channel dimensions, enhancing the network's sensitivity to valuable spatial and channel information. By incorporating these advancements, the network becomes more proficient at identifying lead bonding defects across multiple scales. Yolov3-tiny, known for its fast detection speed due to its low computational requirements and simple network structure, excels in swiftly detecting simple features in industrial settings. Hence, we combine the network architecture of yolov3-tiny with the CGC module and ATSPPF module to propose ATNet, a network specifically designed for lead bonding defect detection. Figure  3 illustrates the structure of ATNet. Detailed descriptions of the CGC module and ATSPPF module proposed in this study will follow.

CGC Module
The C3 structure consists of three convolutional operations followed by a bottleneck layer. On the other hand, the CGC module modifies the bottleneck layer in C3 by substituting it with Ghostconv. Additionally, one of the convolution operations in the branch is eliminated and replaced with a residual connection. Experiments in Dataset-WBD showed that the use of the CGC module could improve mAP 0.5 and mAP 0.5-0.95 by 0.5% and 1.5%, respectively, compared to the use of the C3 module.
Han et al. [36] pointed out that due to the process of feature extraction by the network, a significant portion of redundant and duplicated features can occur if the number of channels is too large. Therefore, to achieve a better balance between extracting enough features and computational cost, the Ghostconv module was proposed. When compared with standard convolution, ghost convolution results in a substantial decrease in both computational workload and parameter quantity and the result of feature extraction is not much different from normal convolution. In this paper, we detail how the CGC module is constructed based on adopting Han's similar idea. As illustrated in Figure 4, the CGC module uses a residual structure. Firstly, the number of channels will be reduced to c/2 using a standard convolution of 3 × 3. The result is input into Ghostconv, and then, by utilizing a 1 × 1 convolution, the number of channels is expanded to align with the dimensions of the original input feature map. This enables the addition operation to be carried out between the original input feature map and the resultant feature map. The inclusion of the residual connection guarantees the stability of the network.

CGC Module
The C3 structure consists of three convolutional operations followed by a bottleneck layer. On the other hand, the CGC module modifies the bottleneck layer in C3 by substituting it with Ghostconv. Additionally, one of the convolution operations in the branch is eliminated and replaced with a residual connection. Experiments in Dataset-WBD showed that the use of the CGC module could improve mAP0.5 and mAP0.5-0.95 by 0.5% and 1.5%, respectively, compared to the use of the C3 module.
Han et al. [36] pointed out that due to the process of feature extraction by the network, a significant portion of redundant and duplicated features can occur if the number of channels is too large. Therefore, to achieve a better balance between extracting enough features and computational cost, the Ghostconv module was proposed. When compared with standard convolution, ghost convolution results in a substantial decrease in both computational workload and parameter quantity and the result of feature extraction is not much different from normal convolution. In this paper, we detail how the CGC module is constructed based on adopting Han's similar idea. As illustrated in Figure 4, the CGC module uses a residual structure. Firstly, the number of channels will be reduced to c/2 using a standard convolution of 3 × 3. The result is input into Ghostconv, and then, by utilizing a 1 × 1 convolution, the number of channels is expanded to align with the dimensions of the original input feature map. This enables the addition operation to be carried out between the original input feature map and the resultant feature map. The inclusion of the residual connection guarantees the stability of the network. In order to address the problems of gradient vanishing, gradient explosion, and overfitting that may be caused by the network depth being too deep, a residual connection is introduced into the CGC module, and the network's capacity for learning is enhanced at the same time. The residual connection can be represented by the following equation. Where denotes the input, +1 denotes the output, the nonlinear variation of the input is defined as F(x,w), and w denotes the weight parameter of the function F. In order to address the problems of gradient vanishing, gradient explosion, and overfitting that may be caused by the network depth being too deep, a residual connection is introduced into the CGC module, and the network's capacity for learning is enhanced at the same time. The residual connection can be represented by the following equation. Where H l denotes the input, H l+1 denotes the output, the nonlinear variation of the input is defined as F(x,w), and w denotes the weight parameter of the function F.
By incorporating the Ghostconv module and residual structure into the CGC module, several benefits are achieved. These include significant reductions in computational requirements, the acquisition of ample features, and ensuring network stability.

ATSPPF Module
VoVNet [37] incorporates the one-shot aggregate (OSA) module as its backbone, which enhances both the recognition speed and accuracy of the network when compared to the baseline DenseNet. Lee et al. [38] introduced residual connectivity and an effective squeezed excitation attention module (eSE) based on OSA, which avoids performance degradation due to the depth of the network being too deep while using eSE to further enhance the features. Inspired by this, a new spatial scale fusion module (ATSPPF) is proposed in this paper, and the specific structure diagram is shown in Figure 5 below. After performing standard convolution operations on the input features, we utilize three maximum pooling operations with kernel sizes of 5 × 5, 9 × 9, and 13 × 13 to extract feature information at various scales. These pooling operations ensure that features across different scales are preserved. The resulting feature maps from convolution and pooling are then stacked along the channel dimension to avoid any loss of feature information. Finally, a standard 1 × 1 convolution is employed to reduce the number of channels from 4C to C. quirements, the acquisition of ample features, and ensuring network stability.

ATSPPF Module
VoVNet [37] incorporates the one-shot aggregate (OSA) module as its backbone, which enhances both the recognition speed and accuracy of the network when compared to the baseline DenseNet. Lee et al. [38] introduced residual connectivity and an effective squeezed excitation attention module (eSE) based on OSA, which avoids performance degradation due to the depth of the network being too deep while using eSE to further enhance the features. Inspired by this, a new spatial scale fusion module (ATSPPF) is proposed in this paper, and the specific structure diagram is shown in Figure 5 below. After performing standard convolution operations on the input features, we utilize three maximum pooling operations with kernel sizes of 5 × 5, 9 × 9, and 13 × 13 to extract feature information at various scales. These pooling operations ensure that features across different scales are preserved. The resulting feature maps from convolution and pooling are then stacked along the channel dimension to avoid any loss of feature information. Finally, a standard 1 × 1 convolution is employed to reduce the number of channels from 4C to C. For the target detection task compared to the classification task, the size of the features to be recognized varies, so using feature maps of different scales can be more beneficial for detection. In order for the network to enhance its focus on the valuable feature channel information and spatial information and selectively ignore some less important feature channels, the CBAM attention mechanism module can be introduced at the end in For the target detection task compared to the classification task, the size of the features to be recognized varies, so using feature maps of different scales can be more beneficial for detection. In order for the network to enhance its focus on the valuable feature channel information and spatial information and selectively ignore some less important feature channels, the CBAM attention mechanism module can be introduced at the end in parallel. Then, the number of channels of 2c is again transformed into c for subsequent residual connections, and finally, the residual connections are used to ensure the stability of the network. Table 2 presents the detailed structure of the ATSPPF module.

Feature Map
Different feature maps have different sensory receptiveness and different sensitivities to the same size lead bonding defects. Yolov5s eventually predicts feature maps of 19 × 19, 38 × 38, and 76 × 76 in size. Among them, the 76 × 76 feature map is more suitable for detecting small targets, while for medium and large targets, feature maps of 19 × 19 and 38 × 38 in size are more effective. As illustrated in Figure 6, the feature maps are 76 × 76, 38 × 38, and 19 × 19 from left to right scale, respectively. For the 76 × 76 feature map, the resulting anchor box is too small to completely contain the defective parts, while for the 19 × 19 and 38 × 38 feature maps, recognition is better. Therefore, in order to reduce the complexity and computational effort of the network, we use feature maps of 13 × 13 and 26 × 26 in size based on yolov3-tiny. There are three different feature scales and corresponding anchor sizes of yolov5s. In Figure 6, the sizes of the anchors are not in proportion to the real anchor frame size, and this just wants to express that the different sizes of anchors will affect whether the objects can be detected.   Figure 6, the feature maps are 76 × 76, 38 × 38, and 19 × 19 from left to right scale, respectively. For the 76 × 76 feature map, the resulting anchor box is too small to completely contain the defective parts, while for the 19 × 19 and 38 × 38 feature maps, recognition is better. Therefore, in order to reduce the complexity and computational effort of the network, we use feature maps of 13 × 13 and 26 × 26 in size based on yolov3-tiny. There are three different feature scales and corresponding anchor sizes of yolov5s. In Figure 6, the sizes of the anchors are not in proportion to the real anchor frame size, and this just wants to express that the different sizes of anchors will affect whether the objects can be detected. Architecture: Based on the CGC and ATSPPF modules, this paper proposes an ATNet network with a strong feature-extraction ability and detection speed. The ATSPPF module is integrated at the end of the backbone network to facilitate the processing of significant information and bring the enhanced features closer to the output layer, resulting in improved accuracy for recognition outcomes.

Experimental Setup
We implement our net using PyTorch deep learning framework and train it on an NVIDIA GeForce RTX 3050 graphics card 4G and an Interl 2.30GHz i7-11800H CPU. Dataset-WBD is divided into a train set and a test set according to the proportion of 90% and 10%. The training set is employed to optimize network parameters by minimizing the loss Architecture: Based on the CGC and ATSPPF modules, this paper proposes an ATNet network with a strong feature-extraction ability and detection speed. The ATSPPF module is integrated at the end of the backbone network to facilitate the processing of significant information and bring the enhanced features closer to the output layer, resulting in improved accuracy for recognition outcomes.

Experimental Setup
We implement our net using PyTorch deep learning framework and train it on an NVIDIA GeForce RTX 3050 graphics card 4G and an Interl 2.30GHz i7-11800H CPU. Dataset-WBD is divided into a train set and a test set according to the proportion of 90% and 10%. The training set is employed to optimize network parameters by minimizing the loss function, while the test set is utilized to assess the trained network's accuracy in detecting wire bonding defects. The network is trained with the Stochastic Gradient Descent (SGD) optimizer with a linear decay learning rate scheduling strategy, in which the learning rate is initially set to 0.01 and gradually reduced to 0.001. We specify the batch size as 8, the momentum parameter as 0.937, and the weight decay rate as 0.0005. The input image is uniformly transformed to 640 × 640 size and normalized. The anchor dimensions were set to [10,14], [23,27], [37,58]

Evaluation Criterion
In the task of chip defect target detection, the Intersection over Union (IOU) is employed to assess whether a detected result corresponds to a true defect. If the IOU value surpasses the predetermined threshold (usually set at 0.5), it is classified as a positive sample; otherwise, it is regarded as a negative sample. In the target detection task, precision and recall are crucial metrics for evaluating the performance of network recognition, which are defined as follows: where TP (true positive) refers to the number of true defects detected; FP (false positive) is the number of incorrectly predicted defects, which are not defects; and FN (false negative) indicates the number of actual defects missed. Note that (TN) (true negative) indicates the number of samples correctly predicted to be negative. For the prediction frames detected by the network, the discrimination between TP or FP is determined by the ratio of the overlap area between the prediction box and the true box, which is defined as follows.
where γ is the overlap ratio, A ∩ B represents the intersection area between the true box and the predicted box, and A ∪ B represents the total area between them. It will be set to 0.5 in this experiment. mAP 0.5 refers to the average accuracy when the IOU is 0.5, and mAP 0.5-0.95 refers to the average accuracy of 0.5 in increments of 0.05 to 0.95. In this paper, mAP 0.5 and mAP 0.5-0.95 are used to evaluate the indicators for judging the comprehensive ability of all categories and choose the time complexity GFLOPs and the space complexity parameters to represent the differences between the different methods.

Ablation Studies
We utilized ablation experiments to assess the benefits of the CGC module and the ATSPPF module, which were introduced in this paper, on the ATNet network. The outcomes of the experiments are presented in the following Table 3. Experiment 2 is 0.1% and 1% higher than Experiment 1 in mAP 0.5 and mAP 0.5-0.95 , respectively, while the detection speed is also improved by nine frames. In experiment 3, after adding the ATSPPF module, the mAP 0.5-0.95 is 1.8% higher compared to experiment 1. In experiment 4, after introducing both CGC module and ATSPPF module, the mAP 0.5 and mAP 0.5-0.95 can reach 99.4% and 69.4%, respectively, and the detection speed can reach 146 (FPS). In order to detect wire bonding defects in real time and accurately in production, the combination of comprehensive comparison experiment 4 is more consistent with the requirements. The attention mechanism is incorporated into the ATSPPF module. At present, three attention mechanism modules are commonly used: SE, CA, and CBAM. As shown in Table 4, after introducing three different attention mechanism modules into the ATSPPF module, the CBAM detection yields the highest result among the three options. Therefore, we use CBAM as the attention module in ATSPPF.

Comparison of State-of-the-Art Models
Here, we benchmark our approach against state-of-the-art defect detection methods. Table 4 lists the details of different performance indicators (mAP 0.5 , FPS, GFLOPs). It can be seen from Table 5 that the method proposed in this paper has great advantages in precision and detection speed compared with other methods. The detection results can be guaranteed that mAP 0.5 reaches 99.4% when the detection speed is 146 FPS. For Faster R-CNN-ResNet50, Dynamic RCNN-ResNet50, and RetinaNet-ResNet50, their detection speed is less than 30 FPS, which is far from the requirement of real-time wire bonding detection. Due to their great GFLOPs, these methods also have high hardware requirements. Compared with YOLOv3-tiny, which has a similar network structure, the detection accuracy and speed of both are similar, but our method is much lower than YOLOv3-tiny in terms of both model size and computational complexity (GFLOPs), and the detection speed is nearly 50% higher than the baseline yolov5s. Our method has demonstrated its effectiveness in detecting defects, as illustrated in Figure 7. The detection results of our approach on the test dataset of Dataset-WBD are depicted.  In addition, in order to validate the effectiveness of the ATSPPF module, the module was introduced into the backbone tail part of other networks for training, and the results are shown in Table 6. From the table, it is apparent that the introduction of the ATSPPF module improves the Recall and mAP0.5-0.95 of the yolov3 network by 0.8% and 1.1%, respectively. In Faster-RCNN, although there is a small decrease in Recall, the accuracy is improved by 0.3%, and both mAP0.5 and mAP0.5-0.95 have a larger improvement. For different SSD backbone networks, almost four indicators have been improved to varying degrees. In addition, for the three networks, YoloR [39], LFyolo [40], and Sclaed_yolov4, the overall comprehensive detection performance has also been improved by introducing the ATSPPF module at the end of the feature extraction network. This shows that ATSPPF uses a pooling operation to fuse receptive field with different scales and introduces an In addition, in order to validate the effectiveness of the ATSPPF module, the module was introduced into the backbone tail part of other networks for training, and the results are shown in Table 6. From the table, it is apparent that the introduction of the ATSPPF module improves the Recall and mAP 0.5-0.95 of the yolov3 network by 0.8% and 1.1%, respectively. In Faster-RCNN, although there is a small decrease in Recall, the accuracy is improved by 0.3%, and both mAP 0.5 and mAP 0.5-0.95 have a larger improvement. For different SSD backbone networks, almost four indicators have been improved to varying degrees. In addition, for the three networks, YoloR [39], LFyolo [40], and Sclaed_yolov4, the overall comprehensive detection performance has also been improved by introducing the ATSPPF module at the end of the feature extraction network. This shows that ATSPPF uses a pooling operation to fuse receptive field with different scales and introduces an attention mechanism, which can extract multi-scale information of the image, making the feature map have more abundant expression ability, thus effectively improving the detection performance of the model. In the network architectures of YOLOv5 and YOLOv3, both models incorporate three detection heads. This necessitates the network to allocate additional time for positional regression of targets detected by different prediction heads. However, it is possible that an object may be detected simultaneously by two detection heads, significantly impacting the convergence speed of the network. Therefore, as depicted in Figure 8, it can be observed that ATNet, which incorporates two detection heads, achieves faster convergence rates than YOLOv5 and YOLOv3 across three metrics: mAP 0.5 , recall, and precision. This also reflects the effectiveness of the ATSPPF module and the network structure we have designed from another perspective. In order to validate the generalization performance of our method, we performed training on the publicly available dataset NEU-DET [41] and compared the experimental results. The results are shown in Table 7. The NEU-DET dataset has more complex defect types and is more difficult to detect. Although Yolov3-tiny has a similar structure to our method, Table 6 shows that our method has both 19.4% and 15% higher indicators in mAP 0.5 and mAP 0.5-0.95 than Yolov3-tiny. In addition, it performs better on this dataset than other networks. This indicates that our approach demonstrates superior performance.  In order to validate the generalization performance of our method, we performed training on the publicly available dataset NEU-DET [41] and compared the experimental results. The results are shown in Table 7. The NEU-DET dataset has more complex defect types and is more difficult to detect. Although Yolov3-tiny has a similar structure to our method, Table 6 shows that our method has both 19.4% and 15% higher indicators in mAP 0.5 and mAP 0.5-0.95 than Yolov3-tiny. In addition, it performs better on this dataset than other networks. This indicates that our approach demonstrates superior performance.

Conclusions
Wire bonding defect detection after chip packaging is essential to reduce production costs, improve chip life, and ensure the normal operation of equipment. Therefore, this paper proposes an automatic real-time chip detection method based on a convolutional neural network. We designed two modules: the CGC and ATSPPF modules, and based on the highly efficient network structure of yolov3-tiny, we have proposed an ATNet defect chip detection network using these two modules. The experimental results indicate that the mAP 0.5 and detection speed of the ATNet method are 99.4% and 146 fps, respectively, which are much higher than our baseline yolov5. The GFLOPs and model size were also reduced by 48.75% and 35.82%, respectively, and the detection speed has seen a near 50% improvement. At the same time, for the mAP 0.5 , recall rate, and recognition accuracy, ATNet converges faster than yolov5s. In the future, we will continue to optimize the algorithm to achieve higher accuracy, faster detection speeds, and lower model complexity.

Data Availability Statement:
The raw/processed data and modeling codes required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.