A Novel Electronic Chip Detection Method Using Deep Neural Networks

: Electronic chip detection is widely used in electronic industries. However, most existing detection methods cannot handle chip images with multiple classes of chips or complex backgrounds, which are common in real applications. To address these problems, a novel chip detection method that combines attentional feature fusion (AFF) and cosine nonlocal attention (CNLA), is proposed, and it consists of three parts: a feature extraction module, a region proposal module, and a detection module. The feature extraction module combines an AFF-embedded CNLA module and a pyramid feature module to extract features from chip images. The detection module enhances feature maps with a region intermediate feature map by spatial attentional block, fuses multiple feature maps with a multiscale region of the fusion block of interest, and classiﬁes and regresses objects in images with two branches of fully connected layers. Experimental results on a medium-scale dataset comprising 367 images show that our proposed method achieved mAP 0.5 = 0.98745 and outperformed the benchmark method.


Introduction
Electronic chip assembly is a key link of electronic manufacturing, and its task is to place and solder chips onto printed circuit boards (PCBs). After electronic chip assembly, the chip and PCB are combined for electronic production. In this process, placement error is the distance between the real and ideal positions that causes functional defects in electronic products. With chip detection methods, electronic products with large placement errors can be found as early as possible. Machine vision techniques can also be used to detect chip position without damaging electronic products. Therefore, chip detection methods based on machine vision play an important role in electronic industries.
Our electronic chip detection method was designed to estimate the class and location of chips in a PCB. This is implemented by a electronic chip detection system that integrates various electronic parts. As shown in Figure 1, the detection system consists of four parts: PCB conveyer, image capture module (including camera, lens, and lighter), x − −y moving module (not shown in Figure 1 for simplicity), and an industrial PC. The PCB to be detected is first transferred to the center of the chip detection system by the conveyor; the image capture module is moved to several predefined positions and takes pictures of the PCB; lastly, these pictures are processed by an industrial PC to classify and locate chips. Three examples of PCB images are illustrated in Figure 2, and their characteristics are given as follows: (1) there are multiple chips in one picture; (2) the background of PCB image is complex, including pins, pads, flame retardant layer, and silk screen; (3) the size, color, and other characteristics of chips vary greatly.

Related Work
To classify and locate chips from images, there are many research achievements on electronic chip detections methods based on machine vision. Crispin [1] incorporated a normalized cross-correlation (NCC) template-matching approach that could reduce computational cost by constraining the search space, and optimize the search strategy of template positions using a genetic algorithm. A tree step algorithm for light-emitting diode (LED) chip localization was proposed by Zhong [2]. First, the positions of potential chips were extracted by applying a image segmentation and blob analyzation method; then, orientations of potential chips were predicted on the basis of dominant orientations; lastly, chips were precisely located using gradient orientation features according to the predicted positions and orientations. Gao [3] proposed a novel algorithm to inspect ball grid array (BGA) component defects: first, a grayscale image of solder balls was extracted with an adaptive thresholding algorithm with modified ( , δ)-component segmentation; then, the ball array was generated with a line-based-clustering method; lastly, the precise position and orientation of BGA were estimated from the recognition results. The main cause of errors in chip detection was analyzed by applying a two-step calibration algorithm in Wang [4]. Zhong [5] proposed a three-step algorithm to exclude polycrystalline and fragmentary LED chips. First, blobs were obtained from an image with a simple but efficient image segmentation algorithm; second, abnormal blobs were excluded, and the position and orientation of a potential object were predicted on the basis of the pose of the minimal enclosing rectangle of each candidate blob; lastly, precise LED chips in the originally captured image were located on the basis of gradient orientation features. Bai [6] studied an online component positioning problem based on corner points that incorporated preprocessing, coarse positioning, and fine positioning stages. The preprocessing stage applied Harris corners and subpixel corner points that were extracted from images of real components. The coarse positioning step used distance and shape feature matching methods to compute correct correspondences between key points and Harris corner points. Lastly, the coarse and fine positioning problems were formulated as least-squares error problems.
With the development of deep-learning theory, object detection methods have recently achieved great success [7][8][9]. Object detection methods based on deep learning can generally be divided into two categories: one-and two-stage methods. One-stage object detection methods directly embed location and classification subnetworks into the font of a main backbone network. Typical one-stage object detection methods include YOLO [10][11][12], SSD [13], and DSSD [14]. Two-stage object detection methods introduce region proposal networks to predict candidate bounding boxes, and estimate class and location with multilayer fully connected networks from these bounding boxes. Typical two-stage object detection methods include R-CNN [15][16][17] and ThunderNet [18]. One-stage object detection methods are faster than two-stage object detection methods, and two-stage object detection methods are more precise than one-stage object detection methods.
Deep-learning-based detection methods come from training a multilayer convolutional neural network from training data; thus, neural network architectures learning features from training data is a research hotspot. Lin [19] designed a two-pathway architecture that contained a top-down and a down-top pathway to extract multiscale hierarchical feature maps. Qin [18] proposed a lightweight architecture to realize real-time object detection. To enrich feature representation, several blocks were introduced in that network, such as the context enhancement module (CEM) and spatial attention module (SAM). Liu [20] proposed a context embedding object detection network to detect concealed objects from millimeter wave images. In context embedding object detect networks, backbone features are attached to tree parallel branches with dilation sizes of 3, 6 and 12 to form the context embedding module and to incorporate surrounding information. Fang [21] fused the semantic object feature extraction module (Conv2dNet), the spatiotemporal feature extraction module (Conv3DNet) and the saliency feature-sharing module to generate the final saliency map for real-time video processing. Wang [22] combined dual-branch feature extraction and gradually refined the cross-fusion module in the network for camouflaged object detection. Gu [23] assembled an X-ray proposal network that applies data augmentation to enlarge input image datasets, and an X-ray discriminative network that fuses region of interest (ROI) feature maps from several levels for baggage inspection. A bidirectional attention feature pyramid network with cosine similarity was proposed for photovoltaic cell defect detection [24]. Table 1 shows the advantages of existing methods, which have two disadvantages: (1) they cannot detect multiple chips at the same time, and (2) cannot process chip images with a complex background. These drawbacks render them unsuitable for real applications.
To solve these problems, we propose a novel chip detection method motivated by [18,23,25].

Method Reference Advantages
Computervisionbased methods FTM [2] Fast template-matching method applied to LED chip localization.
LBC [3] Line-based clustering approach applied to BGA component localization.
VF [4] Main cause of errors in chip detection was analyzed.
Deeplearningbased method FPN Feature-pyramid-based feature extraction introduced into object detection.

Thunder
Context-enhancement and spatial-attention modules Net [18] introduced into object detection.
COB [20] Context-embedding module introduced into concealed object detection form millimeter wave image.

D2C-Net [22]
Dual-branch feature extraction and gradually refined cross-fusion module fused for camouflaged object detection.
XRBI [23] X-ray proposal and X-ray discriminative networks assembled for baggage inspection.
PCDD [24] Bidirectional attention feature pyramid network introduced for photovoltaic-cell defect detection.

Proposed Methodology
In the electronic industry, the main aim of the proposed electronic chip detection method is to classify and locate chips in images. To overcome these challenges, a novel chip detection method is proposed in this work. Its methodology is composed of three steps: (1) AFF-embedded CNLA and pyramid-feature modules are combined to extract multiscale pyramid feature maps from chip images; (2) candidate bounding boxes are proposed in the region proposal module (RPM); (3) region intermediate feature maps are fused into enhanced feature maps from the spatial attentional block, and chip class and location are estimated by two branches of fully connected layers. The overall structure of the novel chip detection method is demonstrated in Figure 3.

Feature Extraction Module
In traditional deep-learning-based object detection methods, a multilayer framework that contains a series of convolutional layers is utilized to extract high-level features from images. In the feature extraction framework, each layer takes the output of the lower layer as input and output features to the higher layer as input. The input of the lowest layer is the raw image, and the output of highest layer is used as the final feature of the detection module. To drop memory usage, convolutional layers in the feature extraction framework apply stride to reduce the feature map. This multilayer structure is able to learn feature extraction methods from large-scale training datasets, and its performance exceeds that of handcrafted feature extraction methods. Several research works [23,26] revealed that multilayer feature extraction cannot extract semantic and location information at the same time: semantic information exists in the upper layers but not in the lower layers, and the opposite for location information. As demonstrated in Figure 3, an improved feature pyramid framework was applied to extract image features in our work. Similar to feature pyramid networks (FPNs) [19], the feature extraction module (FEM) consists of two pathways: the bottom-up and up-bottom pathways. In the proposed improved feature pyramid framework, the bottom-up pathway was designed to extract hierarchical features; hence, traditional multiple convolutional layer structure ResNet was employed, which is composed of 5 stages, and every layer of the same stage had the same output size. To save memory, the output of the first stage was ignored, and the output of remaining stages {C 2 , C 3 , C 4 , C 5 } was chosen to form the reference set.
Both semantic and location information is essential for object detection, but it is distributed in different layers in the bottom-up pathway. To combine this information, it is necessary to fuse features from different layers in the up-bottom pathway. In the highest layer of up-bottom pathway p 5 , the highest layer of bottom-up pathway c 5 was attached to a 1 × 1 convolutional layer. In other layers of up-bottom pathway {p i , i = 1, ..., 4}, a building block was applied. The building block is illustrated in Figure 4: the feature from the same layer of down-up pathway c i was attached to a 1 × 1 convolutional layer, the feature from the higher layer of up-down pathway p i+1 was attached to a 2× up layer, and these two features were then fused into feature p i with the AFF-embedded CNLA block.  The AFF block [25] was designed to combine two input features F 1 and F 2 . To improve detection performance, global feature context (GFC) and local channel context (LCC) were both taken into account in the AFF block.
Considering feature F ∈ R C×H×W of C channels whose width and height were W and H, the GFC is defined as follows: where W 1 and W 2 are two learnable parameters. BN in Equation (1) where γ and β are two learnable parameters, and is the smallest positive value. ReLU in Equation (1) denotes rectified linear unit (ReLU) [28], and is an activation function for convolutional networks. The activation function can activate neural layers when the output reaches a predefined threshold, so it transforms input into the required output. ReLU is a nonlinear function that directly outputs the input value if it is positive; otherwise, it outputs zero. Mathematically, the ReLU is defined as follows: gap(F ) in Equation (1) is global average pooling of feature F : LCC of F is defined as below: (PWConv 1 (F))))), where PWConv is pointwise convolution that uses a 1 × 1 kernel to aggregate channel context for each spatial position.
With GFC(F ) and LCC(F ) in Equations (1) and (8), the attentional weight (AW) of feature F in Figure 5 is defined as: where ⊕ is the broadcasting addition that adds scalars to higher-dimensional tensors.
Lastly, as shown in the upper part of Figure 5, AFF is defined as: where ⊕ is the same broadcasting addition as that in Equation (9), and ⊗ is the element-wise multiplication that adds corresponding elements between tensors. The second part of the AFF-embedded CNLA block is the CNLA block that is calculated from the output of the AFF block F f used . The CNLA block is based on an improved nonlocal (NL) block [29]. In [29], the NL operation was defined as: where i is the position index of the output, and j enumerates all possible positions. x and y are input and output signals, respectively, with the same size. The f (x i , x j ) function generates a scalar value between i and all j, and it is discussed in the next section. Function g(x j ) generates a value for position i, and it can be considered in the form of linear embedding: g(x j ) = W g x j , where W g is a learnable parameter. C(x) is the normalization coefficient.
With the NL operation that is defined in Equation (11), the nonlocal block is defined as: where y i is defined in Equation (11), +x i denotes a residual connection, and W z is a learnable parameter.
The function of f (x i , x j ) in Equation (11) has multiple potential options. In [24], cosine similarity was introduced as the f (x i , x j ) function into the CNLA block: Lastly, output y j of CNLA is defined as: where s i,j is the so f tmax operation performed on a row of the similarity map:

Region Proposal Module
Following FEM described in Section 3.1, the region proposal module (RPM) [17] is applied to estimate the rough location of objects. As shown in Figure 6, in the RPM, a feature map is attached to a 3 × 3 convolutional layer to generate the intermediate feature map (IFM) F IFM , and it was designed to collect information from neighboring regions of the feature map. Referring to [17], k reference boxes, namely, anchors, were predefined with different aspect ratios in every location of the intermediate feature map. To obtain the rough position of chips, F IFM is input into two 1 × 1 convolutional layers to obtain the scoring and regression layers. The scoring layer had k channels, and the regression layer had 4k channels. For each location in the feature map, anchors with a higher score than the predefined thresholding were chosen as candidate ROIs, and accuracy could be further improved with the regression layer.

Detection Module
The detection module was designed to estimate the precise location and class of chips. Feature maps from FEM F f eature could be used in these tasks, but they could not provide feature distribution for the detection module. To solve this problem, the spatial attention block (SAB) [18] was applied to reweight F f eature with spatial dimensions from RPM. As shown in Figure 7, the intermediate feature map of FPM F IFM was attached to a 1 × 1 convolutional layer, followed by a BN layer and a sigmoid layer; then, it was multiplied by feature map F f eature to generate final result F SAM . The SAB is define as: where BN denotes the BN layer described in Equation (5), and sigmoid is defined as: As shown in Figure 3, all of four outputs of the feature maps from FEM were attached to SAM to generate spatial attentional feature maps {s i , i ∈ [2 · · · 4]}. Although several measures were used in previous section, different information is contained in different feature maps. To combine these feature maps, as shown in Figure 8, the multiscale ROI fusion block [23] was introduced in our work. In the RPM section, ROIs were estimated in every feature map. Then, ROI information is input into the ROI align pooling (ROIAlign) layer [30] to extract ROI features that had the same size. In ROIAlign, ROIs are subdivided into spatial bins; the exact values of these bins are computed with bilinear interpolation and generate feature of ROI with aggregates. To obtain multiscale ROI features, these ROI features are fused with element-wise max operation.  Lastly, in order to detect the class and precise location of electronic chips, a fused feature is attached to a sequence of fully connected layers, followed by two branches of fully connect layers: one produces scores about k object classes, and the other generates four values for each K class that encodes refined bounding-box information.

Multitask Loss
Our detection network was assigned two tasks: classify and regress the bounding box that corresponded to two branches in the detection module. The loss function of our network is defined as follows [17]: where i is the index of anchors, and p i is the probability of the object in anchor i. The two terms are normalized by N cls , N reg , which are the minibatch size and the number of anchor locations, respectively, and they are weighted by a balancing parameter λ. L cls (p i , p * i ) and L reg (t i , t * i ) in Equation (18) are described as follows. In Equation (18), p * i is the ground truth of p i , defined as: Classification loss L cls is log loss over two classes (object versus not object).
t i is the reg vector with 4 elements that represents the predicted bounding box: t * i is the ground truth of t i : where x, y, w, and h denote the central coordinates, width, and height of the predicted bounding box. To improve robustness, notation L reg with L1 smooth is defined as: With loss function L defined in Equation (18), learnable parameters can be trained with the stochastic gradient descent method: where θ is one of learnable parameters, and γ is the learning rate.

Dataset
In this section, the proposed chip detection method is evaluated on our chip image dataset. In the electronic industry, the appearance of PCBs and chips largely varies. So, it is impossible to generate a unified dataset that meets the requirements of all chip detection applications. To this end, chip detection application generally relies on small-scale datasets. This image dataset had 367 images, and was divided into two subsets: the training dataset contained 330 images, and the evaluation dataset contained 37 images. Images of the dataset were captured by the electronic chip detection system that is illustrated in Figure 1. On the basis of the image capture strategy, chip images were randomly cropped, and their width was between 300 and 700 pixels. Each image contained at least five chips belonging to at least two classes. Distribution of instances of the chip dataset is shown in Table 2.

Implement Details
The proposed chip detection method was evaluated on a workstation with an Intel Xeon (R) Gold 6278C CPU and a Nvidia Tesla V100 GPU. The network was implemented with Python programming language based on PyTorch [31] and its expansion pack Detec-tron2 [32]. The pretrained checkpoint of ResNet from Detectron2 was used to initialize the backbone of our method.
As discussed in Section 4.1, electronic chip detection methods are always trained with small-scale datasets and are prone to overfitting. To solve this problem, data augmentation was applied to our method. Through expanding the training dataset, data augmentation technology is able to improve generalization and robustness against changes in the input image, such as regarding image density, object position, and object orientation. In this paper, three augmentations are used: (1) random crop augmentation: cropping a region with random size from raw images; (2) random flip augmentation: randomly flipping the image; (3) small object augmentation [33]: copying small objects from the original position and pasting them to different positions.
To learn parameters for our method, a back-propagation-based optimization method is applied to minimize the loss defined in Equation (18), which is a function of weight parameters. First, the derivative of the loss function to each weight was calculated; then, a stochastic gradient descent method with momentum was applied to update weights in the direction of the fastest gradient decent until the maximal iteration. The previous momentum was used to accelerate the current gradient: update direction was defined by the previous update direction and the gradient of the current batch. In other words, if current and previous gradient directions are the same, update speed is higher; otherwise, update speed is lower. In our work, the learning rate was set to 0.0001, momentum was set to 0.9, and the maximal iteration was to 40,000.
In the RPM, the area of anchors for every pyramid feature maps was assigned to {32 2 , 64 2 , 128 2 , 256 2 , 512 2 }, and the aspect ratio of all anchors was [ 1 3 , 1 2 , 1, 2, 3]. The maximal iteration was set to 8000, and the loss curve in training our method is shown in Figure 9. Experimental results of our method are demonstrated in Figure 10.

Evaluation Metrics
The performance of our method was established with the PASCAL criteria [34]. First, detection results were sorted by their confidence scores; then, the IoU was calculated for these results: where DetectionResult is the bounding box of the detection result, and GroundTruth is the annotation box. Then, l t i is defined as: where a i is the IoU of i-th detection result, and t is the threshold. Precision p and recall r are defined as follows: where n t p is number of positive samples, and tp t i and f p t i are true positive and false positive, respectively: On the basis of the area under precision recall curve, the AP was calculated as follows: The final mAP was calculated with the average value of AP for N classes: On the basis of mAP defined above, alternative criteria mAP coco were calculated by the average value of AP with t = 0.05:0.05:0.95 [35,36]. Table 3 shows the accuracy gap between our method and the faster R-CNN method, and Table 4 shows the difference between these two methods by category. As shown in Table 3, our method achieved promising results, outperformomg benchmark method R-CNN.

Evaluation Results
As shown in Table 3, the results of mAP 0.5 and mAP 0.75 in our method were 0.98745 and 0.95142, respectively, while mAP COCO was only 0.81130. According to the definitions, mAP 0.5 evaluates detection results with the IoU threshold between bounding boxes of the detection results and ground truth, and this was equal to 50%. mAP 0.75 evaluates detection results with a threshold equal to 75%; and mAP COCO combines the IoU threshold from 50% to 95% in intervals of 5%, that is, the requirement for IoU is higher. Therefore, our method could roughly detect electronic chips from the image, but, as shown in Figure 10, there was a certain error in the central coordinates and the size area. Hence, the accuracy of the bounding box could be further improved.
As shown in Table 4, the mAP COCO the results of our method in detecting capacitors, transistors, ICs, and inductors were 0.80324, 0.77550, 0.88759 and 0.86782, respectively, which ertr lower in detection resistors than those in the faster R-CNN method. According to our analysis, this result was because the surface of the resistors had text indicating the resistance value, so their surface texture was relatively complex. Compared with the faster R-CNN method, due to the extraction of more object features, our method was prone to overfitting when the amount of data was not large enough. Therefore, it is necessary to improve the accuracy of our method in detecting complex objects with a small amount of training data.

Conclusions
This paper proposed a novel electronic chip detection method that was trained with a small-scale chip dataset. Three aspects distinguish our work from previous works: first, our method was designed to detection chips that belong to different classes in complex backgrounds; second, AFF-embedded CNLA module and pyramid feature module were combined to extract features from chip images; third, pyramid feature maps were enhanced with the region intermediate feature map to classify and locate chips. The experiment showed that our work outperformed a landmark method. There are two challenges for our work: (1) the accuracy of the bounding box needs to be further improved; (2) the detection accuracy of objects with complex textures needs to be further improved. We will focus on improving the precision of the bounding boxes of electronic chips and the performance of the few-shot electronic chip detection method.  Data Availability Statement: The source codes and datasets used to support the findings of this study are available from the corresponding author upon request via email: sunhao2021@hit.edu.cn.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: