2.1. Literature Review
Due to the limited hardware resources of embedded platforms, such as insufficient computational power, restricted memory capacity, and power consumption constraints, applications designed initially to perform die bond detection and recognition on high-performance GPU servers cannot achieve their expected performance in embedded environments. Accelerating die bond inference speed through model lightweighting techniques under limited hardware resources, while maintaining existing accuracy, has become an important research topic to ensure that the model can complete tasks effectively and efficiently.
Boustedt et al. [
9] discussed methods for improving chip interconnection in die bonding to enhance yield. The flip-chip method is the most reliable chip interconnection technique, capable of achieving incredibly high yield at very low cost. The bumps on the chip, the chip carrier, and the interconnection method between the chip and the carrier are the fundamental elements that constitute flip-chip interconnection. These elements are interdependent; therefore, we must comprehensively consider each component when selecting the optimal flip-chip system for a specific application.
On the other hand, Liao et al. [
10] mentioned that as electronic devices advance, they become thinner, smaller, faster, and characterized by higher I/O density. Therefore, the bump diameter and pitch design will gradually become smaller or narrower to meet bump density requirements. This way will bring more challenges to semiconductor packaging processes, such as non-wetting and bridging issues during the die bonding process. Furthermore, X/Y/Z-axis bump misalignment and warpage mismatch can affect the assembly process of the Flip Chip Chip Scale Package (FCCSP).
Tsao et al. [
11] further explained that during the chip mounting process, residual stress is generated within the die attach assembly, and material components experience stress due to the incompatibility of their coefficients of thermal expansion (CTE), resulting in out-of-plane displacement of the chip caused by the die attach process.
Gu et al. [
12] proposed an improved YOLOv7-tiny architecture to meet edge devices’ real-time object detection requirements. They replaced the standard convolution in the ELAN structure with depthwise separable convolution (DWConv) to reduce the number of model parameters. To improve the neck network, they integrated the Coordinate Attention (CA) mechanism into the convolution to establish the Coordinate Attention Convolution (CAConv), which replaced the standard convolution. This approach demonstrated the feasibility and practicality of structural optimization for embedded platform applications.
In addition to depthwise separable convolution, introducing the Ghost module has become an essential technique for model lightweighting. Wang et al. [
13] first integrated the Ghost module into the YOLOv7-tiny architecture. Following this, they introduced the CA module into the feature extraction network to enhance the model’s ability to learn defect location features. Then, they adopted the lightweight convolution module—Ghost Shuffle Convolution (GSConv)—in the feature fusion network, effectively reducing model parameters while maintaining satisfactory detection accuracy.
In the study by Gong et al. [
14], GhostNet was combined with Dynamic Region Convolution (DRConv) and a cross-layer feature sharing network (CotNet Transformer) to enhance YOLOv7-tiny’s feature extraction and fusion capabilities. Implementing the Gaussian Error Linear Unit (GELU) to replace the conventional ReLU activation function improved the model’s nonlinear representation and classification performance. The enhanced model outperformed YOLOv7-tiny in accuracy, inference speed, and model size.
Tang et al. [
15] also demonstrated the potential of combining the Ghost module with novel activation functions. The model improvements involved several key modifications: integrating the GhostNet V2 module into the YOLOv7-tiny backbone for parameter reduction; replacing LeakyReLU with FReLU in convolutional layers to enhance spatial feature modeling; and incorporating the SimAM, C3, and ODConv attention/convolution modules. Experimental results showed that the number of floating-point operations (FLOPs) decreased by 59%, the mAP dropped by only 0.64%, and the inference speed reached 47 fps, demonstrating outstanding performance improvement and application potential.
In summary, effectively lightweighting detection and recognition models by integrating techniques such as depthwise separable convolution, Ghost modules, attention mechanisms, and novel activation functions has become a significant research direction in edge computing for embedded deep learning models.
2.2. Data Collection and Preprocessing
Based on our previous publication [
3], the image data utilized in this study originated from a major semiconductor manufacturer in southern Taiwan.
Figure A2 displays these images, all taken from a top-down perspective of the machine. The research team manually annotated these images for object detection and classification, defining four categories for the die’s bonding status: side_good, side_bad, corner_good, and corner_bad. Initially saved in XML format, the resulting label information required conversion to the YOLO label format to be compatible with YOLO-based model input.
For data collection, this study acquired 3145 images and split them into a training set (70%, 2204 images), a validation set (15%, 467 images), and a test set (15%, 474 images). This 70%:15%:15% split provides sufficient data support for all phases of model development.
To ensure effective learning of the complete die bonding status and accurate classification of side and corner quality, four model variations—CGSE-YOLOv7-tiny, Mobile-YOLOv7-tiny, ReG-YOLOv7-tiny, and DSGReG-YOLOv7-tiny—were initially trained. The experiment set hyperparameters as follows to optimize performance: 300 epochs (complete iterations over the dataset); a batch size of 32 (images per weight update); and an input image size of 1024 × 1024 to facilitate both computational efficiency and high accuracy in die bond object detection.
After training and validating the above models, this study further applied lightweight strategies to the DSGβSI-YOLOv7-tiny, DSGβSI-SE-YOLOv7-tiny, and DSGβSI-SECS-YOLOv7-tiny versions. These models integrate DSG convolution, ModifiedSiLU, AdaptiveSiLU activation functions, and attention mechanisms in their structure, aiming to significantly improve inference speed and computational efficiency while maintaining detection accuracy, thereby achieving a lightweight design suitable for embedded platform deployment. Through a systematic training process and data partitioning, this study ensures that the models deliver stable and reliable predictive performance across different datasets, providing an efficient and precise intelligent solution for die bond inspection.
2.5. CGSE-Yolov7-Tiny Model
Introducing Ghost Convolution (GhostConv) [
17] significantly improves inference efficiency, enabling deep learning models to maintain high accuracy and stability while achieving lightweight characteristics.
Figure 1 illustrates the CSPGhostConv architecture [
18], which integrates the Cross Stage Partial (CSP) structure [
19] with the GhostConv module. This architecture is a lightweight yet practical feature fusion module commonly employed to replace conventional convolutional layers, thereby accelerating inference and reducing the number of parameters.
People treated the CSPGhostConv module as a lightweight variant inspired by the CSP structure, in which the system further incorporates GhostConv to enhance computational efficiency and feature fusion capabilities. This modification makes it particularly suitable for deep learning architectures sensitive to inference speed.
The mechanism divides the workflow of this architecture into two main components. First, the direct path allows a portion of the input feature maps—typically 50% of the input channels—to bypass the main computation module through a skip connection and be directly concatenated at the final stage. This design preserves the original feature information and minimizes information loss. Second, the processing path processes the remaining portion of the input feature maps—typically 50% of the input channels—through the main computational module (e.g., GhostConv), where this architecture can perform complex feature transformations to learn new feature representations.
The Squeeze-and-Excitation (SE) Layer [
20] constitutes a lightweight component engineered to enhance the performance of convolutional neural networks (CNNs). It enhances important features and suppresses less important ones by allowing the network to learn inter-channel relationships and dynamically recalibrate each channel, thereby improving model accuracy.
The SE Layer comprises two key steps: Squeeze and Excitation. Specifically, the Squeeze phase compresses the global spatial information from every channel to a representative value. The most commonly used method is Global Average Pooling (GAP). For an input feature map of size H × W × C, global average pooling compresses it into a 1 × 1 × C feature vector. Each value in this vector represents the average response of the corresponding channel across the spatial dimensions, capturing the global contextual information of that channel. Excitation: This step aims to learn a weight for each channel. The 1 × 1 × C feature vector obtained from the Squeeze step is fed into a fully connected layer with ReLU activation, reducing the number of channels to C/r, where r is the reduction ratio, reducing the number of parameters. This structure is followed by another fully connected layer that restores the channel number to C, using a Sigmoid function to constrain the output between 0 and 1. Essentially, these values represent each channel’s “excitation” or weighting factor.
As the final step, this configuration uses the learned channel weights to recalibrate the features dynamically by applying them to the original input feature map through channel-wise multiplication, as depicted in
Figure 2.
Considering that GhostConv simulates high-cost convolutions through cheap operations, saving parameters and computations, and that CSP divides the input into two parts to reduce computation, these two modules were combined. Although this CSPGhostConv does not fully follow the CSPNet approach of splitting paths and merging, it mimics the spirit of “dual paths → concatenation → compression”. Combining the two outputs and then compressing helps maintain a balance between representational capability and lightweight design. Furthermore, this model can also utilize a Squeeze-and-Excitation (SE) Layer to enhance model performance, as shown in
Figure 3.
2.6. Mobile-YOLOv7-Tiny Model
MobileNetV3 [
21,
22] is the next-generation MobileNet series model developed by Google, designed as an efficient convolutional neural network (CNN) specifically for mobile devices. It combines complementary search techniques and novel architectural designs aimed at optimizing for mobile CPUs, achieving a better balance between accuracy, latency, and model size. The development of MobileNetV3 utilized a hardware-aware neural architecture search (NAS) method, which considers specific hardware constraints, such as CPU performance, to reduce inference latency. In addition, it integrates the NetAdapt algorithm to optimize the model further, ensuring high accuracy under specific latency constraints. Based on NAS and NetAdapt, manual architectural improvements were made to enhance model efficiency, such as adjustments to bottleneck structures and nonlinear activation functions.
NAS is an automated search method for neural network architectures that simultaneously considers hardware performance constraints (e.g., latency, power consumption, memory usage). NAS strategies generally include Reinforcement Learning-based NAS, Evolutionary NAS, Gradient-based NAS (e.g., DARTS), and others. In MobileNetV3, the system uses Platform-aware NAS (MnasNet) to design the block combination sequence, aiming for accurate and fast networks on specific hardware. A recurrent neural network (RNN)-based controller and a factorized hierarchical search space were employed, with the reinforcement learning controller determining each block’s expansion ratio (e.g., 3, 6), kernel size (e.g., 3 × 3, 5 × 5), whether to use the SE module, and the activation function (ReLU or h-swish).
In MNASNet (Mobile Neural Architecture Search Network), the factorized hierarchical search space divides the model into blocks and layers, addressing insufficient layer diversity. During the search, for each block, it determines “which operation each layer performs” and “how many times to repeat it.” The composition of layers can vary completely between different blocks.
Figure 4 shows block 2 consists of standard convolution layers, while block 4 comprises bottleneck layers combined with squeeze-and-excitation layers.
The objective function proposed by MNasNet considers real-world application scenarios. Since MNasNet incorporates the target platform into the objective function, it is called platform-aware NAS. MNasNet aims to maximize this objective function.
Equation (1) represents the objective function, where
and
denote accuracy and latency, respectively, m represents the sampled model,
is the target latency, and w is an application-specific constant (considered a weight factor). In the original paper [
21], we set w to −0.07. In MobileNetV3, the authors observed that the accuracy of lightweight models changes more sharply with latency, so they adjusted w to −0.15. Finally, the model obtained using this method is similar to the results reported in the MnasNet paper; therefore, the authors directly adopted MnasNet-A1 as the initial model for subsequent improvements, as shown in
Figure 5.
The NetAdapt algorithm first generates a set of proposals, which modify the model to reduce its computation time by a specific amount. It employs each proposal to adjust the pre-trained model and fine-tune it to estimate its accuracy roughly. This algorithm then selects the best proposal according to specific criteria. This process repeats until it reaches the target computation time, as shown in
Figure 6.
MobileNetV3 considers two types of proposals: reducing the size of the expansion layers and the size of the bottlenecks uniformly across all blocks. MobileNetV3 uses the change in accuracy relative to latency (accuracy/|latency|) as the criterion for model selection, choosing the proposal that minimally decreases accuracy while satisfying the first step. After identifying the final proposal, NetAdapt re-trains the model from scratch to obtain the final model.
In the previous generation MobileNetV2, the final 1 × 1 convolution expanded the activation tensor into a higher-dimensional space to facilitate subsequent predictions. However, this also required a large amount of computation. MobileNetV3 moves this 1 × 1 convolution after the global average pooling, reducing the input size from 7 × 7 to 1 × 1, thereby decreasing the computational load. Because this change already reduces a substantial amount of computation, this improvement can remove the bottleneck blocks previously used for dimensionality reduction and information propagation, as shown in
Figure 7.
Equation (2) represents the Swish function, a smooth and differentiable nonlinear activation function, where
is the input variable and
is the sigmoid function. Swish possesses a self-gated property, allowing it to adaptively adjust its output based on the input value, enhancing gradient flow and feature representation capability.
MobileNetV3 uses ReLU6 to replace the sigmoid function, naming this function hard swish (h-swish). Equation (3) defines the function, which is a piecewise function with three segments: when the input , the output is 0; when , the function follows a quadratic form ; when , the output equals the input itself. It is computationally simple, and ReLU6 can implement it efficiently. Within the range , it approximates Swish. Since it avoids the expensive sigmoid, it is beneficial in low-latency models, offering performance similar to Swish while enjoying broader platform support for ReLU6.
Equation (4) defines the ReLU6 function, an improved version of the Rectified Linear Unit (ReLU). ReLU6 retains ReLU’s sparsity property while capping the output at 6, preventing numerical overflow during low-precision computations or quantization. This design enhances model stability and quantizability on mobile and embedded devices. By replacing sigmoid with ReLU6, h-swish avoids precision loss during quantization and can be represented as a piecewise function, reducing memory requirements on hardware. Combining linear and sigmoid characteristics, h-swish approaches zero as
and
as
, remaining continuous and differentiable, making it suitable for deep networks.
Compared to setting the number of SE channels based on the bottleneck size in MnasNet, MobileNetV3 fixes the number of SE channels to one-fourth of the expansion layer channels. MobileNetV3 uses channels of the expansion layer /4 to find the intermediate channels for the SE block to achieve the best performance.
Figure 8 shows the overall architecture.
2.7. ReG-Yolov7-Tiny Model
The RepGhost module [
23] is a lightweight convolutional design that combines the Ghost module with reparameterization techniques, as shown in
Figure 9. (a) A block from the GhostNet network. (b) A block of the RepGhost network during training. (c) A block of the RepGhost network is present during inference.
The core idea is that the input channels are first compressed within the bottleneck structure through a 1 × 1 convolution, reducing the input dimension. The model uses the intermediate channels to decrease computational cost. Then, another 1 × 1 convolution restores the feature channels to the output dimension, forming a typical “shrink–expand” structure. Since depthwise convolution (DWConv) can only maintain a one-to-one correspondence between input and output channels without changing the channel number, it performs channel compression and expansion within the bottleneck using standard convolutions.
In the Ghost Bottleneck, intermediate feature extraction uses DWConv to generate redundant features, which are then concatenated with the primary convolution results to achieve efficient feature representation. Furthermore, in the RG-bneck module, a re-parameterization structure is introduced into the compressed DWConv during training to enhance feature representation; during inference, this structure can be folded into a single convolution, avoiding extra inference overhead, thus achieving a “training-enhanced, inference-efficient” design.
Additionally, the module incorporates a Shortcut Block (SBlock) that provides a residual connection path. When enabled, the input features can be directly added to the output, further improving gradient flow and convergence stability. Compared with the original Ghost module, RepGhost maintains efficient redundant feature generation while further strengthening model expressiveness and reducing structural complexity during inference, making it particularly suitable for deployment in resource-constrained scenarios, as shown in
Figure 10.
2.9. DSGβSI-Yolov7-Tiny Model
As noted previously, while Leaky ReLU partially mitigates ReLU’s limitations by allowing small negative outputs, its fixed negative slope often fails to adapt to diverse data distributions or network layer characteristics optimally. In practical applications, the negative outputs introduced by Leaky ReLU may sometimes affect the network’s convergence behavior and learning capacity, resulting in performance degradation. Previous studies [
15,
26] have indicated that properly improving the design of activation functions can further enhance inference accuracy and model stability. Based on this insight, this study replaces the original Leaky ReLU activation function in the DSG-YOLOv7-tiny architecture with the Sigmoid Linear Unit (SiLU).
SiLU is an activation function with smooth characteristics, effectively alleviating the jitter problem during training and enabling more stable gradient updates. Compared with ReLU and Leaky ReLU, SiLU exhibits a soft activation behavior between linear and nonlinear transformations. This activation allows it to capture subtle variations in the input data more flexibly while maintaining better gradient stability throughout the training process. Such properties enhance the model’s ability to represent complex features and improve its generalization capability. As validated by the subsequent experimental results, integrating SiLU into DSG-YOLOv7-tiny yields a slight yet consistent improvement in precision and accuracy, demonstrating its practical value in improving model performance.
Like Leaky ReLU, SiLU produces negative outputs for negative inputs, thereby retaining a non-zero gradient in the negative domain. Equation (5) defines the Sigmoid function, where denotes the input and represents the output. When approaches positive infinity, approaches zero, and approaches 1. Conversely, when approaches negative infinity, grows very large, and approaches 0. Thus, the Sigmoid function maps any real-valued input to the interval , making it widely applicable in probability interpretation and nonlinear modeling.
Building on this, Equation (6) defines the SiLU function, where
is the input and
is the Sigmoid function described above. By combining a linear term with the smooth nonlinear modulation of Sigmoid, SiLU maintains an approximately linear response in the positive domain, while smoothly preserving a small portion of negative outputs in the negative domain. This design avoids the complete inactivation problem of traditional ReLU in the negative region. Due to its balance between smoothness and nonlinearity, SiLU ensures more stable gradient propagation, facilitating faster model convergence and improving overall prediction accuracy.
In some cases, using SiLU has resulted in only marginal improvements in accuracy. Consequently, two SiLU-based variants—Modified SiLU and Adaptive SiLU—have been proposed to address this limitation. The motivation behind adopting these modified activation functions is to maintain a high level of accuracy while pursuing improved inference speed.
Equation (7) defines the Modified SiLU function, where represents the input, denotes the output, and is a learnable parameter that controls the steepness of the activation curve. When the input is large, approaches 1; conversely, when is small, approaches 0. Initially, is set to 1.0, resulting in a fixed curve slope and limited adaptability to different feature levels across the network.
To address this limitation,
is progressively adjusted layer by layer. Specifically, it is first increased to 1.5 to enhance the expression capability of nonlinear features, then gradually reduced to 1.25 to create a smoother curve. This gradual modulation facilitates more stable gradient flow, improves the learning efficiency of different layers, and ultimately enhances inference accuracy.
Equation (8) describes the formulation of MoSiLU, where activation uses to compute the adaptive scaling factor, and represents a small neural network. The weight matrix of the first layer and the weight matrix of the output layer of are both randomly initialized, while and are biases initialized to zero. This structure functions similarly to the learning behavior of a fixed , but provides greater flexibility by dynamically adjusting the activation strength according to the input signal.
Furthermore, Equation (9) presents the Adaptive SiLU function, where
denotes the input and
is the bias, initialized to
. When
, the function behaves similarly to ReLU around
, resulting in
, which preserves nonlinear characteristics while adapting to scale variations of different inputs. This design allows the activation function to be input-sensitive and to capture diverse feature representations across different network layers more effectively.
According to the Xavier initialization method [
27,
28], Equation (10) defines the initialization of the weight matrices
and
. Equation randomly drew the weight values from a uniform distribution
. Here,
represents the input dimension of the layer (in this study, the input image size is 1024 × 1024), and
stands for the output dimension (corresponding to the number of classes, which is 4 in this case). The coefficient
serves as the Xavier initialization factor, which helps maintain stable weight variance during forward propagation.
After initialization, the training updated the weights through backpropagation. The network performs a forward pass to generate predictions and calculate the loss. Then, it computes the loss gradient with respect to the weights. Finally, an optimizer (such as gradient descent) updates and in the direction that minimizes the error. The training iterates this process continuously, allowing the model to improve its performance gradually.
In this structure, the input dimension of
corresponds to the previous layer’s output, and its output dimension corresponds to the number of neurons in the hidden layer. The input dimension of
corresponds to the hidden layer output, and its output dimension is 1, representing the adaptive scaling factor
.
As illustrated in
Figure 12, the DSGβSI-YOLOv7-tiny architecture first introduces modifications to the backbone. When
is at 1.0, the slope of the activation curve becomes inflexible, limiting the function’s ability to adapt. To address this issue, the activation employs Modified SiLU in the early layers (Layers 0, 1, 5, and 10), where
is initially set to 1.5 and then reduced to 1.25 after two layers. This adjustment improves gradient flow in the early stages of the network.
In the later layers of the backbone, the original activation retains SiLU to ensure gradient stability and maintain a balance between feature extraction and information propagation. Subsequently, the neck section adopts Adaptive SiLU, enabling the network to better handle the complex feature combinations in intermediate stages. Finally, Adaptive SiLU is also applied in the head section, allowing the model to dynamically adapt and select the most suitable activation behavior for accurate prediction.