1. Introduction
In industrial quality control, semantic segmentation plays a crucial role in automating processes, detecting anomalies, and ensuring product integrity. In semiconductor manufacturing, precise analysis of functional regions is essential for quality control, especially in complex components such as integrated circuits and Thin Film Transistor (TFT) backplanes. These regions must be carefully analyzed during the manufacturing process to ensure proper alignment and functionality. Currently, this process relies heavily on manual image selection and cropping, which is both time-consuming and labor-intensive. Creating a single Automated Optical Inspection (AOI) recipe can take several minutes to hours, significantly hampering manufacturing efficiency and scalability. This project aims to address these limitations by applying deep learning-based semantic segmentation to automate the selection and cropping of functional regions in semiconductor images, including TFT backplanes and other semiconductor components.
Deep learning models have proven effective in various sectors of the semiconductor industry, particularly in defect detection and quality control. For instance, ref. [
1] employed deep convolutional neural networks (CNNs) to improve defect inspection in semiconductor wafer manufacturing. Similarly, prototype learning has been shown to efficiently handle defect segmentation in scenarios where background patterns vary [
2]. These applications demonstrate the potential of deep learning to generalize well to unseen samples, providing more consistent and accurate results than traditional manual methods. Most of the existing literature on defect inspection in the semiconductor industry focuses on detecting small defects or performing image defect classification [
3,
4,
5]. For instance, CNN-based methods have been successfully used for defect detection in printed circuit boards (PCBs) [
6] and bubble segmentation in TR-PCBs (transmitter and receiver printed circuit boards) [
7]. However, convolutional models, with their focus on local feature extraction, face limitations when applied to the segmentation of large-scale images typical of semiconductor applications. Our attempts to segment such high-resolution images using models like U-Net [
8] and DeepLabV3+ [
9] yielded unsatisfactory segmentation results as shown in
Section 5. While these CNN-based methods have been valuable, they struggle with large-scale semantic segmentation tasks due to their limited receptive field and inability to model long-range dependencies. These shortcomings have motivated the exploration of transformer-based architectures, which use attention mechanisms to capture both local and global context more effectively.
Transformer-based models such as Vision Transformer (ViT) [
10] and DETR [
11] have since emerged as powerful alternatives in computer vision. By leveraging self-attention, they are better suited for tasks like image classification and object detection compared to convolution-based models [
12,
13]. Although ViT and DETR were initially applied to classification and detection, their strong performance demonstrated that transformers—originally developed for natural language processing—could also excel in vision tasks. This paved the way for their adoption in semantic segmentation, with SETR [
14] being the first to use a pure transformer encoder, and TransUNet [
15] introducing a hybrid architecture that combined transformers with convolutional decoders for medical image segmentation. These developments marked the beginning of the transformer era in segmentation, which has since become the dominant paradigm in the 2020s.
Building on this momentum, several advanced transformer-based segmentation models have been proposed. The Segment Anything Model (SAM) [
16] introduces generalized zero-shot, interactive segmentation but lacks class-specific automation, making it less suitable for industrial AOI workflows. Swin-Unet [
17], which integrates U-Net with a Swin Transformer [
18] backbone, achieves excellent segmentation performance in medical images but suffers from slower inference due to its deep decoder. Other unified models, such as OneFormer [
19] and Mask2Former [
20], combine semantic, instance, and panoptic segmentation into a single architecture. While flexible, their complexity and computational demands make them impractical for high-throughput AOI systems that require only semantic segmentation.
Among existing models, SegFormer [
21] offers an appealing balance between performance and efficiency. It uses a hierarchical transformer encoder with a lightweight decoder, enabling effective semantic segmentation with fewer parameters. When tested on our dataset, SegFormer outperformed CNN and other transformer models. However, SegFormer still has limitations. Its transformer layers are computationally demanding on high-resolution images, and it emphasizes global features at the expense of fine-grained local details—an important consideration in semiconductor inspection tasks that require precise boundary delineation.
To address these shortcomings, we propose a hybrid model that enhances SegFormer by integrating atrous convolutions into its encoder. This hybrid design leverages the global context modeling of transformers and the local feature extraction of convolutions, resulting in sharper segmentation masks and improved performance on fine structures. This modification not only enhances segmentation accuracy but also improves inference efficiency. Atrous convolutions are less computationally expensive than the transformer blocks they partially replace, reducing latency—an essential requirement for real-time industrial applications. Additionally, our method supports image downscaling during training and parallel processing of image patches, both of which contribute to substantial speed-ups during deployment. We also employ post-processing techniques to refine segmentation masks, extract contours, and identify points of interest (POIs), facilitating the automatic creation of AOI recipes. By automating the segmentation and cropping of functional regions in semiconductor images, our method drastically reduces manual effort and enhances throughput in high-volume manufacturing environments.
To the best of our knowledge, this work presents the first fully automated deep learning pipeline specifically designed for functional region segmentation in high-resolution semiconductor images, such as TFT backplanes. Unlike previous studies focused on defect classification or localized anomaly detection, our region segmentation approach enables a seamless integration into AOI automation, especially in cases where defects cannot be easily classified. The key contributions of this work are:
A novel hybrid deep learning model whose encoder combines the hierarchical transformer of SegFormer with atrous convolutional networks. This design enhances local feature extraction while maintaining spatial resolution, resulting in sharper segmentation boundaries and improved overall segmentation accuracy for high-resolution semiconductor images.
An optimized inference pipeline for large-scale, high-resolution images. The model supports flexible image downscaling—for example, from to —during inference to accelerate processing while maintaining accuracy. It also uses patch-based batch inference, allowing multiple patches to be processed in parallel by leveraging available computational resources, significantly improving inference speed for large images.
A post-processing algorithm that refines the segmentation masks to support AOI recipe creation. This algorithm removes false positives and false negatives from the masks and simplifies the extracted contours using polygonal approximation. This refinement enables the accurate extraction of regions of interest (ROIs) and points of interest (POIs), which are critical for downstream AOI tasks.
The remainder of this paper is structured as follows:
Section 2 describes the AOI recipe creation process and explains how the segmentation model is integrated into the system.
Section 3 presents the model architecture and the post-processing algorithms.
Section 4 details the implementation on our dataset and the inference optimization strategies.
Section 5 discusses the experimental results, and
Section 6 concludes the paper.
3. Methodology
Our approach to effective application of deep learning to the creation of AOI recipes involves five key steps: preparing the training data, defining the model architecture, training the model on the dataset, testing the model on unseen images, and post-processing the segmentation masks to extract the contours of the ROIs and the coordinates of the POIs. For the segmentation, we propose a deep learning-based segmentation model that combines transformers and atrous convolution with a Multi-Layer Perceptron (MLP) decoder. This architecture is designed to effectively segment the ROIs, and the POIs are subsequently determined based on the segmented ROIs, which are crucial in the AOI recipe creation process. In this section, we define the model’s architecture and describe the post-processing techniques used to extract the ROIs and POIs. The steps related to preparing the data, training the model, and testing the model on unseen images will be covered in the Implementation (
Section 4).
3.1. Segmentation Model
3.1.1. Encoder Design
The encoder, as shown in
Figure 4, features a hybrid design with two parallel paths: a transformer path similar to SegFormer’s encoder and a convolutional network path utilizing dilated convolutions. This hybrid architecture is designed to combine the strengths of both transformers and convolutions to improve both local and global feature extraction capabilities.
This approach was considered after observing that, when training both DeepLabV3+ and SegFormer-B0 on our dataset, the models performed differently across classes and had varying effects on the edges of the predicted segmentation maps. The goal was to maintain the high overall segmentation accuracy that SegFormer provided, while improving the accuracy of specific classes, particularly, the definition of the edges of the under-represented classes that SegFormer struggled with. Additionally, we aimed to enhance the definition of the edges in the segmentation maps, which were better defined by DeepLabV3+ compared to SegFormer.
The transformer block used here is the efficient self-attention block [
21]. It reduces the computational requirement of transformers and aligns with our aim to develop a model for fast inference. The efficient self-attention block is a computationally optimized version of the original scaled dot product attention from [
22] estimated as:
where
Q is the query matrix,
K the key matrix,
V the value matrix, and
is a scaling factor,
. In the original transformer paper [
22], for heads of dimension
where
, the computational complexity is
. The efficient self-attention from [
23] reduces the complexity to
by decreasing the spatial dimensions of the input sequence. Specifically, it performs the following transformation:
which reduces the dimension of
K to
.
where: R is a reduction ratio. means reshaping K to have the shape , refers to a linear layer taking a -dimensional tensor as input and generating a -dimensional tensor as output, are the dimensions of each of the heads .
is the length of the sequence, height and width of the image or patch and C is the channel dimension.
The transformer path in the proposed model closely follows the structure of SegFormer’s encoder, which processes the input image through four stages. Each stage progressively reduces the spatial resolution while increasing the semantic richness of the feature maps. However, in our design, only the outputs of the last two stages (with channels and ) are passed to the decoder. The first stages are replaced by convolution blocks that are more efficient at extracting local features compared to the transformer blocks.
The convolution blocks used here are dilated convolution. The choice of the dilated convolution blocks was inspired by their use in DeepLabV3+, and the general ability of convolutional networks to better extract local features. We opted for atrous convolution instead of standard convolution to slightly expand the receptive field, facilitating a smooth transition from convolution to transformer layers, without increasing the kernel size, which would add more parameters to the model.
The output
of a dilated convolution for a one dimensional input
with a filter
of length
K is defined as:
where
r is the dilation rate.
In this context, represents a data point in the input sequence x, with i being the index of each element of x. w is the convolution kernel (or filter) of length K, and it contains the weights applied to the input sequence x to compute the output The dilation rate r determines the spacing between the elements of the filter as it convolves over the input sequence. When , the operation reduces to a standard convolution, where each filter element interacts with consecutive input element. For higher values of r, the filter elements are spaced apart by r positions in the input sequence, effectively enlarging the receptive field without increasing the kernel size. To compute the output , the weighted sum of the input values is calculated, with the indices determined by the dilation rate. For example, consider a sequence , a kernel , and a dilation rate . The output is given by: . In this case, the kernel elements are spaced by 2 positions, meaning the corresponding input elements are selected with gaps in between. When the indices exceed the range of the input sequence, or when early input elements are not involved in the computation, the input sequence is zero-padded at the boundaries to handle these cases.
The convolution path consists of two dilated convolution blocks, designed to produce feature maps with channel dimensions matching those of the first two transformer stages ( and , respectively). The dilated convolution blocks preserve spatial resolution while expanding the receptive field, allowing them to extract features across multiple scales without losing geometric integrity. The outputs of the convolutional path are used to replace the outputs of the first two transformer stages in the encoder’s final output.
The outputs of the convolutional path (representing the first two stages) are concatenated with the outputs of the last two transformer stages. This combined output, with shape (where is the total channel dimension from all four stages), is passed to the decoder for feature upsampling and segmentation mask prediction.
To summarize, the hybrid encoder design addresses specific challenges observed with existing models, when applied to our dataset:
Boundary irregularities in segmentation maps: While SegFormer achieves high segmentation accuracy, its segmentation maps suffer from boundary irregularities. In particular, edges that should be sharp and straight, such as those in the alignment crosses in
Figure 5, instead exhibit watershed-like effects where edges deviate from expected straight-line structures. As shown in the figure, the lack of sharpness in the segmentation map makes the output unsuitable for practical applications. Even post-processing techniques fail to correct these irregularities, further limiting the usefulness of the model for automated region selection. This problem is not limited to alignment references only, but is also observed across other segmentation classes.
Transformers’ Weakness in Fine-Grained Details: Although the hierarchical transformer encoder excels at extracting local and global features, the self-attention mechanism in its early stages tends to emphasize global relationships across the feature map. This can blur small, precise structures, such as sharp edges, as seen in the segmentation maps in
Figure 5.
Convolutions for Spatial Preservation: Convolutions, by contrast, inherently encode spatial locality, making them better at detecting and preserving fine-grained structures such as edges. The dilated convolutions used in our convolution path further enhance the ability to capture multi-scale features without losing geometric integrity by assigning different dilation rates to the two convolution blocks. To ensure that the spatial dimensions remain constant throughout the convolutional path, we carefully adjust the dilation rates and padding values in each dilated convolution block. This design allows the convolution path to extract features across scales while preserving the original spatial resolution, ensuring that no information is lost at the edges or in smaller regions. By maintaining constant spatial dimensions, the convolutional path eliminates the need for upsampling operations, which are required in the transformer path to restore reduced spatial dimensions. This not only simplifies the decoder design but also reduces computational overhead during inference.
Efficiency: The convolution blocks are computationally lighter than the transformer blocks they replace, simplifying the model for faster inference. Their ability to retain spatial resolution without requiring additional upsampling further contributes to the model’s efficiency.
3.1.2. Decoder Design
The decoder, as shown in
Figure 6, consists of two main components: the MLP Layer and the Post-fusion Block.
The MLP Layer processes the feature outputs from the four encoder stages. In this layer, each feature map is first passed through a convolutional projection to unify the channel dimensions to a consistent size (C). The projected feature maps are then upsampled to match the spatial resolution of . Once aligned and upsampled, the feature maps are concatenated along the channel dimension to form a single tensor with the shape . This concatenation allows the decoder to effectively combine multi-scale information from the different encoder stages.
In the post-fusion block, the concatenated features undergo a series of operations to refine them. First, the features are passed through batch normalization, which normalizes the activations across the batch. Then, a ReLU activation function introduces non-linearity to the model, followed by dropout to prevent overfitting. After these operations, the number of channels is reduced from to , where is the number of segmentation classes, using a convolution. This final convolution generates the pixel-level classification map (logits) with the shape , which corresponds to the predicted segmentation mask.
3.2. Post-Processing of Segmentation Masks
After obtaining the results of the segmentation process, the segmentation masks are analyzed using computer vision and digital image processing techniques to extract the ROIs and POIs necessary for the AOI recipe creation. The focus of these techniques is on detecting the contours of each class in the segmentation mask and approximating them with polygons, which simplifies the extraction of ROIs and POIs.
The steps involved in contour simplification and POI extraction are outlined in Algorithm 1. In step 1, the conversion of the segmentation masks from RGB to HSV (Hue, Saturation, Value) color space allows for better separation of the regions based on color. By separating color (Hue) from brightness (Value), this improves the algorithm’s ability to identify distinct classes more effectively. From steps 2 to 4, each segmentation class is classified based on its unique HSV values, and individual class contours are identified. The number of contours for each class is calculated and false positives—regions incorrectly assigned to a class—are eliminated. A false positive occurs when a region is mistakenly classified as part of one class, but it does not belong there, such as in the case where the AA or background class is incorrectly identified as part of the IC class.
In step 5, false negatives are addressed using contour hierarchy to detect nested contours. False negatives occur when regions that should belong to a class are missed, often due to them being incorrectly labeled or not detected at all. Using contour hierarchy, child contours fully enclosed by parent contours are discarded, ensuring only valid, larger contours are retained.
To prevent contours from different classes merging, the algorithm processes each class individually, repeating Steps 3 to 5 for each class. This guarantees that contours are correctly associated with the appropriate class, avoiding merging across classes.
After eliminating false positives and false negatives, the contours are simplified in step 7 using the Ramer–Douglas–Peucker (RDP) algorithm which reduces the number of vertices while preserving the overall shape of the contours, making them easier to edit when misalignment occurs between the polygon and the area of interest. The POIs are detected and stored along with the vertices of the polygons.
Algorithm 1 Pseudocode of post-processing of segmentation maps to extract ROIs and POIs |
- 1:
Step 1: Convert the segmentation map from RGB to HSV color space - 2:
Use OpenCV to convert the image from RGB to HSV - 3:
Step 2: Identify classes and their assigned HSV colors in the segmentation map - 4:
Extract unique HSV color values - 5:
Assign each color to a corresponding class - 6:
Step 3: For each class in the segmentation map: - 7:
Isolate segments of the class using its HSV color - 8:
Find contours of the class instances - 9:
Count the number of contours as the number of predicted instances - 10:
Step 4: Eliminate false positives: - 11:
For each contour, calculate the enclosed area - 12:
Calculate the threshold area using: - 13:
where is the minimum area ratio, is the largest area among the contours. - 14:
If <, discard the contour - 15:
Step 5: Eliminate false negatives: - 16:
Use contour hierarchy to detect nested contours - 17:
Discard child contours that are fully enclosed by parent contours - 18:
Retain only parent contours for valid instances - 19:
Also discard false positives from other classes if not removed in Step 4 - 20:
Step 6: To prevent contour merging of different classes: - 21:
For each class, repeat Steps 3–5, analyzing one class at a time - 22:
Step 7: Simplify the contours - 23:
Use Ramer–Douglas–Peucker algorithm to approximate the contours with polygons have less vertices - 24:
Detect the POIs and save them along with the coordinates of the vertices - 25:
Step 8: Draw the polygons on the original image - 26:
Fill the polygons using alpha blending for transparency such that the structures of the image can be seen through the mask - 27:
The blended image is computed using the formula: - 28:
where controls the transparency level of the blended image - 29:
Step 9: Display the blended image for user editing when needed - 30:
Make the vertices of the polygons more pronounced, the enlarged vertices are easy to be spotted and edited if needed. - 31:
Step 10: Send the vertices and POIs to inspection stage
|
The RDP algorithm simplifies polygons by reducing the number of vertices while retaining the overall shape. This method is particularly useful in contour simplification for segmentation masks. The algorithm works by recursively eliminating points that lie within a defined tolerance from the line segment joining adjacent points.
To improve the quality of the contours for user editing, the tolerance value,
, is dynamically adjusted based on the polygon’s area. Smaller areas are simplified with a higher
while larger areas are simplified with a lower
, preserving more of the original shape. This approach is governed by the equation:
where
and
are experimentally determined, and
A is the polygon’s area. In our implementation,
,
, and
.
The values of , , and k were determined empirically, taking into account the surface areas of the various ROIs in the dataset, as well as the approximate number of vertices needed to accurately represent the contours of each ROI. These estimates were based on observations made during both the image annotation phase and model testing. Since most ROIs exhibit relatively well-defined geometric shapes, we were able to approximate the number of vertices required to describe their contours using a few representative examples.
Once simplified, the contours are drawn on the original image in step 8, with the polygons filled using alpha blending for transparency. This allows the underlying structures of the image to remain visible through the mask, providing a clear view of the segmentation result. The blended image is then displayed in step 9 for user editing. The vertices of the polygons are highlighted to make them easier to spot and adjust if necessary. Finally, in step 10, the vertices and POIs are sent to the inspection stage, where they are used to guide the inspection process and identify defects in the corresponding ROIs.
4. Implementation
We implemented the proposed model by training it on a custom dataset created with TFT backplane images. We then further implemented techniques to expedite the inference time and compared our segmentation results to other models such as U-Net, DeepLabV3+, and SegFormer-B0.
4.1. Image Acquisition System
The images used for training and evaluation were acquired using a commercial Automated Optical Inspection (AOI) system, model HIO-FPD-A1, developed by HIO Technology (Huzhou) Ltd. (Huzhou, China) This system is purpose-built for high-resolution imaging of TFT backplanes in display screens. It supports a magnification range of to and employs line-scan imaging to generate ultra-high-resolution images (up to 16 K) with a spatial resolution of 3 μm. The AOI system is capable of inspecting flat panel displays ranging from 2 to 6 inches in size.
Image acquisition was carried out in a controlled cleanroom environment compliant with ISO Class 7 standards to minimize particulate contamination. The ambient temperature was maintained at 22 ± 1 °C and the relative humidity was controlled between 40 and 50%, effectively replicating the environmental conditions of typical industrial TFT inspection lines.
4.2. Model Training
To create a multi-class dataset, annotations were meticulously done using LabelMe [
24], an interactive annotation tool that enables precise pixel-level labeling. Post-annotation, JSON files were converted into PNG masks for compatibility with the training pipeline, and divided into training and validation sets. We annotated a total of 703 images for training and 180 images for validation. All images have sizes ranging between
and
. We then partitioned the annotated images and their corresponding masks into
patches to reduce the computational bottleneck of training the model directly on large images. With the patches, we formed a dataset of
and
for training and validation, respectively. Even though the images are very large, when they are divided into patches, some areas of interest only end up in one or a few of these patches. This results in a limited number of samples for certain classes, leading to a class imbalance in the dataset, which made it challenging for the models we trained to effectively predict all classes.
During training, we closely monitored overall metrics such accuracy, mIoU, F1 scores, and losses. However, due to the highly imbalanced nature of the dataset, these overall metrics or average metrics across all classes proved to be poor indicators of the model’s true performance. Specifically, the metrics were dominated by the more frequent classes, to the extent that the training and validation results did not align with the actual segmentation performance during inference. High training accuracy was primarily driven by the dominant classes, while the less frequent classes showed poorer results during inference. To better reflect the model’s performance and ensure consistency between training metrics and inference results, we decided to evaluate the metrics for each class individually. This approach allowed us to assess the model’s performance on specific classes during training, revealing that abundant classes in the dataset had better metrics, while the model struggled with less frequent classes. These class-specific metrics are more accurately reflected in the inference results. This prompted us to use the focal loss function during training, which effectively mitigated the class imbalance by giving more weight to the underrepresented classes. This adjustment resulted in more or less balanced performance across both frequent and less frequent classes. The training process was performed on a system equipped with an NVIDIA GeForce RTX 3090 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA), using a batch size of 10, with a learning rate of .
4.3. Prediction and Model Inference
After successfully training the model, we evaluated its performance on a test set to predict segmentation masks for unseen TFT backplane images. This step validated the model’s ability to generalize and demonstrated its effectiveness in accurately segmenting high-resolution TFT images. Given the high segmentation accuracies, we then shifted our focus to optimizing inference speed, utilizing techniques such as image downscaling and data parallelism, all while ensuring that accuracy was not compromised.
4.3.1. Image Downscaling for Inference
When dealing with large images, it is often beneficial to resize them to expedite processing. In our task, large TFT backplane images, which can consist of 60 to over 100 patches, require significant processing time during segmentation, and we aim to minimize the segmentation time for these large images.
While we allow the model’s image processor to automatically resize all input images to during training, we found that during the inference, pre-resizing those patches externally to before passing them to the model significantly reduced inference time. This approach effectively skips part of the resizing overhead within the image processor, resulting in a reduction of inference time by a factor greater than 3, without affecting segmentation accuracy. By processing fewer large patches resized to , rather than numerous smaller patches, the computational burden is significantly reduced. For example, a pre-processed TFT image with a resolution of produces only 66 patches, compared to 881 patches. Processing the smaller number of resized larger patches is computationally much faster while still retaining the high segmentation accuracy of the model.
Our experiments revealed that the patch size used during training significantly impacts the model’s ability to generalize. TFT backplane images contain intricate patterns, as can be seen in
Figure 7 and several variations, that challenge semantic segmentation models, especially when the patch sizes used in training and testing differ. When trained on
patches, the model learns features and contextual information specific to that resolution, performing only well on test patches of the same size despite the use of data augmentation during training. Similarly, models trained on
or
patches achieve high accuracy when tested on patches matching their training resolution. This behavior suggests that, despite the image processor resizing all input images before passing them to the model, the learned features remain resolution-dependent, making it difficult for the model to generalize effectively across different patch sizes. This pattern was observed on all the models we trained on the dataset.
To address this limitation and ensure that the model generalizes across multiple patch sizes, we trained it on a hybrid dataset containing patches of varying resolutions (, , and ). Because the model’s image processor automatically resizes all inputs to , the model can seamlessly process this hybrid dataset without additional adjustments. Training on patches of different resolutions increased the model’s multi-resolution feature learning, improving its generalization performance across patches of varying sizes during inference.
To ensure minimal information loss during downscaling while preserving high segmentation accuracy, we employed the INTER_AREA interpolation method from OpenCV [
25] for resizing. This method, optimized for efficient downscaling, averages pixel areas to minimize artifacts such as moiré patterns and aliasing effects, which are common with simpler interpolation methods. After inference, the predicted segmentation masks are re-upscaled to the original
resolution using bilinear or bicubic interpolation. This ensures that the reconstructed large masks match the original image size and maintain high segmentation quality.
4.3.2. Data Parallelism for Inference
To further reduce the inference time of large image segmentation tasks, we effectively used parallel computing. By dividing the image into patches and processing them independently, we harnessed the parallel processing capabilities of both the CPU and the GPU to achieve substantial speedup. In the inference process shown in
Figure 8, we found that part of the bottleneck in the task flow lies in the sequential feeding of image patches to the model. To overcome this limitation, we parallelize this step, allowing multiple patches to be sent simultaneously to the model for prediction. This allows the GPU, which inherently supports parallel processing, to work on multiple patches concurrently, resulting in a significant reduction in overall inference time.
Our approach aligns with Gustafson’s law (
5) [
26], which offers insights into the potential speed-up achievable through parallelization.
where
is the portion of the code that is parallelizable,
the number of processors, and
is the speedup gained through parallelization.
We estimated the anticipated speedup gained from parallelizing the patch feeding process on a computer equipped with an i5-12400F CPU boasting 6 cores. The results underscore the tangible improvement achieved through parallelization of part of the segmentation process. According to Gustafson’s law, we anticipated further reduction in inference time on systems with more than 6 processors, and it was confirmed when we tested the model on the 24-core computer used for training.