1. Introduction
Lithography is the core step in the manufacturing of large-scale integrated circuits (VLSI). The purpose of lithography is to transfer a designed pattern layout from a photomask to a silicon wafer through an optical system. As extreme ultraviolet light (EUV) is used as the source, the mask pattern’s critical dimension (CD) continuously shrinks, causing the imaging quality of the lithography to be severely affected by optical proximity effects (OPEs) [
1] and other effects, such as shadowing [
2,
3,
4,
5]. Hence, resolution enhancement technologies (RETs), including source mask optimization (SMO), have been introduced to address the image distortion problem in lithography for 22 nm nodes and above [
6].
Since Rosenbluth et al. [
7] introduced the concept of SMO in 2001, many SMO methods based on different optimization algorithms and strategies have been developed. These optimization methods can be categorized into two types: gradient-based and heuristic. The former includes algorithms such as gradient descent (GD) [
8,
9,
10], conjugate gradient descent (CGD) [
11,
12], deep learning [
13], and level-set-based [
14]. These methods have high optimization efficiency, but they cannot determine the global optimal point. Heuristic algorithms include the genetic algorithm (GA) [
15,
16] and particle swarm optimization (PSO) [
17]. Their main characteristic is their ability to randomly search for optimal points globally.
Although existing SMO methods satisfy the needs of many scenarios, for 32 nm nodes and above, the tape-out process needs multiple repetitions of SMO. Regardless of the SMO methods, the optimization process is computationally intensive and time consuming when applied to every pattern in full chip. There are billions of patterns at the full-chip scale, the majority of which have almost no effect on the final pixelated-source result. To balance the time consumption and optimization of results, Tsai et al. [
18] proposed that SMO should only be applied to critical patterns. The number of patterns in the representation pattern sets should be small; however, they should include most of the features required for SMO so that whether the pattern selection method is effective and efficient or not has an impact on the final full-chip SMO result [
18].
The pattern selection method involves clustering similar patterns and selecting the most representative patterns from each group to process through SMO. After pattern selection, the number of patterns required for optimization decreases by at least one order of magnitude and should have an acceptable process window performance [
18].
Several pattern selection methods based on different ideas have been proposed. IBM Co. Ltd. developed a pattern selection method based on image clustering [
19,
20]. The central concept of this technique involves transferring pattern images to a new domain through specific transforms, such as the Fourier transform, thus, enabling existing clustering methods to cluster images. However, conventional image-processing-based methods cannot effectively extract the underlying features of patterns, resulting in redundant and biased critical pattern selection results. Additionally, one of the limitations of this approach is that the cluster number must be set manually before initiating the pattern clustering process. In contrast, the ASML Co. Ltd. (Veldhoven, The Netherlands) pattern selection technique is based on diffraction signatures [
18,
21]. This method has already been integrated into commercial computational lithography software, Tachyon (Denver, Colorado). The method primarily involves three steps: diffraction signature extraction, checking cover relationships, and finalizing the critical pattern selection. The algorithm uses the diffraction information of different patterns to extract a diffraction signature. Upon verifying the diffraction signatures of different patterns through cover relationships, these patterns are clustered, and the critical pattern can be selected. Liao et al. [
22] proposed a novel method based on ASML’s, with a more precise diffraction signature description and widths in eight selected directions. Compared to IBM’s image-clustering approach, the selection results of the diffraction signature algorithm have a better process window and fewer redundancies. However, its limitation is that the computation time multiplies rapidly when the number of input pattern images increases, which can be a significant challenge for achieving efficient results.
Machine- and deep-learning methods perform well in terms of feature extraction and image clustering. Conversely, convolutional neural networks (CNNs) can build and study nonlinear and complex relationships. More importantly, a CNN model based on learned knowledge can be used to predict unknown data with great generalization ability. Many CNN architectures, such as AlexNet [
23], Visual Geometry Group (VGG) [
24], GoogleNet [
25], and residual neural network (ResNet) [
26,
27], have been proposed for classification and clustering tasks. Pattern images are binary, and the dataset is not large-scale. Considering the above conditions, network choice is vital to avoid overfitting. Zhang et al. [
28] proposed the idea of using a graph neural network (GCN) to tackle a pattern selection problem by defining the pattern selection problem as a classification problem. However, the problem is that the input may not belong to any of the existing categories.
As NN was approved to have many advantages, it was gradually used in engineering, as in [
29,
30]. In this study, the critical pattern selection method based on CNN embeddings for full-chip optimization is proposed to increase the efficiency of full-chip optimization. The inspiration for transferring a pattern image to a new domain comes from IBM’s pattern selection architecture, and the proposed method unprecedentedly introduces a CNN as a transfer function that maps a two-dimensional original pattern to embedding in a new hypergeometric space. The aim of this method is to balance the computation cost and accuracy. Simultaneously, it, as a paradigm, provides a structure for applying CNN to a critical pattern selection problem and can be updated without changing the algorithm structure. To generate a model with an accurate transformation ability, the CNN model was trained. This is different from existing methods. Before training the model, a dataset was obtained from the public pattern layout library, and it was labeled using the diffraction signature method. Regarding the model architecture, the VGG is a mature network for extracting picture features, and it does not have many parameters, thus reducing the overfitting phenomenon to a certain extent. The triplet loss was chosen as the loss function, which was first introduced by Schroff et al. [
31], whose initial design is for face recognition and clustering, and whose main idea is to minimize the Euclidean distance between pictures from the same group and enforce a margin between the distances of pictures from different groups. In the model application stage after training, the density-based spatial clustering of applications with noise (DBSCAN) algorithm [
32] was applied to build the corresponding group using the margin designed in the training model. After calculating the pattern embeddings, critical patterns were selected from different clusters based on their relative positions in the hypergeometric plate. To verify the advantages of our chosen loss function and model, the embedding distribution for both the test and training datasets was visualized, and the process of optimization was demonstrated. Finally, the elapsed time and pattern selection results were compared with those obtained using the diffraction signature method.
2. Methodology
Figure 1 shows a schematic diagram of the optical lithography system. The light from the source illuminates the mask through the condenser. After transmission through the mask, the light is diffracted, and only low-frequency light passes through the projection lens owing to the limited numerical aperture (NA). The light then reaches the substrate coated with the resist and exposes it, changing its solubility. In this process, SMO optimizes the source and mask jointly to make the feature after development close to the designed pattern.
Lithography is a partially coherent imaging technique. From Abbe’s theory [
33], the intensity of an aerial image is the sum of the imaging results of all coherent systems. Each of the coherent systems is based on the source point within the condenser numerical aperture. Abbe’s theory can be formulated as
where
represents the coordinates in the image plane,
represents the coordinates in the pupil plane, and
represents the source distribution.
is the optical transfer function of the projection objective and
is the spectrum of the mask pattern in the frequency domain. From the formula, the frequency domain information, such as the distribution and magnitude of diffraction orders, determines the intensity of the aerial image and, consequently, the distribution of the source after SMO. In real tape-out, particularly at 22 nm and above, the number of patterned samples can easily be in the order of billions [
20]. Most areas of the full chip are identical, mirror-invariant, and noncritical. Among these, only critical patterns with representative features are useful for performing SMO. To cluster different patterns and choose the most representative pattern in each cluster, patterns, to some extent, can be imagined to have “distances”, which refers to the degree of difference in SMO; the larger the “distance” between two patterns, the more different the SMO results. On the contrary, if the “distance” is small, they may have a similar contribution to full-chip SMO, and they can be regarded as redundancies.
2.1. Dataset Preparation
Before training the pattern selection model, the dataset was processed to model the input format, and then it was labeled. Datasets with accurate labels are crucial for training CNN models to describe precise transformation relationships. However, there are no publicly labeled layout datasets. Our layout sets were generated from design layouts (GDS files) obtained from freepdk45, a free public package [
34]. The sample labels were given based on the diffraction signatures and coverage rules obtained through the diffraction signature method.
Given the location of the diffraction orders in the mask spectrum, the approximate source distribution for SMO can be estimated [
35]. Therefore, when multiple mask patterns are used for SMO, patterns with similar spectra, particularly those with similar diffraction order positions, contain redundant information. In other words, only parts of all the patterns that include all the diffraction information are required in the SMO process. Based on this principle, different patterns with close diffraction order positions that conform to the coverage relationship can be clustered together.
Figure 2 illustrates the process of determining whether two patterns belong to the same group using the diffraction signature method. Patterns 1 and 2 were different patterns from the same group in the dataset. After applying the Fourier transform, the spectra of the two patterns were calculated. Subsequently, the zero orders, as well as the strongest and middle orders, were removed. Then, the other orders were extracted as the form of the diffraction signature from the spectra without considering their harmonics. Every diffraction signature includes five features
. Therefore, the two patterns’ signatures were used to check whether one’s orders were all covered by another pattern’s. If the above statement was true, as shown in
Figure 2, they were labeled as the same group. In
Figure 2, the source optimization (SO) results are close and verify this conclusion.
The freepdk45 dataset contained 249 patterns, which is relatively small for model training. To extend the dataset, dataset augmentation was applied, including cropping, scaling, and adding. Finally, patterns were extended to 1047 and labeled into 37 groups according to the above diffraction signature algorithm. Among these patterns, 900 were chosen randomly as the training set, 100 as the validation set, and 47 as the test set in the simulation experiment.
2.2. VGG Network
After the dataset preparation, a CNN model was built through Pytorch in Python. As previously mentioned, the pattern image is binary, and most patterns are not complicated. In this study, VGG-16 was chosen as the cluster model, because it has a moderate depth, compared to that of other classic networks, such as GoogleNet and ResNet, with deeper architectures and more parameters, which may cause overfitting problems, particularly when the size of the dataset is limited. The detailed structure of VGG-16 is shown in
Figure 3.
The VGG-16 network has 13 convolution layers and three fully connected layers. Additionally, two layers are connected by the rectifier linear unit (ReLU) activation function, and two convolution layer groups are connected by max pooling. The main idea of the design is to achieve end-to-end learning of the model so that it can be treated as a black box. The model can transform a pattern image x into a vector in a K-dimensional feature space, that is . The vector mentioned here is “embedding” in the CNN model. The embedding layer is a hidden layer in the CNN, allowing the network to learn more about the relationship between inputs and process data efficiently. Before being fed into the network, the mask pattern pictures were reconstructed with a specific strategy using a triplet generator. Every triplet contained three members: anchor , positive (same group as the anchor), and negative (different group as the anchor). The embeddings had to be L2-normalized, , where is the output embedding of the CNN. The training is important because the relationship for triplets in the feature space should be optimized such that the distance between images in the same group is small while the distance between images from different groups is large.
2.3. Triplet Loss and Optimization
To optimize the CNN model, triplet loss was introduced as the loss function during the training process. This loss ensured that
(anchor) of a triplet was closer to
(positive) than to
(negative). The designed optimization goal of triplet loss is to make embeddings satisfy the inequality shown in
Figure 4. This can also be written as follows:
where α is the margin that is enforced between positive and negative pairs,
T represents all possible triplets for pattern training datasets, and
i is the triplet index. The triplet loss in Equation (3) evolves from the inequality above.
The loss function uses the Euclidean distance to express the distance between different embeddings. This also implies low within-cluster scatter (WCS) and high between-cluster scatter (BCS), which are the ideal results for the pattern cluster task. Only a portion of all triplets satisfied Equation (2) for the model with the initial parameters. While minimizing the triplet loss, the distance between the anchor and positive decreased, whereas the distance between the anchor and negative increased. Another problem that was addressed was the triplet selection strategy for triplet generators. Theoretically, an anchor sample has many positive and negative matches. If every iteration triplet is generated randomly, many triplets satisfy Equation (2) and do not contribute to the optimization, which means that the loss of these triplets is zero. To avoid this problem, the hard-triplet update strategy was applied, which means that for each anchor, only the closest negative and farthest positive samples were selected to build the “hardest” triplet instead of the “easy” triplet. Every few epochs, the triplet set was updated using the hardest triplet. The optimization algorithm used in the training stage was stochastic gradient descent (SGD), whose update formula is
where
is the
th element in the lth layer of the weight matrix,
is the iteration index, and
η is the learning rate. The algorithm allows the model weights in each layer to be updated to minimize the triplet loss.
2.4. DBSCAN Cluster Algorithm
The trained model mapped the image to the embeddings in the feature space. Unlike in the training stage, the model application did not require a feed picture in the form of a triplet. When pattern images were input, the respective embeddings were obtained after the data propagated forward in the model. A clustering algorithm can group these embeddings in a K-dimensional space. Every embedding in the form of a vector in the feature space is a point, and these points have a certain distance apart. The distance is the difference in the results for the SMO content. Based on this idea, DBSCAN was used as the clustering algorithm. Unlike other traditional clustering algorithms, such as the k-means, it does not require the number of clusters but the margin ε, which also appears in the triplet loss formula.
There are several definitions of our DBSCAN algorithm. The relationship between the outliers and core points is shown in
Figure 5.
- (1)
An embedding point is a core point if at least minPts (set manually) points (including itself) are within a distance ε apart.
- (2)
An embedding point is reachable from a core point, if it is within a distance ε from the core point. If embedding B is reachable to embedding A and embedding C is reachable to embedding B, then embedding C is reachable to embedding A.
- (3)
Embedding points that are not reachable from any other point are outliers.
The detailed workflow is shown in
Figure 6. The set of pattern embeddings was stored and clustered after the pattern images were mapped onto the feature space. An embedding point can be individually classified as a core point or an outlier until all points are checked, mentioned in
Figure 6. All core points were stored, and the outliers were classified into a single group. Considering the pattern selection situation, ε is a margin of
in triplet loss. For each core point, the reachable points were determined individually. Once mutually reachable, they were classified into the same group and deleted from the core point set. If no members remained in the set, the algorithm was terminated. Subsequently, every group had a centroid in the K-dimensional space, and the critical pattern in each cluster could be determined by Equation (5).
where
is the selected critical pattern,
i,j is the index in the nth cluster,
N is the number of members in the cluster, and
is the embedding output of the CNN model. The closer the embedding point is to the center, the more representative it is of the cluster. In other words, the pattern embedded nearest to the center of the cluster is the critical pattern.
2.5. Overall Workflow of the Method
The pattern selection workflow was composed of several substeps after cutting patterns from the full-chip area: pattern transformation, pattern clustering, and critical member selection. This study introduced a CNN in the pattern transformation part and output the CNN embedding to cluster and select critical patterns in later processes. A CNN with updatable parameters can perform nonlinear mapping; thus, it can be trained to obtain the most appropriate mapping relationship from pattern graphics into the abstract space in SMO. The workflow of the proposed critical pattern selection process shown in
Figure 7 can be divided into two stages: training and application. Before patterns were fed into the CNN model in the training stage, triplets were generated first through the strategy mentioned in
Section 2.3. Then, respective embeddings and their losses after forwarding CNN could be calculated. With the triplet loss, the model could be updated by SGD. The trained model could be applied to achieve the transformation from pattern images to embedding vectors. The DBSCAN algorithm clustered the embeddings and determined the critical members in each cluster based on the distribution of embeddings.