2.1. Tilted Moiré Fringe Alignment Principle
During the alignment process between the mask and wafer, a Moiré fringe is formed when the wafer is tilted relative to the mask, leading to changes. When the grating mark on the mask rotates and tilts within the plane relative to the grating mark on the wafer, the Moiré fringe formed by the two gratings will exhibit a certain degree of inclination. Similarly, when the two gratings rotate and tilt in space, the Moiré fringe formed by them will exhibit a certain degree of inclination and frequency variation. These inclinations will result in an inability to extract phase information from the Moiré fringe, thereby affecting the final alignment accuracy. Zhou et al. [
11,
12] elucidated the basic principles of Moiré fringe phase analysis. Building upon this foundation, this paper provides a simple analysis of the tilting process between the wafer and mask.
As shown in
Figure 1a, when illuminated by a 365 nm ultraviolet light source, a Moiré fringe is formed by the displacement between the alignment marks on the mask and the wafer. The alignment marks on the upper-quadrant line grating mask have a period of 4 μm, and those on the lower-quadrant line grating have a period of 4.4 μm; the alignment marks on the mask and wafer are in a parallel misalignment state.
The light field distribution of the Moiré fringe can be expressed as
Here, and represent the interference intensity of the upper and lower parts of the Moiré fringe, respectively; and represent the intensity of the background light; and and represent the periods of the mask alignment mark and wafer alignment mark, respectively.
As shown in
Figure 1b, when in the x–y section, the wafer is tilted in the horizontal plane, and the light field distribution of the Moiré fringe can be expressed as
Here,
and
represent the interference intensity of the tilted in horizontal plane Moiré fringe, respectively;
represents the wafer and mask forming an angular deviation of
within the horizontal plane. The negative sign represents the opposite directions of the upper tilt and the lower tilt. The angular deviation can be determined from other misalignments, including the offsets between the centers of the mask and wafer alignment marks, and can be expressed as
Here, represents the antisine function, and represents the offsets between the centers of the mask and wafer alignment marks. represents the distance between the centers of the horizontally aligned line gratings on a mask mark or wafer mark.
As shown in
Figure 1c, when the wafer is tilted in the x–z section, the wafer is tilted in the vertical plane, and the light field distribution of the Moiré fringe can be expressed as
Here,
and
represent the interference intensity of tilted in vertical plane Moiré fringe, respectively;
represents the fringe frequency change caused by the tilt of the wafer relative to the mask in the vertical plane; and
represents the tilt of the wafer relative to the mask in the vertical plane causing changes in the stripe angle
. The negative sign represents the opposite directions of the upper tilt and the lower tilt. The stripe angle and the fringe frequency can be expressed as
Here, represent the rotation angle due to the tilt of the wafer relative to the mask in the vertical plane; and represent the diffraction angles of the mask alignment mark and the wafer alignment mark, respectively; represents the wavelength of the light source; represents the tilt angle in the vertical plane; and represents the original frequency.
Changes in the relative positions of the mask and wafer directly reflect alterations in the Moiré fringe. Therefore, the relative positions of the mask and wafer are simply divided into two parts: the horizontal plane (x–y direction) and the vertical plane (x–z direction). This is basically used to establish a simulated tilt model, which consists of a set of mask and wafer tilt pictures. This is used as the input value for the convolutional neural network and incorporated into the DQN algorithm for training.
2.3. Network Architecture and Training Methodology
As shown in
Figure 3, the CNN-Behavior network structure is based on a convolutional neural network. The network comprises three convolutional layers, one pooling layer, three activation layers, and two fully connected layers. Each operation follows batch normalization and does not require zero padding. The activation layer utilizes the ReLU function, which introduces nonlinearity to matrix operations. In the figure, for the first convolutional layer, 8 × 8 indicates that the size of each filter in the layer is 8 pixels by 8 pixels, 4 indicates that there are four channels in the input data being convolved and indicates that there are 32 filters in this convolutional layer.
For each convolutional layer, the number of filters is as follows: 32, 64, and 64. A maximum pooling layer (2 × 2, stride 2) is used to reduce the number of parameters, avoid overfitting, and increase the processing speed of the model. Following the last convolutional layer, there are two fully connected layers with units of 512 and 3, respectively. The network uses four-channel 80 × 80 wafer tilt images as the input, and the output is the Q-value represented by all actions.
To train the network, the first step is to initialize the experience replay pool as a queue D with a capacity of N. As shown in
Figure 4, the experience replay pool selects an action using the
ε-greedy strategy [
16]. After execution, the system obtains rewards and the next state from the environment, and it then combines the reward function, action, and current state. The latest statuses are stored as training samples, and then a batch of samples is randomly selected for disrupted training. The
ε-greedy strategy can be expressed as
where
represents the
Q-value of the best action,
s represents a state, and
a represents an action.
The second step is to randomly initialize the weight
of the CNN-Behavior network
Q and the weight
of the CNN-Target network
Q. The functions of these two networks are different. The CNN-Behavior network is responsible for controlling the mask and wafer leveling, collecting experience, while the CNN-Target network is used to calculate the TD target. Then, data randomly taken from the experience replay pool are used to calculate the target value of the CNN-Target network. The calculation formula can be expressed as
where
represents the actual observed current state value;
represents the discount factor, which also represents the long-term expected return of taking action in state
s;
represents the next state; and
represents the estimate made by the CNN-Target network at
. The
Q-value must be maximized each time an action is selected.
During the update process, only the weight of the CNN-Behavior network is updated, while the weight of the CNN-Target network remains unchanged. After a certain number of updates, the updated weight of the CNN-Behavior network is copied to CNN-Target network for the next batch of updates. The CNN-Target network completes the update at this time. Since the target value returned is relatively fixed during a period of time when the CNN-Target network does not change, the introduction of the target network enhances the learning stability.
As shown in
Figure 5, the loss function is calculated next. By continuously calculating the residual between the predicted value and the actual value, the parameters of the training model are constantly updated. This process leads to a reduction in the residual value, which gradually converges to a stable value to obtain the best training parameter model.
The loss function can be expressed as
The goal is to minimize the loss function by using the gradient descent method through backpropagation of the CNN-Behavior network to update the weight
of the CNN-Behavior network. The gradient descent method is specifically expressed as the following formula:
Here,
represents the updated weight,
represents the learning rate, and
represents the derivative with respect to
in
, used to obtain the factors. From the steps outlined above, a four-tuple transition can be obtained:
After training, the data are stored in the experience replay pool; then, sample data are randomly extracted from the experience replay pool, and the update process described above is repeated. The entire process is mainly based on the Deep Q-Network (DQN) algorithm outlined in Algorithm 1.
Algorithm 1: DQN algorithm flow in mask–wafer leveling simulation. |
Step 1: | Initialize experience replay pool D to capacity N; |
Step 2: | Initialize CNN-Behavior network with random weights , Initialize CNN-Target network with random weights = ; |
Step 3: | For episode = 1, M do Initialize sequence , and preprocessed sequence ; For t = 1, T do- ➢
With probability select a random action , otherwise select ; - ➢
Execute action in emulator and observe reward and image ; - ➢
, , and preprocess ; - ➢
in D; - ➢
Sample random minibatch of transitions from D; - ➢
Set ; - ➢
Perform a gradient descent step on with respect to the network parameters ; - ➢
All C steps reset ; End For End For |
The CNN-Behavior network and CNN-Target network were implemented on a hardware setup including a 32-core AMD Ryzen Thread ripper PRO 5975WX (3.6 GHZ), 512 GB of RAM, and 12×NVIDIA RTX A5000 GPUs.
Simulation conditions: Extract data from the queue of the experience replay pool and organize them into batches of 32 data points. Each batch should include the current state, actions, rewards corresponding to the actions, and the subsequent state. Calculate the current state value based on the Bellman equation [
16] and store it in the experience replay pool. Send the batch data (comprising the current state, action, reward corresponding to the action, and next state) to the CNN-Behavior network structure for training. The network uses GPU training to update the weight parameter matrix. In this case, for the CNN-Behavior network, the learning rate was set at 0.001. When the CNN-Behavior network has loaded the mask and wafer tilt model, the model is divided into the horizontal plane tilt environment model (x–y) and the vertical plane tilt environment model (x–z). The weight parameters are updated, the current state is input, and the action is output. For the corresponding value, select the Q-value associated with the highest value, identify the optimal action based on this Q-value, execute the best action in the environment, and allow the mask and wafer to receive the relevant information, such as the next frame of the picture and the reward. Then, assemble it into the next state and update it to the current state. Each loop iteration produces optimal behavior, and every 5000 iterations, the model is saved. When interacting with the environment, the wafer and mask take optimal actions. By connecting these actions in series, the wafer and mask are automatically leveled.
As shown in
Figure 6, rlDQNAGENTS refers to the horizontal plane tilt environment model (x–y) and the vertical plane tilt environment model (x–z). The training dataset for the entire DQN algorithm was sourced from the experience replay pool, which consisted of interactions between the initial rlDQNAGENTS and the CNN-Behavior network, as well as the data samples stored in the Experience replay pool. The model was iterated 1200 times, with a maximum of 400 iteration steps. The model was programmed to halt training if the target reward reached −700 in a specific round of training. This target reward was determined based on the leveling of the mask and wafer. If the cumulative reward of each simulation exceeds −400, the model at that specific time was retained, and the results were preserved. The horizontal axis represents the reward value obtained for each learning session, and the vertical axis represents the training rounds. The average reward was calculated as the average return of the last specified number of episodes. The length is specified by train opts. Episode Q0 refers to the estimated Q-value of the initial state of the CNN-Behavior network.
As the training process progressed, the algorithm tended to become stable. Specifically, the episode reward and average reward tended to stabilize. The CNN-Behavior network estimated the Q-value of the action, so Episode Q0 and Episode Reward maintained a certain difference.
As shown in
Figure 7, in order to test the stability of the algorithm and its adaptability to different models, a comparative analysis was conducted on the horizontal plane tilt environment model (x–y) and the vertical plane tilt environment model (x–z) from two perspectives: the learning curve and Episode Q0. As shown in
Figure 7c, the horizontal plane tilt environment model included the mask and wafer in the horizontal plane, as well as the generated Moiré-based phase imaging. The mask was in a balanced state, while the wafer was in an in-plane tilted state. The tilt angle was 30 degrees. This paper measures angles from 0.1 to 60 degrees in the environmental input and measures them proportionally. The 30 degrees here are just a set of simulation conditions as an example to simulate the leveling accuracy of this strategy in the case of large angles. The alignment mark of the mask was a line grating with a period of 4 μm and a duty cycle of 1:1. The alignment mark period of the wafer was 4.2 μm. A line grating with a duty cycle of 1:1 is generated because the wafer was in the horizontal plane tilted state. This produced the horizontal plane inclined Moiré fringe. As shown in
Figure 7d, the vertical plane tilt environment model (x–z) included the mask and wafer in the x–z direction and the generated Moiré-based phase imaging. The mask was in a balanced state, while the wafer was tilted in space. The tilt angle in the z direction was 30 degrees. The mold alignment mark period was 4 µm, and the duty ratio was 1:1 for the line grating. The period of the wafer alignment mark was 4.2 μm, and the line grating duty ratio was 1:1 due to the tilt of the wafer in space, thereby affecting the frequency and angle of the fringes, generating an inclined Moiré fringe in the vertical plane. The thickness of the upper and lower stripes was inconsistent.
As shown in
Figure 7a, the simulation results indicate that in the comparison of the learning curves, the vertical plane tilt environment model (x–z) needed to adjust the angle in the vertical direction. While leveling the angle in the vertical direction, it also needed to level the angle in the x–y direction. Therefore, before iterating 200 times within the time period, the fluctuations and errors were larger than those of the horizontal plane tilt environment model (x–y), and the converged Episode Reward was 200 higher than that of the horizontal plane. As shown in
Figure 7b, in the Episode Q0 comparison, the convergence curves of the two models do not differ by much. Therefore, this algorithm has good robustness in terms of model adaptability.