1. Introduction
Silicon single crystal (SSC) serves as a foundational material for the modern semiconductor industry and remains an irreplaceable base material for manufacturing integrated circuit chips. In particular, the technology for pulling large-scale, high-quality ingots occupies a pivotal front-end position within the entire industry chain, directly determining the performance ceiling of subsequent chip manufacturing processes [
1,
2]. With the continuous decrease in chip linewidth and the continuous increase in silicon wafer size, higher requirements are put forward for the quality of SSC production [
3]. Therefore, an online real-time prediction model for the crystal quality of the SSC growth process holds significant practical engineering significance. The model can continuously monitor the dynamic growth conditions of the crystal, enabling process personnel to promptly assess the growth status and make optimized adjustments. Meanwhile, the high-quality prediction data provided by the model also offers a crucial decision-making basis and data support for maintaining the stable and reliable operation of the growth control system.
It is well known that the Czochralski (Cz) crystal growth process is the most widely adopted mainstream technology for preparing high-quality and large-sized SSC [
4]. This growth process is fundamentally based on the precise control of the solid–liquid phase transition, and its internal mechanism is extremely complex, involving multi-physical field coupling and interaction among the gas phase, melt (liquid) phase, and crystal (solid) phase [
5,
6]. The evaluation of crystal quality in Cz-SSC growth revolves around two core aspects: the geometric dimensional accuracy at the macroscopic level and the density of crystal defects at the microscopic level. These latter defects, which are pivotal, are predominantly the intrinsic point defects (vacancies and self-interstitials) generated as the crystal solidifies from the melt. The gradual accumulation of these intrinsic point defects (such as vacancies and self-interstitial atoms) in the crystal may lead to uneven resistance distribution in integrated circuit chips, increased leakage current, and reduced photoelectric conversion efficiency, among other key technical indicators that fail to meet the design requirements, thereby directly affecting the performance of the chips [
7,
8]. The current theory of point defect generation in crystals can be traced back to the V/G criterion theory first proposed by Vornkov [
9] in the 1980s, where “V” presents the growth rate of the crystal and “G” denotes the temperature gradient near the solid–liquid interface. It should be emphasized that adjusting the V/G ratio is a key method for controlling the type of point defects within crystals during the Cz crystal growth process. By regulating this ratio, point defects within the crystal can be guided to transform from one form to another, thereby promoting rapid mutual recombination between interstitial atoms and vacancies [
7]. This mechanism effectively reduces the residual and aggregation of defects, providing an important process control approach for achieving the growth of high-quality silicon single crystals with low-defect density. It is evident that real-time monitoring of the V/G value during the growth process of Cz-SSC is of great significance, as the fluctuation of this critical ratio directly affects the type and density of defects within the crystal. Such monitoring provides clear guidance for on-site operators to adjust the process parameters, enabling dynamic optimization of the growth procedure. This not only helps to improve the consistency of single crystal quality, but it also has important guiding significance for achieving the efficient and stable operation of the control system.
The V/G criterion is widely regarded as a key indicator for evaluating the quality of silicon single crystal (SSC) wafers [
10,
11,
12]. However, traditional hard sensors cannot directly obtain this ratio during the actual crystal growth process, which poses a significant challenge for real-time monitoring and control. Therefore, adopting an artificial intelligence-based soft sensor modeling technique to predict the V/G criterion in real time through indirect measurable process parameters has become an effective solution to address these issues. Soft sensor modeling technology is a mathematical model-based and data-driven method for real-time estimation or prediction of the process variables that cannot be directly measured. Unlike traditional hard sensors that rely on physical probes for direct measurement, soft sensors are based on data-driven mathematical models. Through advanced modeling and computing technologies (such as machine learning algorithms), this technology can model and analyze the existing and measurable relevant process variables (such as temperature, pressure, rotational speed, etc.) in the system, achieving real-time estimation and prediction of the target variables that are difficult to measure directly [
13,
14,
15]. Based on the above principle, the soft sensor modeling technology mainly unfolds along two paradigms: one is the modeling method based on mechanism knowledge, and the other is the data-driven modeling method. Although soft sensor models based on mechanistic knowledge are explicitly interpretable, they suffer from unmodelled dynamics and are prone to low-prediction accuracy. In contrast, a data-driven soft sensor model relies solely on abundant operational data generated during the production process to establish a high-accuracy predictive model from measurable variables to target variables. The core advantage of such models lies in their ability to automatically learn and capture complex dynamic characteristics from data, thereby playing an increasingly important role in predicting key variables across a wide range of complex industrial processes [
16,
17].
The core principle of data-driven soft sensor modeling involves first selecting auxiliary variables that are highly correlated with the primary variable (i.e., the variable to be predicted) from the production process as inputs; then a mathematical mapping model is constructed between them using historical or real-time data through machine learning algorithms. This model can accurately depict complex, nonlinear dynamic relationships and ultimately enable real-time online prediction of the key variables, completing the soft measurement task. Problems such as strong data redundancy and significant process noise often arise when dealing with the massive data accumulated in industrial processes, resulting in the effective information being overwhelmed by a large amount of irrelevant or repetitive data. Consequently, the key challenge and research focus in current soft sensor modeling lies in accurately and efficiently extracting feature information that reflects the intrinsic nature of the process from multidimensional, highly noisy raw data. Traditional shallow modeling methods (e.g., least squares regression, principal component analysis, etc.) often fail to effectively capture and extract complex, nonlinear deep features from data due to their relatively simple structures; this results in significantly limited representational capacity when confronted with industrial process data that exhibits strongly coupled and high-dimensional dynamic characteristics. In contrast, deep learning has been widely applied in the field of soft sensor modeling due to its outstanding ability in extracting deep features and learning representations. Therefore, various deep learning models, such as stacked autoencoders [
18], convolutional neural networks [
19], long short-term memory networks [
20], deep belief networks [
21], and other modeling methods have been successfully introduced and significantly improved the prediction performance of complex process variables. For instance, as a typical deep learning model, a stacked autoencoder (SAE) demonstrates significant advantages in soft sensor modeling because it can fully extract the underlying deep features and key information from the data, achieving effective dimensionality reduction in high-dimensional data. In this context, Gao et al. further proposed an innovative generative modeling framework, which combined the deep feature learning and representation capabilities of stacked variational autoencoders (VAEs) with the stable generation advantages of Wasserstein generative adversarial networks (WGANs) to construct a hybrid generative model. Experimental results show that this model significantly improves the prediction accuracy and generalization performance of soft sensors under complex conditions [
22]. Wang et al. proposed a hybrid modeling framework by integrating the improved support vector machine with the sparse autoencoder network, effectively enhancing the classification accuracy in pipeline leakage detection [
23]. To address the problem of generalization failure of prediction models caused by changes in working conditions in complex industrial processes, Ren et al. proposed a stack-enhanced autoencoder transfer learning algorithm based on variational mode decomposition, which effectively solved the domain-adaptive prediction problem and provided a feasible solution for adaptive prediction problems in actual industrial applications [
24]. At present, the application of deep learning methods to the modeling of soft sensors in the growth process of pull-type silicon single crystals (Cz-SSC) is still in the initial exploration stage, and the relevant high-quality research results are relatively limited. Therefore, there are still numerous open questions in areas such as model architecture design, training strategy optimization, and integration of physical mechanisms that require further research and exploration.
Inspired by the aforementioned soft measurement modeling approach, a soft sensor modeling method driven by mechanism and data fusion is presented in this paper. In this method, an enhanced stacked autoencoder network incorporating an attention mechanism is specifically designed to optimize the prediction performance of the data-driven module and enhance its ability to capture the characteristics of the process history dynamics. Furthermore, an adaptive weight adjustment strategy for fusion is designed during the fusion stage of the data-driven sub-model and the mechanistic sub-model. This strategy can adaptively adjust the fusion of the mechanistic sub-model and the data-driven sub-model based on the real-time operating conditions and the local prediction performance of each sub-model. Through this adaptive weighted fusion, the model can more flexibly and synergistically leverage the generalizability of mechanistic knowledge and the compensatory strengths of data-driven approaches, ultimately achieving stable and accurate online prediction of the key crystal quality variable: the V/G criterion. Specifically, the main contributions of this paper are as follows:
- (1)
To address the technical challenge of direct detection of the V/G criterion during crystal growth using hard sensors, a mechanism and data fusion-driven soft sensor model (referred to as M-AD-SEAE) is designed, aiming to integrate the prior knowledge of the mechanism with the adaptive learning ability of the data-driven method to overcome the limitations of a single modeling approach.
- (2)
During the establishment of the data-driven sub-model (i.e., AD-SEAE), an attention mechanism is introduced to dynamically calculate the weights of different historical information, thereby focusing on the key dynamic features. The weighted historical information is integrated with the current sample and input into the stack-enhanced self-coding network (SEAE) for feature extraction and modeling. This design not only enhances the model’s ability to capture temporal dependencies, but also helps preserve the key process of dynamic information, thereby significantly improving prediction accuracy and reducing information loss.
- (3)
An adaptive weight adjustment strategy based on the entropy weight method is proposed herein to achieve dynamic fusion between the mechanistic sub-model and the data-driven sub-model. This method objectively calculates and allocates fusion weights based on the performance of each sub-model during real-time predictions through an entropy weighting approach, rather than employing a fixed proportion. This data-driven weight allocation mechanism effectively enhances the overall prediction accuracy and robustness of the soft measurement model for the V/G criterion of crystal quality.
The rest of the paper is organized as follows:
Section 2 briefly describes the Cz-SSC growth process and the need for V/G criterion prediction in crystal quality.
Section 3 systematically elaborates on the theoretical foundations and modeling methodology of the proposed soft sensor model, providing the essential knowledge framework for subsequent model construction. Subsequently,
Section 4 conducts comparative experiments to qualitatively and quantitatively evaluate the proposed approach against existing mainstream modelling methods, analyzing its performance advantages and scope of applicability. Finally,
Section 5 summarizes the research undertaken throughout the paper, outlining the core contributions and limitations of this methodology, and offers prospects for future investigations.
2. Cz Process and Problem Description
The growth of Cz-SSC is essentially a liquid–solid phase change process carried out under precise control [
25]. The core of this process lies in the coordinated regulation of multiple parameters such as the temperature field, pulling speed, and crystal rotation, to achieve precise control over the morphology and dynamics of the crystal growth interface. The core objective of the process operation is to produce high-quality and large-sized SSCs with high efficiency and low costs, while ensuring the long-term stable operation of the single crystal furnace, in order to meet the strict requirements of semiconductor manufacturing for the perfection and consistency of materials. The growth process is shown in
Figure 1.
The crystal growth environment inside the Cz-SSC furnace involves the coupling of multiple physical fields, such as the temperature, flow, and stress fields, as well as complex interactions among the gas, liquid, and solid phases, making it a typical strongly nonlinear, dynamic, and time-varying process [
26], as depicted in
Figure 2. The Cz-SSC growth process occurs under multiple extreme conditions including high temperatures, high vacuum, and strong magnetic fields, whilst simultaneously involving a complex physicochemical environment characterized by the coexistence and strong mutual coupling of solid, liquid, and gaseous phases. These complex working conditions make it difficult to directly monitor in real time the quality indicators of crystal growth (such as defect density, micro-unevenness, etc.) through traditional hard sensors, thereby restricting the effectiveness of implementing precise and optimal control over the SSC growth process. Fortunately, there are a series of process-operating variables that can be directly or indirectly monitored during the Cz-SSC growth process, such as crystal rotation rate, crystal pulling rate, crucible rotation rate, crucible lifting rate, crystal diameter, thermal field temperature distribution, and heater power, etc. Most of these variables can be reliably obtained through corresponding sensors, providing an important multi-source data foundation for indirect perception, assessment, and prediction of the internal quality state of the crystal.
Therefore, establishing an accurate mapping model between measurable process variables and crystal quality indicators is of great practical significance for guiding the process personnel at the Cz-SSC growth site to accurately assess the crystal growth status and achieve dynamic control of key point defects. This is precisely the core starting point and motivation for this paper to conduct in-depth research on this issue.
3. Mechanism and Data Fusion-Driven Soft Sensor Model
3.1. Mechanism Sub-Model
Figure 3 illustrates the physics-based (mechanistic) sub-model for the Cz-SSC growth process. It consists of two coupled parts: (i) an energy-transfer block that characterizes heat transport to the growth interface, and (ii) a hydrodynamic/geometric pulling block that links meniscus shape and diameter evolution [
27,
28]. The measurable time series {
,
,
,
} serve as inputs to the mechanistic block. Among them,
represents the crucible rising rate,
denotes the pulling rate, the heater power is represented by
, the crystal diameter is denoted by
.
In this study, considering that the energy transfer model involves a complete thermal field of complex radiation, conduction, and convection, it is simplified for the realization of soft sensor modeling. According to the theoretical derivation of the energy transfer model in reference [
27], this paper only retains the physical relationship most related to V/G (such as Equations (1), (3) and (4)). Therefore, the complex heat transfer process can be equivalently represented as a computable thermal state estimation. Specifically, given the measurable inputs {
,
,
,
}, then the melt-side gradient agent
can be calculated. The solid-side gradient
is further deduced by the conservation of interface energy, and the V/G output of the mechanism sub-model is finally obtained. The following is the calculation process of the specific mechanism sub-model V/G output:
Firstly, it is assumed that the process near the meniscus is axially symmetric and quasi-steady at each sampling interval. The thermophysical properties change slowly and are taken as constants. The solid–liquid interface temperature is fixed at the melting point of silicon. The axial temperature gradient at the meniscus is approximated by a finite difference along the meniscus height.
Secondly, the melt temperature and meniscus height are calculated: the melt-side axial temperature gradient at the meniscus is approximated as
where
is the melting point temperature of silicon.
is the melt temperature at the bottom of the meniscus (provided/estimated by the energy-transfer block). The meniscus height
is modeled as
where
represents the capillary constant,
denotes the surface tension,
is the melt density,
describes the gravitational acceleration. In addition,
is the growth angle, usually defaulted to
,
denotes the crystal slope angle, and
describes the crystal radius.
Then, according to the interface energy balance and the solid-side gradient, the following is generated by the interface energy conservation:
where
and
are the thermal conductivity of the melt and crystal phases, respectively.
is the crystal density, and
is the latent heat of crystallization.
denotes the axial temperature gradient on the crystal (solid) side at the growth interface. Since
is not directly measurable in our industrial setup, we estimate it by using Equations (1)–(3) as:
Here, the growth (solidification) rate is approximated from the relative motion as .
The crystal radius evolution follows the curved-interface geometry as
According to Voronkov’s criterion, the mechanistic V/G is defined as
In summary, given {, , , }, the mechanistic block sequentially computes , , , , and , providing a physics-consistent prior for the subsequent hybrid fusion.
3.2. Data-Driven Sub-Model
3.2.1. Attention Mechanisms
In deep learning, the design of the attention mechanism draws inspiration from the human brain’s ability to selectively focus on key parts when processing information. Its core principle is that the model can automatically assess the importance of different parts of the input information and assign higher weights to more critical elements, thereby concentrating the limited computing resources on the features that are more relevant to the current task objective.
Figure 4 shows the schematic of the model of the attention mechanism [
29]:
Figure 4 illustrates the core framework of the attention mechanism, wherein source domain data can be modeled as a sequence of <Key, Value> pairs. For a given query in the target domain, the weight of each corresponding value is obtained by calculating the similarity between its Key and all Keys in the source domain. The final attention output is the weighted sum of all values, calculated as follows:
where
describes the length of
. The core objective of attention mechanisms is to selectively extract the most critical components for the current task from redundant or complex inputs when processing vast amounts of information. This effectively suppresses the influence of secondary or distracting information, thereby enhancing the model’s ability to represent and learn from key features. The focusing process of the attention mechanism essentially manifests as the calculation of different weight coefficients assigned to input information. The magnitude of these weight values directly reflects the model’s degree of attention to each “value”: the higher the weight, the more focused it is on its corresponding
Value, i.e., the weight represents the importance of the information and the
Value is its corresponding information.
The computational process of the attention mechanism can be systematically divided into three stages, with their logical relationships illustrated in
Figure 5.
In the first stage, the similarity or correlation is calculated based on Query and
. In this paper, the vector dot product is used for the solution, i.e.,
In the second stage of attention computation, a method similar in principle to the softmax function is typically employed to normalize the raw scores generated in the first stage. This step enables standardization, transforming the scores into a standard probability distribution where the sum of all elements equals one. It should be emphasized that the weights of key elements are highlighted through the inherent mechanism of the softmax function. The calculation form is as follows:
where
is the weighting factor corresponding to
.
In the third stage, all the weighting coefficients are weighted and summed up to obtain the
Attention value of the
Query, i.e.,
3.2.2. Stacked Autoencoder
SAE is a neural network widely employed for deep feature extraction and whose architecture consists of multiple autoencoders (AE) stacked sequentially, thereby enabling layer-by-layer abstraction of input data. As shown in
Figure 6, each basic autoencoder unit comprises an input layer, a hidden layer (encoding layer), and an output layer (decoding layer), enabling an effective encoding of the data through unsupervised training. Specifically, each basic autoencoder unit comprises an encoder and a decoder. Both the input layer and hidden layer form the encoder, while the hidden layer and output layer constitute the decoder [
18]. The computations involved in encoding and decoding are given below:
where
denotes the network input, while
denotes reconstructed output. Additionally, both
and
denote neural activation functions, though the former pertains to the encoding process, while the latter relates to the decoding process. Note that the sigmoid function is employed throughout this paper. Here,
denotes the characteristics of the hidden layer, while the weights for the encoding and decoding processes are represented by
and
, respectively. The deviations in the encoding and decoding processes are represented by
and
, respectively.
The training of the AE network is accomplished by minimizing the reconstruction error between its input and output, with the specific form of this optimization objective shown in Equation (13).
3.2.3. Stack-Enhanced Autoencoder
Traditional SAE networks achieve deep feature extraction from input data by minimizing the AE reconstruction error layer by layer. However, considering that the actual reconstruction process cannot guarantee that each layer of the AE can achieve accurate and lossless reconstruction of the input data, the phenomenon of information loss accumulation will inevitably exist as the feature learning proceeds layer by layer. To address this problem, this paper uses a stack-enhanced autoencoder (SEAE) [
24], whose structure is shown schematically in
Figure 7.
The SEAE network shown in
Figure 7 is stacked by
enhanced autoencoders (EAE). The structure of EAE is shown in
Figure 8. The key mechanism of the sound emission network lies in its recursive input reconstruction process. Specifically, at each layer, the network not only reconstructs the output of the current layer, but also incorporates the original input information as a constraint or reference within the reconstruction process. This reconstruction strategy which integrates the original input enables the network to retain the integrity of the source data to the greatest extent during deep feature extraction, thereby effectively compensating for the information loss caused by layer-by-layer abstraction in traditional SAE and enhancing the robustness and physical interpretability of feature representation.
In the actual network training process, EAE is only required to reconstruct the input information, and its corresponding error loss function can be simply formulated as:
where
denotes the original input,
represents the reconstructed output, and the symbol
describes the number of neurons in the hidden layer. After the training of EAE 1 is completed, the output layer is removed, and its hidden layer is used as the input of EAE 2. At this point, for EAE 2, the variables to be reconstructed are the hidden layer eigenvalue
of EAE 1 and the original input
. The output of EAE 2 is then removed, and the output of the hidden layer is used as the input of EAE 3, and so on.
Assuming that the SEAE has a total of
hidden layers, the loss function of
(
) can be expressed as follows [
24]:
It should be emphasized that compared with the traditional SAE network, the SEAE network simultaneously reconstructs the shallow data features and the original data information during the network training process to ensure that the deep features of the data are obtained while minimizing unnecessary information loss.
3.2.4. SEAE Network Based on Attention Mechanism
The traditional SAE network ignores the influence of historical information on the current results as it is a static model. However, in the actual industrial production process, historical information is of great significance to the current production; the neglect of historical information will reduce the feature extraction ability and final prediction accuracy of the network. Therefore, considering the historical information, this paper proposes a dynamic stacked enhanced autoencoder network (AD-SEAE) based on an attention mechanism (see
Figure 9). Among them,
Figure 10 is the calculation process of the attention sample.
In the actual industrial production process, historical information closely related to the current production is mainly concentrated in the recent time period, while the historical information with a long time span has a very limited effect on the current information. Therefore, to reduce the computational cost, this paper introduces sliding window technology to limit the number of historical data input. In
Figure 9, the input of the AD-SEAE network consists of two parts, namely, the historical information and the current information of the attention mechanism. For the historical information that introduces the attention mechanism, the calculation process is as follows:
1. The historical samples in the window are mapped to the Key space and the Value space. The mapping process is as follows:
where
denotes the
th Key and Value values, respectively;
denotes the Key-Value mapping weight matrix and bias, respectively; and
denotes the numerical mapping weight matrix and bias, respectively. Here,
is the unit matrix and the all-zero matrix, respectively.
2. Calculating the attention score includes two steps: the dot product operation to calculate similarity and the softmax function to perform numerical conversion operation.
3. Then obtain the Attention value based on weighted sum.
The difference between AD-SEAE and SEAE lies mainly in the first EAE. For the first EAE network, its inputs include both current inputs and historical inputs it obtained based on the combination of attention mechanisms. Therefore, its network parameters can be expressed as follows:
where
denotes the encoding and decoding weights of the first EAE network in the AD-SEAE network, respectively;
is the bias; and
denotes the weight and bias of the combined historical input information. The loss function of the first EAE network in the AD-SEAE network is denoted as follows:
where
is the historical information weight adjustment parameter. The above loss function contains not only the reconstruction of the original information, but also the reconstruction of the combination of attention history information. This operation ensures that the encoder can extract both the current features and the relevant historical information features.
The basic process of the AD-SEAE network contains three steps. Firstly, the required history samples are determined by choosing the appropriate window length, and then the combined history information is obtained through calculations based on the attention mechanism. Then, the attention samples and the current samples are combined and inputted into the AD-SEAE model in order to extract to the deep combination features. Finally, the mapping relationship between the deep combination features and the target variables is modeled in order to achieve the prediction of the target variables. In short,
Table 1 is the modeling algorithm of the AD-SEAE model.
3.3. Output Fusion of Mechanisms Model and Data Model
In summary, to make full use of the prediction advantages of the mechanism sub-model and the data-driven sub-model, a model fusion weight adjustment method based on entropy weight is proposed. The entropy weight method is an objective weighting method, which determines the objective weight according to the variability of the entropy weight index [
30]. Specifically, the smaller the degree of variation in the entropy weight index, the less the amount of information reflected, and the lower the corresponding weight. The specific fusion steps are as follows:
1. In this paper, the root mean square error (RMSE) and the mean absolute error (MAE) are used as the basis of the weight solution, that is
where
is the predicted value,
is the actual value, and
is the number of samples.
2. The sub-model prediction accuracy value is defined as
where
Based on the sub-model prediction accuracy
, the sub-model likelihood measure parameter
can be further defined as:
In Equation (24), the larger the
value is, the higher the prediction accuracy of the sub-model is. Assuming that the prediction accuracy of the sub-model is
, and
is the window length, then the prediction accuracy data set
of all models can be expressed as:
where
and
represent the
th prediction output accuracy of the mechanism sub-model and the data-driven sub-model in a window with a length of
, respectively.
3. By standardizing Equation (26), a standardized matrix can be obtained.
Let
, then the entropy can be expressed as:
When , there exists .
4. The information entropy of the mechanism sub-model and data-driven sub-model can be expressed as
. Therefore, the fusion weight of each sub-model is:
Therefore, the final prediction output can be expressed as follows:
where
represents the predicted output of the mechanism sub-model, and
denotes the predictive output of the data-driven sub-model. It is worth mentioning that the aforementioned RMSE/MAE-driven weight update belongs to offline calibration, which is used to determine the fusion weights on historical data (including offline reconstructed labels). During the online operation phase, only forward reasoning and fusion are performed, without relying on any online labels.
3.4. M-AD-SEAE-Based Soft Sensor
The main steps of the M-AD-SEAE soft sensor modeling method proposed in this paper are as follows:
1. Data Preprocessing
- (1)
Outlier processing: In the actual industrial production process, there are always inevitabe abnormal data points that deviate from the expected value due to noise and other factors. If these outliers are not eliminated and modeled with normal data, the accuracy of the model and other aspects will be greatly affected. Therefore, this paper uses the criterion to eliminate the outliers in the data set. It is assumed that the measured value is , the average value is , the absolute error is , and the standard deviation is . If the absolute error of a measured value satisfies , it is considered that is abnormal and needs to be eliminated.
- (2)
Standardization: Since the modeling process of the data-driven sub-model requires auxiliary variables, the magnitude of different auxiliary variables is different. If the real value is used directly, the convergence time of the network becomes longer. Therefore, it is necessary to standardize the input data and output data:
2. Auxiliary feature variable selection:
Based on the standardized data, this paper considers the mixed correlation between the auxiliary variables and the target variables based on Pearson and Spearman in the process of auxiliary variable selection. That is
where
denotes the correlation adjustment parameter, and
denotes the difference in rank order between
and
. Based on the descending order of the mixed correlation values, the more relevant auxiliary variables can be selected.
3. AD-SEAE network modeling:
Firstly, the modeling sample is selected. Then, the attention score is calculated and the attention sample is obtained. Next, the attention samples and current samples are used as inputs in the AD-SEAE network. Finally, the final network model is obtained by pre-training the network layer by layer and then by backward fine-tuning.
4. Establishment of mechanism sub-model:
According to Equations (1)–(4) as shown above, the corresponding mechanism sub-model is established.
5. Prediction output fusion:
First, the prediction accuracy of the computer mechanism sub-model and the data-driven sub-model is obtained. Then, the information entropy and fusion weight of each sub-model are calculated according to the model prediction accuracy. Finally, the predicted output results are fused according to Equation (28).
4. Case Study
To validate the performance of the algorithm proposed in this paper, simulation experiments were conducted on two systems: a numerical case and an actual Cz-SSC production process. The model prediction performance metrics consisted of three indicators: MAE, MAPE, and RMSE where
4.1. Case 1: Generalized Numerical Case
To verify the applicability and effectiveness of the proposed M-AD-SEAE, this paper employed the following discrete-time nonlinear system to characterize the batch process of a specific group of industrial objects.
where
,
, and
denote the input, state, and output of the system at the time
. To mimic six batch runs (six operating conditions), we generated six different input trajectories
,
. The specific form was as follows:
For each case , was injected into Equation (34) to generate the corresponding output . In our soft-sensing setting, the first five outputs were treated as auxiliary variables (inputs), while the sixth output was treated as the target variable to be estimated.
The initial conditions were randomly set as
and
, respectively. To emulate measurement noise under different operating conditions, additive noise with different amplitudes was imposed on each output sequence
, which yielded the noisy data shown in
Figure 11. Also, to verify the effectiveness of the M-AD-SEAE soft sensor modeling method proposed in this paper, SAE, SEAE, AD-SEAE models were used for comparison.
To simulate the aforementioned data, it was necessary to conduct data preprocessing before establishing the soft sensor model. Given that the network parameters (i.e., weight, bias) significantly impact the model’s performance, this paper employed particle swarm optimization to optimize these parameters. The resulting network structure was as follows: [
1,
3,
6,
7]. The initial learning rate was set at 0.25, with a reduction factor of 0.5. Additionally, the activation function utilized is ‘sigm’.
Figure 12 and
Figure 13 display the prediction output results of the test set, as well as the corresponding error curves for the four distinct soft sensor models. Notably, the Figures reveal significant variations in the prediction outcomes obtained by these different models, as this paper outlines. For the original SAE model, the maximum absolute error was recorded as 16.3277. Due to potential information loss during the layer-by-layer feature extraction process in the SAE network, the SEAE model was introduced, resulting in a decreased maximum absolute error of 12.8334 in its prediction results. Building upon SEAE, AD-SEAE further considered the influence of system history information, yielding relatively good prediction output that closely fluctuated around the true value. However, this method faced certain issues, i.e., a relatively large maximum absolute error of 17.2917.
The primary reason behind this issue was that the introduction of historical information accounted for the influence of historical noise, which led to significant errors in the model output when there are sudden changes in the true value. However, this problem can be mitigated by incorporating a mechanistic model. By leveraging the combined strengths of both approaches, the prediction accuracy can be further enhanced. Moreover, compared to AD-SEAE, the introduction of the mechanistic model in M-AD-SEAE reduced the maximum absolute error to 15.1239, demonstrating an improvement in performance.
The prediction performance indices of the four models are presented in
Figure 14 and
Table 2, clearly indicating a significant improvement in prediction performance through the step-by-step model enhancement process. Specifically, compared with the original SAE model, the MAE indexes of other models were reduced by 18.36%, 35.55%, and 62.76%, respectively. Similarly, the MAPE index exhibited decreases of 18.37%, 35.57%, and 62.78%, while the RMSE index demonstrated decreases of 16.89%, 28.46%, and 56.12%, respectively. These findings further demonstrate the suitability of the proposed modeling method for soft sensor modeling in numerical cases.
4.2. Case 2: Cz-SSC Growth Process
The historical database of the Cz-SSC furnace recorded a large amount of process data reflecting the state of crystal growth. In the actual single furnace production, the equal diameter growth stage of Cz-SSC takes a long time. Therefore, this paper selected 20,000 sets of sample data extracted from historical industrial growth logs, and the sampling time was 2 s. The first 85% of the data were used for model training and verification, and the last 15% of the data were used to verify the prediction performance of the model. Here, the data corresponding to 20,000 samples came from the same furnace.
According to expert experience, the auxiliary variables that affect the V/G value can be preliminarily determined.
Table 3 shows these auxiliary variables include: crystal diameter, crystal pulling rate, main heater power, heating element temperature, crucible rise speed, liquid level temperature, crystal rotation speed, and crucible rotation speed. Here, the Pearson and Spearman coefficient was used to measure the correlation. The auxiliary variable data and target variables after data preprocessing are shown in
Figure 15.
To verify the effectiveness of the M-AD-SEAE in the Cz-SSC process, SAE, SEAE, and AD-SEAE models were used for comparison. Here, the parameters of M-AD-SEAE were determined using the particle swarm optimization algorithm as [
1,
3,
8,
17]. Among them, “17” denoted the input dimension, “1” represented the output dimension of the network, and [
3,
8] indicated that the system comprised two hidden layers. The initial learning rate was set at 0.05, the learning rate reduction factor was 0.9, and the activation function used was ‘sigm’. To ensure fairness, the parameters for SAE, SEAE, and ADSEAE networks remained consistent with those of M-AD-SEAE.
In order to intuitively and clearly compare the prediction performance of each model,
Figure 16 shows the V/G index prediction output results of SAE, SEAE, AD-SEAE, and M-AD-SEAE. It can be seen that the prediction accuracy of the SEAE model was higher than that of the SAE model. This was because the SEAE network increased the operation of reconstructing the original data in the feature extraction process, which reduced the cumulative loss of the original information, thereby enhancing the prediction accuracy of the SEAE network. Further, the prediction accuracy of the AD-SEAE model was better than that of the SEAE network, mainly because the self-attention mechanism module increased the model’s focus on key information, which enabled the model’s predictions to track the actual values in a largely consistent manner with a very good trend match of the inflection point. In short, the V/G indexes predicted by the proposed M-AD-SEAE model and the actual curves were in good agreement in terms of values and trends, which further illustrated the validity and prediction accuracy of the method proposed in this paper. It is worth mentioning that the “actual value” in
Figure 16 is not the V/G directly measured by an online sensor, but rather the V/G reference label value obtained through offline reconstruction and calibration.
The prediction error curves of the above different models are shown in
Figure 17. It can be seen that the prediction errors of the SAE and SEAE models were significantly higher than those of AD-SEAE and M-AD-SEAE, which was not a significant advantage for the semiconductor SSC growth process with high precision crystal quality control requirements. It is worth pointing out that although there was a small error mutation between AD-SEAE and the proposed M-AD-SEAE at t = 2800~2830, the error can be guaranteed to be stable within the rest of the time. In contrast, the M-AD-SEAE model proposed in this paper had high-prediction accuracy and could meet the monitoring requirements of the Cz-SSC growth process.
In order to compare the prediction performance of each model more clearly, this paper analysed the error histogram as shown in
Figure 18 and the prediction performance index of each model as shown in
Table 4.
Figure 18 shows that incorporating the attention module markedly enhances AD-SEAE relative to SEAE. Moreover, by integrating the mechanistic sub-model, the proposed M-AD-SEAE further outperformed AD-SEAE. As reported in
Table 4, introducing the original-information pathway already boosted SEAE over the baseline SAE. With attention enabled, AD-SEAE achieved reductions in 91.4% (MAE), 91.3% (MAPE), and 71.0% (RMSE) compared with SEAE. In addition, M-AD-SEAE delivered further decreases of 52.8%, 52.0%, and 53.2% in MAE, MAPE, and RMSE, respectively, over AD-SEAE, indicating the smallest prediction error and the highest accuracy. Overall, M-AD-SEAE provided the best predictive performance among the compared methods and was well suited for online estimation of the V/G quality indicator in Cz-SSC growth.