Dimension Reduction of Multivariable Optical Emission Spectrometer Datasets for Industrial Plasma Processes

A new data dimension-reduction method, called Internal Information Redundancy Reduction (IIRR), is proposed for application to Optical Emission Spectroscopy (OES) datasets obtained from industrial plasma processes. For example in a semiconductor manufacturing environment, real-time spectral emission data is potentially very useful for inferring information about critical process parameters such as wafer etch rates, however, the relationship between the spectral sensor data gathered over the duration of an etching process step and the target process output parameters is complex. OES sensor data has high dimensionality (fine wavelength resolution is required in spectral emission measurements in order to capture data on all chemical species involved in plasma reactions) and full spectrum samples are taken at frequent time points, so that dynamic process changes can be captured. To maximise the utility of the gathered dataset, it is essential that information redundancy is minimised, but with the important requirement that the resulting reduced dataset remains in a form that is amenable to direct interpretation of the physical process. To meet this requirement and to achieve a high reduction in dimension with little information loss, the IIRR method proposed in this paper operates directly in the original variable space, identifying peak wavelength emissions and the correlative relationships between them. A new statistic, Mean Determination Ratio (MDR), is proposed to quantify the information loss after dimension reduction and the effectiveness of IIRR is demonstrated using an actual semiconductor manufacturing dataset. As an example of the application of IIRR in process monitoring/control, we also show how etch rates can be accurately predicted from IIRR dimension-reduced spectral data.


Introduction
As indicated in recent International Technology Roadmap for Semiconductors reports [1], the future fabrication cost per-unit-area of integrated circuits (IC) will be decreased by moving to larger-diameter semiconductor wafers in the fabrication process, however, this move will require more sophisticated and precise process control mechanisms to ensure that process yields are preserved. Hence, driven by practical future manufacturing requirements, the design of process control mechanisms continues to be an active research topic in the IC manufacturing domain.
Plasma etching is a key processing method employed in IC fabrication steps. By first masking areas of the silicon wafer being processed, subsequent exposure to plasma yields the required etched features on the surface of the wafer. The process is fundamentally complex from a physical and engineering control perspective and sensitive to an array of process parameters [2]. As there is currently an incomplete understanding of the underling physics and chemistry to allow for pre-determined process control, etching processes are often developed empirically [3]. Critical to empirical control (and to the development of further fundamental understanding of the process) is the development of mechanisms for plasma monitoring by sensor data collection and analysis.
Generally, there are two types of plasma diagnostic sensors: intrusive sensors and non-intrusive sensors. One popular intrusive technology is the Langmuir probe [4], which is immersed directly into the plasma. Although direct measurements of targeted plasma parameters may be made, the direct immersion of the probe into the process environment results in changes in the temperature, density, and potential of the plasma and ultimately affects etching process results. Non-intrusive plasma process monitoring technologies include impedance monitoring [5], reflectometry sensing [6] and OES [7]. Due to the abundant information that can be extracted from the data and the direct (although complex) relationship of the data to the etching process, OES is widely applied to IC fabrication [7]. The richness of OES data is also a potential hindrance to effective interpretation and utility of the data. Of particular concern is data dimensionality. For example, a miniature Ocean Optics USB4000 fibre optic spectrometer, as used in the present work, provides intensity measurements of 2,048 wavelengths from 178.31 nm to 874.27 nm [8]. Full spectrum samples are typically taken every 0.7 s over typically 40 s of a dynamically changing process and datasets from hundreds of such process runs are taken for statistical analysis.
Previous research on OES measurements of plasma etching processes has largely focused on the use of OES data for particular target applications, for example, virtual metrology methods [9,10], automatic process end-point detection strategies [11,12], and system condition monitoring [13]. In this paper the focus is more directly on the fundamental dimensionality problem of OES data, so that such applications can be better facilitated. In the next section, our general approach to an appropriate dimension-reduction for the specific data type in question is introduced and our approach is distinguished from existing dimension-reduction methods. Section 3 describes our proposed Internal Information Redundancy Reduction (IIRR) method in detail. Section 4 demonstrates that little information content is lost when the method is applied to a dataset from a real semiconductor manufacturing environment. Additionally, practical problems relating to the particular spectroscopy data are addressed, namely data pre-processing steps to deal with sensor output saturation and data time-stamp uncertainty. As an example of the application of IIRR in process monitoring/control, we also show how etch rates can be accurately predicted from IIRR dimension-reduced spectral data. Finally, Section 5 gives our conclusions and future work ideas. Abbreviations used in the remainder of this paper are listed in Table 1.

Motivation for Approach to OES Dimension Reduction
Our overall approach to the design of an effective dimension-reduction method for OES data is guided by the following factors: (i) at a fundamental level, emission spectra from chemical species in a plasma are composed of emissions at discrete wavelengths only. Thus, we wish to isolate and work with only peak wavelength intensities in our spectral data, the assumption being that non-peak intensities represent only noise; (ii) as emission lines from each chemical species are highly correlated we expect considerable data redundancy within spectra; (iii) to maximize the utility of the dimension-reduced data, we wish to avoid transforming the data to an abstract variable space (as is common in many dimension-reduction methods), instead working directly with wavelength variables; (iv) as plasma processing is a dynamic process, it is important to preserve time domain information, that is, our focus is on dimension reduction in the wavelength domain only.
From a plasma-etching viewpoint, there has been little focus on dimension and redundancy reduction of the OES dataset per se. Most previous research has been focused on application of the dataset (e.g., for process fault detection) where dimension reduction is used as a data pre-processing step but is not the focus itself. In [14], principal component analysis (PCA) (in conjunction with a hidden Markov model) is used for process end-point detection in plasma etching processes. In [15], a weighted PCA method is proposed for fault detection and classification in plasma etching. Besides OES data, other plasma diagnostic datasets were also used such as chamber impedance measurements and gas flow measurements. In order to reduce data dimensions, the original data is replaced by several summary statistics, such as averages, standard deviations, maxima, minima. In [16], Sparse PCA (SPCA) is used to select signature OES variables. In [17], Partial Least Squares (PLS), support vector machines, and rules ensemble methods are compared with each other for process yield prediction. Dimensionality of the input data is reduced using PLS and rules ensemble within the prediction process.
A general feature of these previous applications of dimension reduction of OES data is that generic methods (e.g., PCA, SPCA, or use of summary statistics) are applied directly to the full set of input wavelength variables, without regard to the specific nature of the dataset and these methods can have difficulty in finally isolating important variables in the original variable space. For example, it is not possible to trace back to individual wavelength measurements at a certain time point when only summary statistics are the output of the method [15]. In PCA-based methods, every Principal Component (PC) is a linear combination of all original variables. This is a problem if quantification of the contribution by each variable to certain PCs is required [18]. SPCA is a possible solution to this problem [19], but the grouping effect (equal weights tend to be given to highly correlated variables) is a weakness, leading to difficulty in final variable selection [16].
Other general dimension-reduction methods also have disadvantages for direct application to the problem at hand. Ensemble methods have been shown to be successful in identifying important variables in the original space [20], however ensemble learning methods (e.g., boosting, bagging [21], rules ensembles [20]) need to be supervised by knowledge of output variables, which in our case would be actual etch-rate measurements, which are normally not available. Other supervised learning methods are similarly unsuitable in the current context. Factor Analysis (FA) [22], projection pursuit [23], Artificial Neural Networks (ANN), and Independent Component Analysis (ICA) all have their own particular issues. In [24], a number of problems are highlighted for the FA method, where it is often possible to extract too few or too many factors and factor stability can be a concern. For projection pursuit [23], high computational intensity is a disadvantage [25]. Compared with PCA, ANN gave a better dimension reduction result in [26], however, ANN can suffer from relevantly high computational load, the over-fitting problem, and the empirical nature of model development [27]. ICA has a similar information transformation strategy as PCA and FA but can have difficulty in determining component number and component order [28]. Both problems will lead to a high computational cost and difficulty for further interpretation of components.
Based on the above mentioned difficulties in directly applying general dimension-reduction methods in our specific domain, we propose in this paper a new method, called Internal Information Redundancy Reduction (IIRR). A central feature of the method is the importance of peak values of the wavelength variables comprising the dataset, and consequently the opportunity to remove variables which do not exhibit significant peaking. Secondly, high correlation between certain groups of peak variables is known to exist, given the nature of the physical emission process. This helps us in designing our method to maximise redundancy reduction. We develop our method in the following sections, showing that a large reduction in the number of variables is possible with little information loss.

Dataset Description and Method Overview
A complete dataset of OES data is comprised of time-stamped spectral scans collected over multiple etching process runs. There are N process runs made and a spectral scan is taken at each time instance t j , j = 1, 2, …, M, during each run. Each spectral scan yields a set of wavelength intensity measurements {I k :k = 1, 2, …, K}. The K wavelengths measured { k :k = 1, 2, …, K}are equally spaced, typically across a range from  1 = 178.31 nm to  k = 874.27 nm, with K = 2,048 total wavelength values. Process run durations typically span approximately 40 s with spectral scans taken approximately every 0.7 s. Thus the complete dataset is the set of N  M  K data points, which we denote by the vector: We note here that the sensor employed in OES measurement can saturate its output value at certain wavelengths that are prominent in the process. A method to deal with de-saturation is described in Appendix 1. We additionally note that, in the raw OES input data, time stamps from one process run to another may not always exactly align (the time between scans can vary during each process run). Also, the number of time points recorded for each process run is not necessarily the same. Thus, we must re-align/normalize the data in time before it is inputted to the IIRR procedure. This is done using the method described in Appendix 2 to yield data with an equal number of time points M for all process runs, with time points fixed to standard time instances t j , j = 1, 2, …, M (with t j -t j-1 constant) for all process runs. We assume de-saturated and time-synchronized data in the above and further definitions and descriptions in this section.
Together they comprise the Internal Information Redundancy Reduction (IIRR) workflow, depicted in Figure 1. Firstly, APS identifies peak wavelength variables in the (de-saturated and normalized) input OES dataset d. The output of APS for a time point t j is the set of wavelength indices for which peaks in wavelength intensity occur, at time point t j , during any of the N process runs. Formally, the output of APS is the collection of sets of peak wavelength indices, denoted: (2) where = {k  {1, 2, …, K}: I k is a peak wavelength intensity at time t j in at least one process run}.
Next, the IRP algorithm takes each set of peak wavelengths and ranks each peak wavelength variable in the set based on how well it can be predicted from other peak intensity values in . Peaks that are poorly predicted by others are assumed to hold more significant information and are ranked more strongly. Repeating the procedure for each time point, the output of IRP is a dataset of the same dimensions as containing wavelength indices in order of ranking, i.e., a dataset , where: (3) Finally, at each time point, the OPS algorithm calculates a measure of how well the first r top-ranked peak wavelength variables can predict the full set of original wavelength variables, at each time point t j . An aggregate statistical measure Mean Determination Ratio (MDR) is used to summarize the prediction quality for different r and a minimal value of r is determined under a constraint on the MDR value. The procedure is repeated at each time point and the final output of OPS is a time series of the minimized peak wavelength sets, denoted . The details of these three stages of the IIRR procedure are given in the subsections below, accompanied by basic results illustrating the properties of each stage. Results in Section 4 then show that the whole procedure results in a high level of data reduction for a sample OES dataset, without significant loss of predictive power of the reduced set of variables.

Absolute Peak Selection
Absolute Peak Selection (APS) is a simple method to identify wavelength intensity variables that are relatively higher in value than neighboring wavelength intensities, while accounting for noise in wavelength intensity measurements. The noise accompanying each wavelength intensity measure is represented as a mean bias value B, which is derived from the specifications of the spectrometer. In our case, the USB4000 Spectrometer signal-to-noise ratio is quoted as 300:1 at full signal [8] and the appropriate value for B is derived from this in Appendix 1. APS operates independently at each time point t j and, for each process sample, identifies a wavelength variable as a peak wavelength if its intensity is greater than the intensities of neighboring wavelengths plus the bias value. Let d i.tj be the set of wavelength intensity values {I k :k = 1, 2, …, K} measured at time point during process run i. Then peak wavelengths within this set are identified by the set: .
Having found for all process runs 1, 2, …, N, all peak wavelength variables are then aggregated as: Finally, iterating this procedure at each time point, the complete output of APS is We note that the rationale for this aggregation in the above equation is that the data gathered from each process run is an independent sample of an (ostensibly) fixed etching procedure, thus the process state should be similar for all samples at the same time point and so aggregation can achieve data reduction without significant loss of information content. Based on our OES dataset from a semiconductor etching process, we have found APS reduces the original 2,048 wavelength variables to a relatively small number of peaks at each time point, ranging from 22 to 113 peaks (averaging ~47.7 peaks). Over all time points, 178 distinct peak wavelengths are detected.

Iterative Ranking Process
The ultimate goal of the Iterative Ranking Process (IRP) and Optimal Peak Selection (OPS) methods is to identify which subset of the peak wavelength variables can best represent the remaining variables, so that the set of peak wavelengths can be reduced with minimal information loss. This is done by the ranking procedure of IRP followed by a selection from top-ranked variables performed by the OPS procedure (Section 3.3).
Each set of peak wavelength intensity samples is treated separately by IRP, as follows. For each wavelength intensity variable I k , k  , an ordinary least squares linear regression is performed: where are the regression coefficients and  k the error term. IRP then calculates the coefficient of determination (the R 2 value) of the prediction of I k , which we denote by R k . The lowest ranked wavelength in is then identified as having the largest R 2 value: that is, the ranking number assigned to wavelength k is equal to the total number of peaks in . This wavelength k is then removed from the pool of peak wavelengths and the process is repeated on the new set -k to yield the next ranked wavelength (rank number . The process repeats until only one wavelength remains in the pool, which is assigned the highest ranking (rank number 1). The complete output of the IRP process (for the time point t j ) is then the ordered set of wavelength indices: This IRP process is repeated for each time point to yield the final output: The rationale for the method is that peaks that are removed from the pool early (low ranked peaks) can be well predicted by the remaining pool of peaks and so hold relatively less information. Peaks that remain in the pool are less correlated with others and are ranked higher. We note that in our IRP method, removing peaks and reiterating the evaluation with a decreasing pool size should improve the sensitivity of the ranking between peaks. Particularly, in very highly redundant datasets, a simpler method of ranking based only on a single evaluation of R 2 over the full pool can yield all R 2 values very close to 1, giving only a weak distinction between peaks.
We apply IRP to the APS output of the OES test dataset mentioned in Section 3.1. Figure 2 shows samples of the IRP output for four example time points during the etching process. The curves show the maximal coefficients of determination R k as the procedure iterates, that is, the R 2 value of the peak to be removed from the pool at each iteration. At the start of the procedure the R 2 values are very close to 1 and towards the end of the procedure, only the highest ranked peak variables, with lower R 2 values, remain. In terms of identifying an opportunity for data reduction, it can be seen that only relatively few high rank peaks have lower R 2 values. This general pattern was observed at all time points in the IRP output.

Optimal Peak Selection
Having ranked peak wavelength variables using IRP, the Optimal Peak Selection (OPS) procedure selects a top-ranked subset of the peak variables. The number of peaks in this subset is minimized under the constraint that the prediction of all wavelength variables by the subset of peak variables meets a specified prediction quality target. More formally, let be the ranked set of peak wavelength indices k 1 , k 2 , k 3 , …, for time point t j , and let K' be the size of this set. For each r  {1, 2, …, K'}, the OPS procedure regresses wavelength intensity variable {I k :k = 1, 2, …, K} on the set of peak wavelength variables I k1 , I k2 , …, I kr , to yield prediction , and calculates how well predicts I k by way of the R 2 value denoted R r,k . (Similarly to the IRP procedure, an ordinary least squares linear regression is used). An aggregate measure of how well the set of peak variables I k1 , I k2 , …, I kr  . (10) Finally, an optimal value of r, denoted , is determined (as explained below), which selects the final reduced set of peak wavelength variables for this time point t j . Empirically, we have found that as r is initially decreased from its maximum value of K', the prediction quality (MDR value) remains at a high value and decreases very slowly. Eventually, as r approaches 0, the MDR begins to drop off quickly. (Figure 3 illustrates this pattern for our test OES dataset). Thus, we have chosen to use a threshold on the slope of MDR r to determine the optimal value that gives a small variable set but with still high MDR value, that is, in the OPS procedure r is decreased from its maximum value until: (11) where m threshold is chosen to achieve a desired trade-off between prediction quality and the number of remaining peak variables. The above process is repeated for each time point to yield the final output of the IIRR procedure as :

Experimental Results and Discussion
In the previous section we have shown that, when applied to our test data set, the IIRR method can reduce the number of input variables by a large degree without a significant loss in prediction accuracy from the remaining variables to the full set of original input variables. To further validate the method, in this section we quantify the prediction quality of the reduced set of variables produced by IIRR when predicting an independent output variable, the etch rate. Although this measurement is not normally available from plasma etch process monitoring, for our particular test dataset of spectral data from a real semiconductor etching process, we have a corresponding final etch rate measurement for each process run. Our validation procedure is as a follows.
We have 900 process samples (process runs) which we split equally into a training group and a testing group. A process sample contains the time series OES data for the process run plus one final etch rate measurement. The distribution of all etch rate samples in each group is shown in Figure 4.
The IIRR procedure is used to find a reduced set of wavelength measurements using only the training OES sample group. We note that the etch rate variable is not part of the IIRR training input, only the OES training samples. A sample of the OES measurements before and after the IIRR process is shown in Figure 5, for a typical process run. It can be seen that data dimension is reduced significantly. Across all samples, less than 0.27% of the original 2,048 wavelength measurements remain after IIRR.
We next compare the prediction accuracy of the IIRR reduced dataset to the prediction accuracy when using the full set of OES data (we note that the full data set is first de-saturated and time normalized, see Appendixes 1 and 2), resulting in input data with 1,807 wavelength variables, of a possible 2,048, over 41 time points). In either case, wavelengths (or peak wavelengths) at each time point are treated as independent parameters for prediction. Three popular prediction methods are tested to provide a regression-independent result: multiple linear regression (MLR), PLS, and ANN.   We can see very good R 2 and MAPE scores for the predictions. Interestingly, for the prediction of the testing dataset, there is better prediction accuracy (for MLR and PLS cases) when using the IIRR reduced dataset compared to using the full dataset. We attribute this improvement to the noise reduction effect of IIRR. We additionally note that PLS achieves the best result. It has been noted previously, in [28], that PLS is a more suitable method than MLR when the data dimension is large and there is high redundancy. In addition to good overall R 2 and MAPE values, there is very good prediction accuracy across all individual samples, as shown in the Figure 6.

Conclusions and Future Works
We have presented a new Internal Information Redundancy Reduction (IIRR) method for reducing the dimension of time series samples of OES measurements and, by use of real sample data from a plasma semiconductor process, have shown that the method can effectively reduce the number of wavelength intensity measurements required to accurately represent the data. As validation, we showed that prediction of an independent output variable (etch rate) can be done very effectively with the reduced set of variables, which comprise less than 0.27% of the original variables. In fact, prediction accuracy was slightly improved, compared to prediction with the full set of input variables.
We note that our IIRR operates in the original variable space, rather than a transformed variable space, which would make the method useful for OES analysis methods whose goal relates to physical interpretation of the data and process, for example in virtual metrology methods. We would also expect the method to be effective for application to high-dimensional spectral data from other processes, where the dataset represents a set of time series, each of which is an independent sample from the same fundamental process. Although the APS step of the algorithm is specific to OES datasets, the core method (IRP + OPS) could be expected to be effective for other (non-OES) high-dimensional time series datasets, where multiple independent samples of the same (repeatable) underlying process behaviour are available. However, we note a caveat here. As the IRP phase of the method ranks less correlated variables highly, there is a risk of biasing noise for inclusion in the final variable set. In our case, our interpretation of non-peak data as noise and its effective reduction/removal by APS avoids this scenario. For data from other processes, some similar insight to the nature of the noise and an effective noise reduction method would be required, so that a high level of data reduction can be achieved. On the other hand, as our IRP/OPS method is 'internal' in nature, not guided/biased by a chosen output variable(s), it is conservative in terms of attempting to distinguish unexplained variation from noise. As a stand-alone method of preparing a universal reduced OES dataset, that can be applied to prediction of multiple different output variables of interest, this may be useful.
Future work will investigate application of the method to other such data sets. Additionally, we will in future also consider how redundancy in the time domain can be reduced, which we have not considered in the present paper. In relation to our current OES plasma data, at least for certain periods of the process when it is less dynamic, the process is most likely over sampled and there is an opportunity for further reduction without significant loss of important time domain information. (13) with an SNR of 300:1 at full signal for the spectrometer in [8]. Figure A1b shows the result of removal of the saturated intensity values. Over the full dataset, 241 of the 2,048 wavelengths are removed.

Appendix II. Time Series Normalization
Each etching process run outputs a time series of spectral intensity scans, however, the sequence of timestamps from one process run to another is not necessarily identical. As the IIRR method needs to group all samples at a given time point during its data processing stages, the timestamps need to be aligned to a normalized time scale. The time between samples averages approximately 0.7 s and, over all process samples, the minimum final time stamp is 40.14 s. We set the normalized time scale to have 1 s intervals, with the final timestamp at 40 s. Having set the time scale, the values in each time series (process run) are transformed by linearly interpolating the wavelength intensity values between the points either side of exact 1 s intervals. The process is illustrated in Figure A2 below with two representative samples from the data.