# Principal Component Analysis of Process Datasets with Missing Values

## Abstract

## 1. Introduction

## 2. Methods

#### 2.1. Introduction to PCA

#### 2.2. PCA Methods for Missing Data

## 3. Case Study

#### 3.1. Simulations of Gaussian Data

#### 3.2. Tennessee Eastman Problem

#### 3.3. Addition of Missing Data

#### 3.4. Results

## 4. Discussion

## Supplementary Materials

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Abbreviations

ALM | Augmented Lagrange multipliers |

BPCA | Bayesian PCA |

EM | Expectation maximization |

HLV | Heteroscedastic latent variable model |

MAR | Missing at random |

MCAR | Missing completely at random |

MLPCA | Maximum likelihood PCA |

NMAR | Not missing at random |

PCA | Principal component analysis |

PCADA | PCA-data augmentation |

PPCA | Probabilistic PCA |

RMSE | Root mean square error |

SVD | Singular value decomposition |

SVT | Singular value thresholding |

TEP | Tennessee Eastman problem |

## Appendix A. Definition of the Subspace Angle

**Figure 1.**Possible realizations of the investigated missingness mechanisms: (

**a**) shows random missingness; (

**b**) shows sensor failure which results in missingness that is correlated in time; (

**c**) shows multi-rate data, and (

**d**) shows censored data.

**Figure 2.**Average RMSE of the missing data with standard deviation for the Gaussian cases. In the $d>n$ case, ALM never converged to a solution.

**Figure 3.**Average RMSE of the missing data with standard deviation for the Gaussian cases. In the $d>n$ case, ALM never converged to a solution.

**Figure 4.**Average subspace angle of learned vs. true subspace with standard deviation for the Gaussian cases.

**Figure 5.**Average RMSE and standard deviation of the fully observed TEP test set. In all cases ALM failed to converge.

**Table 1.**The minimum, average, and maximum number of PCs chosen using parallel analysis for each method over 20 realizations of the missing data. Each missingness type is combined with the naturally arising multi-rate missingness to total 25% missing data. ALM never converged and therefore no results are reported.

MI | ALS | Alt. | SVD. | PCADA | PPCA | PPCA-M | BPCA | SVT | ALM | |
---|---|---|---|---|---|---|---|---|---|---|

Random | ||||||||||

Min | 2 | 3 | 1 | 3 | 1 | 3 | 4 | 3 | 4 | – |

Avg | 2.95 | 3.2 | 4.15 | 3 | 2.55 | 3 | 4.3 | 3 | 4.95 | – |

Max | 3 | 4 | 7 | 3 | 4 | 3 | 5 | 3 | 5 | – |

Drop | ||||||||||

Min | 1 | 3 | 1 | 3 | 1 | 3 | 3 | 3 | 4 | – |

Avg | 3.15 | 3.3 | 4.15 | 3 | 2.65 | 3 | 4.05 | 3 | 4.9 | – |

Max | 4 | 4 | 6 | 3 | 5 | 3 | 5 | 3 | 5 | – |

Censoring | ||||||||||

Min | 1 | 3 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | – |

Avg | 3 | 3.5 | 3.65 | 2.9 | 2.6 | 2.85 | 3.3 | 2.9 | 1.65 | – |

Max | 4 | 5 | 7 | 3 | 7 | 3 | 5 | 3 | 4 | – |

**Table 2.**The mean detection times for each of the methods and missingness types. Cases are marked by “–” where every trial resulted in a false detection (e.g., a detection prior to $t=160$).

MI | ALS | Alt. | SVD. | PCADA | PPCA | PPCA-M | BPCA | SVT | |
---|---|---|---|---|---|---|---|---|---|

Fault 1 | |||||||||

Random | 163.1 | 163 | 163 | 163 | – | 163.8 | 163.1 | – | 171.0 |

Drop | 163 | 163 | – | 163 | – | 163.7 | 163.4 | – | 170.5 |

Censor | 163.1 | 163.2 | 163 | 163.5 | – | 163.2 | 163.4 | – | – |

Fault 13 | |||||||||

Random | 182 | 181.8 | 210 | 182 | – | 180.3 | 183.2 | – | 174 |

Drop | 182 | 181.4 | – | 181.3 | – | 182.3 | 179.3 | – | 174.5 |

Censor | 180.3 | 181.9 | 411 | 184.9 | – | 185 | 189.7 | – | – |

MI | ALS | Alt. | SVD. | PCADA | PPCA | PPCA-M | BPCA | SVT | |
---|---|---|---|---|---|---|---|---|---|

Fault 1 | |||||||||

Random | 0 | 0 | 19 | 0 | 20 | 2 | 0 | 20 | 0 |

Drop | 9 | 0 | 20 | 0 | 20 | 1 | 1 | 20 | 1 |

Censor | 5 | 3 | 19 | 3 | 20 | 6 | 9 | 20 | 20 |

Fault 13 | |||||||||

Random | 7 | 3 | 19 | 1 | 20 | 4 | 4 | 20 | 0 |

Drop | 11 | 4 | 20 | 5 | 20 | 5 | 4 | 20 | 0 |

Censor | 12 | 9 | 19 | 8 | 20 | 19 | 17 | 20 | 20 |

**Table 4.**The computational costs of each of the methods where d is the number of measurements, n is the number of samples, a is the latent dimension, and k is the number of bootstrap samples.

ALS/Alternating/PPCA/BPCA | SVDImpute/SVT/ALM | PCADA | PPCA-M |
---|---|---|---|

$O({a}^{2}dn+{a}^{3}n+{a}^{3}d)$ | $O(\mathrm{min}(n{d}^{2},{n}^{2}d))$ | $O(\mathrm{min}(kn{d}^{2},k{n}^{2}d))$ | $O(n{a}^{3}+nd{a}^{2})$ |

