# A Comparison of Clustering and Prediction Methods for Identifying Key Chemical–Biological Features Affecting Bioreactor Performance

^{*}

## Abstract

**:**

## 1. Introduction

- Non-uniform and inconsistent sampling intervals.
- Order-of-magnitude differences in dimensionalities.
- Complex interactions between participating species in a bioreactor (e.g., protagonistic vs. antagonistic members, functionally-redundant members).
- Conditionally-dependent effects of features (i.e., some features only affect the outcome in the presence or absence of other features).

- During each operating stage, operators would only need to monitor a small set of variables, instead of hundreds or thousands. This simplifies the controller tuning and maintenance drastically, and undesirable multivariable effects (such as input coupling [14]) are reduced.
- By using data-based ML models, the process outcomes can be predicted ahead of time, such that unsatisfactory outcomes are prevented. Moreover, the models can be updated using new data collected from each new operating stage, eliminating the need for complete re-identification.
- The ranking of feature impacts can be performed using grey-box models, which are mostly empirical but are guided by a modest amount of system domain knowledge. This combination is exceptionally powerful if the domain knowledge is accurate, since it defines otherwise-unknown prior assumptions. This improves both prediction accuracy and feature analysis accuracy. The task of control and monitoring is also much more feasible, since the focus is only on a handful of variables (as opposed to hundreds).

## 2. Methods

#### 2.1. Process Flow Diagram and Description

#### 2.2. Data Pre-Treatment

#### 2.3. Unsupervised Learning Methods

#### 2.4. Network Analysis

#### 2.5. Supervised Learning Methods

#### 2.6. Feature Selection

- Hypothesis testing: A model is trained with all features left untouched. Then, features are either removed or permutated (scrambled), either individually or conditionally according to other features. The model is re-trained, and its accuracy is compared to the base-case accuracy. The features which cause the largest decreases in model accuracy are considered “most important,” and vice versa.
- Scoring: A metric or “score” based on information or cross-entropy is defined and calculated for all features. Features with the highest scores are identified as the “most relevant,” and vice versa.

- Inability to recognize coupling effects between multiple features, such as correlations or redundancies [34].
- Inability to distinguish conditional effects between features, i.e., whether a feature is “relevant” given the presence of other feature(s).

## 3. Results

#### 3.1. Hierarchical Clustering of OTUs

- Unweighted pair-group method with arithmetic means (UPGMA)
- Ward’s minimum variance method (Ward)
- Nearest-neighbour method (Single-linkage)
- Farthest-neighbour method (Complete-linkage)

#### 3.2. Gaussian Mixture Analysis of OTUs

#### 3.3. Dirichlet Mixture Analysis of OTUs

#### 3.4. Prediction Results

- Base case: Water chemistry variables only.
- Hierarchical: Water chemistry variables plus representative OTUs obtained using hierarchical clustering.
- Gaussian: Water chemistry variables plus representative OTUs obtained using GMMs.
- Dirichlet: Water chemistry variables plus representative OTUs obtained using DMMs.

- Random forests (RFs)
- Support vector machines (SVMs)
- Artificial neural nets (ANNs)

#### 3.5. Feature Selection Results

#### 3.6. Summary and Critique of Results

## 4. Conclusions and Future Work

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

AIC | Akaike Information Criterion |

ANN | Artificial Neural Net |

ARIMA | Autoregressive with Integrated Moving Average |

ARMA | Autoregressive with Moving Average |

ARX | Autoregressive with Exogenous Inputs |

BIC | Bayesian Information Criterion |

C | Total number of discrete classes for a classification problem |

C-MDA | Conditional Mean Decrease in Accuracy |

COD | Chemical Oxygen Demand |

CP | Conditional Permutations |

d | Dimensionality of a dataset: number of variables or features |

DMM | Dirichlet Mixture Model |

DNA | Deoxyribonucleic Acid |

EBCT | Empty-Bed Contact Time |

DL | Deep Learning |

DR | Dimensionality Reduction |

FIR | Finite Impulse Response |

GAN | Generative Adversarial Network |

GMM | Gaussian Mixture Model |

IID | Independent and Identically Distributed |

K | Number of clusters or latent variables |

${\ell}_{p}$ | Lebesgue p-norm (value of p may vary) |

LIC | Laplace Information Criterion |

LTI | Linear Time Invariant |

MA | Moving Average |

MDA | Mean Decrease in Accuracy |

MGI | Mean Gini Impurity |

ML | Machine Learning |

MPC | Model Predictive Control |

MV | Manipulated Variable |

N | Total number of samples in a dataset |

NaN | Not a Number |

OTU | Operational Taxonomic Unit |

PID | Proportional-Integral-Derivative |

SCN | Stochastic Configuration Network |

SVM | Support Vector Machine |

rRNA | Ribosomal Ribonucleic Acid |

RF | Random Forest |

RL | Reinforcement Learning |

Se | Selenium |

SeD | Selenium Dissolved |

SeRR | Selenium Removal Rate |

UPGMA | Unweighted Pair-Group Method with Arithmetic Means |

ZOH | Zero-Order Hold |

## Appendix A. Machine Learning Nomenclature

## Appendix B. Model Training, Validation, and Testing

- Training: Samples used to obtain mathematical mappings (or models) between the input and output data.
- Validation (or development): Samples used to select optimal values of hyperaparameters—for example: model complexity (or order), regularization constants, etc. Systematic methods such as k-fold cross-validation are used.
- Testing: Samples restricted for assessing the performance (e.g., accuracy) of the selected model. This reflects its capability of generalizing to new, unseen samples.

## Appendix C. Standardization of Data

## Appendix D. Pretreatment of Data

- Uncalibrated, aging, or malfunctioning sensors
- Unexpected plant disruptions or shutdowns
- Human errors in data recording (either incorrect or missing values)
- Unmeasured, drifting disturbances (such as seasonal ambient temperatures)

**Outlier removal based on human intuitions:**The elimination of spurious sensor values (e.g., negative flowrates recorded through a valve) using a priori knowledge. These values can either be replaced by NaN (missing) values, or estimates via imputation.**Standardization:**The scaling of each feature to zero-mean and unit variance, equalizing the effect of each individual feature. This prevents features with relatively large ranges (e.g., flowrate with range $\pm 1000$) from dominating model weights over features with relatively small ranges (e.g., pH with range $\pm 0.1$).**Imputation:**The estimation of missing values, using a priori knowledge if available, or using standard techniques such as interpolation—for example, zero-order-hold (ZOH) or linear interpolation.**Smoothing:**The flattening of spiky measurements due to sensor noise, using techniques such as moving-average (MA) filters.**Common time-grid alignment:**The unification of sampling intervals for time-series data. For example, consider a variable measured every second, and another measured every 0.5 s. In order to model using both variables, each variable must contain the same number of samples. Therefore the uniform time-grid can either be taken at every second (losing half the resolution of the second variable) or every 0.5 s (requiring interpolation of the first variable).

## Appendix E. Details of the Random Forest (RF) Model

**Table A1.**Feature partitioning for a two-feature decision tree, with ${2}^{2}=4$ possible partitions. Each partition is labelled using a number between 0 and 3. The threshold values $\theta $ decide which partition each sample falls under.

${\mathit{x}}_{1}<{\mathit{\theta}}_{1}$ | ${\mathit{x}}_{1}>{\mathit{\theta}}_{1}$ | |
---|---|---|

${x}_{2}<{\theta}_{2}$ | 0 | 1 |

${x}_{2}>{\theta}_{2}$ | 2 | 3 |

**Figure A1.**Multiple random forests constructed for a binary-class problem. The outcomes (either Class 0 or 1) are decided by combining sequential splits of ${d}_{k}$ randomly selected features, from the original ${d}_{x}$-dimensional feature space. The final outcome is determined by a majority vote of individual outcomes from all trees.

## Appendix F. Details of the Support Vector Machine (SVM) Model

**Figure A2.**Multi-class SVM for four classes. The hyperplanes (lines) in the 2-D space clearly separate the four distinct classes with acceptable misclassification rates. A smooth approximation can be made using a four-class softmax function.

**Figure A3.**A linearly-separable dataset (

**left**), versus a non-linearly-separable dataset (

**right**), adapted from [51].

## Appendix G. Details Behind Artificial Neural Networks (ANNs)

**Figure A4.**Visualization of the operation $\mathit{y}=A(\mathit{W}\mathit{X}+\mathit{b})$ in a single ANN node. The weighted sum of its inputs is added to a bias term; the final sum is transformed by a nonlinear activation function chosen by the user.

Activation | Abbreviation | Formula |
---|---|---|

Affine | $aff\left(z\right)$ | $wz+b$ |

Step | $S\left(z\right)$ | $\left\{\begin{array}{cc}0,\hfill & \mathrm{if}\phantom{\rule{4.pt}{0ex}}z<0\hfill \\ 1,\hfill & \mathrm{if}\phantom{\rule{4.pt}{0ex}}z\ge 0\hfill \end{array}\right.$ |

Sigmoid | $sig\left(z\right)$ | $\frac{1}{1+{e}^{-z}}$ |

Hyperbolic Tangent | $tanh\left(z\right)$ | $\frac{{e}^{z}-{e}^{-z}}{{e}^{z}+{e}^{-z}}$ |

Rectified Linear Unit | $ReLU\left(z\right)$ | $max(0,z)$ |

Leaky Rectified Linear Unit | $LReLU\left(z\right)$ | $max(\alpha z,z)$ |

## Appendix H. Details of Hierarchical Clustering

Type | $\mathit{S}({\mathit{x}}^{\left(\mathit{i}\right)},{\mathit{x}}^{\left(\mathit{j}\right)})$ |
---|---|

Euclidean Distance | ${\left||{x}^{\left(i\right)}-{x}^{\left(j\right)}|\right|}_{2}$ |

Manhattan Distance | ${\left||{x}^{\left(i\right)}-{x}^{\left(j\right)}|\right|}_{1}$ |

Cosine Similarity | $\frac{{x}^{\left(i\right)\mathrm{\top}}{x}^{\left(j\right)}}{{\left|\left|{x}^{\left(i\right)}\right|\right|}_{2}\xb7{\left|\left|{x}^{\left(j\right)}\right|\right|}_{2}}$ |

Jaccard Similarity | $\frac{\mathbf{1}({x}^{\left(i\right)}=c\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\cap \phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{x}^{\left(j\right)}=c)}{\mathbf{1}({x}^{\left(i\right)}=c\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\cup \phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}{x}^{\left(j\right)}=c)},c\in [1,\cdots ,C]$ |

Bray-Curtis Similarity | $\frac{\sum |{x}^{\left(i\right)}-{x}^{\left(j\right)}|}{\sum |{x}^{\left(i\right)}+{x}^{\left(j\right)}|}$ |

**Agglomerative (bottom-up):**Start with individual samples, then gradually merge them into clusters until one big cluster remains. This is the most common method.**Divisive (top-down):**Samples start as one big cluster, then gradually diverge into an increasing number of clusters, until one cluster is formed for each individual sample.

**Figure A6.**A dendrogram representation of hierarchical clustering. At the bottom, each individual sample belongs to its own cluster. Going up the dendrogram, samples are merged together based on the desired distance metric. At the top, all samples are merged into one giant cluster.

**Single linkage (Nearest-neighbour)**: “Nearest-neighbour” clustering. Initially, each sample is considered a centroid. The pair of samples with the smallest distance between them is merged together; subsequent clusters are merged according to the distances between their closest members. The linkage function is expressed as:$$\begin{array}{c}\hfill D({C}_{p},{C}_{q})=\underset{{\mathit{x}}^{\left(i\right)}\in {C}_{p},{\mathit{x}}^{\left(j\right)}\in {C}_{q}}{min}\phantom{\rule{0.166667em}{0ex}}d({\mathit{x}}^{\left(i\right)},{\mathit{x}}^{\left(j\right)}).\end{array}$$**Complete linkage (Farthest-neighbour)**: Also known as “farthest-neighbour” clustering. Identical to single linkage, except clusters are merged together according to distances between their farthest members. The linkage function is expressed as:$$\begin{array}{c}\hfill D({C}_{p},{C}_{q})=\underset{{\mathit{x}}^{\left(i\right)}\in {C}_{p},{\mathit{x}}^{\left(j\right)}\in {C}_{q}}{max}\phantom{\rule{0.166667em}{0ex}}d({\mathit{x}}^{\left(i\right)},{\mathit{x}}^{\left(j\right)}).\end{array}$$**Agglomerative averages**: Also known as “average” clustering. Identical to single linkage, except clusters are merged together according to average distances between their members. The linkage function is expressed as:$$\begin{array}{c}\hfill D({C}_{p},{C}_{q})=\frac{1}{\left|{C}_{p}\right|\left|{C}_{q}\right|}\sum _{{\mathit{x}}^{\left(i\right)}\in {C}_{p}}\sum _{{\mathit{x}}^{\left(j\right)}\in {C}_{q}}\phantom{\rule{0.166667em}{0ex}}d({\mathit{x}}^{\left(i\right)},{\mathit{x}}^{\left(j\right)}),\end{array}$$**Ward’s method**: Also known as “minimum-variance” clustering. Instead of merging samples or clusters together based on distance, it starts by assigning “zero variance” to all clusters. Then, an Analysis of Variance (ANOVA) test is performed: Two arbitrarily-selected clusters are merged together. The “increase in variance” is calculated as:$$\begin{array}{c}\hfill \Delta ({C}_{p},{C}_{q})=\frac{\left|{C}_{p}\right|\xb7\left|{C}_{q}\right|}{\left|{C}_{p}\right|+\left|{C}_{q}\right|}{\left||{\overline{C}}_{p}-{\overline{C}}_{q}|\right|}_{2}^{2}\end{array}$$

**cophenetic correlations [53]:**Measures how well a specified clustering method preserves original pairwise distances between samples. In other words, how similar are the average inter- cluster distances between pairwise points compared to their actual distances. The formula is:$$\begin{array}{c}\hfill \frac{{\displaystyle \sum _{i\ne j}}(d({x}^{\left(i\right)},{x}^{\left(j\right)})-\overline{d})}{\sqrt{\left[{\displaystyle \sum _{i\ne j}}{(d({x}^{\left(i\right)},{x}^{\left(j\right)})-\overline{d})}^{2}\right]\left[{\displaystyle \sum _{i\ne j}}{(cd({x}^{\left(i\right)},{x}^{\left(j\right)})-\overline{cd})}^{2}\right]}}\end{array}$$**cophenetic distance**between two pairwise points ${x}^{\left(i\right)}$ and ${x}^{\left(j\right)}$, defined as the distance from the base of the dendrogram to the first node joining ${x}^{\left(i\right)}$ and ${x}^{\left(j\right)}$.**Silhouette analysis [21]:**Measures the optimal depth of a specified clustering method. Mathematically, it assesses how well each sample ${x}^{\left(i\right)}$ belongs to its assigned cluster ${C}_{p}$. Each individual Silhouette number is evaluated as:$$\begin{array}{c}\hfill {s}^{\left(i\right)}=\frac{{\overline{x}}_{{C}_{q}}^{\left(i\right)}-{\overline{x}}_{{C}_{p}}^{\left(i\right)}}{max({\overline{x}}_{{C}_{q}}^{\left(i\right)},{\overline{x}}_{{C}_{p}}^{\left(i\right)})}\end{array}$$

**cophenetic**and

**silhouette**analyses as outlined above, the “most confident” clustering method (i.e., UPGMA vs. Ward vs. single-linkage vs. complete-linkage) and the optimal clustering depth, respectively, can both be selected.

## Appendix I. Details Behind Probabilistic Mixtures

**Gaussian Mixtures [47]:**$p\left(\mathit{x}\right)={\displaystyle \sum _{k=1}^{K}}{w}_{k}\mathcal{N}(\mathit{x}|{\mathit{\mu}}_{k},{\mathsf{\Sigma}}_{k})$; the underlying distribution is assumed to be a sum of K weighted multivariate Gaussians with individual means and covariances. The term ${w}_{k}$ represents the weighting factor for each Gaussian. Each Gaussian has the formula $\mathcal{N}(\mathit{x}|{\mathit{\mu}}_{k},{\mathsf{\Sigma}}_{k})=\frac{1}{\sqrt{{\left(2\pi \right)}^{{d}_{x}}\xb7\mathrm{det}\left({\mathsf{\Sigma}}_{k}\right)}}\xb7exp\left[-\frac{1}{2}{(\mathit{x}-{\mathit{\mu}}_{k})}^{\mathrm{\top}}\xb7{\mathsf{\Sigma}}_{k}^{-1}\xb7(\mathit{x}-{\mathit{\mu}}_{k})\right]$.**Dirichlet Mixtures [24]:**Define ${\mathit{p}}^{\left(i\right)}$ as a vector containing the probabilities that sample ${\mathit{x}}^{\left(i\right)}$ belongs to each community species. The Dirichlet mixture prior over K distributions is $\mathit{P}\left({\mathit{p}}^{\left(i\right)}\right)={\sum}_{k=1}^{\mathit{K}}Dir({\mathit{p}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}\left|\phantom{\rule{0.166667em}{0ex}}{\alpha}_{k}\right){\pi}_{k}$, where ${\alpha}_{k}$ are the Dirichlet parameters and ${\pi}_{k}$ are the Dirichlet weights.

## Appendix J. Details Behind the Dirichlet Mixture

- The likelihood of observing each sample ${\mathit{x}}^{\left(i\right)}$ is:$$\begin{array}{c}\hfill {L}^{\left(i\right)}[{\mathit{x}}^{\left(i\right)}|{\mathit{p}}^{\left(i\right)}]=\left[{\displaystyle \sum _{j=1}^{{d}_{OTU}}}{x}_{j}^{\left(i\right)}\right]!{\displaystyle \prod _{j=1}^{{d}_{OTU}}}\frac{\left[{p}_{j}^{\left(i\right)}\right]}{{x}_{j}^{\left(i\right)}},\end{array}$$
- The total likelihood across all samples is therefore:$$\begin{array}{c}\hfill L(\mathit{X}\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}{\mathit{p}}^{\left(1\right)},\cdots ,{\mathit{p}}^{\left(N\right)})={\displaystyle \prod _{i=1}^{N}}{L}^{\left(i\right)}({\mathit{x}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}\left|\phantom{\rule{0.166667em}{0ex}}{\mathit{p}}^{\left(i\right)}\right).\end{array}$$
- The Dirichlet distribution is modelled as:$$\begin{array}{c}\hfill Dir({\mathit{p}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}\left|\phantom{\rule{0.166667em}{0ex}}\theta \mathit{m}\right)=\Gamma \left(\theta \right){\displaystyle \prod _{j=1}^{{d}_{OTU}}}\frac{{\left[{p}_{j}^{\left(i\right)}\right]}^{\theta {\mathit{m}}_{j}-1}}{\Gamma \left(\theta {\mathit{m}}_{j}\right)}\delta \left({\displaystyle \sum _{j=1}^{{d}_{OTU}}}{p}_{j}^{\left(i\right)}-1\right)\end{array}$$
- The Dirichlet mixture prior over $\mathit{K}$ distributions is:$$\begin{array}{c}\hfill \mathit{P}({\mathit{p}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}\left|\phantom{\rule{0.166667em}{0ex}}Q\right)={\displaystyle \sum _{k=1}^{\mathit{K}}}Dir({\mathit{p}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}\left|\phantom{\rule{0.166667em}{0ex}}{\alpha}_{k}\right){\pi}_{k}\end{array}$$
- The Dirichlet mixture posterior over $\mathit{K}$ distributions is:$$\begin{array}{c}\hfill \mathit{P}({\mathit{p}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}{\mathit{x}}^{\left(i\right)},Q)=\frac{{\sum}_{k=1}^{\mathit{K}}{L}^{\left(i\right)}\left({\mathit{x}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}\left|\phantom{\rule{0.166667em}{0ex}}{\mathit{p}}^{\left(i\right)}\right)Dir\right({\mathit{p}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}\left|\phantom{\rule{0.166667em}{0ex}}{\alpha}_{k}\right){\pi}_{k}}{{\sum}_{k=1}^{\mathit{K}}\mathit{P}({\mathit{x}}^{\left(i\right)}\phantom{\rule{0.166667em}{0ex}}\left|\phantom{\rule{0.166667em}{0ex}}{\alpha}_{k}\right){\pi}_{k}}.\end{array}$$

## Appendix K. Time Plots of Process Variables over Time

#### Appendix K.1. Time-Plots from Reactor 1

#### Appendix K.2. Time-Plots from Reactor 2

## Appendix L. Feature Selection Results from C-MDA

## References

- Breiman, L. Bagging predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef][Green Version] - Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn.
**1995**, 20, 273–297. [Google Scholar] [CrossRef] - Cutler, D.R.; Edwards, T.C., Jr.; Beard, K.H.; Cutler, A.; Hess, K.T.; Gibson, J.; Lawler, J.J. Random forests for classification in ecology. Ecology
**2007**, 88, 2783–2792. [Google Scholar] [CrossRef] - Campbell, W.M.; Campbell, J.P.; Reynolds, D.A.; Singer, E.; Torres-Carrasquillo, P.A. Support vector machines for speaker and language recognition. Comput. Speech Lang.
**2006**, 20, 210–229. [Google Scholar] [CrossRef] - Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature
**1986**, 323, 533–536. [Google Scholar] [CrossRef] - Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw.
**1989**, 2, 359–366. [Google Scholar] [CrossRef] - Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 160–167. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 1 June–3 September 2012; pp. 1097–1105. [Google Scholar]
- Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012; Volume 1. [Google Scholar]
- Chen, W.; Zhang, C.K.; Cheng, Y.; Zhang, S.; Zhao, H. A comparison of methods for clustering 16S rRNA sequences into OTUs. PLoS ONE
**2013**, 8, e70837. [Google Scholar] [CrossRef] - Cernava, T.; Müller, H.; Aschenbrenner, I.A.; Grube, M.; Berg, G. Analyzing the antagonistic potential of the lichen microbiome against pathogens by bridging metagenomic with culture studies. Front. Microbiol.
**2015**, 6, 620. [Google Scholar] [CrossRef][Green Version] - Legendre, P.; Legendre, L. Numerical Ecology, Volume 24, (Developments in Environmental Modelling); Elsevier: Amsterdam, The Netherlands, 1998. [Google Scholar]
- Seborg, D.E.; Mellichamp, D.A.; Edgar, T.F.; Doyle, F.J., III. Process Dynamics and Control; John Wiley & Sons: Hoboken, NJ, USA, 2010. [Google Scholar]
- CCME. Canadian Water Quality Guidelines for the Protection of Aquatic Life: NITRATE ION. Available online: http://ceqg-rcqe.ccme.ca/download/en/197 (accessed on 25 May 2019).
- CCME. Soil Quality Guidelines: SELENIUM Environmental and Human Health Effects. Available online: https://www.ccme.ca/files/Resources/supporting_scientific_documents/soqg_se_scd_1438.pdf (accessed on 24 May 2019).
- Lemly, A.D. Aquatic selenium pollution is a global environmental safety issue. Ecotoxicol. Environ. Saf.
**2004**, 59, 44–56. [Google Scholar] [CrossRef][Green Version] - Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C
**1979**, 28, 100–108. [Google Scholar] [CrossRef] - Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD
**1996**, 96, 226–231. [Google Scholar] - Reynolds, A.P.; Richards, G.; de la Iglesia, B.; Rayward-Smith, V.J. Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. J. Math. Modell. Algorithms
**2006**, 5, 475–504. [Google Scholar] [CrossRef] - Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.
**1987**, 20, 53–65. [Google Scholar] [CrossRef][Green Version] - Rasmussen, C.E. The infinite Gaussian mixture model. In Proceedings of the Neural Information Processing Systems 1999, Denver, CO, USA, 29 November–4 December 1999; pp. 554–560. [Google Scholar]
- La Rosa, P.S.; Brooks, J.P.; Deych, E.; Boone, E.L.; Edwards, D.J.; Wang, Q.; Sodergren, E.; Weinstock, G.; Shannon, W.D. Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE
**2012**, 7, e52078. [Google Scholar] [CrossRef] [PubMed] - Holmes, I.; Harris, K.; Quince, C. Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE
**2012**, 7, e30126. [Google Scholar] [CrossRef] [PubMed] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B
**1977**, 39, 1–22. [Google Scholar] [CrossRef] - Matsuda, H.; Ogita, N.; Sasaki, A.; Satō, K. Statistical mechanics of population: The lattice Lotka-Volterra model. Prog. Theor. Phys.
**1992**, 88, 1035–1049. [Google Scholar] [CrossRef] - Yasuhiro, T. Global Dynamical Properties of Lotka-Volterra Systems; World Scientific: Singapore, 1996. [Google Scholar]
- Faust, K.; Raes, J. Microbial interactions: From networks to models. Nat. Rev. Microbiol.
**2012**, 10, 538–550. [Google Scholar] [CrossRef] - Gonze, D.; Lahti, L.; Raes, J.; Faust, K. Multi-stability and the origin of microbial community types. ISME J.
**2017**, 11, 2159–2166. [Google Scholar] [CrossRef][Green Version] - Morueta-Holme, N.; Blonder, B.; Sandel, B.; McGill, B.J.; Peet, R.K.; Ott, J.E.; Violle, C.; Enquist, B.J.; Jørgensen, P.M.; Svenning, J.C. A network approach for inferring species associations from co-occurrence data. Ecography
**2016**, 39, 1139–1150. [Google Scholar] [CrossRef] - Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann Publishers: San Francisco, CA, USA, 2014. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Han, H.; Guo, X.; Yu, H. Variable selection using mean decrease accuracy and mean decrease gini based on random forest. In Proceedings of the 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 26–28 August 2016; pp. 219–224. [Google Scholar]
- Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics
**2007**, 23, 2507–2517. [Google Scholar] [CrossRef] [PubMed][Green Version] - Strobl, C.; Boulesteix, A.L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinform.
**2008**, 9, 307. [Google Scholar] [CrossRef] [PubMed] - Sanderson, S.C.; Ott, J.E.; McArthur, E.D.; Harper, K.T. RCLUS, a new program for clustering associated species: A demonstration using a Mojave Desert plant community dataset. West. N. Am. Nat.
**2006**, 66, 285–297. [Google Scholar] [CrossRef] - Morgan, M. Dirichlet Multinomial: Dirichlet-Multinomial Mixture Model Machine Learning for Microbiome Data; R package; R Foundation for Statistical Computing: Vienna, Austria, 2014. [Google Scholar]
- Xu, Z.; Dai, X.; Chai, X. Effect of different carbon sources on denitrification performance, microbial community structure and denitrification genes. Sci. Total Environ.
**2018**, 634, 195–204. [Google Scholar] [CrossRef] [PubMed] - Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Wang, D.; Li, M. Stochastic configuration networks: Fundamentals and algorithms. IEEE Trans. Cybern.
**2017**, 47, 3466–3479. [Google Scholar] [CrossRef] [PubMed] - Han, H.G.; Zhang, L.; Qiao, J.F. Data-based predictive control for wastewater treatment process. IEEE Access
**2017**, 6, 1498–1512. [Google Scholar] [CrossRef] - Qiao, J.F.; Hou, Y.; Zhang, L.; Han, H.G. Adaptive fuzzy neural network control of wastewater treatment process with multiobjective operation. Neurocomputing
**2018**, 275, 383–393. [Google Scholar] [CrossRef] - Han, H.G.; Zhang, L.; Liu, H.X.; Qiao, J.F. Multiobjective design of fuzzy neural network controller for wastewater treatment process. Appl. Soft Comput.
**2018**, 67, 467–478. [Google Scholar] [CrossRef] - Runge, J.; Nowack, P.; Kretschmer, M.; Flaxman, S.; Sejdinovic, D. Detecting causal associations in large nonlinear time series datasets. arXiv
**2017**, arXiv:1702.07007. [Google Scholar] - Izadi, I.; Shah, S.L.; Shook, D.S.; Chen, T. An introduction to alarm analysis and design. IFAC Proc. Vol.
**2009**, 42, 645–650. [Google Scholar] [CrossRef] - Wang, J.; Yang, F.; Chen, T.; Shah, S.L. An overview of industrial alarm systems: Main causes for alarm overloading, research status, and open problems. IEEE Trans. Autom. Sci. Eng.
**2015**, 13, 1045–1061. [Google Scholar] [CrossRef] - Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Cambridge, UK, 2006. [Google Scholar]
- Breiman, L. Classification and Regression Trees; Routledge: Abingdon, UK, 2017. [Google Scholar]
- Strobl, C.; Boulesteix, A.L.; Zeileis, A.; Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform.
**2007**, 8, 25. [Google Scholar] [CrossRef] [PubMed] - Vapnik, V.N.; Vapnik, V. Statistical Learning Theory; Wiley: New York, USA, 1998; Volume 1. [Google Scholar]
- Lemm, S.; Blankertz, B.; Dickhaus, T.; Müller, K. Introduction to machine learning for brain imaging. Neuroimage
**2011**, 56, 387–399. [Google Scholar] [CrossRef] [PubMed] - Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; Wiley: Hoboken, NJ, USA, 2009; Volume 344. [Google Scholar]
- Sokal, R.R.; Rohlf, F.J. The comparison of dendrograms by objective methods. Taxon
**1962**, 11, 33–40. [Google Scholar] [CrossRef]

**Figure 1.**Machine learning (ML)-guided process control and decision-making. Manipulated variables (MVs) selected from the original process features may be high-dimensional and full of confounding effects. Instead, the small subset of MVs most responsible for causing observed process changes is identified using ML algorithms. The key MVs may change from one operating stage to another, but they can be re-identified given the corresponding new data.

**Figure 2.**A simple bioreactor schematic, with wastewater and biological nutrients as inlets, and treated effluent as outlet. The system contains directly-measurable macro variables related to water chemistry (such as contact time $\tau $), and difficult-to-measure micro variables reflecting the metabolism of micro-organisms.

**Figure 3.**Workflow of the pre-processing, dimensionality reduction, modeling, and feature selection steps. The final goal is to transform the input data into predicted outcomes, as well as key variables responsible for said outcomes.

**Figure 6.**Mean decrease in accuracy (MDA) applied on a dataset with six features. During each outer iteration, the values of a single feature are scrambled or permutated sample-wise. The model accuracy with the scrambled feature is compared against the base-case model accuracy. If the accuracy decreases significantly, then the feature is considered “important.” On the other hand, if the accuracy decreases negligibly, then the feature is “irrelevant” to the model.

**Figure 7.**In the conditional mean decrease in accuracy (C-MDA) approach, the permutation is only performed on the values of a feature given the presence of another feature falling within a range of values. By contrast, permutation in traditional MDA (as shown in Figure 6) is performed on all values of a feature, with no consideration of other features.

**Figure 8.**UPGMA dendrogram (

**top**) and heatmap (

**bottom**) showing log-transformed OTU abundances. The rows of the heatmap represent individual samples, while the columns represent individual OTUs. Dark colours on the heatmap represent distances close to zero and hence similar OTUs, while light colours represent large distances and hence dissimilar OTUs.

**Figure 10.**Silhouette values as a function of distance cut-off in UPGMA clustering. The optimal cutoff value is the one corresponding to the maximum silhouette value.

**Figure 11.**Dendrogram of the UPGMA hierarchy with optimal distance cut-off, at a depth of $K=37$ groups. Each branch is labelled with the dominant OTU, and the number of its followers.

**Figure 14.**Akaike information criterion (AIC) values for Gaussian mixture models (GMMs) with cluster sizes $1<K<30$. The minimum occurs at $K=21$, which is selected as the desired number of groups.

**Figure 15.**Bayesian information criterion (BIC) values for GMMs of cluster sizes $1<K<30$. The minimum occurs at $K=1$, indicating that one single group should be considered. This is an impractical result and is therefore discarded.

**Figure 18.**Heatmap of the BIC-optimal DMM model, with respect to the 20 highest-weighting OTUs. Colours are coded according to log-transformed OTU abundances; a dark colour indicates high OTU abundance, and vice versa.

Variable | Description |
---|---|

$\tau $ or EBCT | Empty bed contact time $=\phantom{\rule{0.166667em}{0ex}}\frac{\mathrm{volume}}{\mathrm{flowrate}}$ (min) |

$Ammoni{a}_{out}$ | Concentration of $N{H}_{3}$ in effluent $\left(\frac{\mathrm{mg}}{\mathrm{L}}\right)$ |

$Nitrat{e}_{in}$ | Concentration of $N{O}_{3}^{-}$ in influent $\left(\frac{\mathrm{mg}}{\mathrm{L}}\right)$ |

$Nitrit{e}_{out}$ | Concentration of $N{O}_{2}^{-}$ in effluent $\left(\frac{\mathrm{mg}}{\mathrm{L}}\right)$ |

$Se{D}_{in}$ | Concentration of total dissolved $Se$ in influent $\left(\frac{\mathsf{\mu}\mathrm{g}}{\mathrm{L}}\right)$ |

$CO{D}_{in}$ | Chemical oxygen demand in the influent $\left(\frac{\mathrm{mg}}{\mathrm{L}}\right)$ |

$MicroC$ | Equal to 1 if MicroC is added as carbon source, otherwise 0 |

$Acetate$ | Equal to 1 if Acetate is added as carbon source, otherwise 0 |

Reactor 1 | Equal to 1 if Reactor 1 is the relevant bioreactor, otherwise 0 |

Reactor 2 | Equal to 1 if Reactor 2 is the relevant bioreactor, otherwise 0 |

**Table 2.**Cophenetic (coph.) correlations. Unweighted pair-group method with arithmetic means (UPGMA).

Method | Coph. Correlation |
---|---|

UPGMA | 0.51 |

Ward | 0.41 |

Single-linkage | 0.08 |

Complete-linkage | 0.22 |

Group Number | Dominant OTU | Group Size |
---|---|---|

1 | OTU200 | 27 |

2 | OTU46 | 13 |

3 | OTU11 | 4 |

4 | OTU112 | 8 |

5 | OTU3313 | 37 |

6 | OTU470 | 22 |

7 | OTU6 | 4 |

8 | OTU2756 | 10 |

9 | OTU157 | 5 |

10 | OTU48 | 6 |

11 | OTU3057 | 20 |

12 | OTU185 | 66 |

13 | OTU559 | 25 |

14 | OTU778 | 26 |

15 | OTU8968 | 7 |

16 | OTU14 | 1 |

17 | OTU105 | 15 |

18 | OTU8 | 1 |

19 | OTU77 | 3 |

20 | OTU93 | 4 |

21 | OTU1 | 1 |

**Table 4.**Prediction results for each model type. Random forests (RFs), support vector machines (SVMs), and artificial neural networks (ANNs).

Base Case | Hierarchical | Gaussian | Dirichlet | |
---|---|---|---|---|

RF | $96.3$ | $90.6$ | $93.4$ | $92.2$ |

SVM | $91.8$ | $87.2$ | $91.3$ | $90.6$ |

ANN | $81.7$ | $78.6$ | $83.4$ | $80.2$ |

Feature | MDA $(\%)$ |
---|---|

$S{e}_{D,in}$ | 6.3 |

$Ammoni{a}_{out}$ | 0.3 |

$EBCT$ | 0.2 |

$Nitrit{e}_{out}$ | 0.2 |

OTU215 | 1.5 |

OTU2637 | 0.6 |

OTU1579 | 0.6 |

OTU49 | 0.6 |

OTU3945 | 0.5 |

Feature | MDA $(\%)$ |
---|---|

$S{e}_{D,in}$ | 7.1 |

$EBCT$ | 1.2 |

$Ammoni{a}_{out}$ | 0.7 |

$Nitrit{e}_{out}$ | 0.7 |

OTU57 | 1.5 |

OTU7347 | 1.1 |

OTU2765 | 0.9 |

OTU48 | 0.9 |

OTU7 | 0.8 |

Feature | MDA $(\%)$ |
---|---|

$S{e}_{D,in}$ | 5.3 |

$EBCT$ | 1.9 |

$Nitrit{e}_{out}$ | 1.1 |

$CO{D}_{in}$ | 0.7 |

OTU35 | 1.4 |

OTU8 | 1.0 |

OTU7 | 1.0 |

OTU1 | 0.6 |

OTU9 | 0.5 |

Rank | Feature |
---|---|

1 | $S{e}_{D,in}$ |

2 | $EBCT$ |

3 | $Nitrit{e}_{out}$ |

4 | $CO{D}_{in}$ |

5 | $Nitrat{e}_{in}$ |

6 | $Ammoni{a}_{out}$ |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tsai, Y.; Baldwin, S.A.; Siang, L.C.; Gopaluni, B. A Comparison of Clustering and Prediction Methods for Identifying Key Chemical–Biological Features Affecting Bioreactor Performance. *Processes* **2019**, *7*, 614.
https://doi.org/10.3390/pr7090614

**AMA Style**

Tsai Y, Baldwin SA, Siang LC, Gopaluni B. A Comparison of Clustering and Prediction Methods for Identifying Key Chemical–Biological Features Affecting Bioreactor Performance. *Processes*. 2019; 7(9):614.
https://doi.org/10.3390/pr7090614

**Chicago/Turabian Style**

Tsai, Yiting, Susan A. Baldwin, Lim C. Siang, and Bhushan Gopaluni. 2019. "A Comparison of Clustering and Prediction Methods for Identifying Key Chemical–Biological Features Affecting Bioreactor Performance" *Processes* 7, no. 9: 614.
https://doi.org/10.3390/pr7090614