This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

As a key regulatory mechanism of gene expression, DNA methylation patterns are widely altered in many complex genetic diseases, including cancer. DNA methylation is naturally quantified by bounded support data; therefore, it is non-Gaussian distributed. In order to capture such properties, we introduce some non-Gaussian statistical models to perform dimension reduction on DNA methylation data. Afterwards, non-Gaussian statistical model-based unsupervised clustering strategies are applied to cluster the data. Comparisons and analysis of different dimension reduction strategies and unsupervised clustering methods are presented. Experimental results show that the non-Gaussian statistical model-based methods are superior to the conventional Gaussian distribution-based method. They are meaningful tools for DNA methylation analysis. Moreover, among several non-Gaussian methods, the one that captures the bounded nature of DNA methylation data reveals the best clustering performance.

DNA methylation is a covalent modification of DNA, which can regulate the expression of genes [

Relatively little is known about the taxonomy of cancers at the DNA methylation level. Hence, there is currently a strong interest in performing unsupervised clustering of large-scale DNA methylation data sets in order to identify novel cancer subtypes. DNA methylation data is quantified naturally in terms of a beta-distribution. The DNA methylation beta-value, _{2}

However, there still remains a significant shortage of methods, specially for the dimensional reduction of large DNA methylation data sets. For instance, blind source separation (BSS) [

Gaussian distribution is the ubiquitous probability distribution used in statistics [

In this paper, we will introduce and compare some machine learning methods, which are based on non-Gaussian statistical models, for DNA methylation data analysis. The analysis of DNA methylation data includes two parts. (1) Dimensional reduction: DNA methylation array data is high-dimensional, typically involving on the order of 25 k up to 500 k dimensions (and even higher). As with other omics data, the number of samples is typically on the order of 100. However, typically, most of the salient variability in the data, e.g., variation distinguishing cancer from normal samples, or distinguishing different cancer phenotypes, is captured by a much lower-dimensional space. Hence, we need to perform some forms of dimension reduction; (2) Unsupervised clustering: Cancers especially are known to be highly heterogeneous [

The DNA methylation data is obtained from Gene Expression Omnibus (GEO) website [

We considered a DNA methylation data matrix over 5000 dimensions (specifically, CpG dinucleotides). The 5000 CpGs were selected as those with the highest variance across the 136 samples. A data matrix

In each dimension reduction method, we need to specify the number of dimensions for which to search. The random matrix theory (RMT) [

Comparisons of DNA methylation data before and after dimension reduction. (

The level of DNA methylation can discriminate normal and cancer samples [

When implementing dimension reduction with Gaussian assumptions, PCA [

Illustration of the clustering result via PCA (principal component analysis) + the variational Bayesian Gaussian mixture model (VBGMM). The clusters are color coded. The normal data are marked with dots, and the cancer data are marked with crosses. Samples in a larger size are those misclustered. (

We applied the BG-NMF method (this work was presented in; it is used as the benchmark for non-Gaussian methods in our paper) to the above mentioned 5000 × 136 matrix. We set the number of basis vectors equal to 14 when applying the BG-NMF method.

Setting the number of basis vectors equal to 14 and applying BG-NMF to ^{T} resulted in a 136 × 14 pseudo-basis matrix and a 14 × 5000 excitation matrix. The hypothesis is that the dimensionally-reduced basis matrix, whose element remains bounded, supported and is assumed to be beta distributed, captures the salient patterns of variation.

The benchmarked RPBMM algorithm was applied to estimate the final clusters of the reduced 136 × 14 matrix, which is illustrated in

Illustration of the clustering result via beta-gamma (BG)-nonnegative matrix factorization (NMF) + VBBMM (variational Bayesian estimation framework for BMM). The clusters are color coded. The normal data are marked with dots, and the cancer data are marked with crosses. Samples in the larger size are those misclustered. (

Recursive partitioning beta mixture model (RPBMM) clustering details for all of the 136 samples over 14 BG-NMF pseudo-basis vectors.

Cluster | Normal | Cancer |
---|---|---|

rLLLL | 16 | 4 |

rLLLR | 6 | 0 |

rLLR | 1 | 32 |

rLR | 0 | 31 |

rRLLL | 0 | 4 |

rRLLR | 0 | 3 |

rRLRL | 0 | 10 |

rRLRR | 0 | 5 |

rRR | 0 | 34 |

One disadvantage of the RPBMM method is that it will estimate the number of mixture components in a recursive manner. It employs the wtdBIC to decide whether to further split one mixture component into two or not. The VBBMM can potentially estimate the model complexity automatically. After convergence, each mixture component corresponds to one cluster. Hence, we applied the VBBMM method on the 136 × 14 pseudo-basis matrix to cluster the samples. The initial settings of the number of mixture component is 15. Eventually, the VBBMM method estimated nine clusters (by removing components whose mixture weights are smaller than 0.01), which is the same as that inferred by RPBMM. Five samples are misclustered, out of which four cancer samples are classified as normal ones, and only one normal sample is recognized as a cancer one. The overall computational time in this scenario is about 124 seconds, which is faster than BG-NMF + RPBMM. This is the main advantage of applying the BG-NMF + RPBMM method. The clustering results are illustrated in

An alternative way for dimension reduction is to apply the SC method. As the RMT estimated _{ij}_{i}_{j}^{2}. This choice is due to the fact that the affinity matrix ^{2}. The choice of the optimal _{2} norm of each feature equals one. To model such property efficiently, the VBvMM is used to realize the unsupervised clustering. After convergence, VBvMM determined 14 clusters. Seven normal samples are clustered as cancer ones, while zero cancer samples are clustered as normal ones. The overall number of misclustered samples is seven.

Assuming what we observed from the SC resulting feature is only one of the mirrored pairs (with respect to the origin), the VBWMM can be applied to model the axially symmetric data. With a similar approach as the SC + VBvMM, we have six cancer samples clustered as normal ones and none of the normal samples inferred as the cancer one. The overall misclustered samples are six. The VBWMM inferred seven clusters in the end. The clustering results are shown in

Illustration of the clustering result via spectral clustering (SC) + variational inference framework-based Bayesian analysis of the vMF mixture model (VBvMM) and SC + VBWMM (Watson mixture model (WMM)). The clusters are color coded. The normal data are marked with dots, and the cancer data are marked with crosses. Samples in the larger size are those misclustered. (

The comparisons of the above-mentioned four methods are listed in

Comparisons of the clustering performance of different methods.

Method | Error Rate | Cancer→Normal | Normal→Cancer |
---|---|---|---|

6.62% | 9 | 0 | |

3.68% | 4 | 1 | |

3.68% | 4 | 1 | |

5.15% | 7 | 0 | |

4.41% | 6 | 0 |

When looking at the misclustered samples, all of the BG-NMF related clustering methods miscluster four or five cancer samples to normal and miscluster one normal sample to cancer, while the SC-related method estimated six or seven cancer samples to normal, but no normal sample to cancer. Misclustering happens since the data are highly-dimensionally correlated. Although we have reduced the dimensions to remove redundant features, it is still difficult to separate one type of data from the other. The SC-related methods, however, do not miscluster any normal sample to cancer. We speculate that this is because the SC method embedded the data in a tight manner, so that a relatively “clearer” positive/negative boundary can be obtained than the BG-NMF method. On the one hand, BGNMF-related methods have overall better clustering performance than the SC-related methods, but misclustered data in both ways. On the other hand, SC-related methods do not cluster any normal data to cancer, but have relatively worse overall accuracy. These observations motivate us to improve the unsupervised clustering method so that better clustering results can be obtained.

In summary, for DNA methylation analysis, the bounded nature of the data plays an important role. Thus, such a property should be retained in both the dimension reduction and clustering methods. Furthermore, an appropriate unsupervised learning method is required for revealing the heterogeneity more accurately.

The Gaussian distribution (both univariate and multivariate) has a symmetrical “bell” shape, and the variable’s definition is on the interval (−∞, ∞). Non-Gaussian statistical distributions refer to a set of distributions that have special properties that the Gaussian distribution cannot characterize. For example, the beta distribution is defined on the interval [0, 1] (in a general form, the beta distribution could have definition on any interval [a,b]; after linear scaling, it can be represented with the standard beta distribution [_{2} norm equals one, the von Mises–Fisher (vMF) distribution [

In the remaining part of this section, we will introduce some typical non-Gaussian distributions that can be applied in DNA methylation analysis.

The beta distribution is characterized by two positive shape parameters

where Γ(·) is the gamma function. The beta distribution has a flexible shape, which is shown in

Beta distributions for different pairs of parameters. (

The vMF distribution is considered a popular distribution in the family of directional distributions [_{2} norm equals one, _{2} = 1. The vMF distribution contains two parameters, namely the mean direction

where ||_{2} = 1, _{K}

where ℐ_{ν}

Scatter plot of samples from a single von Mises–Fisher (vMF) distribution on the sphere for different concentration parameters, ^{T}

Observations on the sphere might have an additional structure, such that the unit vectors ^{p}^{−1}, which is obtained by identifying opposite points on the sphere

One of the simplest distributions for axial data, with a rotational symmetry property, is the (Dimroth–Scheidegger–) Watson distribution. The Watson distribution is a special case of the Bingham distribution [

A random vector ^{p}^{−1}, or equivalently ±_{p}

where _{2} = 1, and _{1}_{1} is Kummer’s (confluent hypergeometric) function (e.g., [

where
_{p}_{p}

Scatter plot of samples from a single distribution,
_{p}^{T} [

When analyzing DNA methylation data, the high-dimensional property presents mathematical challenges, as well as opportunities. The main purpose of applying dimension reduction methods on microarray data is to extract the core features driving interesting biological variability [

Unlike PCA or ICA, NMF reveals the data’s nonnegativity during dimension reduction. Traditional NMF decomposes the data matrix into a product of two nonnegative matrices as:

where _{P×T}_{P×K}_{K×T}_{pt}_{pk}_{kt}

The DNA methylation data are naturally bounded on interval [0, 1]. Conventional NMF strategies do not take such a nature into account. In order to capture such a bounded feature explicitly, we proposed an NMF for bounded support data [_{pt}_{pt}_{pt}_{P×T}

With the above description, we assume that the matrix _{pt}

where Gamma(

As the data is assumed to be beta distributed and the parameters of the beta distribution are assumed to be gamma distributed, this model is named BG-NMF.

For BG-NMF, the variational inference (VI) method [_{pt}_{pk}_{pk}_{kt}_{pt}

which can be expressed in matrix form as:

where ⊘ means element-wise division. When placing sparsity constraints on the columns in

Hence, the resulting pseudo-basis matrix

Recently, spectral clustering (SC) has become one of the most popular clustering algorithms [^{L}^{K}^{K}

Spectral clustering.

_{1}, _{2}, …, _{N}Create the affinity matrix Construct the intermediate matrix
Apply eigenvalue analysis on Form a matrix _{n}_{n} |

The ultimate goal of dimension reduction is to benefit the clustering of DNA methylation data. The dimension reduction methods introduced above yield features with special properties. The BG-NMF method provides a basis matrix (see

For the beta-valued DNA methylation data, it is natural to consider the beta distribution as a candidate to model the underlying distribution. Since the DNA methylation data that show an obvious normal/cancer status are multi-modal, a beta mixture model (BMM) can be applied for modeling. One mixture component represents one cluster. In unsupervised clustering, selecting the optimal number of clusters is a big challenge. One popular method designed for such purpose is the recursive partitioning beta mixture model (RPBMM) [

Another way of carrying out model selection is to employ the variational Bayesian estimation framework for BMM (VBBMM). Under this circumstance, the joint posterior distribution of the weighting factors is modeled by a sparse Dirichlet distribution, so that the component with a very small weight will be pruned from the mixture model. In [

The data with its _{2} norm equaling one has a directional property. The von Mises-Fisher (vMF) distribution is suitable for such a type of data [

The Watson distribution is a simple distribution for modeling axially symmetric data on the unit hypersphere ([_{2} norm equals one) and its axial mirror by the Watson distribution. Similarly, when such data are multi-modally distributed, a Watson mixture model (WMM) can be applied. With a variational inference framework, Taghia

Cancer is characterized by alterations at the DNA methylation level. A Gaussian distribution, in general, cannot describe the DNA methylation data appropriately. Hence, the Gaussian distribution-based unsupervised clustering does not provide convincing performance.

For the purpose of efficiently clustering DNA methylation data, we proposed several dimension reduction methods and consequent unsupervised learning methods, which are all based on non-Gaussian distributions. They all perform better than the Gaussian distribution-based method. In the dimension reduction step, both the BG-NMF and the SC methods can remove the redundant dimensions efficiently. In unsupervised clustering, the VBBMM method, the VBvMM method and the VBWMM method can all reveal the heterogeneity of the DNA methylation data appropriately. Clustering performance demonstrates that the proposed non-Gaussian distribution-based methods are meaningful tools for analyzing DNA methylation data. Experimental results also show that the BG-NMF + VBBMM method performs the best among all of the proposed methods and is faster than the benchmarked BG-NMF + RPBMM method. Furthermore, for the reduced features inferred from both the BG-NMF method and the SC method, the consequent unsupervised clustering method needs to be improved, so that better clustering accuracy can be obtained.

Moreover, the methodology introduced in this paper can be easily extended to analyze other DNA methylation data sets. Some other non-Gaussian statistical models can also be applied for such purposes.

The authors would like to thank the editor for organizing the review process and thank the anonymous reviewers for their efforts in reviewing this manuscript and providing fruitful suggestions.

This work is partly supported by the “Fundamental Research Funds for the Central Universities” No. 2013XZ11, NSFC Grant No. 61273217, Chinese 111 program of Advanced Intelligence and Network Service under Grant No. B08004 and EU FP7 IRSESMobileCloud Project (Grant No. 612212).

Z.M. provided the non-Gaussian statistical models and the BGNMF code, carried out the VBBMM experiment, analyzed the results and wrote the manuscript. A.E. Teschendorff provided the DNA methylation data, conducted the RPBMM experiment and helped in revising the manuscript and analyzing the results. H.Y. provided the data visualization figures. J.T. implemented the VBvMM and VBWMM experiments. J.G. revised the manuscript.

The authors declare no conflict of interest.

_{s}