#### 3.1. Non-Gaussian Statistical Distributions

The Gaussian distribution (both univariate and multivariate) has a symmetrical “bell” shape, and the variable’s definition is on the interval (−∞, ∞). Non-Gaussian statistical distributions refer to a set of distributions that have special properties that the Gaussian distribution cannot characterize. For example, the beta distribution is defined on the interval [0, 1] (in a general form, the beta distribution could have definition on any interval [a,b]; after linear scaling, it can be represented with the standard beta distribution [

44]) and can have a symmetric or asymmetric shape [

27]. The Dirichlet distribution, which is a multivariate generalization of the beta distribution, has a pdf with respect to the Lebesgue measure on the Euclidean space [

45]. The gamma distribution is defined on the interval (0, ∞), and the shape cannot be symmetric [

46]. To model data whose

l_{2} norm equals one, the von Mises–Fisher (vMF) distribution [

47] and the Watson distribution [

48] are usually applied. These distributions show characteristics that are significantly different from a Gaussian distribution.

In the remaining part of this section, we will introduce some typical non-Gaussian distributions that can be applied in DNA methylation analysis.

#### 3.1.1. Beta Distribution

The beta distribution is characterized by two positive shape parameters u and v. The pdf of the beta distribution is:

where Γ(·) is the gamma function. The beta distribution has a flexible shape, which is shown in

Figure 5. In real applications, the beta distribution can be applied to model the distribution of a gray image pixel [

49], to describe the probability of human immunodeficiency virus (HIV) transmission [

50] and to capture the bounded property of the DNA methylation level [

8,

15].

**Figure 5.**
Beta distributions for different pairs of parameters. (**a**) u = 5, v = 5; (**b**) u = 2, v = 5; (**c**) u = 0.1, v = 2; (**d**) u = 0.2, v = 0.8.

**Figure 5.**
Beta distributions for different pairs of parameters. (**a**) u = 5, v = 5; (**b**) u = 2, v = 5; (**c**) u = 0.1, v = 2; (**d**) u = 0.2, v = 0.8.

#### 3.1.2. vMF (von Mises-Fisher) Distribution

The vMF distribution is considered a popular distribution in the family of directional distributions [

47,

51]. The data following a vMF distribution are located on a unit hypersphere. Hence, the vMF variable’s

l_{2} norm equals one,

i.e., ||

**x**||

_{2} = 1. The vMF distribution contains two parameters, namely the mean direction

**μ** and the concentration parameter

λ. The pdf of a

K-dimensional vMF distribution can be expressed as:

where ||

**μ**||

_{2} = 1,

λ ≥ 0 and

K ≥ 2 [

51]. The normalizing constant c

_{K}(

λ) is given by:

where

$\mathcal{I}$_{ν}(·) represents the modified Bessel function of the first kind of order

ν [

52]. The pdf of the vMF distribution is illustrated in

Figure 6. In information retrieval applications, the vMF distribution can be applied to model the cosine similarity for the clustering of text documents [

53,

54]. It can also be applied in modeling the gene expression data, which has been shown to have directional characteristics [

55].

**Figure 6.**
Scatter plot of samples from a single von Mises–Fisher (vMF) distribution on the sphere for different concentration parameters,

λ = {4, 40, 400}, and around the same mean direction

**μ** = [0, 0, 1]

^{T}. Samples generated from

(

**μ**, 400) (shown by green colors) are highly concentrated around the mean direction, while for samples generated from

(

**μ**, 4) (shown by blue colors), the distribution of samples on the sphere is more uniform around the mean direction.

**Figure 6.**
Scatter plot of samples from a single von Mises–Fisher (vMF) distribution on the sphere for different concentration parameters,

λ = {4, 40, 400}, and around the same mean direction

**μ** = [0, 0, 1]

^{T}. Samples generated from

(

**μ**, 400) (shown by green colors) are highly concentrated around the mean direction, while for samples generated from

(

**μ**, 4) (shown by blue colors), the distribution of samples on the sphere is more uniform around the mean direction.

#### 3.1.3. Watson Distribution

Observations on the sphere might have an additional structure, such that the unit vectors

**x** and −

**x** are equivalent. In other words, that is ±

**x** that are observed. Here, we need probability density functions for

**x** on

which are axially symmetric, that is

f(−

**x**) =

f(

**x**). In such cases, the

p-dimensional observation ±

**x** can be regarded as being on the projective space ℙ

^{p}^{−1}, which is obtained by identifying opposite points on the sphere

.

One of the simplest distributions for axial data, with a rotational symmetry property, is the (Dimroth–Scheidegger–) Watson distribution. The Watson distribution is a special case of the Bingham distribution [

56], which is developed for axial data with no rotational symmetry property.

A random vector

**x** ∈ ℙ

^{p}^{−1}, or equivalently ±

**x** ∈

, has the (

p − 1)-dimensional Watson distribution

W_{p} (

**μ**,

κ), with the mean direction

**μ** and the concentration parameter

κ, if its probability density function is:

where

κ ∈ ℝ, ||

**μ**||

_{2} = 1, and

_{1}F_{1} is Kummer’s (confluent hypergeometric) function (e.g., [

57] (Formula (2.1.2)), or [

58] (Chapter 13)), defined as:

where

${r}^{(j)}\equiv \frac{\mathrm{\Gamma}(r+j)}{\mathrm{\Gamma}(r)}$ is the raising factorial. Similar to the case of vMF distributions, for

κ > 0, as

κ → 0,

W_{p} (

**μ**,

κ) reduces to the uniform density, and as

κ → ∞,

W_{p} (

**μ**,

κ) tends to a point density. For

κ < 0, as

κ → − ∞, the density concentrates around the great circle orthogonal to the mean direction ([

59] (Chapter 9.4)). The samples generated from Watson distribution are shown in

Figure 7.

**Figure 7.**
Scatter plot of samples from a single distribution,

W_{p} (

**μ**,

κ), on the sphere for positive and negative concentration parameters

κ, around the same mean direction,

μ = [0, 1]

^{T} [

60]. For larger concentration parameters,

i.e.,

κ = 40 or

κ = −40, samples are more concentrated around the mean direction (shown by the red color). For smaller concentration parameters,

i.e.,

κ = 4 or

κ = −4, samples are more uniformly distributed around the mean direction (shown by the blue color). (

**a**)

κ > 0,

κ ∈ {+4, +40}; (

**b**)

κ < 0,

κ ∈ {4, −40}.

**Figure 7.**
Scatter plot of samples from a single distribution,

W_{p} (

**μ**,

κ), on the sphere for positive and negative concentration parameters

κ, around the same mean direction,

μ = [0, 1]

^{T} [

60]. For larger concentration parameters,

i.e.,

κ = 40 or

κ = −40, samples are more concentrated around the mean direction (shown by the red color). For smaller concentration parameters,

i.e.,

κ = 4 or

κ = −4, samples are more uniformly distributed around the mean direction (shown by the blue color). (

**a**)

κ > 0,

κ ∈ {+4, +40}; (

**b**)

κ < 0,

κ ∈ {4, −40}.

#### 3.3. Nonnegative Matrix Factorization for Bounded Support Data

Unlike PCA or ICA, NMF reveals the data’s nonnegativity during dimension reduction. Traditional NMF decomposes the data matrix into a product of two nonnegative matrices as:

where **X**_{P×T}, **W**_{P×K} and **V**_{K×T} contain nonnegative values X_{pt}, W_{pk} and V_{kt}, respectively, and p = 1, …, P, t = 1, …, T, k = 1, …, K, K ≪ T.

The DNA methylation data are naturally bounded on interval [0, 1]. Conventional NMF strategies do not take such a nature into account. In order to capture such a bounded feature explicitly, we proposed an NMF for bounded support data [

65]. Each bounded support element

X_{pt} is assumed to be generated from a beta distribution with parameters

a_{pt} and

b_{pt}. With an observation matrix

**X**_{P×T}, two parameter matrices

**a** and

**b** of size

P ×

T are obtained, respectively. Each parameter matrix, rather than the observation matrix, is decomposed into a product of a basis matrix and an excitation matrix as:

With the above description, we assume that the matrix **X** (with element X_{pt} ∈ [0, 1]) is drawn according to the following generative model:

where Gamma(x; k, θ) is the gamma density with parameters k, θ defined as:

As the data is assumed to be beta distributed and the parameters of the beta distribution are assumed to be gamma distributed, this model is named BG-NMF.

For BG-NMF, the variational inference (VI) method [

66] is applied to estimate the posterior distributions. The expected value of

X_{pt} is

${\overline{X}}_{pt}=\frac{{a}_{pt}}{{a}_{pt}+{b}_{pt}}$. If we take the point estimate to

A_{pk},

B_{pk} and

H_{kt}, then the expected value of

X_{pt} can be approximated as:

which can be expressed in matrix form as:

where ⊘ means element-wise division. When placing sparsity constraints on the columns in

**H**, the reconstruction in

Equation (11) could be approximated as:

Hence, the resulting pseudo-basis matrix **W̄** is low-dimensional while retaining the bounded support constraint.

#### 3.3.1. Spectral Clustering for Non-Gaussian Reduced Features

Recently, spectral clustering (SC) has become one of the most popular clustering algorithms [

67]. It is an alternative method for the K-means algorithm. When the natural clusters in ℝ

^{L} are not corresponding to the convex region, the K-means algorithm cannot provide satisfactory clustering results. However, when mapping the data points to ℝ

^{K} space via SC, they may form tight clusters [

68]. SC analyzes the affinity matrix of data. Assuming that the data are likely to be clustered in

K-dimensional space, the reduced features, each of which is

K-dimensional, are extracted by taking eigenvalue analysis of an intermediate matrix

**M**. The reduced features will then be used for clustering, with conventional methods like K-means. The feature extraction procedure via the SC method [

68] is summarized in

Algorithm 1. With the above extracted features, the task of data clustering can be carried out in the reduced ℝ

^{K} space.

**Algorithm 1.**
Spectral clustering.

**Algorithm 1.**
Spectral clustering.
**Input:** Original data matrix **X** = {**x**_{1}, **x**_{2}, …, **x**_{N}}, each column is a L-dimensional vector.
Create the affinity matrix **A**_{N}_{×}_{N}, where
${A}_{ij}=\{\begin{array}{cc}{e}^{\frac{-{\Vert {\mathbf{x}}_{\text{i}}-{\mathbf{x}}_{\text{j}}\Vert}^{2}}{2{\sigma}^{2}}}& i\ne j\\ 0& i=j\end{array}$; Construct the intermediate matrix
$\mathbf{M}={\mathbf{D}}^{-{\scriptstyle \frac{1}{2}}}{\mathbf{AD}}^{-{\scriptstyle \frac{1}{2}}}$, where **D** is a diagonal matrix whose (i, i)th element is the summation of the i-th row of **A**; Apply eigenvalue analysis on **M** and create a matrix **Y**_{K}_{×}_{N}, which contains K eigenvectors corresponding to the largest K eigenvalues; Form a matrix **Z** from **Y** by normalizing each column of **Y**. Each column of **Z** has a unit length.
**Output:** Reduced K-dimensional feature **z**_{n} for each data point **x**_{n}. |