Modelling Bimodal Data Using a Multivariate Triangular-Linked Distribution

: Bimodal distributions have rarely been studied although they appear frequently in datasets. We develop a novel bimodal distribution based on the triangular distribution and then expand it to the multivariate case using a Gaussian copula. To determine the goodness of ﬁt of the univariate model, we use the Kolmogorov–Smirnov (KS) and Cramér–von Mises (CVM) tests. The contributions of this work are that a simplistic yet robust distribution was developed to deal with bimodality in data, a multivariate distribution was developed as a generalisation of this univariate distribution using a Gaussian copula, a comparison between parametric and semi-parametric approaches to modelling bimodality is given, and an R package called btld is developed from the workings of this paper.


Introduction
Bimodality in data can be described as the presence of two distinct modes and many examples of bimodal data can be found in nature. Bimodality can occur in a variety of circumstances. Firstly, it can occur when there are two or more latent attributes of the data which might result in bimodality. For instance, if we consider the frequency of cars crossing a bridge in a 24 h period, two modes manifest because two peak traffic hours are latently reflected [1]. Moreover, as determined by [2], if we review the spread of tree cover in a desert, specific attributes about the geography in the area latently influence the spread of the trees. In both cases, the bimodality in the data is generated from a single population with multiple latent attributes.
Secondly, bimodality could occur if there are attributes of sub-populations present in the data. An example of this can be the spread of height amongst college students. If we ignore gender and just record the height of students, we can expect bimodality to reflect in histograms of the data.
A common approach to handle bimodality is to use mixture distributions, which are semi-parametric models that yield density estimations as well as clustering solutions [3]. This approach requires the estimation of five parameters: the means and standard deviations of the two normal distribution, and the mixing parameter. Alternatively, one could fit a bimodal parametric distribution to the data.
Both of these approaches pose benefits and challenges to the modeller. Bimodal parametric models result in a mathematical formulation for the cumulative distribution function (CDF) which is convenient for hypothesis testing and calculation of confidence intervals. The challenge is to find the exact estimates of the parameters. Mixture models come with relaxed parametric assumptions, but the choice of number of components is not always intuitive and the computational complexity increases as the number of components increases [4].
In this paper, we propose a mathematically concise and simplistic approach to modelling bimodal data. We use the well-known triangular distribution as the initial point of reference and then extend it to what we call the bimodal triangular-linked distribution (BTLD). This newly formulated distribution accommodates bimodality without considering mixture models.
We generalise this distribution to a multivariate context using a Gaussian copula. Copulas join or "couple" univariate distribution functions, say F(x) and G(y), to form a multivariate distribution function H(x, y) = C(F(x), G(x)). Modelling the joint distribution is done by mixing the marginal distributions using a bivariate function (known as a copula) C : (u, v) → C(u, v) which captures and reflects the dependence pattern of X and Y [5].
To determine the goodness of fit of the univariate model, we use the Kolmogorov-Smirnov (KS) and Cramér-von Mises (CVM) tests. The KS test is a nonparametric test that uses the maximum absolute distances between the empirical distribution and the proposed null distribution to test if the proposed null distribution truly fits the data in the test [3,6]. The CVM test is an alternative to the KS test and has been suggested to use the data more effectively [7,8]. The null hypothesis is the same for both tests and the discrepancy in the tests lies in the way that the test statistics, d for the KS and ω 2 for the CVM, are calculated [8]. We use both tests to ensure the consistency of our testing framework. The contributions of this paper are as follows:

1.
A simplistic yet robust distribution geared to handle bimodality in its parameters is developed from a fusion of uniform and triangular distributions which we called the bimodal triangular-linked (BTL) distribution, 2. A multivariate extension is developed using a copula to model the dependance structures of multiple variables. A simulation is shown as well as comparison to a multivariate Gaussian mixture model (GMM), 3. A comparison between parametric and semi-parametric approaches to dealing with bimodality in data is illustrated in the form of an application to gene expression data, 4. An R package is constructed from the workings of this paper called btld. An explanation of the functionality of this package is included in Appendix B. All generations and computations regarding the BTL are done using this package.
The rest of the paper is structured as follows: In Section 2, we review related work in the fields of bimodal distributions and mixture model. Section 3 covers important background theory on the triangular model. This section also motivates the use of Kolmogorov-Smirnov and Cramér-von Mises test as goodness of fit tests. The bimodal triangular-linked model with its statistical properties is introduced in Section 4. The generalisation to the multivariate context is enabled through copulas which are introduced in Section 3.3. The multivariate generalisation, the Multivariate triangular-linked distribution is introduced in Section 5 after which its application is illustrated in Section 6. Section 7 concludes with a discussion of results and future work.

Related Work
Bimodal data are very relevant as it occurs often in practice. The two main approaches used to to model such data are the use of mixture models or special parametric distributions which inherently capture bimodality. Mixture models are semi-parametric models that yield density estimations [3]. Although not strictly bimodal, mixture models may yield bimodality in the densities if the data provide for it. Mixture models are constructed by taking a weighted sum of two or more distributions. There is a vast number of distributions to choose from, thus when it comes to creating a mixture model, there are virtually endless possibilities and the final decision will typically depend on the data and the needs of the practitioner. Sheng et al. [9] demonstrated that mixture models can be used in pharmacodynamic studies in which bimodal count data arise. They found that the two generalized Poisson mixture model was the best fit for their bimodal dataset consisting of the number of times that a rodent licked an oral medication in a palatability study. Bimodal distributions have also been used in cancer studies. Irace and Batatia [10] demonstrated the effectiveness of the univariate and bivariate mixtures of Poisson distributions in automatic image segmentation of 4-D bimodal PET-CT images used for cancer diagnoses. A mixture of bivariate negative binomial-normal distributions has also been considered for the same purpose [11]. In genetics, Gaussian mixture models have been used to identify genes with bimodal expression patterns in tumors [12]. Other applications where mixture models have been used for bimodal data include modelling ratings from Tripadvisor.com [13], Over the last two decades, many researchers have proposed new distributions for bimodal data based on the skew normal distribution [14]. The skew normal distribution is an extension of the normal distribution which includes an additional parameter to induce asymmetry. Based on this idea of modifying a normal distribution, Elal-Olivero [15] developed a new skewed distribution called the alpha-skewed-normal (ASN) distribution by introducing a parameter which allows for the added flexibility of modelling both unimodal and bimodal data. A further extension to the ASN that allowed at most four modes, called the alpha-beta skewed normal (ABSN) distribution, was later proposed by Shafiei et al. [16]. This ABSN was used to model the bimodal acidity indices of lakes in Northeastern United States [16]. More recently, using a methodology advocated by Balakrishnan [17], the Balakrishnan-alpha-skewed-normal (BASN 2 ) distribution was proposed [18] as well as its extension, the Balakrishnan alpha-beta skewed normal (BABSN 2 ) [19] which is able to model up to four modes. A BASN 2 distribution was fit to a bimodal dataset consisting of 69 samples of N latitude degrees from world lakes and was found to outperform several other skewed distributions, including the ASN and ABSN, based on AIC and BIC [19]. This approach of using skewed distributions to model bimodal data is very common. Other examples of such distributions include the alpha-skew Laplace distribution [20], the alpha-skew-logistic distribution [21,22], bimodal skew-elliptical distributions [23], flexible generalized skew normal and t distributions [24], as well as some multivariate skewed distributions presented in [25][26][27]. See also [28][29][30][31][32][33][34][35][36][37] for other related work. Skewed distributions have also been used in the construction of mixture models. For instance, the skew normal mixture model (SNMIX) was shown to produce batter AIC and BIC scores than traditional normal mixture models on two real-world datasets [38]. To address the SNMIX's possible lack of robustness in the presence of outliers, Lin et al. [39] proposed the skew t mixture (STMIX).

Background Theory
We first provide background theory on the triangular distribution, goodness-of-fit tests, and copulas which are main points of reference for the rest of the paper.

Triangular Distribution
The Triangular distribution on the (0, 1) space has one parameter θ also lying in (0, 1). Taken from Kotz and van Dorp [40], the probability density function (PDF) for the triangular distribution is defined as The cumulative distribution function (CDF) is obtained by integrating Equation (1), The inverse CDF (ICDF) is, The ICDF transform method [41] is used to simulate random variables from the triangular distribution as illustrated in Figure 1. For clarity, we overlay a kernel density estimate and the PDF of the triangular distribution on the histogram of the random variables.

Kolmogorov-Smirnov Goodness-of-Fit
In many cases it is not possible to make assumptions about the underlying distribution of a given dataset and one needs to consider non-parametric or semi-parametric methods. A non-parametric model postulates that observations come from some distribution function F not constrained by any parameters. However, the interpretability of these models are not so clear and semi-parametric models might be considered as a compromise [3]. First, consider the estimate of a distribution function, i.e., the empirical cumulative distribution function, F * n (x). It is formally calculated, at some point x, by taking the proportion of sample observations less than or equal to that point, where I(·) is the indicator function. The indicator function is defined as It has been proven that the empirical CDF is an unbiased and consistent estimator of the true CDF. That is to say the following properties can be drawn about the estimator, is an unbiased and consistent estimator for F(x) The Kolmogorov-Smirnov (KS) test of goodness of fit is a non-parametric test that uses a proposed distribution function F 0 (x) versus the observed cumulative function S n (x), alternatively known as the empirical CDF (ECDF). Our null hypothesis is that the sample can be modelled using F 0 (x). The crux of this test is that we expect the observed CDF and the proposed CDF to be very close to each other for all N observations. If the distributions are too distinct from each other then we can conclude that the proposed distribution is not appropriate for modelling the sample. That is, we reject our null hypothesis [6]. Our hypotheses are generally constructed as follows, Formally, if F 0 (x) is the population cumulative distribution function, and S N (X) = k/N where k is the cumulative index of x, then let Kolmogorov's D be defined as follows, Large values of this statistic suggest with a high level of probability that we will reject the null hypothesis and the converse applies for small values [3]. Furthermore, note that Equation (6) To make conclusions about D we use tables that yield critical points of the distribution of D for differing sample sizes. We reject the null hypothesis if D > d α (N) where α is the level of significance for our test.

Copulas
Most standard probability distributions are generalised to the multivariate context such as the multivariate Gaussian distribution. The multivariate generalisation is important in order to model dependencies between variables. We run into the problem that multivariate generalisations do not exist for all distributions and furthermore, one might want to model the dependency between two different distributions for which the mathematical formulation does not exist either.
Copulas are mathematical functions that describe the dependency structure between a finite set of univariate marginal distributions [42]. The main idea is that marginal properties can be separated from correlation properties. Under the hood, copulas are multivariate distributions with uniform marginals. Using the inverse probability transform [41,42] any distribution can be generated from the uniform distribution.
The basis of taking a set of correlated uniform random variates (which is the copula itself) and transforming these into a multivariate distribution with arbitrary marginals is based on Sklar's theorem which proofs that we can extract the correlation structure from marginal distributions and used to create a set of marginals and a copula [42,43].
The dependence structure is independent of the marginals and is fully described by the copula. Furthermore, transformations that preserve the ranks will preserve the copula, since the copulas are linked to the ranks of random variables. The main purpose of a copula is to describe the dependence between the marginal distributions. Furthermore, the copula can be used to calculate conditional probabilities and predictions. At first sight, this might not seem to make a huge difference, but the effect is typically amplified in the tails of the dependency structure.
The most common copula is the Gaussian copula which forms part of the collection of copulas known as elliptical copulas from the shape the copula structure makes.Another elliptical copula is the Student's t copula [44]. Popular copulas in the Archimedian family of copulas include the Gumbel, Frank copula, the Cook-Johnson copula and the Clayton copula.
Alternative approaches to measuring the dependence structures of copulas may be taken instead of the widely known Pearson's linear correlation coefficient because the structure of the marginals can change the linear correlation coefficient, but not so for non-parametric dependence measures such as Kendall's tau and Spearman's rho. These measures capture the dependency inherent in the copula structure, not that just presented by the marginals. One may opt to use the Clayton copula if one is interested in having a high lower tail dependence structure. This is useful for modelling unlikely events, i.e., ones that tend to occur in the lower tail. Conversely, for a Gaussian copula, the dependence breaks down in the tails [45]. Figure 2 illustrates the conditional distributions that can be generated from copulas: Given a particular observation in one of the marginals, we can determine the probability and cumulative distributions for the other.

Bimodal Triangular-Linked Distribution
We derive the Bimodal triangular-linked (BT L) distribution as an extension of the triangular PDF defined in Equation (1). The mode of the triangular distribution is θ. Pulling the two triangle legs apart-left and right of the mode-results in a distribution with two modes, say θ 1 and θ 2 . The introduction of the two parameters α 1 and α 2 has the implication that the density is zero between the two modes θ 1 and θ 2 . If we allow the density between the modes, denoted as f (x) = α 0 , to follow a uniform distribution as part of the probability density over this area such that That is, the uniform density α 0 can be rewritten in terms of parameters of the distribution. Note that θ = [θ 1 , θ 2 ] is defined such that it satisfies the following, 0 < θ 1 ≤ θ 2 < 1.
Therefore, if we substitute these values of θ and introduce multiplicative scaling factors α 1 and α 2 into Equation (1), we find that the BTL distribution takes on form,

Triangular-Linked Statistics
Some statistics of the TL distribution-apart from being bimodal with modes θ 1 and θ 2 are the first and second moments. The mean is: and Also useful to note is that if θ 1 = 0 and θ 2 = 1, the distribution becomes a UNF(0, 1) with α 1 = 1, α 2 = 1.

Derivation of the CDF
Using Equation (7), we can derive the CDF for the BTL distribution.
Proof. For x < 0 : = 0 For 0 ≤ x < θ 1 : For θ 2 ≤ x < 1 : Note that the expression that gets reduced to 1 in the second last line of the above derivation is proven in Appendix A. For x ≥ 1 : = 1 Therefore the CDF of the BT L is,

Sample Estimates of Parameters
The CDF values of θ 1 and θ 2 can be determined using Equation (15). Thus, Note that the expression for F(θ 2 ) is derived from Equation (A2) in Appendix A. These values are pivotal in determining sample estimates of α i and θ i . If the empirical CDF of the function is known, the breakpoints on the plot will be indicative of the modal parameters θ i . Once the breakpoint valuesθ 1 andθ 2 , and their corresponding function values F(θ 1 ) and F(θ 2 ) are determined, it is trivial to determine the scaling parameters α i using the formulae derived above.

Derivation of the ICDF
Therefore the ICDF of the BTL distribution is, Using the ICDF, we can generate the random variates from the BT L distribution which is illustrated in Figure 3. To showcase the exact shape of the BTL, we overlay the PDF of the generated variates as well as a KDE estimate. This is done to illustrate the exact bimodal distribution as well as the approximate bimodal distribution.

Goodness of Fit Test
We continue with the simulated example to illustrate the goodness of fit tests for the BTL distribution. Given a simulated sample x 1 , . . . , x n of size n from the TL(θ, α) distribution, the estimation of the parameters is fairly easy using the empirical estimate of F(x) which is shown in Figure 4. It is almost identical to the true df. Testing the fit of the model was conducted in R for simulated data from the bimodal triangular-linked distribution. The null hypothesis and alternative hypotheses for both cases are presented as follows: • H Case 1 0 : the distribution, F 0 (x), follows a BTL distribution with parameters θ 1 , θ 2 , α 1 , α 2 • H Case 1 a : the distribution does not follow a BTL distribution The test can be conducted using p-values as well where a specified level of significance is used. The null hypothesis is rejected if p < α. Illustrated in Figure 3, the KS-test statistic is observed to be D = 0.02 which is quite small. Using goftest, we calculate the p-value = P(D ≥ q) = 0.6264 which is larger than even a lenient significance level of α = 0.1. Therefore, we do not reject the null.

Multivariate Triangular-Linked Distribution
Copulas equip us with the mathematics to generalise the Bimodal triangular-linked distribution to the multivariate case and we use the theory above to derive the multivariate triangular-linked (MVTL), MV T L, distribution. Let U be a random Gaussian copula, C Gauss P (u), with u i = φ i (X i ), by the probability integral transform, [46] being the marginals where X ∼ MV N d (0 d×1 , P d×d ) [42], N is the sample size we are considering and d represents the dimensionality of the random variables. By the standard inverse transform [46], let ]. Therefore, let the distribution of Y be defined as follows where Θ d×2 is a d-by-2 matrix containing the modal parameters; A d×2 is the d-by-2 matrix containing the scaling factors for the distribution (explained in previous sections) and; R d×d ) is the positive definite correlation matrix, with dimensions of d-by-d, of the Gaussian copula required to generate this multivariate distribution. Fitting this distribution as a proposed model implies we must determine sample estimates. We explain how to do this below.
• R d×d can be estimated by calculating the sample correlation matrixR = (Y). Pearson's correlation coefficient is calculated and used as the sample estimate. • To determine the modes, they can be read off the ECDF at the break pointsθ i1 andθ i2 . These points correspond to estimates of the CDF, F(θ i1 ) and F(θ i2 ). • The scale parameters estimatesα i1 andα i2 , can be determined using Equations (16) and (17), respectively. After some algebraic manipulation of these equations, formulae for these parameters can be determined.

Generated Example
Let the parameter and correlation matrices be defined as follows: A dataset of size 1000 is simulated and a matrix plot is constructed in Figure 5 for X ∼ MV T L(A, Θ, R) so as to showcase the marginal distributions of X, their shape and contours. In the lower triangular section we have scatterplots of all the marginals of X which we will denote as X 1 ∼ BT L(0.3, 0.7, 1, 5) , X 2 ∼ BT L(0.3, 0.7, 3, 3) and X 3 ∼ BT L(0.3, 0.7, 5, 1). The ECDFs of the marginals are plotted in Figure 6. This plot, which we call the H-plot, is used to read off the density values associated with the mode parameters. whereR is the Pearson correlation coefficient of the observed data. Finally, the scale parameters are determined using the expressions obtained in Equations (16) and (17). Therefore, the estimate of A, i.e.,Â, is as follows,  As an illustration, we plot the contours in 2D for two cases: one where we have a MVTL and the dependance structure is captured by a Gaussian copula, and; two, where the BTL marginals are independant. The distributions of these cases are plotted for all the variable X 1 , X 2 and X 3 shown in Figure 7. Note that these contours are not the same which means that by modelling the distribution using a copula, better captures any dependance structures inherent in the data.  (c) Figure 7. These subfigures are bivariate density plots for variables X 1 , X 2 and X 3 plotted against themselves in exhaustive pairs. The plots consider the cases when they independently generated (Red contours) and the case when they are generated using a copula (Blue contours). (a) Contour plot of X 1 against X 3 . (b) Contour plot of X 2 against X 3 . (c) Contour plot of X 1 against X 2 .

Gene Expression Data
Gene expression data have been directed towards a better understanding of a diverse range of biological processes which can then be used to determine associations between genetic information. If bimodality in the gene expressions are identified, then this can be used to extract interesting insights into biological attributes of a particular cancer associated with the tumor. The data we will use are extracted using developed software [12], which is written in R. These data consist of expression and clinical data from 25 different tumor types which is in turn harvested from the Cancer Genome Atlas (https://portal.gdc.cancer.gov/ accessed on 5 September 2021). In total the authors of [12], get expression values measured in Fragments by Exon Kilobase per Millions of Mapped Fragments values (FPKM), for just under 25,000 genes which yields more than 10 million observations. In the paper presented by Justino, Gaussian mixture models (GMMs) is used to model the overall density of the bimodal genes and provide a clustering solution. The GMM, as highlighted in chapter in modelling the bimodal data, is that it allows one to extract an overall density function for the bimodal data. We will consider the overall density modelled by a GMM against the overall density modelled by the BTL. Since GMMs yield clustering solutions as well as overall density functions, it could be a heavy handed approach if there is no meaningful interpretation in the clustering solution. Furthermore, one could quite easily reach an overfitted and overparameterized model by using a k-component mixture model [47].

BTL vs. GMM
In Figure 8, we show the comparative fit of three GMMs against our parametric model, the bimodal triangular-linked distribution. From the figure, it seems evident that a two component mixture fits well to the data with three and four components not add much difference to the overall density fit. Furthermore, the three component and four component models do not converge in 1000 iterations which is troublesome, however, the two component model just fits in 38 iterations. In Table 1, we have tabulated the estimates used forα andθ for the BTL. These are calculated using the formulae provided by Equations (16) and (17).  If we compare to the fit of the BTL, we find that it has a very good fit. Using the Cramér-von Mises and the Kolmogorov-Smirnov tests, we find that the we do not reject the null hypothesis comfortably, even with the penalty on the CVM for sample estimated parameters (shown in Table 2). In Table 3, it an be seen that we also do not reject the null hypothesis for the KS and CVM tests. To get to more distinctive quantitative results, we look a the AIC and BIC scores in Table 4, where according to these relativistic metrics, the GMM outperforms the BTL as a density model. Furthermore, we visually illustrate the goodness of fit comparisons with the plots in Figure 9, showing a PP-plot and an ECDF plot of the two models, BTL and GMM.

Conclusions
The curse of dimensionality is a problem for all models. Working in higher dimensions complicates the modelling process. This is even more true for bimodal data since very few classical (i.e., normal, t) distributions that can accurately model bimodality in one dimension, let alone multiple dimensions. Most often, researchers turn to finite mixture models to capture bimodality. The use of finite mixture models does not provide five number summary statisics directly, bare useful tools such as the CDF which can used for prediction and confidence intervals and can be computationally expensive with k components in d dimensions. Furthermore, the choice of mixtures is most often a heuristic and in scenarios where the modes are close to each other, convergence of the EM algorithm is not guaranteed. Goodness-of-fit cannot be determined for mixture models and the only way to measure the model is by using machine learning metrics such as RMSE, accuracy, BIC and AIC.
In this paper, we introduce a simple mathematical model that addresses the issues with existing models for bimodality. The approach is as follows: The modes can be determined by identifying structural breaks in the empirical CDF. This is the only empirical information needed in order to fully specify the model estimates. We will investigate alternative approaches to determining the modes such as using KDEs as a proxy for mode estimate, MLEs and Methods of moments estimates.
In order to generalise to the multivariate case, we use a Gaussian copula. Copulas have found wide application in the field of actuarial modelling and investment management. We only take advantage of the dependance modelling of copulas to build the multivariate distribution as a fully fledged mathematical form of a multivariate distribution can necessitate a PhD to develop. A simulation of the multivariate distribution was illustrated and a comparison to a multivariate Gaussian mixture model was shown. An application in gene expression data of the univariate BTL vs. a GMM is shown. Finally, a package called btld was constructed from the workings of this paper.
Further research includes invesitgating the MVTL as a substitute for the compound Dirichlet distribution in modelling word rates. Furthermore, we can look to alternate copula structures to see how the multivariate distribution could change.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: