Identification of Block-Structured Covariance Matrix on an Example of Metabolomic Data

Modern investigation techniques (e.g., metabolomic, proteomic, lipidomic, genomic, transcriptomic, phenotypic), allow to collect high-dimensional data, where the number of observations is smaller than the number of features. In such cases, for statistical analyzing, standard methods cannot be applied or lead to ill-conditioned estimators of the covariance matrix. To analyze the data, we need an estimator of the covariance matrix with good properties (e.g., positive definiteness), and therefore covariance matrix identification is crucial. The paper presents an approach to determine the block-structured estimator of the covariance matrix based on an example of metabolomic data on the drought resistance of barley. This method can be used in many fields of science, e.g., in agriculture, medicine, food and nutritional sciences, toxicology, functional genomics and nutrigenomics.


Introduction
In many experiments using high-throughput omics technologies, the entire spectrum of the observed features (so-called nontargeted analysis), e.g., chemical compounds, is considered. That leads to collection of a huge amount of data, in which, from statistical point of view, there are too many parameters to estimate (e.g., in metabolomics: [1][2][3][4][5][6], in proteomics: [7][8][9]). Moreover, methods of automated co-eluted compounds separation lead to growing distance between number of features and analyzed samples. To be able to carry out a more detailed statistical analysis for such type of data, a new approach should be adapted. Similar problem (in a simplified version) was considered in [10] by selecting an appropriate covariance structure for three subsets of data obtained in a study of metabolomic changes in barley (Hordeum vulgare) leaves under drought stress. The aim of the work is to present how to proceed metabolites, proteins, lipids, gen expression quantitative traits, phenotypic traits on the example of barley data, to achieve an appropriate covariance structure, which leads to the analysis better reflecting the data. The characterization of a covariance structure was investigated with the use of methods based on the Frobenius norm and on the entropy loss function. In [10] to overcome the problem of singularity, we considered three selected nonsingular subsets of the barley data. In this paper, we analyze the whole dataset working on singular matrix. We indicate block-structured estimators and the most suitable block-structured covariance matrix by visualization of the correlation matrix using heatmaps. The specified block-structured covariance matrices are considered e.g., in [11][12][13].
Analyzing the whole dataset, regarding 781 traits and 211 samples, we have to deal with the problem of high-dimensionality of the data, in which the sample size is too small in comparison to number of variables. All measurements are collected in a matrix where D´1 " diagˆ1 ? s 11 , 1 ? s 22 ,¨¨¨, 1 ? s mm˙. (2) All elements of the matrix R belong to the interval r´1, 1s and for the structure identification, we visualize the correlation matrix using a heatmap ( Figure 1). From Figure 1 we cannot recognize any structure, thus further analysis is necessary. To identify the covariance structure, in the first step, we visualize data using the hierarchical clustering method, which is described in the Section 2.1. Based on the hierarchical clustering results we have described the estimation procedure of the covariance matrix in Section 3 using statistical methods from Section 2.2.

Materials and Methods
Our motivation came from an investigation of the effects of water shortage on the levels of primary metabolites in 9 varieties of spring barley, measured repeatedly during drought treatment and control. Data were performed as a pilot study for a larger systems biology project [14]. The pilot experiment is described in [15] for two varieties: Maresi and Cam/B1/CI 08887//CI 05761. Barley plants were cultivated under partially controlled greenhouse conditions. The primary metabolites were recognized by gas chromatography coupled with mass spectrometry (GC-MS). In our paper, nine spring barley genotypes were used: the European varieties Georgia, Maresi, Lubuski, Sebastian and Stratus; Morex, bred in the USA; and Cam/B1/CI 08887//CI 05761, Harmal, and Maris Dingo/Deir Alla 106, being lines bred in Syria. Data were obtained at 4 time points in 4 biological replications and 2 technical replications. The total number of samples was 422. According to [10], after averaging the data over technical replications, 211 samples are obtained and after summation over mass-to-charge ratio, 781 traits are considered. After this step, each biological sample is represented by one total ion current (TIC) chromatogram. In order to ensure normality of the data, observations were transformed by logarithm with base 1.2.
To indicate the most suitable block-structured covariance matrix to our whole dataset, we visualize the data by heatmaps using hierarchical clustering methods, which are described below. Furthermore, in order to indicate block-structured estimators we present methods of its determination.

Hierarchical Clustering
Due to the fact that the order of random variables is not important in hierarchical clustering, we can change the sequence of variables to find groups of similar traits.
Hierarchical cluster analysis uses a measure of dissimilarity for some number of objects being clustered. Initially, each object is in its own cluster and then the most similar clusters (the smallest dissimilarity) are joined iteratively until there is just a single cluster. At each stage, distances between elements are computed using a selected distance function (described in Section 2.1.1) and the distances between clusters are obtained using the chosen linkage criterion (described in Section 2.1.2). The distances are stored in a distance matrix. The procedure of choosing a pair of clusters to merge at each step is based on finding the nearest clusters (the smallest value) from this distance matrix. We applied R package pvclust [16] to analyze the dissimilarities of the data.

Distance Functions
The distance between two observations can be calculated by the following methods: • Binary-Jaccard index JpA, Bq "´Ğ Ğ A X B¯{´Ğ Ğ A Y B¯-ratio of number of common elements in both sets to number of all elements,

Linkage Criteria
The following criteria of linkage are implemented: • Ward.D and Ward.D2-procedures applied the variance analysis to compute the cluster distances. The difference of the two Ward linkage algorithms is that the additional Ward's clustering criterion is not implemented in "Ward.D" (1963), whereas the option "ward.D2" implements that criterion, with the latter, the dissimilarities are squared before cluster updating; cf. [17], • Single-procedure computing the distance between two clusters as the minimum distance between each observation from one cluster and each observation from the other cluster, • Complete-procedure computing the distance between two clusters as the maximum distance between each observation from one cluster and each observation from the other cluster, • Average (UPGMA-Unweighted Pair-Group Method using Arithmetic Averages)procedure computing the distance between two clusters as the average distance between each observation from one cluster and each observation from the other cluster, • Mcquitty (WPGMA-Weighted Pair Group Method with Arithmetic Mean)-procedure based on UPGMA using the size of clusters (number of elements) as weights, • Centroid (UPGMC-Unweighted Pair-Group Method using the Centroid Average)the distance between two clusters is the distance between the cluster centroids, • Median (WPGMC-Weighted Pair-Group Method using the Centroid Average)procedure based on UPGMC using the size of clusters as weights.
Ward's minimum variance method points at finding compact spherical clusters. The complete linkage method achieves similar clusters. The single linkage method adopts a "friends of friends" clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods. Note, that the methods median and centroid are not leading to a monotone distance measure, cf. [18].
We can distinguish three diagonal blocks in the covariance matrix structure based on Figure 4. The next section presents the possible covariance structures and methods, which can be used to identify the structure of relevant blocks of the covariance matrix. Color Key and Histogram Count Figure 4. Visualization of the correlation matrix after hierarchical clustering using complete linkage method with euclidean distance.

Statistical Background
To deal with the situation, where the number of features m is greater than the number of observations n, we can add a restriction on the covariance matrix structure. In other words, we assume that the covariance matrix has some structure from the set of most common structures described in the next subsection. Further methods of determination estimators of the covariance matrix for a given structure are presented.

Covariance Structures
Let us assume the variances between characteristics are homogenous and the covariances between elements can be homogenous or heterogenous. In the literature different covariance structures with homogenous or heterogenous elements are considered. We choose the most common structures, for which there are methods for obtaining positive definite estimators in the sense of Frobenius norm. Therefore, we consider the following possible covariance structures: • compound symmetry (CS) To ensure the positive definiteness of the matrix Ψ CS , we assume σ 2 ą 0 and P´´1 m´1 ; 1¯; cf. [19].
• banded symmetric Toeplitz structure (T p , p ă m) where H i is an mˆm symmetric matrix with i-th superdiagonal and subdiagonal elements equal to 1 and all other elements equal to 0. In this paper, we consider the Toeplitz covariance matrix with p " 1 and p " 2. The matrix Ψ T 1 is p.d. when The conditions for positive definiteness of the estimator of the matrix Ψ T p (p ą 1) is not expressed in the explicit form, cf. [20].

Identification Methods
In the paper [10], where the nonsingular matrices were considered, besides the Frobenius norm, the entropy loss function was used as an identification method. This discrepancy function was considered also in [19] for standard multivariate model, and in [21,22] or [23] for doubly multivariate model. However, the entropy loss function requires nonsingularity of the observation matrix and it cannot be applied for considered dataset.
To find the most suitable covariance structure, we use the Frobenius norm The formulae for the p.d. estimator of a structured covariance matrix using the Frobenius norm are described by [24] for CS, T 1 and AR(1) and by [20] for T p . These formulae are as follows: trpSq` δ m`mpm´1q 2 with δ " trrSpJ m´I m qs, Ref. [20] observed that the matrix obtained from (3) given by [24] may be indefinite. Thus, they proposed an algorithm for determination of the minimum of the Frobenius norm; cf.
T p for p ą 1 In this case, the formulae for the estimator of the T p structure cannot be given in explicit form. To determine the estimator, the algorithm proposed by [20], p. 78, can be used. • AR(1) structure To determine the estimator of the AR(1) structure, the following system of equations should be solved: with H 0 " I m . The above system of equations provides the local minimum of the discrepancy function; cf. [24].
For CS and AR(1), the formulae described by [25] relating to the separable structure Ψ b Σ (Ψ : pˆp and Σ : qˆq) with q " 1 can also be used.

Results
The first step of the structure identification is choosing the number of diagonal blocks b, based on the heatmaps (for example, three or four). Depending on the number of considered structures a, the number of possible combinations on diagonal is a b . We consider four structures and additionally we have the scaled identity matrix, therefore 5 b combinations of diagonal block structures are possible. Off-diagonal blocks are rectangular matrices consisting of zeros or nonzero values. Because the covariance matrix and its estimator are symmetric, thus we have 5 bˆ2bpb´1q{2 possible combinations to consider.
In the next part of this section, we show the feasible way to obtain the closest estimator in the sense of Frobenius norm. Surely, if we have a few estimators with nearby distances of Frobenius norm, we should select the structure, which explains in the best way the nature of the phenomenon under investigation.
For the considered data, we chose three diagonal blocks (1000 possible structures to consider) based on the Figure 4. This visualization is determined by one particular permutation of R. Therefore we have to use the same permutation for S and we denote it as S˚.
Firstly, we assume the following structure of S˚: For different structures for diagonal blocks Ψ 1 and Ψ 3 we compute the Frobenius norm distances between S˚and Σ 1 . However, the Frobenius norm is not an upper bounded distance and we cannot conclude that we are close or far from the true structure. Thus, we use the adjusted Frobenius norm of the form which has values from the interval [0,1]; cf. [21,25]. We can interpret a small value (near 0) of the adjusted distance as the considered structure is close to the true structure and a value close to 1, that the considered structure is far from the true structure. It is worth noting that, ||S˚|| F " ||S|| F in our case.
The adjusted Frobenius norm distances between S˚and Σ 1 with different (optimal) diagonal blocks are presented in Table 1. The smallest distance between S˚and Σ 1 is observed for Σ 11 and Σ 33 having CS structures with parameters given in Table 2. It is worth noting that the Σ 1 is a positive definite matrix, because all diagonal blocks are p.d.
It should be emphasized that the matrix Σ 3 can be indefinite. Thus we should check the positive definiteness of the obtained estimator. In considered case Σ 3 is p.d. If Σ 3 is not positive definite, then we will modify the off-diagonal blocks, to get p.d. estimator. We propose to multiply off-diagonal blocks by a number from the interval p0, 1q, such that the estimator is p.d. and the Frobenius norm is as small as possible.
Let us recall that we were looking for the estimator of S˚, which is permuted S. To get the estimator of the matrix S we should inverse the permutation, which means in this case using the same permutation again: S " pS˚q˚. The estimator of S has the same eigenvalues as the estimator of S˚, since the applied permutation is even. Therefore the estimator of S is also p.d.

Discussion
The problem was considered in [10] for three selected nonsingular subsets of the barley data. However, more interested is whole dataset, which is usually singular in high-dimensional case. The results of selected subsets are uncomparable with the results of all dataset. For the first considered diagonal structure Σ 1 we obtained the smallest adjusted Frobenius norm equals 0.81877. Adding one more diagonal structure the distance is 0.79225. Taking into account non-zeros off-diagonal blocks the best result is 0.37952, which gives significant improvement.
The structure analysis of the estimator Σ 3 show that some metabolomic groups are identically correlated and the number of parameters was reduced from 781ˆp781`1q{2 to 9. Obviously, using more diagonal blocks, we will estimate more parameters of the covariance matrix and we should obtain a closest estimator in the sense of the Frobenius norm.

Conclusions
By analyzing experiments where many characteristics are measured, the existing dependencies between the variables should be taken into account. The covariance matrix identification allows us to expand knowledge about feature dependencies and enables data analyzing using a more precise statistical model with a smaller number of covariance parameters. It is particularly important in high-dimensional problems where standard methods are not always available, since the matrix S is, in almost all cases, singular. The knowledge of the estimator of the covariance matrix with good properties is crucial in many statistical analysis methods, e.g., principal component analysis (PCA), linear or quadratic discriminant analysis (LDA or QDA), regression analysis, analysis of independence and conditional independence relationships between components.
Visualisation of correlation matrices using heatmaps is useful in finding blockstructured estimators and facilitates the selection of an appropriate structure to the statistical model. In this paper, metabolomic data obtained by GC-MS technique were used. However, our approach can be applied to any type of high-dimensional data, e.g., obtained by different omics techniques, namely: metabolomics, proteomics, limidomics, genomics and transcyptomics as well as common phenotypic data. Moreover, the obtained covariance matrix estimator can be used in aforementioned methods of statistical analysis.
Author Contributions: All authors equally worked on manuscript and computations. All authors have read and agreed to the published version of the manuscript.