A New Ensemble Method for Detecting Anomalies in Gene Expression Matrices

: One of the main problems in the analysis of real data is often related to the presence of anomalies. Namely, anomalous cases can both spoil the resulting analysis and contain valuable information at the same time. In both cases, the ability to detect these occurrences is very important. In the biomedical ﬁeld, a correct identiﬁcation of outliers could allow the development of new biological hypotheses that are not considered when looking at experimental biological data. In this work, we address the problem of detecting outliers in gene expression data, focusing on microarray analysis. We propose an ensemble approach for detecting anomalies in gene expression matrices based on the use of Hierarchical Clustering and Robust Principal Component Analysis, which allows us to derive a novel pseudo-mathematical classiﬁcation of anomalies.


Introduction
Real datasets often contain observations that behave differently from the majority of the data. If an occurrence differs from the dominant part of the data, or if it is sufficiently unlikely under the assumed data probability model, it is considered an anomaly or outlier.
Outliers may be caused by errors, but they may also result from exceptional circumstances or belong to a different population of data. On the one hand, anomalies may adversely affect the conclusions drawn from data analysis; on the other hand, they may contain important information. Thus, outlier detection is about the interest in the outliers themselves or the fact that they may contaminate the subsequent statistical analysis. In statistics, an outlier is an observation that falls outside the overall pattern of a distribution [1]. However, it is difficult to determine how much a value must deviate to be called an outlier. Robust statistics are designed to detect outliers by first fitting the majority of the data and then flagging data points that deviate [2]. In the biomedical field, the correct identification of outliers is of great importance: depending on the type of analysis to be performed, biologists can decide whether or not these data should be removed.
In this work, we address the problem of detecting outliers in Gene Expression Profiling (GEP) data, focusing on microarray data containing gene expression values for a given number of samples labeled with a biological class (tumor type or experimental condition). In this type of data, there are generally two main types of outliers, which refer to the case where the instances are genes or samples, respectively [3]. The former is present when a gene has abnormal expression values in one or more samples from the same class, whereas the latter can be dually seen as samples that belong to a different class present in the data (often referred to as mislabeled samples) or as samples that do not belong to any class present in the data (called abnormal samples or outliers).
The origin of these outliers may be ambiguous; they may result from an undiscovered biological class, poor class definitions, experimental error, or extreme biological variability. Note that, when we say that an anomalous sample does not belong to its class, we are not necessarily disputing the validity of its designation. Indeed, a sample may still be a tumor, but at the same time it could have expression levels that are significantly different from those of other tumor samples. Using learning models to analyze data affected by outliers may even lead to incorrect conclusions. In the past, the impact of outliers was rarely considered when analyzing data from standard microarrays. According to the new current, outlier detection is used as preprocessing for data cleaning. However, it is essential to emphasize that, in many cases, outliers may simply be the result of natural variability in the data.
In this paper, we propose a novel ensemble approach that combines Hierarchical Clustering and Robust Principal Component Analysis to detect outliers in GEP data (these two techniques are generally not used for this purpose in this context), with an additional decision-making model (the anomaly detection tool) that can provide a pseudomathematical classification of outliers based on their biological nature.
The paper is organized as follows. Section 2 gives a brief overview of anomaly detection algorithms, also focusing on the microarray context, while Section 3 illustrates the proposed ensemble mechanism. Section 4 describes the experimental results obtained when applying the proposed methodology to six different datasets (two artificial and four real medical databases), discussing also some biological aspects and considerations. In Section 5, comparisons with the most commonly used anomaly detection techniques are reported with some discussions on the advantages of the proposed approach. Finally, conclusions and directions for future research are outlined in Section 6.

Methods for Outliers Detection
Detecting anomalies in real-world data is a difficult problem that can be addressed using a variety of mathematical techniques, ranging from standard univariate strategies to more comprehensive multivariate analysis. A comprehensive and complete review of the most commonly used methods can be found in References [4,5]. In the context of a multivariate approach, clustering and Low-Rank reduction mechanisms proved to be the most important. Clustering is an unsupervised learning mechanism capable of finding structures and similar patterns in collections of unlabeled data [6]. A cluster refers to a group of objects that are "similar" to each other (with respect to some measure) and "dissimilar" to objects belonging to other clusters.
Thus, the outliers are those samples that belong to a separate micro-cluster because they are far away from most of the other data. They are usually identified by increasing the number of clusters and for this reason it is necessary to define a measure of dissimilarity between clusters. In the analysis of gene expression data, the correlation-based measure (i.e., the one based on Pearson Correlation Coefficient) is considered the most appropriate dissimilarity measure when clusters of observations with the same overall profiles are obtained. Correlation works well for gene expression in clustering of samples and genes, although Pearson's correlation is quite sensitive to outliers. In samples clustering, this issue is irrelevant because the correlation is among thousands of genes, whereas when genes have to be clustered, it is important to be aware of the possible impact of outliers.
Assigning weight values to samples [7] is a possible way to improve the performance of the Pearson distance in handling outliers. On the other hand, distance is not the only choice to be made in clustering algorithms; the appropriate mechanism that defines how to separate two different clusters is also a task to be addressed. To tune these hyperparameters, either the Cophenetic Correlation Coefficient (CCC) and the Silhouette coefficient [8] could be adopted. The CCC expresses the correlation between the original dissimilarity matrix and the one derived based on the classification. Usually, a CCC ≥ 0.8 means a good match, while CCC < 0.8 indicates that the dendrogram is not a good representation of the relationships between the objects.
We give below the definition of the Silhouette coefficient, used to validate the quality of the clustering. Based on the clustering vector and the set of distances, the algorithm computes the dissimilarity of a point x i to its current class and the lowest dissimilarity of the point to other classes, through a(x i ) and b(x i ), respectively, defined as follows, for all Then, the Silhouette coefficient is set to zero when |C i | = 1 by definition and usually ranges in [−1, 1], in the other cases it can be defined as Its maximum value 1 indicates a better match with the current cluster, while −1 means that the point actually belongs to the other class or a so-called "neighboring cluster". Table 1 shows the range values of the Silhouette coefficients associated with the corresponding structures and usually adopted in literature panorama [9]. Table 1. Structures corresponding to particular Silhouette coefficient ranges.

SC Interpretation
0.71-1.0 A strong structure was found. 0.51-0.70 A reasonable structure was found. 0. 26-0.50 The structure is weak and may be artificial. <0. 25 No substantial structures have been found.
Among Low Rank reduction mechanisms, Robust Principal Component Analysis (ROBPCA) has been widely used in GEP data analysis to identify subgroups with specific biological characteristics that correlate with different clinical behaviors [10][11][12]. It combines the strengths of Projection-Pursuit techniques (PP) [13] and robust covariance estimation. The former is used to reduce initial dimensionality, while the latter, particularly the Minimum Covariance Determinant (MCD) estimator, is used to obtain a smaller data space. To our knowledge, however, this is the first time that ROBPCA has been applied to outlier detection in microarray data, although it has recently been used for the analysis of RNA-seq data, which requires a more complex sample preparation protocol and analysis compared to microarray data [14].
Formally, consider a data matrix X n,p ∈ R n×p , where n indicates the number of the observations, and p the original number of variables, and the ROBPCA method proceeds in three main steps: 1.
the data is preprocessed so that the transformed data lies in a subspace whose dimension is at most n − 1; 2.
a preliminary covariance matrix S 0 is constructed and used to select the number of components k that are subsequently retained, resulting in a k-dimensional subspace that is well fitted to the data; 3.
data points are projected onto this subspace, where their location and scatter matrix are robustly estimated, from which their k non-zero eigenvalues 1 , . . . k are computed. The corresponding eigenvectors are the k robust principal components.
Let P p,k be the p × k eigenvector matrix (with orthogonal column vectors); the location estimate is denoted by the p-variate column vectorν and called the robust center. The scores are the entries of the n × k matrix The k robust principal components generate a p × p robust scatter matrix S of rank k given by S = P p,k L k,k P p,k , where L k,k is the diagonal matrix with the eigenvalues 1 , . . . k . Similar to classical PCA, the ROBPCA method is locationally and orthogonally equivariant, properties that are non-trivial for other robust PCA estimators [2,15]. It should be emphasised that one of the advantages of using PCA-related mechanisms is a feasible outlier classification that takes into account the positions of outliers with respect to the projected subspace and also provides a useful interpretation of these positions.
An example of this behavior is shown in Figure 1, where four types of points can be distinguished, depending on the location of the observation. These points can be categorized as: Regular Observations (ROs), which form a homogeneous group near the PC subspace generated by the principal components; Good Leverage Points (GLPs), which are at the same plane as the PC subspace but distant from the ROs; Orthogonal Observations (OOs), which have a large orthogonal distance to the PC subspace, but whose projection is on the PC subspace; and, finally, Bad Leverage Points (BLPs), which have a large orthogonal distance such that the projection on the PC subspace is away from ROs. To understand and quantify how far an observation is from the center of the ellipse, defined by ROs (score = 0 was chosen as reference), two distances can be used: the Score Distance, SD i , and the Orthogonal Distances, OD i , defined as follows: for i = 1, . . . , n, with the eigenvalues of the dispersion matrix MCD, and t ij the robust scores for each j = 1, . . . , k, is a measure of the distance between an observation belonging to the PC k-dimensional subspace and the origin of this subspace; and where each t i is the i-th row of T n,k , andμ is the robust estimate of the center, which measures the deviation (i.e., lack of fit) of an observation from the PC k-dimensional subspace. Based on these measures, the diagnostic plot (DD-plot or outlier map) can be plotted with SD i on the horizontal axis and OD i on the vertical axis to distinguish between ROs and the three types of outliers. To classify all observations, two cutoff lines are drawn according to the data. Since it is known that the distance adopted to construct the tolerance ellipsoid (the squared Mahalnobis distances for normally distributed scores) follows a χ-squared distribution approximately, the cutoff value of the horizontal axis is obtained from the 0.975 quantile of this distribution with k degrees of freedom: On the other hand, even if the distribution of the OD i for the generic i, it is not known exactly, literature results allow to approximate them by a normal distribution with particular mean and variance estimated using the MCD [10]. Defining in this way the cutoff for the vertical axis as: whereμ andσ are the MCD estimates for the mean and standard deviation of the above normal distribution, and z 0.975 is the 97.5% quantile of the Gaussian distribution. Figure 2 illustrates an example of the outlier map described. As with standard PCA approaches, their robust variants require a criterion for selecting the number of principal components. We adopt the mechanism proposed in Reference [10], which selects k components according to the following empirical Rayleigh rule: where j for j = 1, . . . , k are the eigenvalues of S 0 , the robust covariance matrix of the data and r the rank of S 0 . However, there are some other criteria proposed in literature for selecting k, ranging from the use of the formula k 1 ≥ 10 −3 to the use of more complex information based criteria and cross-validation approaches [16][17][18].
In the context of gene expression array analysis, state-of-the-art standard methods for detecting anomalies in microarrays, which are considered reference methods, are implemented in the Bioconductor package arrayQuality [19,20]. These perform univariate analysis based on two independent methods that assign a rank to each sample. The first technique takes one sample at time and compares its probability distribution to that of the entire dataset using a Kolmogorov-Smirnov statistic, assuming the intrinsic similarity measure is that based on the statistical distribution. Instead, the other technique simply ranks each sample according to that sample's total information across all genes (by summing over expression levels). In both approaches, the outlier sample is selected at the end by performing the standard univariate detection method according to the rank score.

Proposed Approach
To detect anomalous samples in microarray datasets and provide a possible feasible outlier classification, we combined Hierarchical Clustering (HC) and Robust PCA performing them sequentially. The ensemble approach obtained uses the first technique to derive a preliminary view of the dataset, where the choice of the distance is determined by the CCC and the clustering is validated by the Silhouette coefficient. The second technique for characterizing the nature of the outlier is based on the position of the sample on the DD-plot. The workflow of the proposed approach detailed with the packages used to perform each block is depicted in Figure 3. The proposed framework identifies different categories of samples based on the fitting of the data by the two techniques performed sequentially and on the classification in the Section 2, that labels data samples as ROs, GLPs, BLPs, and OOs. In addition, an Alert Sample-type (AS) is included: this is identified as an observation with a lower silhouette (below a pre-selected threshold) and they should not indicate strictly anomalous samples, but samples not properly fitting the data. The Algorithm 1 describes the identification procedure that encodes the sample classification mentioned above.

Experimentation
This section is devoted to evaluating the performance of the proposed methodology on both artificial and real biological datasets. All numerical results were obtained testing the proposed approach in the R environment [21] run on a 16Gb RAM, I7 octa core machine.
Concerning the HC, as discussed in Section 2, we have used this general framework with the Pearson distance, while the CCC was adopted to select the most appropriate method. On the other hand, referring to the approach Low-Rank, the number of PCs was selected according to the empirical criterion in (10).

Synthetical Datasets
Two artificial datasets were constructed to investigate different key aspects of this study. The first dataset aims to simulate the typical structure of the cancer dataset [22][23][24] to estimate the biological aspects of the chosen techniques. The second artificial dataset was designed instead to investigate, from a mathematical point of view, the nature of the outliers and to try to mimic what happens in a biological context.
As a first step for our study, we simulated a typical cancer dataset with known outliers as proposed in Reference [24]. Each dataset contains two clearly distinguishable sample classes. The abnormal samples do not belong to either class or are simply mislabeled.
We have 1000 genes in the rows and 100 samples in the columns (50 for each class). The first 900 rows are drawn from the same normal distribution for both classes, and the remaining 100 were drawn from different distributions for samples of classes C 1 and C 2 , respectively. In addition, three samples of class C 1 were swapped with three samples of the second class C 2 . Finally, the last sample of each class was replaced by one with a different distribution (e.g., the Poisson distributions with λ 1 = 30 and λ 2 = 35, respectively). A graphical representation of this structure can be found in Figure 4.  9  11  13  15  17  19  21  23  25  27  29  31  33  35  37  39  41  43  45  47  49  51  53  55  57  59  61  63  65  67  69  71  73  75  77  79  81  83  85  87  89  91  93  95  97 Figure 5 give the CCC value and the Silhouette index, respectively, which are used to validate the quality of the clustering. In particular, in Table 2, the CCC values for different methods allow the selection of the average linkage as the best method for this clustering. The clustering associated with the highest CCC value was then validated by the Silhouette coefficient, which is equal to 0.35. As can be seen in Figure 5, the two techniques give the same results, in particular HC positions the mislabeled outliers in the right class, while the abnormal type outliers form a separate cluster. Subsequently, the ROBPCA method detects more qualitative information, as described in Section 2. In this case, the "mislabeled" samples are labeled as GLPs, as shown in Figure 6.
A second artificial dataset was created to check whether outliers detected by the proposed ensemble methodology can be endowed with some statistical properties.
From a statistical point of view, an outlier is seen as an observation that lies outside the overall pattern of a distribution, so the dataset was constructed so that most samples satisfy a normal distribution, with only a few anomalous samples generated with probability distributions that do not verify the Central Limit Theorem (CLT). Specifically, the data matrix X n,m , with n = 1000 and m = 300, s = m − 4 has normal X :j ∼ N (0, 1) (for j = 1, . . . , s) samples and only four anomalies.  We reproduce a bimodal trend of the sampling distributions by setting 100 rows in X n,m to a normal distribution N (3, 0.5). As for the four outliers, two of them retain the normal distribution, with different mean and variance (1 and 1.1 for the means and 1 and 0.9 for the values of the second pulse); while the remaining two are generated without central t-students with 2 and 2.1 degrees of freedom and 2 and 2.2 without central parameter, respectively.
The results obtained by applying the proposed ensemble method are described below. The Silhouette coefficient is equal to 0.44, confirming the fit of clustering to synthetic data; the CCC is equal to 0.97, confirming that the choice of Pearson distance with the Average method generates a clustering that fits to the data very well. Looking at the DD plot shown in Figure 7, one can notice that the two outliers with the t-student distribution are called BLPs, while the two samples with normal distributions different from the other samples are OOs. For a more detailed discussion of this behavior, see Section 5.

Real Dataset
To test the effectiveness of the proposed ensemble methods in the context of microarray outlier detection, four real cancer datasets were used. These cancer datasets were selected to verify the robustness of the approach in different loading situations where either solid and biologically similar samples are outliers.
We generated four different datasets (hereafter referred to as Datasets A, B, C, and D, respectively) from real tumor gene expression profiling data characterized by patients with the same cancer type. In each dataset, we added some "true outliers" derived from patients with different cancer diagnoses, and attempted to simulate a gradient of biological distance between the "true outliers" and all other samples. Specifically, Dataset A, C, and D were added with samples derived from different tumor types or cell types, while Dataset B was added with samples having the same cell of origin of the main dataset. Specifically, Dataset A consisted of 591 samples of diffuse large B-cell lymphoma (DLBCL), a particular type of aggressive Non-Hodgkin Lymphoma (NHL) derived from B cells, and two samples of ovarian cancer (OC) [12,[25][26][27][28][29][30][31].
Dataset B consisted of 591 DLBCL and 6 samples of other NHL subtypes also derived from B cells (two Follicular Lymphomas FL, two Mantle Cell Lymphomas MCL , and two Burkitt Lymphoma BL) [32][33][34]. In this dataset, the choice of the Pearson's distance is essential because the FL, MCL, and BL outliers differ little from the other samples as they come from the same type of tumor. Dataset C was created by grouping three types of hematological cancers originating from different types of hematological cells, namely 448 cases of Chronic Lymphocytic Leukemia (CLL) and three cases of T-Acute lymphoblastic leukemia (T-ALL) and three cases of Myelodysplastic Syndrome (MDS). Finally, Dataset D was created by adding three cases of breast cancer samples to 448 CLL [35]. Table 3 summarizes the used datasets and their main characteristics. Raw data (In particular, samples related to DLBCL are associated to GSE10846, GSE132929, GSE23501, GSE34171, GSE87371 and GSE98588; samples of ovarian cancer to GSE9891; whereas the other six samples in Dataset B were randomly chosen from GSE12195,GSE55267, GSE93261, GSE26673, and GSE21452. For the other two datasets, the baseline of 448 CLL with the particular six samples are associated to GSE13159, while the remaining three sample of breast cancer were from E-MTAB-2501 series.) were downloaded from Gene Expression Omnibus and Array Express databases and preprocessed removing background, normalizing and batch effect correction procedures (needed when raw data come from different series/laboratories). In detail, these operations allow to remove background from native files (such as CEL extension files), normalize arrays in order to have comparable samples, and correct batch effects due to systematic technical differences (such as Laboratory, time, day, or instrument used for the biological experiment [37][38][39]). Table 4 reports the quality metric values of clustering and the number of outliers identified in the numerical experiments, while Table 5 summarizes the cardinality of each cluster and the associated averaged Silhouette (which indicates the contribution of each cluster for every dataset). High value of the CCC, as well as the values of the Silhouette coefficient, confirm the presence of a reasonable clustering structure.
It should be emphasized that the detected samples correspond to the expected outliers in all datasets (i.e., the two samples with a solid ovarian tumor in the Dataset A, the six samples divided in FL, MCL, and BL in Dataset B, three in Dataset C, and the remaining three samples of Breast cancer in Dataset D) and that the majority of them are in the quadrant of the ROBPCA and then classified as BLPs with the exception of the three samples from Dataset C, which were classified as OOs. In addition, the approach identifies other "not expected" samples as outliers: these need further investigation from a biological point of view. Table 4. Results obtained for each dataset. We highlight that, in Dataset C, the second cluster has seven samples instead of the expected six mislabeled samples. In the first analysis, one could think of an error in the clustering technique; in fact, the study of the degradation of the samples suggests that this sample shows a similar behavior with respect to the Normalized Unscaled Standard Error (NUSE) graph as the mislabeled samples of the cluster in which it was inserted.  For readability, we discuss here only the results obtained when considering Dataset A, and refer the reader to the Appendices A-D for a detailed discussion of the full results for all datasets. Considering the distributions and positions on the DD-plot (see Figure 9) of the samples identified as novel outliers (i.e., expected a priori to be an outlier), they are correctly referred to as BLPs, ASs, and OOs. To assess the results of our approach, several density plots are performed. First, the density plot in Figure 10 illustrates the different distribution between samples labeled according to their cluster membership. Figure 11 illustrates the density of the individual outliers obtained accordingly to their classification. The mislabeled samples (in this case the two Ovarian Cancer) are depicted in light blue and blue; BLPs are the samples in dark green and pink, whereas OOs are reported in green color.

Biological Analysis
We used the four datasets to detect the outliers using both HC and Robust PCA. Interestingly, we observed that both methods were able to identify the "true outliers"; however, we also detected "unanticipated outliers" within the group of main samples. To analyze these results, we examined the array quality of each outlier by analyzing RNA degradation plots that analyze mean intensity in relation to probe numbers and help identify samples with low RNA quality; relative log expression (RLE) and Normalized Unscaled Standard Error (NUSE), implemented in the AffyRNAdeg R package. The former is calculated by subtracting the median gene expression estimate across arrays from each gene expression estimate, while the latter provides a measure of the precision of its expression estimate on a given array relative to other arrays in the batch [40]. In Dataset A, two DLBCL samples (GSM844275 and GSM2601431) were classified as BLPs along with the two "true" OC outliers and showed an irregular RNA degradation plot, NUSE and RLE, compared to the other outliers, suggesting poor array quality. Similarly, in Datasets B, the same DLBCLs were classified as BLPs, along with the FL, MCL, and BL "true outlier" samples, which instead did not show poor array quality according to RNA degradation plots, NUSE and RLE. Note that we also detected "non-expected outliers" in the CLL groups of datasets C and D, which also showed array parameters of poor quality. Interestingly, "true outliers" in Dataset C and D were characterized by good quality. We also detected OOs in Dataset A and B that were characterized by a single DLBCL case, while, in Dataset C, three MDS "true outliers" were identified as OOs outliers. All OOs showed array parameters of good quality. Finally, we identified another outlier category called "Alert" in all analyzed datasets that were displayed in the threshold range of BLPs in the DD plots, these samples were characterized by ambiguous quality array parameters. We want to specify that Alert samples are samples that may have a pre-degraded state. The algorithm signals them, and then it is up to the domain expert to verify the actual degradation state. Detailed figures regarding this point can be found in the Supplementary Materials.

Comparison with Existing Methods
To strengthen the study on the reliability of our method, we compared it with standard techniques proposed in the microarray literature (as explained in Section 2): (i) the method based on Kolmogorov-Smirnov distribution, hereafter referred to as KS, and (ii) SUM (Sum operation by column), the mechanism that assumes total weight among genes. As before, only the results for Datasets A are reported here (see Appendices A-D for the complete overview of all biological datasets). KS individuates 34 outliers: this is a high number compared to the nature of biological outliers, which are expected to be very small. The reason is probably that the KS procedure is based on a probability distribution distance. In contrast, SUM finds several outliers and only one outlier compared to our approach. Figure 12 shows the Euler-Venn diagram (drawn using an online tool for comparing and visualizing biological lists with area-proportional Venn diagrams) [41] for the sets of outliers detected by the three mechanisms on Dataset A, and, Figure 13 shows the study of the degradation of the additional samples obtained by the comparison techniques.
Based on these results, we can assume that our approach combines the concept of Bad Leverage points with the concept of degradation, an aspect that the other comparison techniques do not have. Indeed, it can be observed in the figure that the degraded samples are the ones identified by our approach, while the others are not degraded.

Discussion
An ensemble mechanism combining Robust PCA and Hierarchical Clustering with opportune distances has been proposed to search for anomalies in gene expression matrices in a more reasonable way. The strength of this method is to provide a pseudo-classification model of the outliers. Figure 14 shows a possible interpretation about the reason why each sample is in a particular quadrant of the PC plane.
The figure shows that from left to right we have the samples that have a distribution from most similar to least similar. From top to bottom, we have the samples that have a degradation status from highest to lowest. The threshold is plotted in red. According to the above discussion, we could assume that outliers that are of "low quality" are considered as very extremely bad outliers above a certain threshold. On the contrary, the "mislabeled" type outliers are on the borderline between the "orthogonal" type outliers and the bad leverage points. In general, the more the distribution of the anomalous samples deviates from the majority of the data, the more the outlier is on the right side of the plot. From a biological perspective, our results provide a reliable tool to better understand the outlier properties. It has been widely demonstrated that a small percentage of publicly available arrays are of poor quality and that these samples can affect downstream analysis. When an outlier is detected, it is likely to be associated with poor quality values. Beyond analysis of degradation and quality data, our combined approach to outlier detection is able to identify samples with putative biological significance that may be worth investigating in terms of putative biological significance and clinical characteristics. It may also be interesting to integrate these observations with other high-throughput analyses, such as genome sequencing, to investigate what changes might be responsible for such peculiar transcriptional patterns.
By the way, a very important aspect is to establish the threshold beyond which to consider a sample degraded or not. This problem is configured as a search and optimization problem of the optimal hyperparameter [42] for which our ensemble method can be considered a possible decision model for the search for anomalies in this type of data.

Conclusions and Future Works
In this work, we present a new ensemble approach to anomaly detection that combines HC and Low-Rank methods. It is configured as an additional tool and allows deriving a pseudo-mathematical classification of outlier samples in GEP data focusing on microarray.
Since recent work has focused only on RNA-seq data [14], we will extend our approach to be interchangeable between different platforms, such as RNA-seq and GEP from Nanostring Technologies. The preliminary experimental results performed using the proposed approach have shown that it is possible to pseudo-classify the outliers based on their nature. Future work should be carried out to identify the thresholds within which it is possible to associate the mathematically defined outlier with the biological outlier. The results obtained are quite promising and indicate the usefulness of the proposed mechanism as a preprocessing tool for the analysis of datasets that need further investigation. For example, the proposed mechanism shows that it is able to eliminate degraded samples, perform analysis on specific samples, and possibly reclassify mislabeled outlier types. Moreover, the proposed method was preliminary used in same recent prognostic studies related to LXR (Liver x receptor) protein [43], confirming its ability to identify some samples to be removed. This result helped to improve the discriminating prognosis ability of considered patient with respect to clinical outcome and the confirmed the possible applicability of the proposed procedure also in prognostic biological context. Future research should be devoted to the construction of a new decision model that incorporates the proposed ensemble mechanism as a data preprocessing method to identify the anomalies, and integrate the anomaly detection tool into the context of microarrays for searching and classifying samples that can generate new biological hypotheses.

Conflicts of Interest:
The authors declare no conflict of interest.
Sample Availability: Samples are available from the authors.

Abbreviations
The following abbreviations are used in this manuscript:  Figure A1 collects the figures illustrating the quality measures that have been previously detailed when Dataset A is used. Figure A1. Summary of quality array measures for Dataset A. Figure A2 collects the figures illustrating the quality measures used when Dataset B is analyzed. This dataset was derived from Dataset A only replacing the ovarian tumor samples with the samples with FL, MCL, and BL tumors. As can be seen from the Robust PCA, the MCL, FL, and BL tumor type samples are outliers. Specifically, these are Bad Leverage Outliers, as they are in the first quadrant. For the samples of FL, MCL, and BL tumors shown in blue, a different distribution is observed than for the other samples shown in green. The previously found outliers are shown in red. Degradation analysis was also performed in this case, giving the same results for the BAD samples, but, in this dataset, two of the three Alert samples found by our method are in a pre-degradation state. Figure A2. Summary of quality array measures for Dataset B. Figure A3 illustrates the degradation plot of outliers obtained by the KS and SUM techniques and the Euler-Venn diagrams comparing the results of the proposed method with those of standard techniques. The same consideration drawn for Dataset A can be done also for these pictures.  Figure A4 collects the figures illustrating the quality measures used when Dataset C is analyzed. As can be seen from the DD-plot, the samples show a more dispersed behavior than in the previous cases. We can observe that the ensemble method identifies 3 Bad Leverage points, depicted in red in the plot. It detects the mislabeled samples colored in blue and between the samples at the border between the Bad Leverage points and the Good Leverage points, where there are some Alert samples that result in a pre-degradation state. Figure A5 illustrates the degradation plot of outliers obtained by the KS and SUM techniques and the Euler-Venn diagrams comparing the results of the proposed method with those of standard techniques.   Figure A6 collects the figures illustrating the quality measures used when Dataset C is analyzed. Similar to Dataset C, the DD plot in this case also shows a dispersive scattering behavior of the samples. The ensemble method identifies 3 Bad Leverage points, depicted in red in the plot. It identifies the Mislabeled Samples in blue between the samples on the border between the Bad Leverage points and the Good Leverage points, where there are some Alert samples. We get the same reasoning as in the previous case.

Appendix D. Dataset D
Finally, Figure A7 illustrates the degradation plot of outliers obtained by the KS and SUM techniques and the Euler-Venn diagrams comparing the results of the proposed method with those of standard techniques.