1. Introduction
The multi-scale character of biological regulation typically produces one (or a few) main modes explaining a major part of variance (size component), squashing potentially interesting (shape) components into the noise floor. These minor components could be erroneously discarded as pure noise by the conventional selection methods, based on identifying the main order parameters that organize the system [
1].
This phenomenon is very common in applications of principal component analysis (PCA) to biological data [
2]. The component explaining most of the data set’s variance typically often reflects well-known regularities. A classic example is the first principal component (PC) in microarray data sets, which frequently captures the gene expression profile characteristic of a specific tissue [
3]. In contrast, biologically relevant information—such as the effects of pathology or cell-fate transitions—may be hidden in minor components that account for only a small fraction of the total variance.
Moreover, systematic variations introduced by laboratory conditions, reagent batches, or personnel differences—commonly referred to as batch effects [
4]—can generate artificial major components that obscure biologically meaningful signals. It is important to emphasize, however, that this is not always the case: the presence of one or a few components explaining nearly all system variance is not necessarily an artifact, but may instead reflect the existence of a dominant order parameter that organizes the system’s variability [
5]. Similar considerations hold true for biomedical applications like imaging studies, where the main order parameter corresponds to anatomical organization, while diagnostic relevant information is embedded in subtle image details [
6].
In a purely ‘supervised learning’ frame—such as in medical diagnosis—this issue can be easily addressed by identifying biologically relevant minor components through their correlation with external variables of interest. These variables correspond to labels assigned to statistical units based on their biological or clinical nature, as determined by independent criteria (e.g., healthy/unhealthy, active/inactive). In such cases, the relevant components are those enabling a classification consistent with the a priori labels (considered as the golden standard), regardless of the amount of variance they explain [
7].
In the case of explorative, hypothesis-generating studies, this straightforward approach is not applicable. Alternative strategies are required to assign biological meaning to minor components and/or sparsely populated clusters.
In this study, we propose a novel strategy applied to a data set derived from single-cell RNA sequencing (scRNA-Seq). The innovation lies in the integration of a non-linear clustering procedure (Convex Hull-based) with a structural-semantic interpretation of minor components, and in the introduction of a new statistical index—termed ‘Lacunarity’—to determine the optimal partitioning of the data set.
2. Data Analysis Strategy
The higher the proportion of variance explained by principal components, the higher their clustering efficiency—an empirical observation familiar to any data analyst. This stems from the coherence between K-means and PCA procedures. In fact, the continuous solutions of the discrete K-means clustering membership indicators correspond to the principal eigenvectors of the covariance matrix [
8]. Similar considerations hold true for other clustering procedures, both hierarchical and non-hierarchical.
This relationship, however, is purely structural and not related to the ‘biological meaning’ of clusters and components. It reflects intrinsic properties of the data set: higher variance implies a wider space for separating groups. This, combined with the mutual linear independence of components (which guarantees a proper Euclidean metric), explains why clustering procedures in unsupervised learning are commonly applied to the major components of a data set.
Non-hierarchical clustering methods such as K-means require the a priori choice of the number of clusters K. To determine the optimal K, the procedure is repeated for different values of K, and each solution is evaluated using R
2 or pseudo-F statistics. The optimal K is the one that maximizes the ratio of between-cluster to within-cluster variance [
9].
The convex-hull optimization strategy adopted in K-volume clustering paradigm [
10] replaces variance with ‘lacunarity’. The underlying idea is that, for a structurally meaningful partition of the data set, the sum of the areas corresponding to the different clusters in the reference space should be smaller than the area occupied by the entire data set before clustering.
In the original version [
10], the clustering algorithm was applied in a hierarchical mode. Here, we propose a non-hierarchical version of the algorithm, given that a natural hierarchical structure is not necessarily present in every data set.
We defined two different ‘lacunarity’ measures: one as the ratio between the total area and the sum of the cluster areas (see the two-dimensional case presented in [
10]) and another based on volume (for our extension to the three-dimensional case). Let us call
LCA the two-dimension lacunarity index, A
TOT the total area within the data points and A
i the area of the
i-th cluster for
I = 1, …,
K. The formula for the two-dimension
LCA is:
where
BA corresponds to the empty space in two dimensions and the optimal number of clusters (
K) will correspond to the maximal lacunarity.
Now let us call
LCV the three-dimensional lacunarity index,
VTOT the total volume within the data points and
Vi the volume of the
i-th cluster for
I = 1, …,
K. The formula for the three-dimensional
LCV is:
where
BV corresponds to the empty space in three dimensions and the optimal number of clusters (
K) will correspond to the maximal lacunarity.
It is worth noting the resemblance between lacunarity and the presence of distinct ‘attractors’ drastically reducing the phase space accessible to the system at hand. In the case of scRNA-Seq data, these attractors may correspond to different cell types, each characterized by a unique ‘ideal’ gene expression profile.
In K-volume clustering, the clustering cost corresponds to the minimal convex volume enclosing all points within a cluster. Under this definition, a single data point has zero volume, and any set of collinear points also has null volume as well [
10]. This second property is very important in biological investigations: assigning zero volume to collinear points enhances the method’s sensitivity to subsets that exhibit high intra-class correlation (close to one), even when the overall data set shows no correlation—as in principal component spaces, which are spanned by axes that are linearly independent by construction.
Given that each cell type typically exhibits a highly specific gene expression pattern, independent samples of the same cell type tend to show near-unity correlation in terms of expression profile [
3]. This makes K-volume clustering particularly effective in identifying small groups of cells that differ from the majority of the cell population. These groups, made by cells of a different kind with respect to the majority of the population, are expected to show high intra-class correlation when embedded into principal component spaces mainly defined by the ‘dominant’ cell kind.
Let us call such cells as ‘Type A’; when we analyze a homogeneous ‘Type A’ population by PCA (or by any other dimensionality reduction method), we will generate a dominant principal component reflecting their unique expression profile [
3]. Conversely, when these A-cells are sparsely distributed within a large ‘Type B’ population of a different cell kind, the relatively few ‘Type A’ cells are likely to contribute only to minor components, accounting for a small fraction of the explained variance.
This observation inspired the data analysis strategy we propose: after identifying sparsely populated clusters—particularly those with collinear distributions—in the space defined by the major principal components, we examine the elements of these clusters for an excess of extreme absolute values relative to the minor component(s) corresponding to their main order parameter (ideal profile). The enrichment of extreme values of a minor principal component could qualify such a component as the dominant (and thus high variance) mode of a small group of cells coming from a different tissue kind.
When a statistically significant excess of extreme values of a minor component is observed for a specific cluster, we analyze its loading pattern with respect to RNA species. If this pattern aligns with a known cell type profile, we can assign a putative biological identity (cell type) to the cluster.