Detecting Very Weak Signals: A Mixed Strategy to Deal with Biologically Relevant Information

Vici, Alessandro; Zeuner, Ann; Giuliani, Alessandro

doi:10.3390/a18090581

Open AccessArticle

Detecting Very Weak Signals: A Mixed Strategy to Deal with Biologically Relevant Information

by

Alessandro Vici

¹

,

Ann Zeuner

^1,†

and

Alessandro Giuliani

^2,*,†

¹

Department of Oncology and Molecular Medicine, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161 Rome, Italy

²

Environment and Health Department, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161 Rome, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to the work.

Algorithms 2025, 18(9), 581; https://doi.org/10.3390/a18090581

Submission received: 15 July 2025 / Revised: 8 September 2025 / Accepted: 10 September 2025 / Published: 13 September 2025

(This article belongs to the Special Issue Advances in Algorithms Through Heuristics: Theory, Applications, and Innovations)

Download

Browse Figures

Versions Notes

Abstract

In many biological investigations, the relevant information does not coincide with the most powerful signals (most elevated eigenvalues, dominant frequencies, most populated clusters...), but very often hides in minor features that are difficult to discriminate from random noise. Here we propose an algorithm that, by the combined use of a non-linear cluster analysis procedure and a strategy to discriminate minor signal components from noise, allows singling out biologically relevant hidden information. We tested the algorithm on a sparse data set corresponding to single-cell RNA-Seq measures, being able to identify a very small population of cells in charge of the immune response toward cancer tissue.

Keywords:

K-volume clustering algorithm; K-area clustering algorithm; scRNA-Seq

1. Introduction

The multi-scale character of biological regulation typically produces one (or a few) main modes explaining a major part of variance (size component), squashing potentially interesting (shape) components into the noise floor. These minor components could be erroneously discarded as pure noise by the conventional selection methods, based on identifying the main order parameters that organize the system [1].

This phenomenon is very common in applications of principal component analysis (PCA) to biological data [2]. The component explaining most of the data set’s variance typically often reflects well-known regularities. A classic example is the first principal component (PC) in microarray data sets, which frequently captures the gene expression profile characteristic of a specific tissue [3]. In contrast, biologically relevant information—such as the effects of pathology or cell-fate transitions—may be hidden in minor components that account for only a small fraction of the total variance.

Moreover, systematic variations introduced by laboratory conditions, reagent batches, or personnel differences—commonly referred to as batch effects [4]—can generate artificial major components that obscure biologically meaningful signals. It is important to emphasize, however, that this is not always the case: the presence of one or a few components explaining nearly all system variance is not necessarily an artifact, but may instead reflect the existence of a dominant order parameter that organizes the system’s variability [5]. Similar considerations hold true for biomedical applications like imaging studies, where the main order parameter corresponds to anatomical organization, while diagnostic relevant information is embedded in subtle image details [6].

In a purely ‘supervised learning’ frame—such as in medical diagnosis—this issue can be easily addressed by identifying biologically relevant minor components through their correlation with external variables of interest. These variables correspond to labels assigned to statistical units based on their biological or clinical nature, as determined by independent criteria (e.g., healthy/unhealthy, active/inactive). In such cases, the relevant components are those enabling a classification consistent with the a priori labels (considered as the golden standard), regardless of the amount of variance they explain [7].

In the case of explorative, hypothesis-generating studies, this straightforward approach is not applicable. Alternative strategies are required to assign biological meaning to minor components and/or sparsely populated clusters.

In this study, we propose a novel strategy applied to a data set derived from single-cell RNA sequencing (scRNA-Seq). The innovation lies in the integration of a non-linear clustering procedure (Convex Hull-based) with a structural-semantic interpretation of minor components, and in the introduction of a new statistical index—termed ‘Lacunarity’—to determine the optimal partitioning of the data set.

2. Data Analysis Strategy

The higher the proportion of variance explained by principal components, the higher their clustering efficiency—an empirical observation familiar to any data analyst. This stems from the coherence between K-means and PCA procedures. In fact, the continuous solutions of the discrete K-means clustering membership indicators correspond to the principal eigenvectors of the covariance matrix [8]. Similar considerations hold true for other clustering procedures, both hierarchical and non-hierarchical.

This relationship, however, is purely structural and not related to the ‘biological meaning’ of clusters and components. It reflects intrinsic properties of the data set: higher variance implies a wider space for separating groups. This, combined with the mutual linear independence of components (which guarantees a proper Euclidean metric), explains why clustering procedures in unsupervised learning are commonly applied to the major components of a data set.

Non-hierarchical clustering methods such as K-means require the a priori choice of the number of clusters K. To determine the optimal K, the procedure is repeated for different values of K, and each solution is evaluated using R² or pseudo-F statistics. The optimal K is the one that maximizes the ratio of between-cluster to within-cluster variance [9].

The convex-hull optimization strategy adopted in K-volume clustering paradigm [10] replaces variance with ‘lacunarity’. The underlying idea is that, for a structurally meaningful partition of the data set, the sum of the areas corresponding to the different clusters in the reference space should be smaller than the area occupied by the entire data set before clustering.

In the original version [10], the clustering algorithm was applied in a hierarchical mode. Here, we propose a non-hierarchical version of the algorithm, given that a natural hierarchical structure is not necessarily present in every data set.

We defined two different ‘lacunarity’ measures: one as the ratio between the total area and the sum of the cluster areas (see the two-dimensional case presented in [10]) and another based on volume (for our extension to the three-dimensional case). Let us call LC_A the two-dimension lacunarity index, A_TOT the total area within the data points and A_i the area of the i-th cluster for I = 1, …, K. The formula for the two-dimension LC_A is:

{L C}_{A} = \frac{(A_{T O T} - \sum_{i = 1}^{K} A_{i})}{A_{T O T}} = \frac{B_{A}}{A_{T O T}}

(1)

where B_A corresponds to the empty space in two dimensions and the optimal number of clusters (K) will correspond to the maximal lacunarity.

Now let us call LC_V the three-dimensional lacunarity index, V_TOT the total volume within the data points and V_i the volume of the i-th cluster for I = 1, …, K. The formula for the three-dimensional LC_V is:

{L C}_{V} = \frac{(V_{T O T} - \sum_{i = 1}^{K} V_{i})}{V_{T O T}} = \frac{B_{V}}{V_{T O T}}

(2)

where B_V corresponds to the empty space in three dimensions and the optimal number of clusters (K) will correspond to the maximal lacunarity.

It is worth noting the resemblance between lacunarity and the presence of distinct ‘attractors’ drastically reducing the phase space accessible to the system at hand. In the case of scRNA-Seq data, these attractors may correspond to different cell types, each characterized by a unique ‘ideal’ gene expression profile.

In K-volume clustering, the clustering cost corresponds to the minimal convex volume enclosing all points within a cluster. Under this definition, a single data point has zero volume, and any set of collinear points also has null volume as well [10]. This second property is very important in biological investigations: assigning zero volume to collinear points enhances the method’s sensitivity to subsets that exhibit high intra-class correlation (close to one), even when the overall data set shows no correlation—as in principal component spaces, which are spanned by axes that are linearly independent by construction.

Given that each cell type typically exhibits a highly specific gene expression pattern, independent samples of the same cell type tend to show near-unity correlation in terms of expression profile [3]. This makes K-volume clustering particularly effective in identifying small groups of cells that differ from the majority of the cell population. These groups, made by cells of a different kind with respect to the majority of the population, are expected to show high intra-class correlation when embedded into principal component spaces mainly defined by the ‘dominant’ cell kind.

Let us call such cells as ‘Type A’; when we analyze a homogeneous ‘Type A’ population by PCA (or by any other dimensionality reduction method), we will generate a dominant principal component reflecting their unique expression profile [3]. Conversely, when these A-cells are sparsely distributed within a large ‘Type B’ population of a different cell kind, the relatively few ‘Type A’ cells are likely to contribute only to minor components, accounting for a small fraction of the explained variance.

This observation inspired the data analysis strategy we propose: after identifying sparsely populated clusters—particularly those with collinear distributions—in the space defined by the major principal components, we examine the elements of these clusters for an excess of extreme absolute values relative to the minor component(s) corresponding to their main order parameter (ideal profile). The enrichment of extreme values of a minor principal component could qualify such a component as the dominant (and thus high variance) mode of a small group of cells coming from a different tissue kind.

When a statistically significant excess of extreme values of a minor component is observed for a specific cluster, we analyze its loading pattern with respect to RNA species. If this pattern aligns with a known cell type profile, we can assign a putative biological identity (cell type) to the cluster.

3. Results

3.1. Simulated Data

As first, we demonstrate the greater flexibility of K-Volume clustering with respect to K-Means algorithm, allowing to overcome both the classical limitations of handling outliers and individuating non-spherical clusters [11]. All the figures were generated through R (version 4.5.1; R Core Team 2025). We performed the analysis on a simulated data set (Figure 1).

For visual comparison, we report the cluster identified with the two algorithms by a 2D-Convex Hull representation. Through a numerical simulation, we generated 35 statistical units divided into K = 4 a priori defined clusters (Table S1). Specifically, the first three clusters were designed to be approximately spherical—thus theoretically more suitable for K-means—while the fourth cluster was constructed as a collinear distribution (i.e., points aligned along a single direction; see Figure 1a).

As expected, the K-Means algorithm struggled to capture the ‘correct’ shape of the clusters, except for Cluster 3 (Figure 1b). In contrast, the K-Volume algorithm successfully assigned the statistical units to their correct clusters, including the collinear one (Figure 1c).

In addition to visual inspection, we assessed the quality of the solutions produced by the K-means and K-Area procedures using the Silhouette and Adjusted Rand Index (ARI) metrics, along with the proposed Lacunarity index. The results of this comparison are presented in Table 1.

3.2. Real Data

Moving on to a real experimental data set, we evaluated the clustering performance of the K-Volume algorithm on a complex biological scRNA-Seq data set, obtained from blood cells (CD45⁺) isolated from dissociated mouse lungs (GEO ACCESSION: GSE261385) [12].

Due to the nature of scRNA-Seq experiments, a common challenge before clustering was the inherent sparsity of the data. In fact, the output of scRNA-Seq is a gene-by-cell matrix, where each entry represents the number of transcripts detected for a gene in a single cell. The sparsity arises from both biological factors (not all genes are expressed in every cell) and technical limitations (the process of capturing and sequencing RNA from single cells is inherently noisy due to the low number of transcripts per cell). As a result, the expression matrix contains many zero counts, which complicates computational efficiency, statistical modeling and dimensionality reduction.

Our data set initially consisted of approximately 32k genes and 6k cells, resulting in a highly sparse count matrix. To address this, we first retained only the genes with non-zero counts across all the cells. Then, using the standard Seurat pipeline [13], we reduced the dimensionality of the scRNA-Seq dataset (100 genes and 321 cells, see Table S2), facilitating subsequent analysis. Figure 2 shows the correlation heatmap of the filtered genes.

It is worth noting in Figure 2 the presence of a highly positive correlation zone in the top-left corner, whereas negative correlations are relatively rare across the matrix. This pattern suggests the existence of a dominant ‘size component’ [2] that accounts for the gene expression profile of the majority of the cell population.

In the following analysis, we will identify a minor component whose most strongly loaded genes approximately correspond to the ‘pale red’ correlation block (and its two associated ‘stripes’), located just after the first block in the bottom-right direction of Figure 2.

A subsequent step involves applying PCA to reduce the dimensionality of the data and assign component scores representing the statistical units in a 3D space. As shown in [9], PCA is a valuable tool for highlighting information relevant to the classification of genes or cell types, enabling the discovery of new biologically meaningful groupings at varying levels of detail. The results of PCA are shown in Figure 3, showing that the first three PCs are the most informative, explaining 56% of the explained variance overall.

PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
0.271	0.164	0.130	0.101	0.070	0.041	0.026	0.021	0.018	0.017

In order to compare the proposed K-Volume clustering approaches with K-means and Louvain clustering (very popular in scRNA-Seq applications, see [14]) procedures, we evaluate their performance in terms of both LC_V and Silhouette methods. Figure 4 reports the space spanned by the first three principal components.

Figure 5 shows the distributions of LC_V and S values for the three methods, along with a 3D visualization of the resulting clusters. Notably, LC_V exhibits a linear increase with the number of clusters, a pattern not observed for S. This trend supports interpreting the proposed Lacunarity index as a non-linear analogue of the proportion of variance explained by clustering solutions as K increases.

The optimal K is 5 for both K-Volume and K-means, while it is 6 for Louvain. In order to check for the biological meaning of the obtained clusters, we look for the association of these clusters with minor components (not used for clusterization) using the method reported in Algorithm 1 (see below).

Algorithm 1 Cluster-wise 2 × 2 Table Testing Significance (K, C, NC, PC)

1: Set K = n° of clusters from K-volume clustering.

2: Define the clusters partitioning S based on K, initialized as S = C.

3: Initialize a matrix M_P of dimension NC × K.

4: Set c = 1.

5: while c ≤ NC do

6: Set current_PC = PC[c];

7: Retrieve partition S = {C₁, C₂,..., C_k} related to current_PC;

8: for k = 1 to K do

9: Let C_k the cluster k of the partition S;

10: Extract values V in C_k for current_PC;

11: a = n° of current units in C_k such that V ∉ (−2,2);

12: b = n° of current units in C_k such that V ∈ (−2,2);

13: c = n° of expected units in C_k such that V ∉ (−2,2);

14: d = n° of expected units in C_k such that V ∈ (−2,2);

15: Construct 2 × 2 contingency table M_P of the form (3);

16: Compute p-value p from statistical test on M_P;

17: Update M_P[c, k] = p;

18: end for

19: c = c + 1;

20: end while

21: return M_P

Table 2 reveals a distinctive role for PC7: cluster C4 (18 cells) from K-Volume, cluster C4 (21 cells) from K-Means, and cluster C3 (21 cells) from Louvain exhibit highly significant values for PC7. Notably, these cell groups are nearly identical across the three clustering methods (Figure 5).

To elucidate the biological relevance of these findings, we annotated all data set cells using the SingleR package [15]. This annotation enabled the characterization of each cluster in terms of cell type(s), based on the ImmGenData reference database, which contains mouse bulk expression profiles from the ‘Immunologic Genome Project’ [16] (Figure 6).

The ‘PC7′ clusters emerging from the three clustering procedures (see Figure 5) are made (with no exception) of endothelial cells. It is worth noting that all the endothelial cells out of the 321 initial population (21/321) pertain to the three (one for each method) PC7-specific clusters. To go more in depth into the ‘endothelial’ character of PC7 we performed a Gene Ontology analysis based on the loadings of the genes on PC7 (Figure 7).

Figure 7 confirms the endothelial profile (the relatively low values of loading are a consequence of the fact that PC7 accounts for a minor part of total variance), so ‘closing the circle’ of the biological meaning assignment to the minority (21 cells) cluster.

A last confirmation comes from the equivalence of UMAP and PCA dimensionality reduction. We performed the same analysis described for PCA by means of the UMAP non-linear dimensionality reduction procedure [17].

Figure 8 reports the projection of the 321 cells in the space spanned by the first three UMAP axes. Panel b of the figure reports the annotation of cell kinds by the same method used in the PCA case.

The ‘endothelial’ cluster emerging from UMAP is perfectly coincident with the cluster coming from PCA so giving cogent proof of the robustness of the proposed strategy.

3.3. Step-by-Step Description of the Procedure

The testing phase of the K-Volume algorithm was performed on the first three PC cell scores (Table S3). We tested different cluster solutions (from K = 3 to K = 10) and for each solution, we calculate the lacunarity index based on Equation (2). Then we reported the results of the clustering (Table 3).

The choice of the optimal value for K is shown in Figure 9a. It is worth noting that the lacunarity index reaches a plateau at K = 5 (LC_V = 0.97), which leads to the visualization of clustering solution in 3D-space (Figure 9b). The clusters had different shapes and sizes (275 cells for Cluster 1, 12 cells for Cluster 2, 15 cells for Cluster 3, 18 cells for Cluster 4 and one cell for Cluster 5).

After clustering, we assessed the statistical significance of the minor PCs not used in the clustering phase with respect to the obtained clusters. In fact, weak signals may be embedded inside these minor components—despite their low explained variance—and could be associated with one of the less populated clusters. We propose a potential approach to assess the statistical significance of minor principal components by measuring their deviation from what is expected by a standard normal distribution N(0,1) inside a specific cluster. Since component values are expressed as Z-scores across the entire data set, an excess of ‘extreme values’ (e.g., more than 2 Standard Deviation (SD) apart from zero) suggests that the component may be specific to that cluster.

For small clusters (fewer than 30 observations), we will use Fisher’s exact test, whereas for larger clusters we will use the Chi-Squared Test. The procedure focuses on identifying anomalies, for each cluster, within a specific principal component, by counting how many statistical units (i.e., cells) fall outside the ±2 SD range (using the score values associated with the single components). These observed frequencies are then compared to expected frequencies under the normality assumption (95% within the range, 5% outside). A two-way contingency table is constructed for this comparison.

	Outside ± 2 SD	Inside ± 2 SD
Current	a	b	(3)
Expected	c	d

The first row reports the observed number of statistical units falling inside or outside the range of ±2 SD for a given cluster/PC combination, while the second row shows the expected counts under the normality assumption. Applying the Fisher test to (3) yields a p-value indicating whether the cluster significantly deviates from the expected distribution. A low p-value suggests that the minor component carries meaningful information not captured during clustering based on the major components. The compact representation of the proposed approach is presented in Algorithm 1.

Table 4 summarizes the results obtained by applying Algorithm 1. Cluster 5 was excluded from the analysis as it contained only one cell and it was not informative. It is worth noting that Cluster 4, which contains 18 cells, shows a very significant link with the minor component PC7 (explaining 2% of the total variance). PC4 too exhibits a significant association with Cluster 4; however, its loading pattern primarily reflects the generic pertinence of the cluster elements to blood cells, making it less informative about the specific nature of the cells. It could be interesting to observe this cluster in terms of the score values of the statistical units.

From Table 5 it is worth noting that the high significance between Cluster 4 and PC7 was found due to the presence of extreme score values, for each statistical unit in the cluster, assumed for this specific minor component. In particular, the range of variation reached extreme values (far away from the usual “range” of ±2 SD) in both directions (positive/negative).

We next aim to identify the genes that play a major role in characterizing the cells within Cluster 4. To this end, we analyze the loadings (associated with the genes, see Table S4) for PC7 (Figure 10a). The bar plot highlights the genes with the highest loadings on PC7. Loadings for all 100 original genes were sorted in ascending order. As expected for a minor component, the loading values are relatively low (ranging between −0.4 and 0.4).

After examining the loading distribution for PC7, we proceed with the biological characterization of Cluster 4. Classical bioinformatics analysis was performed using Gene Ontology (GO) [18]. The input for this analysis consists of genes selected according to biological criteria related to specific signatures of different cell types. Based on the distribution of loadings, we select as input all genes/sequences with absolute loading values above the ±0.1 threshold.

The results of the Gene Ontology analysis are shown in Figure 10b and in Table S5. The most significant biological pathways (p-value < 0.05) are reported and categorized into three groups: biological processes, molecular functions, and cellular components. The most highly loaded genes reveal a clear immune-related signature for the cells in Cluster 4. This assignment is supported by the presence of biological functions associated with membrane remodeling (linked to lipid pathways), communication with the extracellular matrix, angiogenesis, and the production of cytokines, chemokines, and interleukins.

To conclude this section, the entire analytical procedure presented here is summarized in Algorithm 2 with the method called K-Volume Biological Signal Characterization (KBSC).

Algorithm 2 K-Volume Biological Signal Characterization (M_EXP, K_LOW, K_MAX)

1: Let M_EXP be the original gene-by-cell expression matrix of scRNA-Seq object.

2: Apply standard Seurat pipeline to reduce the sparsity of M_EXP.

3: Let M_EXP^F be the resulting filtered gene-by-cell expression matrix.

4: Apply PCA to M_EXP^F.
5: Set C = n° of cells of M_EXP^F, G = n° of genes of M_EXP^F and T = n° of PCs.

6: Let A = {PC₁, PC₂, PC₃} be the set containing the three PCs with highest explained variance.
7: Let A^C = {PC₄, PC₅,…, PC_T} be the set containing the remaining PCs

8: Let M_S be the scores matrix (C × T).

9: Set K_LOW = minimum n° of clusters and K_MAX = maximum n° of clusters.

10: for each k in [K_LOW, K_MAX] do

11: Apply K-Volume clustering algorithm taking as input M_S[C,A];

12: Calculate Lacunarity Index LC_V;

13: end for

14: Select optimal K ∈ [K_LOW, K_MAX] clustering solution based on LC_V values.

15: Define the optimal clustering partition S = {C₁, C₂, …, C_K}.

16: Apply Algorithm 1 with the following input parameters (K, S, A^C, T).

17: Set n = n° of {A^C, S} combinations where p-value < 0.05.

18: if n is empty do

19: stop;

20: else

21: for each element in n do

22: Set PC⁰ = current PC in A^C and C⁰ = current cluster in S;

23: Loading analysis for PC⁰ and selection of relevant genes L_C;

24: Gene Ontology analysis with L_C as input;

25: Biological interpretation of cluster C⁰;

26: end for

4. Conclusions

The proposed strategy aims to amplify potentially interesting signals that are difficult to distinguish from randomness. This scenario is common in biomedical sciences but is also relevant in other application domains. The core idea is to focus on sparsely populated clusters identified through a non-linear optimization procedure independent of the clusters’ geometrical shape. The presence of intra-cluster correlation within a space spanned by linearly independent axes is particularly noteworthy, as it suggests a heterogeneous pattern of relationships among the measured variables. This feature is often revealed by components that explain only a minor portion of the total variance, corresponding to the ‘signature’ of a small, heterogeneous group embedded within a larger population. In the specific case analyzed, this is especially relevant, as subpopulations of immune cells—represented by the heterogeneous ‘rare’ Cluster 4—may crucially influence patient outcome in many medical conditions. The same method may find a useful application in oncology by identifying rare cell populations such as cancer stem cells, chemoresistant cells and dormant cells, whose elusiveness often frustrates the success of anticancer therapies [19]. Parallel advancements in biomedical technologies and statistical approaches will likely pave the way for an improved management of diseases involving the perturbation of specific cell populations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/a18090581/s1.

Author Contributions

Conceptualization, A.V., A.Z. and A.G.; software, A.V.; writing—original draft preparation, A.G., A.V. and A.Z.; writing—review and editing, A.G., A.V. and A.Z.; supervision, A.Z. and A.G.; funding acquisition, A.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Italian Association for Cancer Research (AIRC) Investigator Grants to A.Z. (AIRC IG 2023 #29148), by Unione europea—Next Generation EU—PNRR M6C2—Investimento 2.1 Valorizzazione e potenziamento della ricerca biomedica del SSN, PNRR-MAD-2022-12376183 (CUP I55E22000570006), and by European Union-Next Generation EU Project M4C2I1.3 HEAL ITALIA—Health Extended Alliance for Innovative Therapies, Advanced Lab-research, and Integrated Approaches of Precision Medicine.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author.

Acknowledgments

We are grateful to Rachele Rossi for her technical and administrative support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Giuliani, A.; Vici, A. On the (Apparently) Paradoxical Role of Noise in the Recognition of Signal Character of Minor Principal Components. Stats 2024, 7, 54–64. [Google Scholar] [CrossRef]
Jolicoeur, P.; Mosimann, J.E. Size and shape variation in the painted turtle. A principal component analysis. Growth 1960, 24, 339–354. [Google Scholar] [PubMed]
Tsuchiya, M.; Giuliani, A.; Hashimoto, M.; Erenpreisa, J.; Yoshikawa, K. Self-Organizing Global Gene Expression Regulated through Criticality: Mechanism of the Cell-Fate Change. PLoS ONE 2016, 11, e0167912. [Google Scholar] [CrossRef] [PubMed]
Leek, J.T.; Scharpf, R.B.; Bravo, H.C.; Simcha, D.; Langmead, B.; Johnson, W.E.; Geman, D.; Baggerly, K.; Irizarry, R.A. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 2010, 11, 733–739. [Google Scholar] [CrossRef] [PubMed]
Daidone, I.; Amadei, A. Essential dynamics: Foundation and applications. WIREs Comput. Mol. Sci. 2012, 2, 762–770. [Google Scholar] [CrossRef]
Odusami, M.; Maskeliūnas, R.; Damaševičius, R.; Krilavičius, T. Analysis of Features of Alzheimer’s Disease: Detection of Early Stage from Functional Brain Changes in Magnetic Resonance Images Using a Finetuned ResNet18 Network. Diagnostics 2021, 11, 1071. [Google Scholar] [CrossRef] [PubMed]
Roden, J.C.; King, B.W.; Trout, D.; Mortazavi, A.; Wold, B.J.; Hart, C.E. Mining gene expression data by interpreting principal components. BMC Bioinform. 2006, 7, 194. [Google Scholar] [CrossRef] [PubMed]
Ding, C.; He, X. K-means Clustering via Principal Component Analysis. In Proceedings of the Twenty-First International Conference on Machine learning–ICML’04, Banff, AB, Canada, 4–8 July 2004; ACM Press: Banff, AB, Canada, 2004; p. 29. [Google Scholar]
Crescenzi, M.; Giuliani, A. The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data. FEBS Lett. 2001, 507, 114–118. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Li, F. K-Volume Clustering Algorithms for scRNA-Seq Data Analysis. Biology 2025, 14, 283. [Google Scholar] [CrossRef] [PubMed]
Khan, I.K.; Daud, H.B.; Zainuddin, N.B.; Sokkalingam, R.; Abdussamad; Museeb, A.; Inayat, A. Addressing limitations of the K-means clustering algorithm: Outliers, non-spherical data, and optimal cluster selection. MATH 2024, 9, 25070–25097. [Google Scholar] [CrossRef]
Mondal, J.; Zhang, J.; Qing, F.; Li, S.; Kumar, D.; Huse, J.T.; Giancotti, F.G. Brd7 loss reawakens dormant metastasis initiating cells in lung by forging an immunosuppressive niche. Nat. Commun. 2025, 16, 1378. [Google Scholar] [CrossRef] [PubMed]
Arbatsky, M.; Vasilyeva, E.; Sysoeva, V.; Semina, E.; Saveliev, V.; Rubina, K. Seurat function argument values in scRNA-seq data analysis: Potential pitfalls and refinements for biological interpretation. Front. Bioinform. 2025, 5, 1519468. [Google Scholar] [CrossRef] [PubMed]
Seth, S.; Mallik, S.; Bhadra, T.; Zhao, Z. Dimensionality reduction and louvain agglomerative hierarchical clustering for cluster-specified frequent biomarker discovery in single-cell sequencing data. Front. Genet. 2022, 13, 828479. [Google Scholar] [CrossRef] [PubMed]
Huang, Q.; Liu, Y.; Du, Y.; Garmire, L.X. Evaluation of cell type annotation R packages on single-cell RNA-seq data. Genom. Proteom. Bioinform. 2021, 19, 267–281. [Google Scholar] [CrossRef] [PubMed]
Zemmour, D.; Goldrath, A.; Kronenberg, M.; Kang, J.; Benoist, C. The ImmGen consortium OpenSource T cell project. Nat. Immunol. 2022, 23, 643–644. [Google Scholar] [CrossRef] [PubMed]
Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.-A.; Kwok, I.W.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2019, 37, 38–44. [Google Scholar] [CrossRef] [PubMed]
Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene Ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef] [PubMed]
Francescangeli, F.; De Angelis, M.L.; Rossi, R.; Cuccu, A.; Giuliani, A.; De Maria, R.; Zeuner, A. Dormancy, stemness, and therapy resistance: Interconnected players in cancer evolution. Cancer Metastasis Rev. 2023, 42, 197–215. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Simulated data and application of K-means and K-Area algorithm with K = 4 fixed clusters: (a) Statistical units represented in a 2D-space, colored according to their assigned cluster; (b) Clustering result using the K-means algorithm with K = 4 on the simulated data set; (c) Clustering result using the K-Area algorithm with K = 4 on the same data set.

Figure 2. Correlation Heatmap for the top-100 most variable genes.

Figure 3. Scree plot showing the percentage of variance explained by the first ten PC.

Figure 4. Projection of the data onto the space defined by the first three PCs.

Figure 5. Results from the three clustering methods: (a) K-volume; (b) K-means; (c) Louvain.

Figure 6. Results of the automatic annotation for the 321 cells. The heatmap displays annotated cell populations as rows and individual cells as columns. The score quantifies the degree of overlap between each cell and the reference populations: yellow indicates a high degree of overlap, while blue indicates a low degree of overlap.

Figure 7. Gene Ontology Analysis: (a) Loadings of genes on PC7; (b) Heatmap of the gene–cell–matrix corresponding to PC7; (c) GO characterization for Biological Process, Cellular Component, and Molecular Function.

Figure 8. UMAP-based visualization of the 321 cells: (a) Overall view; (b) Cells labeled according to their annotation.

Figure 9. Graphical representation of the clustering solution obtained with the K-Volume algorithm: (a) Scree plot for selecting the optimal number of clusters (K); where the x-axis represents the number of clusters and the y-axis shows the LCV index values; (b) 3D visualization of the optimal clustering solution.

Figure 10. Biological overview of Cluster 4: (a) Loading analysis for PC7 represented as a bar plot, showing the intensity of gene loadings. Darker bars indicate genes selected for GO analysis (loading values above 0.1 or below −0.1); (b) GO analysis for the cells in Cluster 4. Each bar represents a pathway, with its length proportional to the −log₁₀ of the p-value.

Table 1. K-Area demonstrates superior performance, as expected for Lacunarity, and notably also for the other two indices.

	K-Means	K-Area
ARI	0.67	1.00 *
Silhouette	0.52	0.60 *
LC_A	0.59	0.78 *

We use * to denote the best results.

Table 2. Statistical significance of minor components across the obtained clusters: (a) K-Volume; (b) K-means; (c) Louvain. The statistical significance refers to the excess of “extreme values” in the components (see Algorithm 1). The table also reports the percentage of variance explained by each component (Var) and the number of elements in each cluster (N).

(a)
	N°	275		12		15		18		1
Var		C1		C2		C3		C4		C5
10.1%	PC4	5.4 × 10⁻⁶¹		1.00		1.00		4.2 × 10⁻⁹		1.00
7%	PC5	2.2 × 10⁻⁵		1.00		1.00		1.00		1.00
4.1%	PC6	3.6 × 10⁻⁵		1.00		1.00		1.00		1.00
2.6%	PC7	0.01		1.00		1.00		4.2 × 10⁻⁹		1.00
2.1%	PC8	0.09		1.00		1.00		1.00		1.00
1.8%	PC9	0.17		1.00		0.59		1.00		1.00
1.7%	PC10	0.58		0.59		1.00		0.08		1.00
(b)
	N°	15		171		12		21		102
Var		C1		C2		C3		C4		C5
10.1%	PC4	1.00		4.9 × 10⁻⁴		1.00		8.1 × 10⁻¹¹		4.91 × 10⁻⁴
7%	PC5	1.00		4.9 × 10⁻⁴		1.00		1.00		0.70
4.1%	PC6	1.00		4.9 × 10⁻⁴		1.00		1.00		1.00
2.6%	PC7	1.00		0.02		1.00		1.61 × 10⁻⁹		0.06
2.1%	PC8	1.00		0.66		1.00		1.00		0.09
1.8%	PC9	0.59		0.79		1.00		1.00		0.02
1.7%	PC10	1.00		0.82		0.59		0.04		0.74
(c)
	N°	103	99		70		21		15		13
Var		C0	C1		C2		C3		C4		C5
10.1%	PC4	4.91 × 10⁻⁴	4.91 × 10⁻⁴		4.91 × 10⁻⁴		8.1 × 10⁻¹¹		1.00		1.00
7%	PC5	0.73	0.41		4.91 × 10⁻⁴		1.00		1.00		1.00
4.1%	PC6	1.00	1.00		4.91 × 10⁻⁴		1.00		1.00		1.00
2.6%	PC7	0.06	0.06		0.35		1.6 × 10⁻⁹		1.00		1.00
2.1%	PC8	0.08	0.21		0.09		1.00		1.00		1.00
1.8%	PC9	0.02	0.21		0.74		1.00		0.59		1.00
1.7%	PC10	0.75	0.22		0.25		0.04		1.00		0.32

Table 3. Summary of clustering results for different values of K.

K	V_k	LC_V
3	202.1	0.89
4	131.1	0.93
5	54.1	0.97
6	52.2	0.97
7	46.2	0.98
8	28.6	0.98
9	26.7	0.99
10	25.9	0.99

Table 4. Summary of p-values for K = 5, along with the number of units in each cluster and the percentage of variance explained by each minor PC.

	N°	275	12	15	18
Var		C1	C2	C3	C4
7.8%	PC4	5.4 × 10⁻⁶¹	1.00	1.00	4.2 × 10⁻⁹
5.5%	PC5	2.2 × 10⁻⁵	1.00	1.00	1.00
3.2%	PC6	3.6 × 10⁻⁵	1.00	1.00	1.00
2%	PC7	0.01	1.00	1.00	4.2 × 10⁻⁹
1.6%	PC8	0.09	1.00	1.00	1.00
1.4%	PC9	0.17	1.00	0.59	1.00
1.4%	PC10	0.58	0.59	1.00	0.08

Table 5. Score values for the statistical units in Cluster 4 with respect to PC7.

Cluster 4	PC7
CTGCCTACATCACCCT-1	6.78
GATGAGGGTACTCAAC-1	6.36
GCGAGAACAATGAATG-1	5.72
TGGTTCCGTCGCGGTT-1	5.59
CCTCAGTAGCGGCTTC-1	5.45
CGCTGGAAGCAACGGT-1	4.92
CGTAGGCTCTAACTCT-1	4.89
CTCGTACGTTCACGGC-1	4.75
CTGCCTACATCACCCT-1	3.94
CTGCGGATCGCCATAA-1	−2.82
GACGCGTAGTGGGCTA-1	−3.59
GATGAGGGTACTCAAC-1	−3.61
GCGAGAACAATGAATG-1	−5.87
GTGCGGTGTGTGCCTG-1	−5.93
GTTCATTGTAGGACAC-1	−5.95
TCACGAAAGCGAGAAA-1	−6.81
TGCGGGTAGGTGCAAC-1	−7.04
TGGTTCCGTCGCGGTT-1	−7.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vici, A.; Zeuner, A.; Giuliani, A. Detecting Very Weak Signals: A Mixed Strategy to Deal with Biologically Relevant Information. Algorithms 2025, 18, 581. https://doi.org/10.3390/a18090581

AMA Style

Vici A, Zeuner A, Giuliani A. Detecting Very Weak Signals: A Mixed Strategy to Deal with Biologically Relevant Information. Algorithms. 2025; 18(9):581. https://doi.org/10.3390/a18090581

Chicago/Turabian Style

Vici, Alessandro, Ann Zeuner, and Alessandro Giuliani. 2025. "Detecting Very Weak Signals: A Mixed Strategy to Deal with Biologically Relevant Information" Algorithms 18, no. 9: 581. https://doi.org/10.3390/a18090581

APA Style

Vici, A., Zeuner, A., & Giuliani, A. (2025). Detecting Very Weak Signals: A Mixed Strategy to Deal with Biologically Relevant Information. Algorithms, 18(9), 581. https://doi.org/10.3390/a18090581

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting Very Weak Signals: A Mixed Strategy to Deal with Biologically Relevant Information

Abstract

1. Introduction

2. Data Analysis Strategy

3. Results

3.1. Simulated Data

3.2. Real Data

3.3. Step-by-Step Description of the Procedure

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI