Asymmetric Distance in K-Means Clustering Enhances Quality of Cells Raman Imaging

Scopacasa, Bernadette; Candeloro, Patrizio

doi:10.3390/app15084461

Open AccessArticle

Asymmetric Distance in K-Means Clustering Enhances Quality of Cells Raman Imaging

by

Bernadette Scopacasa

and

Patrizio Candeloro

^*

Nanotechnology Research Center, Department of Experimental and Clinical Medicine, University Magna Graecia of Catanzaro, 88100 Catanzaro, Italy

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4461; https://doi.org/10.3390/app15084461

Submission received: 28 February 2025 / Revised: 8 April 2025 / Accepted: 15 April 2025 / Published: 17 April 2025

(This article belongs to the Special Issue Biological Sample Analysis Techniques and Devices)

Download

Browse Figures

Versions Notes

Abstract

Raman microspectroscopy is a powerful, label-free technique for the biochemical characterization of cells, but its complex spectral data require advanced computational methods for meaningful interpretation. Clustering analysis is widely used in spectroscopic imaging to extract meaningful biochemical information. Traditional methods, such as K-means clustering with Euclidean distance, often struggle to capture subtle spectral variations, leading to suboptimal segmentation. Alternative distance metrics, including cosine and Mahalanobis distances, have been explored to enhance cluster separability, yet challenges remain in distinguishing chemically relevant features while minimizing redundancy and noise. In this study, we introduce an asymmetric metric distance matrix with a tunable eccentricity parameter to improve clustering performance in Raman hyperspectral imaging. Our results demonstrate that suitable eccentricity values enhance the identification of subcellular structures while requiring fewer clusters than Euclidean-based approaches. Compared to polar metrics, the proposed asymmetric metric achieves better stability and reduced noise, leading to more accurate segmentation. Future research could explore its application in other clustering techniques and machine learning frameworks, as well as its application in broader spectral imaging techniques where the distance metric plays a fundamental role.

Keywords:

Raman imaging; biochemical microspectroscopy; asymmetric metric matrix; K-means clustering analysis; cellular spectral imaging

Graphical Abstract

1. Introduction

Raman spectroscopy is a powerful, non-invasive analytical technique that provides detailed molecular information about biological samples, making it particularly valuable in biomedical research. By measuring the inelastic scattering of monochromatic light, Raman spectroscopy generates spectral fingerprints that reflect the biochemical composition of cells, tissues, and biofluids. This capability has led to its widespread application in, among others, cellular biology [1,2], disease diagnosis [3,4], and drug development [5,6].

However, the complexity and high dimensionality of Raman spectral data pose significant challenges in data interpretation. Spectra often contain overlapping peaks, autofluorescence, and noise background, along with an intrinsic variability due to the heterogeneity of biological samples. Primarily, raw data need to be preprocessed with background reduction (such as polynomial subtraction or similar techniques) and somehow normalized in intensity to reduce, as much as possible, the presence of non-Raman signals and signal differences due to different optical parameters (laser power, integration time, variations in focusing on the sample). However, after the most efficient preprocessing and in the presence of clean Raman signals, the extraction of meaningful insights from Raman imaging datasets of complex biological samples requires the use of computational approaches, such as machine learning and, more recently, deep learning techniques. Among machine learning approaches, K-means clustering analysis (KCA) can still be considered an effective unsupervised method for analyzing Raman hyperspectral images.

K-means clustering is a widely used partitioning algorithm that groups data points into clusters based on spectral similarity. By minimizing intra-cluster variance, it allows for the identification of distinct biochemical regions within a Raman image, facilitating the differentiation of cellular components, disease states, or metabolic processes. Compared to other clustering techniques such as hierarchical clustering or Gaussian mixture models, K-means offers computational efficiency and scalability, making it suitable for large Raman imaging datasets.

However, the choice of distance metric in K-means clustering has crucial importance in achieving a meaningful segmentation of Raman data. Different metrics, such as Euclidean distance or polar distance (cosine function), can significantly impact the clustering outcome by influencing how spectral similarities and differences are measured. Selecting an appropriate metric ensures that biologically relevant spectral variations are effectively captured, leading to improved cluster separability and the more accurate identification of biochemical structures within the sample.

In the last two decades, several efforts have been made to introduce non-Euclidean metrics in the framework of clustering analysis, and limiting the view to angular-based metrics, several works have been proposed, like polar coordinates for circular clustering [7] and for density peaks clustering [8], polar transforms with k-means segmentation [9], and clustering using polar self-organizing maps [10]. Also, efforts for exploring and exploiting asymmetries have been proposed, like asymmetric self-organizing maps [11], unsupervised anisotropic clustering [12], the asymmetric K-means algorithm [13], and K-means clustering on asymmetric data [14]. However, in these latter works on asymmetric clustering, all the proposed methods develop a non-metric use of data asymmetry by introducing Gaussian kernels or dissimilarity measures.

In this study, we develop an asymmetric metric distance to improve clustering performance, and an introduced ad hoc parameter called eccentricity will move from Euclidean to polar distance through different asymmetry degrees. By applying K-means clustering to the Raman imaging of cell samples, we explore the optimization of two clustering parameters, the eccentricity and the number of clusters. We show that an appropriate choice of the eccentricity parameter significantly enhances the quality of image segmentation compared to both Euclidean and polar metrics, and could improve the ability to distinguish subtle biochemical variations inside the cells. These results could contribute to the advancement of computational Raman microspectroscopy for biomedical applications.

2. Materials and Methods

2.1. Cell Culturing and Fixation

Two cell lines were used for this work, specifically human hepatocarcinoma cells HepG2 and human hepatic stellate cells LX-2.

The HepG2 cell lines were purchased from the American Type Culture Collection (ATCC, Manassas, VA, USA) and were grown in a humidified incubator (95% O₂, 5% CO₂) at 37 °C in Minimum Essential Medium (MEM, Corning 10-009-CV), supplemented with 10% FBS (SIAL) and 1% penicillin/streptomycin (100 μg/mL) (SIAL).

LX-2 cells were grown at 37 °C in a humidified atmosphere containing 5% CO₂ in complete Dulbecco’s Modified Eagle Medium (DMEM, 4.5 g/L glucose, phenol red, no L-glutamine, no sodium pyruvate) supplemented with 1% v/v penicillin/streptomycin mixture (penicillin: 10,000 U/mL, streptomycin: 10,000 µg/mL), 1% v/v of L-glutamine (200 nM), and 2% v/v fetal bovine serum (FBS) (LX-2 cells, DMEM, FBS penicillin, streptomycin, and L-glutamine were all from Merck Millipore, Darmstadt, Germany).

For all cell cultures, sterilized CaF₂ slides (from Crystran Ltd., Dorset, UK) were used as substrates inside the culturing wells because of the negligible Raman signal of CaF₂. Moreover, all cell media were replaced with serum-free and phenol-red-free DMEM (from HyClone, Logan, UT, USA) supplemented with 1% v/v penicillin/streptomycin and 1% v/v L-glutamine before the Raman experiments to reduce possible interfering signals from the media.

2.2. Raman Measurements

Raman microspectroscopy was carried out by means of an Alpha 300-R instrument (Witec GmbH, Ulm, Germany), using a 532 nm laser. A laser power of 10 mW/cm² was set over the sample, with a typical integration time of 1 s per single spectrum. During Raman experiments, the fixed cells were maintained in a PBS (1×) solution. A 60×/1.00 NA water immersion objective was used to focus the incident laser on the sample. According to diffraction laws, the minimum achievable spot size under optimal conditions with this optical setup was approximately 0.35 µm. Raman maps were acquired by scanning the sample under the laser focus according to a measurement grid with a pixel size of 0.40 µm and collecting one Raman spectrum per pixel. The software used for Raman measurements was Witec Control 1.60.

2.3. Raman Preprocessing

Raman images were obtained from hyperspectral datasets using multivariate analysis applied to preprocessed spectra. The same preprocessing steps were applied to all spectra. Initially, the water background signal was subtracted from all spectra, followed by polynomial baseline subtraction to account for potential autofluorescence effects. Subsequently, the spectra of each map were normalized to the maximum spectral area, ensuring comparability between Raman datasets acquired at different times. All data preprocessing as well as KCA was performed with Raman-Tool-Set software v.3.1, freely available [15].

3. Anisotropic Distance Metric

In vibrational spectroscopy, each spectrum can be represented as a point in an N-dimensional space, where N is the number of acquired intensities. Thus, each spectrum can be expressed as an array, denoted as r. Since vibrational intensities are strictly positive, all components of r are positive.

When applying clustering or other machine learning techniques, a distance metric is required to compare spectra. Incorporating prior knowledge of the data structure can be beneficial in defining an appropriate metric. Two spectra, r₀ and r₁, convey the same chemical information if they exhibit the same relative peak ratios, even if their absolute intensities differ (Figure 1). Mathematically, such spectra can be expressed as:

r₁ = c⋅r₀ with c > 1,

where c is a scalar greater than 1. In this case, both spectra originate from the same molecular species, with the only difference being concentration (higher for r₁ if c > 1), which affects the overall intensity. Representing r₀ and r₁ as vectors in a coordinate system, they share the same direction and lie on the same straight line originating from the origin. Conversely, a spectrum r₂ that deviates significantly in angle from r₀ represents a different chemical composition. Therefore, an ideal distance metric should reflect this data structure.

For simplicity, we consider a three-dimensional (3D) case. Given a reference spectrum r₀ in a 3D space, the distance to another spectrum r is computed using a metric matrix. The displacement vector is:

Δr = r − r₀,

The Euclidean distance d is typically given by:

d² = |Δr|² = Δx² + Δy² + Δz²,

which can be rewritten as the dot product:

d² = Δr·Δr,

Introducing a metric matrix M, the generalized distance is:

d² = Δr^TMΔr,

where the Euclidean metric matrix M_EU is simply the identity matrix:

M_EU = I,

This defines an isotropic distance, meaning all points equidistant from r₀ form a sphere centered at r₀ (Figure 2A). However, this standard distance metric does not account for the underlying data structure. To incorporate anisotropy, we define an asymmetric metric matrix M_AS, which produces ellipsoidal isodistance surfaces instead of spherical ones.

The desired anisotropic distance metric forms a prolate ellipsoid with its major axis aligned with r₀. To control elongation, we introduce an eccentricity parameter, ecc, which defines the ratio between minor and major axes. Figure 2 illustrates the effect of varying ecc, with panels B, C, and D corresponding to ecc = 0.50, 0.25, and 0.10, respectively. The most elongated ellipsoid is represented by ecc = 0.10.

These surfaces indicate that all blue points in Figure 2 have the same computed distance d from r₀, based on the chosen metric. When using the Euclidean matrix (Figure 2A), the isodistance surface remains a sphere (indicating isotropic distance), whereas with the asymmetric matrix (Figure 2B–D), the sphere deforms into an ellipsoid (indicating an anisotropic distance). In this formulation, points aligned with r₀ along the major axis are assigned smaller distances, while points deviating from r₀ along the minor axes are assigned larger distances. This ensures that the metric prioritizes spectral similarity based on relative peak ratios rather than absolute intensity differences.

If r₀ lies on the x axis, the asymmetric metric matrix that produces the desired result in 3D is given by:

T = [\begin{matrix} \frac{1}{e c c_{x}} & 0 & 0 \\ 0 & \frac{1}{e c c_{y}} & 0 \\ 0 & 0 & \frac{1}{e c c_{z}} \end{matrix}]

where 1/ecc_x determines the elongation along the x axis (such that a smaller ecc_x results in a larger elongation), and ecc_x corresponds to the eccentricity parameter ecc introduced earlier, i.e., ecc_x = ecc. For the other axes we set ecc_y = ecc_z to obtain a prolate ellipsoid shape. These parameters can be collectively denoted as ecc_⊥. Moreover, to prevent volume scaling when transitioning from spherical to ellipsoidal isodistance surfaces, we impose the asymmetric metric matrix T to be unitary by applying the following condition:

1 = \frac{1}{e c c_{x}} \cdot \frac{1}{e c c_{y}} \cdot \frac{1}{e c c_{z}}

which simplifies to

1 = \frac{1}{e c c} \cdot {(\frac{1}{e c c_{⊥}})}^{2}

Solving for ecc_⊥, we obtain:

e c c_{⊥} = {(\frac{1}{e c c})}^{1 / 2}

This result implies that only a single eccentricity parameter, ecc, needs to be specified along the axis of r₀, while the others are uniquely determined.

To compute the distances from r₀ using the metric matrix T, we apply:

d² = Δr^TTΔr,

yielding isodistance surfaces in the form of prolate ellipsoids with the major axis aligned to x axis. However, this formulation is only valid when r₀ lies on the x axis. To extend the approach for a generic r₀, we first apply a rotation matrix R, which aligns r₀ with the x axis:

r₀′ = Rr₀
Δr′ = RΔr = R(r − r₀)

Then, the asymmetric matrix T is used to compute the distances between the rotated vectors:

d² = Δr′^TTΔr′

Substituting Δr′ = RΔr yields:

d² = (RΔr)^T T (RΔr)
d² = Δr^T(R^TTR)Δr

Thus, the desired asymmetric metric matrix, M_AS, applicable to any arbitrary r₀, is given by:

M_AS = R^TTR

Since T is a positive definite matrix and the application of a rotation R preserves the positive definiteness of T, the resulting matrix M_AS remains positive definite and constitutes a valid metric matrix.

Extending this approach to N-dimensions is straightforward. The matrix T in N-dimensions is given by:

T = [\begin{matrix} \frac{1}{e c c} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & \frac{1}{e c c_{⊥}} \end{matrix}]

where the unitary condition results in:

e c c_{⊥} = {(\frac{1}{e c c})}^{1 / (N - 1)}

Instead, the N-dimensional rotation matrix R is computed using the general method proposed in ref. [16].

4. Results

4.1. Asymmetric Metric Matrix Benchmark, Compared with Eculidean and Polar Metrics

We performed K-means clustering analysis on the Raman hyperspectra of cells using the asymmetric metric matrix M_AS to compute the distances between spectra. First, we compared Euclidean and pure polar distances with asymmetric distances, varying the eccentricity parameter ecc. A Raman dataset recorded over a HepG2 cell was used for this test.

Figure 3 presents the results obtained with the different metrics for some cluster numbers (six, seven, and eight clusters). For all cluster numbers, we observe that Euclidean distance and the asymmetric distance with ecc = 0.50 (first two columns of Figure 3) produce remarkably similar results. This indicates that the KCA algorithm does not perform significantly differently when using these two metrics. The reason becomes clear when observing the isodistance surfaces (Figure 2A,B), with ecc = 1.00 corresponding to Euclidean distance, and ecc = 0.50, which still results in a near-spherical shape. Indeed, with ecc = 0.50 the ratio between the major and minor axes of the prolate ellipsoid is not so pronounced and the isodistance surface is not so far from the sphere of Euclidean metric.

The results change significantly when decreasing to ecc = 0.25 and even more to ecc = 0.10. Even with six clusters, the asymmetric metric with ecc = 0.25 highlights subcellular regions (yellow areas in the picture of Figure 3) that are not detected with Euclidean and ecc = 0.50 metrics, while preserving the general structure of other regions (blue, reddish, and green areas). It is important to note that all images in the first row of Figure 3 correspond to clustering with six clusters. But for Euclidean and ecc = 0.50 metrics, we have four different reddish colors (dark red, purple, red, and magenta), while in the case of ecc = 0.25, we have only three reddish colors (purple, red, and magenta) along with one additional yellow cluster. This means that ecc = 0.25 improves segmentation by reducing redundant clusters (the four reddish tones in the Euclidean and ecc = 0.50 metrics) and enhancing the detection of biochemical differences due to spectral vector orientation. This capability is even more pronounced at ecc = 0.10 (again with six clusters): the corresponding image shows well delimited and clean subcellular regions, where blue and reddish areas are better defined within a more extended yellow region (see below in the text for color assignment).

The pure polar metric can be regarded as an extreme case of the asymmetric metric, where the ellipsoid becomes highly elongated, approaching a straight-line form. This is just a point of view for comparison with the asymmetric metric. Mathematically, the pure polar distance is a cosine-based metric that considers only angular deviations between spectral vectors. In the last image of the first row (Figure 3), the pure polar result appears excessively noisy: while the reddish and blue areas are well defined, the yellow and green regions are mixed. Moreover, only two reddish colors (red and purple) remain, and a light green cluster is added. This evident deterioration in segmentation is likely due to the exclusive reliance of the polar metric on vector orientation, without consideration of spectral intensity (i.e., vector magnitude).

When moving to seven clusters (second row of Figure 3), we first note that the additional cluster results in a yellow area, like that mentioned above, in the case of Euclidean and ecc = 0.50 asymmetric metrics. While in the case of ecc = 0.25 and ecc = 0.10 asymmetric metrics, the additional cluster is found as a fourth reddish tone. The conclusion is that adding one cluster allows Euclidean and ecc = 0.50 metrics to recover one chemical information missing with six clusters (the yellow area), while the ecc = 0.25 and ecc = 0.10 are forced to add a redundant cluster among the reddish tones. Further, we can note that all these metrics produce similar results with seven clusters (first four images of the second row), but again the ecc = 0.25 and ecc = 0.10 metrics lead to better-defined subcellular regions. The image of pure polar metric with seven clusters is still noisy, and the blue area is divided into two sub-regions.

4.2. Biochemical Assignment of Clustered Regions

To elucidate the chemical meaning of the different clusters and to check redundant segmentation, Figure 4 shows the average spectra for the seven-cluster case. Since some clustering results are similar to each other, we examine the average curves resulting from Euclidean, asymmetric with ecc = 0.10, and polar metrics. Further, for clarity’s sake, we divide the lipid-originated spectra (reddish tones) from other biological contributions.

For all reddish spectra, we can see the presence of clear peaks at 2850 and 2880 cm⁻¹, which are well-known spectral markers for lipid molecules [17,18]. Moreover, the peaks at 1265 and 1300 cm⁻¹ are also related to lipids vibrations, and their intensity ratio I₁₂₆₅/I₁₃₀₀ correlates directly with the unsaturation degree [19]. Another indicator for the presence of C=C double bonds in lipids is the intensity of the 1660 cm⁻¹ peak compared with the intensity at 1445 cm⁻¹ [19]. However, this marker is more useful in the context of pure molecule analysis than in cellular datasets, because in cellular and tissue samples, protein vibrations also contribute to the 1660 cm⁻¹ intensity. Among the other average spectra, the yellow curve is characterized by peaks at 750, 1130, and 1582 cm⁻¹, which are typical vibrations of Cytochrome C and can be ascribed to mitochondria [20,21,22], both in perinuclear regions (endoplasmic reticulum) and in cytoplasmic regions (branched to cytoskeleton). The blue spectrum exhibits characteristic peaks of DNA bases at 785, 1340, and 1575 cm⁻¹ [23,24], and, therefore, blue clusters can be assigned to nuclear regions. Finally, the green curve resembles the average Raman signal of the overall cell, but with a significantly smaller intensity. These features, along with the outer location of green clusters, suggest that this signal is originated by the thinner parts of the external cell membrane. The just described spectral features are shared among the average spectra obtained with the different metrics.

At this point, we would like to mention that, although Raman microspectroscopy offers high chemical specificity and can, in principle, provide molecular fingerprints of proteins as well, in complex biological samples, such as whole cells and/or tissues, protein signals are generally less distinctive compared to those of lipids or nucleic acids. Raman bands associated with proteins are often more difficult to resolve due to overlap with other cellular components, and the most prominent protein-related Raman feature, the Amide I band (1600–1700 cm⁻¹), would require spectral deconvolution (e.g., using Gaussian or Pseudo-Voigt fits) to extract information about protein secondary structures [25]. However, such further spectral analysis would be beyond the scope of the present work.

In Figure 4, the lipid spectra for Euclidean and ecc = 0.10 asymmetric metrics can be considered redundant. As an example, if we consider the I₁₂₆₅/I₁₃₀₀ ratio and the 2850 and 2880 cm⁻¹ intensities, the red and magenta curves (the first two from the top) are quite overlapping each other, while the dark red curve (the bottom one) is significantly different from the others. The purple curve is a kind of hybridization between the red-magenta curves and the dark-red one, thus representing a kind of transition curve but it does not bring any new chemical information. Instead, the other curves (yellow, blue, and green) are strictly necessary to properly address nuclear, mitochondrial, and outer membrane regions. Conversely, the polar metric with seven clusters correctly addresses the lipidic compartment, assigning only two curves ascribable to different unsaturation degrees. But two redundant curves are introduced, one in the nuclear region and another in the outer membrane. The nuclear region is divided into two parts, but the corresponding average spectra (blue and dark cyan curves) do not exhibit significant differences. A similar argumentation holds for the division of the outer membrane and the corresponding green and light green curves.

4.3. Concluding Remarks on Asymmetric Metric Benchamrk

In summary, seven clusters are needed for Euclidean and ecc = 0.50 asymmetric metrics to detect the presence of mitochondria signals, but at the expense of redundant segmentation in the lipidic regions, with four clusters assigned to lipids. On the other hand, ecc = 0.25 and even more ecc = 0.10 asymmetric metrics properly address mitochondria signals already with six clusters, and only a slight redundancy is present in the lipidic region with three clusters. Passing to seven clusters for ecc = 0.25 and 0.10 is not helpful and only increases the redundancy in the lipidic segmentation. The polar metric with six and seven clusters reveals the presence of mitochondria and works fine in the segmentation of lipidic signals, assigning only two clusters to lipids. But redundancy is produced in the other regions: in the case of seven clusters, nearly overlapping average spectra are assigned in both nuclear and outer membrane regions, while in the case of six clusters, this redundancy is present in the outer membrane. However, the main disadvantage of polar clustering is the excessive noise of the produced images, with mixed clusters and not well-defined subcellular regions.

The segmentation accuracy is assessed, in the case of six clusters, by projecting the clusters onto the first two principal components (PCs) computed by Principal Component Analysis (PCA). Unfortunately, there is no ground truth to compare with, and many internal metrics used for segmentation accuracy in the absence of ground truth, like the Dunn Index, the Silhouette method, the Davies-Bouldin Index, and the Calinski-Harabasz Index, are based on intra-cluster and inter-cluster characteristic distances. The effectiveness of these internal metrics with Euclidean distance reflects their fundamental reliance on symmetric distance metrics. In our case, an asymmetric distance metric is employed for clustering, which could impair the reliability of such indices (Dunn, DBI, CHI, and the Silhouette method).

Instead, the projection of clusters on PCs (Figure 5) provides a visual inspection of how the clustering behaves under different distance metrics. Since the PC1 axis is much more elongated than it appears in the figure (note the different scales for PC1 and PC2 scores), the Euclidean metric (top panel) fails to identify the yellow cluster dominated by mitochondrial signal, and merges it with the blue cluster dominated by nuclear signals. Since the Euclidean distance is isotropic and the PC1 axis dominates in terms of variance, clustering segmentation is primarily driven along the PC1 direction in the case of Euclidean distance. Conversely, the polar metric (bottom panel) is sensitive only to directional information (cosine similarities) and not to spectral vector magnitudes. It correctly addresses lipidic signals using only two clusters (red and purple symbols), but leads to significant overlap among mitochondria, cytosol, and membrane regions (yellow, green, and light green clusters). The proposed asymmetric distance offers a compromise between the Euclidean and polar cases by tuning the eccentricity parameter ecc. In the figure, clustering with ecc = 0.10 is shown (middle panel). Mitochondrial (yellow), nuclear (blue), and membrane (green) regions are clearly separated, while lipidic signals are assigned to three clusters (red, magenta, and purple symbols), which are less redundant than the four clusters of the Euclidean case. Hence, PCA projection suggests that the asymmetric distance improves the separation of relevant biochemical regions without producing excessive redundancy compared to the Euclidean and polar metrics.

Finally, the test performed with eight clusters (last row of Figure 3) produces redundancy with all metrics, but to different extents. Compared to the seven-cluster attempt, Euclidean and asymmetric metrics introduce a new redundancy in the outer membrane, where an unnecessary division into two regions (green and light green) can be observed. The polar metric, while preserving the lipidic segmentation to two clusters, produces an excessive redundancy in the outer membrane (where three clusters are assigned), and, consequently, a higher degree of noise affects the resulting cell image.

4.4. Further Validation by Analysing LX-2 Cells upon Treatment

Here, we propose a further example of K-means clustering with an asymmetric metric, still in the field of cellular spectral analysis. The cell under study is an LX-2 cell after proper treatment with Retinol and Palmitic acid (ROL+PA) [26]. LX-2 cells are human hepatic stellate cells (HSCs). HSCs are usually in a quiescent state, characterized by high expression of lipid droplets (LDs) for storing vitamin A. Upon liver injury, HSCs transdifferentiate towards a so-called active state, characterized by both high fibrogenesis and loss of LDs. In this active state, HSCs are a crucial player in originating and sustaining liver fibrosis. Upon specific treatments, or as the liver injury subsides, active HSCs could be reverted to a quiescent-like state, restoring a high expression of LDs. LX-2 cells are partly active HSCs, and when treated with a combination of Retinol and Palmitic acid are reverted to a quiescent-like state.

Figure 6 shows the results achieved by KCA with six clusters performed on the Raman hyperspectra recorded over an LX-2 cell after ROL+PA treatment. Euclidean, ecc = 0.10 asymmetric, and polar metrics have been used for KCA. As in the former case, by comparing the image obtained for Euclidean and asymmetric metrics, we can notice that subcellular regions are better defined in the case of asymmetric distance. The Euclidean metric leads to two subregions in the nuclear area, but with neatly overlapping average spectra (dark cyan and blue curves). Instead, the asymmetric metric correctly assigns a single cluster to the nuclear region, surrounded by a yellow area ascribable to the presence of CytC (mitochondria). The yellow average spectrum obtained with asymmetric metric has a spectral behavior that better resembles the pure CytC spectrum, while the yellow curve for Euclidean metric is partly a mixture of CytC and lipid spectra, as denoted by large intensities at 2850 and 2880 cm⁻¹. This is likely due to an overlap of the yellow cluster with lipid regions. This issue is better solved by asymmetric distance, which introduces a transition region (the magenta cluster in the second row of Figure 5) between LDs and mitochondria.

The last row of the figure shows the results attained using the polar metric. We find again a noisy image, as observed above, with some mixing regions in the outer part of the cell where two redundant clusters (green and light green) are assigned. Correctly, a single cluster is assigned to the nuclear region, but its border is rough and irregular. A similar conclusion holds for the CytC area, whose average spectrum is as clean as that of asymmetric distance, but whose shape is fragmented and intermixed with the green clusters.

5. Discussion

Within the k-means clustering framework, the proposed asymmetric distance metric captures directional relationships in the data, a property particularly important for Raman signals characterized by gradual chemical or structural gradients. Although the core of this study is the direct comparison of clustering metrics within the KCA framework, in this section we briefly discuss how our approach relates to dimensionality reduction and clustering techniques commonly employed in Raman hyperspectral analysis.

Principal Component Analysis (PCA) is frequently used to reduce the dimensionality of Raman datasets prior to clustering [27,28,29]. However, PCA projects the data onto orthogonal components that maximize variance, which can result in the loss of fine spectral features critical for accurate biochemical discrimination. In contrast, the present approach operates directly in the original spectral space without dimensionality reduction, thereby preserving subtle but potentially important variations.

t-Distributed Stochastic Neighbor Embedding (t-SNE) [30,31] and Uniform Manifold Approximation and Projection (UMAP) [32,33,34] are powerful non-linear dimensionality reduction techniques that have been used to visualize high-dimensional spectral data. However, these methods are primarily designed for visualization rather than clustering. Moreover, t-SNE and UMAP embeddings can distort global structures, potentially obscuring underlying chemical relationships.

Hierarchical Clustering Analysis (HCA) has also been applied in Raman hyperspectral studies [35,36]. Compared to KCA, HCA does not require the number of clusters to be specified a priori, but it becomes computationally intensive for large datasets. Like KCA, HCA is highly sensitive to the choice of distance metric. However, KCA only requires local distance computations between data points and centroids, and can naturally incorporate a directional, asymmetric attraction toward centroids. In contrast, HCA relies heavily on a precomputed global distance matrix, and classical linkage criteria assume symmetric distances, such as Euclidean distance. As a result, applying an asymmetric distance in HCA could introduce ambiguities during the merging of clusters, making KCA a more appropriate framework for direct use of directional information in the data.

In summary, compared to commonly used dimensionality reduction and clustering techniques in Raman imaging, the main advantages of the proposed asymmetric clustering approach are: (i) it preserves full spectral resolution without the need for dimensionality reduction; (ii) it introduces directionality into the distance calculation, improving sensitivity to gradual biochemical changes. Future work could explore further applications of the asymmetric distance, such as its integration after PCA-based dimensionality reduction or its combination with hierarchical methods adapted to asymmetric distance.

6. Conclusions

In this study, we demonstrated that the incorporation of an asymmetric metric matrix into K-means clustering significantly enhances the segmentation quality of Raman imaging data from cells. By systematically varying the eccentricity parameter (ecc), we evaluated the impact of this novel distance metric on the clustering performance compared to conventional Euclidean and polar metrics.

Our results indicate that the asymmetric metric provides a more refined and biologically meaningful segmentation of cellular regions, particularly at intermediate values of eccentricity (ecc = 0.25 and ecc = 0.10). At these settings, the algorithm effectively distinguishes subcellular structures, such as lipidic compartments, mitochondria, and nuclear regions, while minimizing redundant segmentation. Furthermore, this improved segmentation is achieved with a lower number of clusters compared to the Euclidean metric, which requires additional clusters to resolve the same biochemical features. This highlights the efficiency of the asymmetric metric in capturing relevant spectral variations without excessive cluster fragmentation. In contrast, the polar metric, although effective in distinguishing lipid compartments, introduced excessive noise and instability in clustering results.

Moreover, we observed that increasing the number of clusters beyond an optimal threshold does not necessarily improve segmentation accuracy. Instead, excessive clustering introduces redundancy, particularly in lipid-rich regions and the outer membrane. This finding suggests that the choice of both an appropriate distance metric and an optimal cluster number is crucial for obtaining biologically interpretable results in Raman-based cell imaging.

Overall, the implementation of asymmetric metric distances in K-means clustering represents a promising approach for cellular analysis through Raman microimaging. By providing a flexible and tunable metric, this method enhances spectral biochemical differentiation while preserving a more regular shape of cellular subregions. Future work may explore the extension of this approach to other spectral imaging techniques and its potential integration with advanced machine learning techniques for automated biochemical characterization of cells. Additionally, further research could investigate the application of asymmetric metrics in other clustering techniques or mathematical frameworks where distance metrics play a critical role, such as hierarchical clustering, graph-based segmentation, or manifold learning.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15084461/s1.

Author Contributions

Conceptualization, P.C.; Data curation, B.S.; Formal analysis, B.S. and P.C.; Investigation, B.S. and P.C.; Methodology, B.S. and P.C.; Software, P.C.; Supervision, P.C.; Validation, B.S. and P.C.; Visualization, B.S. and P.C.; Writing—original draft, P.C.; Writing—review and editing, B.S. and P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article and Supplementary Material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

KCA	K-means clustering analysis
3D	Three dimensional
CytC	Cytochrome C
HSC	Hepatic stellate cell
ROL+PA	Retinol + Palmitic acid
PCA	Principal Component Analysis (PCA)
PC	Principal Component
t-SNE	t-Distributed Stochastic Neighbor Embedding
UMAP	Uniform Manifold Approximation and Projection
HCA	Hierarchical Clustering Analysis

References

Palonpon, A.F.; Sodeoka, M.; Fujita, K. Molecular imaging of live cells by Raman microscopy. Curr. Opin. Chem. Biol. 2013, 17, 708–715. [Google Scholar] [CrossRef]
Kallepitis, C.; Bergholt, M.S.; Mazo, M.M.; Leonardo, V.; Skaalure, S.C.; Maynard, S.A.; Stevens, M.M. Quantitative volumetric Raman imaging of three dimensional cell cultures. Nat. Commun. 2017, 8, 14843. [Google Scholar] [CrossRef] [PubMed]
Abramczyk, H.; Brozek-Pluska, B. Raman imaging in biochemical and biomedical applications. Diagnosis and treatment of breast cancer. Chem. Rev. 2013, 113, 5766–5781. [Google Scholar] [CrossRef]
Krafft, C.; Steiner, G.; Beleites, C.; Salzer, R. Disease recognition by infrared and Raman spectroscopy. J. Biophotonics 2009, 2, 13–28. [Google Scholar] [CrossRef] [PubMed]
Ren, J.; Mao, S.; Lin, J.; Xu, Y.; Zhu, Q.; Xu, N. Research progress of Raman spectroscopy and Raman imaging in pharmaceutical analysis. Curr. Pharm. Des. 2022, 28, 1445–1456. [Google Scholar] [CrossRef]
El-Mashtoly, S.F.; Petersen, D.; Yosef, H.K.; Mosig, A.; Reinacher-Schick, A.; Kötting, C.; Gerwert, K. Label-free imaging of drug distribution and metabolism in colon cancer cells by Raman microscopy. Analyst 2014, 139, 1155–1161. [Google Scholar] [CrossRef] [PubMed]
Sun, X.; Sajda, P. Circular Clustering with Polar Coordinate Reconstruction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2024, 21, 1591–1600. [Google Scholar] [CrossRef]
Li, C.; Ding, S.; Xu, X.; Du, S.; Shi, T. Fast density peaks clustering algorithm in polar coordinate system. Appl. Intell. 2022, 52, 14478–14490. [Google Scholar] [CrossRef]
Neghina, M.; Rasche, C.; Ciuc, M.; Sultana, A.; Tiganesteanu, C. Automatic detection of cervical cells in Pap-smear images using polar transform and k-means segmentation. In Proceedings of the 2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA), Oulu, Finland, 12–15 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
Xu, L.; Chow, T.W.; Ma, E.W. Topology-based clustering using polar self-organizing map. IEEE Trans. Neural Netw. Learn. Syst. 2014, 26, 798–808. [Google Scholar] [CrossRef]
Olszewski, D. Asymmetric k-Means clustering of the asymmetric self-organizing map. Neural Process Lett. 2016, 43, 231–253. [Google Scholar] [CrossRef]
Hanwell, D.; Mirmehdi, M. QUAC: Quick unsupervised anisotropic clustering. Pattern Recognit. 2014, 47, 427–440. [Google Scholar] [CrossRef]
Olszewski, D. Asymmetric k-means algorithm. In Adaptive and Natural Computing Algorithms, Proceedings of the 10th International Conference, ICANNGA 2011, Ljubljana, Slovenia, 14–16 April 2011; Part II 10; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1–10. [Google Scholar]
Olszewski, D. K-means clustering of asymmetric data. In Hybrid Artificial Intelligent Systems, Proceedings of the 7th International Conference, HAIS 2012, Salamanca, Spain, 28–30 March 2012; Part I 7; Springer: Berlin/Heidelberg, Germany, 2012; pp. 243–254. [Google Scholar]
Candeloro, P.; Grande, E.; Raimondo, R.; Di Mascolo, D.; Gentile, F.; Coluccio, M.L.; Perozziello, G.; Malara, N.; Francardi, M.; Di Fabrizio, E. Raman database of amino acids solutions: A critical study of Extended Multiplicative Signal Correction. Analyst 2013, 138, 7331–7340. [Google Scholar] [CrossRef]
Aguilera, A.; Pérez-Aguila, R. General n-dimensional rotations. In Proceedings of the WSCG ‘2004: Short Communications: The 12-th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, Plzeň, Czech Republic, 2–6 February 2004; pp. 1–8. [Google Scholar]
Krafft, C.; Knetschke, T.; Funk, R.H.; Salzer, R. Identification of organelles and vesicles in single cells by Raman microspectroscopic mapping. Vib. Spectrosc. 2005, 38, 85–93. [Google Scholar] [CrossRef]
Czamara, K.; Majzner, K.; Pacia, M.Z.; Kochan, K.; Kaczor, A.; Baranska, M. Raman spectroscopy of lipids: A review. J. Raman Spectrosc. 2015, 46, 4–20. [Google Scholar] [CrossRef]
Wu, H.; Volponi, J.V.; Oliver, A.E.; Parikh, A.N.; Simmons, B.A.; Singh, S. In vivo lipidomics using single-cell Raman spectroscopy. Proc. Natl. Acad. Sci. USA 2011, 108, 3809–3814. [Google Scholar] [CrossRef]
Parrotta, E.; De Angelis, M.T.; Scalise, S.; Candeloro, P.; Santamaria, G.; Paonessa, M.; Coluccio, M.L.; Perozziello, G.; De Vitis, S.; Sgura, A. Two sides of the same coin? Unraveling subtle differences between human embryonic and induced pluripotent stem cells by Raman spectroscopy. Stem Cell Res. Ther. 2017, 8, 1–12. [Google Scholar] [CrossRef] [PubMed]
Johannessen, C.; White, P.C.; Abdali, S. Resonance Raman optical activity and surface enhanced resonance Raman optical activity analysis of cytochrome c. J. Phys. Chem. A 2007, 111, 7771–7776. [Google Scholar] [CrossRef]
Read, D.S.; Woodcock, D.J.; Strachan, N.J.; Forbes, K.J.; Colles, F.M.; Maiden, M.C.; Clifton-Hadley, F.; Ridley, A.; Vidal, A.; Rodgers, J. Evidence for phenotypic plasticity among multihost Campylobacter jejuni and C. coli lineages, obtained using ribosomal multilocus sequence typing and Raman spectroscopy. Appl. Environ. Microbiol. 2013, 79, 965–973. [Google Scholar] [CrossRef]
van Manen, H.; Kraan, Y.M.; Roos, D.; Otto, C. Single-cell Raman and fluorescence microscopy reveal the association of lipid bodies with phagosomes in leukocytes. Proc. Natl. Acad. Sci. USA 2005, 102, 10159–10164. [Google Scholar] [CrossRef]
Krafft, C.; Knetschke, T.; Siegner, A.; Funk, R.H.; Salzer, R. Mapping of single cells by near infrared Raman microspectroscopy. Vib. Spectrosc. 2003, 32, 75–83. [Google Scholar] [CrossRef]
Zolea, F.; Biamonte, F.; Candeloro, P.; Di Sanzo, M.; Cozzi, A.; Di Vito, A.; Quaresima, B.; Lobello, N.; Trecroci, F.; Di Fabrizio, E. H ferritin silencing induces protein misfolding in K562 cells: A Raman analysis. Free Radic. Biol. Med. 2015, 89, 614–623. [Google Scholar] [CrossRef] [PubMed]
Valentino, G.; Zivko, C.; Weber, F.; Brülisauer, L.; Luciani, P. Synergy of phospholipid—Drug formulations significantly deactivates profibrogenic human hepatic stellate cells. Pharmaceutics 2019, 11, 676. [Google Scholar] [CrossRef] [PubMed]
Surmacki, J.; Brozek-Pluska, B.; Kordek, R.; Abramczyk, H. The lipid-reactive oxygen species phenotype of breast cancer. Raman spectroscopy and mapping, PCA and PLSDA for invasive ductal carcinoma and invasive lobular carcinoma. Molecular tumorigenic mechanisms beyond Warburg effect. Analyst 2015, 140, 2121–2133. [Google Scholar] [CrossRef]
Kong, K.; Kendall, C.; Stone, N.; Notingher, I. Raman spectroscopy for medical diagnostics—From in-vitro biofluid assays to in-vivo cancer detection. Adv. Drug Deliv. Rev. 2015, 89, 121–134. [Google Scholar] [CrossRef] [PubMed]
Meksiarun, P.; Ishigaki, M.; Huck-Pezzei, V.A.; Huck, C.W.; Wongravee, K.; Sato, H.; Ozaki, Y. Comparison of multivariate analysis methods for extracting the paraffin component from the paraffin-embedded cancer tissue spectra for Raman imaging. Sci. Rep. 2017, 7, 44890. [Google Scholar] [CrossRef]
Kim, J.H.; Zhang, C.; Sperati, C.J.; Bagnasco, S.M.; Barman, I. Non-perturbative identification and subtyping of amyloidosis in human kidney tissue with Raman spectroscopy and machine learning. Biosensors 2023, 13, 466. [Google Scholar] [CrossRef]
Stevens, F.; Carrasco, B.; Baeten, V.; Fernández Pierna, J.A. Use of t-distributed stochastic neighbour embedding in vibrational spectroscopy. J. Chemom. 2024, 38, e3544. [Google Scholar] [CrossRef]
Sigle, M.; Rohlfing, A.; Kenny, M.; Scheuermann, S.; Sun, N.; Graeßner, U.; Haug, V.; Sudmann, J.; Seitz, C.M.; Heinzmann, D. Translating genomic tools to Raman spectroscopy analysis enables high-dimensional tissue characterization on molecular resolution. Nat. Commun. 2023, 14, 5799. [Google Scholar] [CrossRef]
Vermeulen, M.; Smith, K.; Eremin, K.; Rayner, G.; Walton, M. Application of Uniform Manifold Approximation and Projection (UMAP) in spectral imaging of artworks. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2021, 252, 119547. [Google Scholar] [CrossRef]
de Andrade Silva, T.; Dos Santos, G.F.S.; Prado, A.R.; Cavalieri, D.C.; Junior, A.G.L.; Pereira, F.G.; Díaz, C.A.; Guimarães, M.C.C.; Cassini, S.T.A.; de Oliveira, J.P. Surface-Enhanced Raman Scattering Combined with Machine Learning for Rapid and Sensitive Detection of Anti-SARS-CoV-2 IgG. Biosensors 2024, 14, 523. [Google Scholar] [CrossRef]
Miljković, M.; Chernenko, T.; Romeo, M.J.; Bird, B.; Matthäus, C.; Diem, M. Label-free imaging of human cells: Algorithms for image reconstruction of Raman hyperspectral datasets. Analyst 2010, 135, 2002–2013. [Google Scholar] [CrossRef] [PubMed]
Hedegaard, M.; Matthäus, C.; Hassing, S.; Krafft, C.; Diem, M.; Popp, J. Spectral unmixing and clustering algorithms for assessment of single cells by Raman microscopic imaging. Theor. Chem. Acc. 2011, 130, 1249–1260. [Google Scholar] [CrossRef]

Figure 1. Spectrum similarities and dissimilarities. (A) The red curve represents a typical Raman spectrum recorded on cells, while the black curve corresponds to the same spectrum scaled by a factor c > 1. These spectra convey identical chemical information. (B) Representing the spectra as vectors in 3D space, r₀ and r₁ share the same direction, while r₂ deviates, indicating different chemical composition. This highlights that spectra aligned in the same direction should have smaller distances compared to those that are not aligned.

Figure 2. Isodistance surfaces calculated using an asymmetric metric matrix for different values of the eccentricity parameter ecc. The red circle indicates the point r₀, starting from which distances are computed; the blue dots delineate the shape of the isodistance surfaces, while the straight line represents the axis along which ellipsoids are expected to align. (A) Isotropic Euclidean distance corresponding to ecc = 1.00. (B–D) Anisotropic distances with ecc = 0.50, 0.25, and 0.10, respectively.

Figure 3. Evaluation of KCA using asymmetric metric distances on a Raman dataset acquired from a HepG2 cell. Different numbers of clusters (six, seven, and eight) and various distance metrics were tested. The first row presents images computed with six clusters using the Euclidean metric (first column), asymmetric metrics with eccentricities of 0.50 (second column), 0.25 (third column), and 0.10 (fourth column), as well as the polar metric (fifth column). The second and third rows display the corresponding images obtained with seven and eight clusters, respectively. The scale bar (black line in the first image) represents 10 µm for all images. The color legend on the right is intended to be beneficial to color-blind individuals.

Figure 4. Average spectra for the seven-cluster segmentation. The spectra on the left correspond to lipid-associated clusters, while those on the right are assigned to mitochondria (yellow), nuclei (blue), and the outer membrane (green). The average spectra are displayed for three different metrics: Euclidean (first row), asymmetric with ecc = 0.10 (second row), and polar (third row). The clustering images in each row correspond to those in Figure 3 and are included here for reference. The color legend on the left and color labels over the curves are intended to be beneficial to color-blind individuals.

Figure 5. Projection of clusters on the first two principal components for the six-cluster choice. In the absence of ground truth to compare with, this projection helps to visualize and to inspect how clustering is performing with the different distance metrics. The Euclidean metric (top panel) produces too redundant segmentation for lipids (red, magenta, purple, and dark red symbols) and fails to identify a mitochondrial cluster (absence of yellow symbols). On the other hand, the polar metric (bottom panel) addresses the lipids with only two clusters (red and purple) but produces an excessive overlapping of other clusters (yellow, green, and light green). The asymmetric distance (middle panel) behaves as a compromise between Euclidean and polar metrics: yellow, blue, and green clusters are well separated, while the three clusters for lipids do not produce an excessive redundancy. The color legend on the right is intended to be beneficial to color-blind individuals.

Figure 6. K-means clustering analysis of an LX-2 cell after treatment with Retinol and Palmitic acid. Euclidean (top row), ecc = 0.10 asymmetric (middle row), and polar (bottom row) metrics are used with KCA. For each metric, image segmentation is presented along with cluster average spectra. The scale bar (black line in the top image) is 10 μm. The color legend on the left and the color labels on the curves are beneficial to color-blind individuals.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Scopacasa, B.; Candeloro, P. Asymmetric Distance in K-Means Clustering Enhances Quality of Cells Raman Imaging. Appl. Sci. 2025, 15, 4461. https://doi.org/10.3390/app15084461

AMA Style

Scopacasa B, Candeloro P. Asymmetric Distance in K-Means Clustering Enhances Quality of Cells Raman Imaging. Applied Sciences. 2025; 15(8):4461. https://doi.org/10.3390/app15084461

Chicago/Turabian Style

Scopacasa, Bernadette, and Patrizio Candeloro. 2025. "Asymmetric Distance in K-Means Clustering Enhances Quality of Cells Raman Imaging" Applied Sciences 15, no. 8: 4461. https://doi.org/10.3390/app15084461

APA Style

Scopacasa, B., & Candeloro, P. (2025). Asymmetric Distance in K-Means Clustering Enhances Quality of Cells Raman Imaging. Applied Sciences, 15(8), 4461. https://doi.org/10.3390/app15084461

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Asymmetric Distance in K-Means Clustering Enhances Quality of Cells Raman Imaging

Abstract

1. Introduction

2. Materials and Methods

2.1. Cell Culturing and Fixation

2.2. Raman Measurements

2.3. Raman Preprocessing

3. Anisotropic Distance Metric

4. Results

4.1. Asymmetric Metric Matrix Benchmark, Compared with Eculidean and Polar Metrics

4.2. Biochemical Assignment of Clustered Regions

4.3. Concluding Remarks on Asymmetric Metric Benchamrk

4.4. Further Validation by Analysing LX-2 Cells upon Treatment

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI