1. Introduction
The tourism industry is a very dynamic and financially relevant sector, globally, especially in emerging regions such as Latin America, where it significantly contributes to the generation of employment and income. In this region, tourism generates one out of every ten jobs and attracts approximately 5% of total investment; moreover, in 2023 it contributed more than USD 629 billion to Latin American GDP and employed 24.6 million people, figures that underscore its economic significance (
IDB Invest, 2026). However, tourism companies face significant financial risks due to financial volatility and unforeseen events, such as financial or health crises, which have a direct effect over the sector’s demand and stability (
Brida et al., 2021;
Inter-American Development Bank, 2024). In Colombia, particularly in cities such as Barranquilla, tourism has become a major financial catalyst, highlighting the need for analytical tools that enable a deeper understanding of firms’ financial behavior and support the strengthening of their sustainability and competitiveness in uncertain environments (
la Hoz et al., 2020;
MINCIT, 2024). In this regard, previous studies conducted in Barranquilla have successfully applied clustering techniques and neural networks to classify firms based on their financial behavior, demonstrating the usefulness of these methods in local business contexts (
la Hoz et al., 2020).
Financial statements offer a comprehensive assessment of a company’s financial health, delivering crucial indicators of liquidity, earnings, and financial stability (
Brealey et al., 2018;
Matias et al., 2024). However, traditional financial analysis approaches often prove insufficient to capture the complexity, heterogeneity, and high dispersion that characterize the tourism sector, particularly in volatile economic contexts and in local markets with diverse business structures (
Hair et al., 2019). These limitations hinder the identification of differentiated financial patterns and constrain the understanding of underlying structural behaviors within the data.
As an unsupervised learning technique, clustering groups companies with similar financial characteristics together. This enables hidden patterns to be identified and risk and profitability profiles to be segmented within the tourism sector (
Han et al., 2019;
James et al., 2021). Its application in financial analysis has proven effective for classifying firms according to their economic behavior, particularly in highly competitive and heterogeneous sectors such as tourism (
Griffin et al., 2023). Likewise, clustering techniques make it possible to capture inter-firm heterogeneity by avoiding the assumption that all companies follow the same strategy, which is especially relevant in local markets where microenterprises coexist alongside large hotel chains (
Adamska & Dąbrowski, 2021;
Celebi et al., 2013).
The application of clustering techniques and neural networks in the analysis of the financial statements and performance of tourism companies has gained popularity in recent years. Previous studies have employed these techniques to categorize firms based on financial indicators (
Vilas et al., 2022) as well as operational characteristics, whereas neural networks, often combined with clustering, have proven effective in analyzing the relations between variables and predicting outcomes, such as travel costs (
Herrera et al., 2022;
Tiwari & Tripathi, 2023). This approach relies on the use of normalized financial ratios to capture genuine differences in economic behavior, profitability, capital structure, liquidity, and operational efficiency, thereby providing a more accurate representation of firms’ financial dynamics (
Karahuta et al., 2017).
Several studies have analyzed the vulnerability of the tourism sector to financial risks, pointing out that economic instability in Latin America increases this risk (
ONU, 2022). Recent research suggests that combining clustering and neural networks improves the accuracy of financial profile analysis and forecasting. This enables a more proactive, strategic response to uncertainty (
Bravo et al., 2023;
Fontalvo Herrera & La Hoz-Granadillo, 2020). However, these studies have limitations, such as biases in the selection of financial data and the lack of sufficient time series in local markets, such as that of Barranquilla (
Fontalvo Herrera et al., 2023;
Kelliher et al., 2018).
In the context of Barranquilla’s tourism sector, behavior-based financial segmentation supported by clustering techniques enables a deeper understanding of the sector’s financial landscape. This approach facilitates the identification of financial profiles exhibiting similar behavioral patterns, which is essential for strategic analysis and in has proven to be effective toolformed decision-making (
De La Hoz & Polo, 2017). Based on this segmentation, it becomes possible to assess differences in financial performance and levels of risk exposure, thereby strengthening analytical capacity in the face of economic volatility and changing market conditions phenomena that are recurrent in the tourism sector and further intensified by unforeseen events such as economic crises or pandemics (
Brida et al., 2021;
Inter-American Development Bank, 2024).
This study analyzes the financial behavior of tourism companies in Barranquilla through an approach based exclusively on unsupervised Machine Learning techniques, aimed at financial segmentation using key financial ratios. The analysis employs indicators such as return on assets (ROA), return on equity (ROE), the degree of financial leverage, the capital ratio, and a proxy for current liquidity, in order to identify differentiated and statistically supported financial archetypes. By focusing on financial statements corresponding to fiscal year 2024 (a period marked by post-pandemic recovery and a significant increase in tourism flows in Colombia), the study seeks to provide an up-to-date overview of the financial performance of Barranquilla’s tourism sector and the main challenges faced by firms within this context.
This study primarily aims to develop a financial segmentation approach based on clustering techniques that contributes to a better understanding of the financial behavior of tourism firms in Barranquilla. By classifying firms into financially homogeneous groups, the study seeks to identify relevant patterns in their financial statements, providing an analytical tool to strengthen efficiency, competitiveness, and enterprise risk analysis. Additionally, the results are intended to serve as input for investors and public policy makers, supporting the design of strategies aimed at strengthening and promoting the sustainable development of local tourism.
This research seeks to answer key questions for the financial analysis of the tourism sector in Barranquilla: What financial characteristics and dominant financial profiles distinguish tourism firms in Barranquilla when analyzed using normalized financial ratios? To what extent do unsupervised learning techniques enable the identification of robust and financially homogeneous groups of tourism firms based on their financial behavior? How can a clustering-based financial segmentation approach support the interpretation and strategic analysis of financial performance in the tourism sector?
Based on these questions, the following research objectives are established:
Characterize the financial information of tourism-sector firms in Barranquilla for fiscal year 2024 through the use of normalized financial ratios, in order to analyze their economic behavior.
Apply unsupervised Machine Learning techniques to segment tourism firms in Barranquilla into financially homogeneous clusters, identifying differentiated profiles based on capital structure, profitability, and liquidity.
Analyze and interpret the resulting financial profiles with the purpose of providing an analytical tool that supports strategic analysis and enhances the understanding of financial performance in the local tourism sector.
These objectives guide the research toward a structural analysis of firms’ financial behavior, providing relevant inputs for informed decision-making and for strengthening the tourism sector in Barranquilla.
2. State of the Art
To achieve the objectives of this study, the state of the art addresses the role of financial statements in business analysis, highlighting their relevance for the assessment of economic performance. In addition, clustering techniques are reviewed as unsupervised learning tools for the financial segmentation of firms and their application in identifying patterns of economic behavior within the tourism sector. Finally, previous studies employing Machine Learning–based approaches to support strategic analysis and decision-making in business contexts characterized by high heterogeneity and volatility are examined.
Financial statements constitute fundamental instruments for evaluating a firm’s economic and financial condition. Documents such as the balance sheet, income statement, and cash flow statement provide structured information on assets, liabilities, equity, revenues, and expenses, enabling a comprehensive understanding of an organization’s financial performance (
Brealey et al., 2018).
Specifically, financial analysis makes it possible to interpret these statements by means of specific indicators. For example, a liquidity analysis studies a company’s ability to cover its short-term obligations, while a profitability analysis assesses its ability to generate profit according to its revenues, assets, or equity (
Matias et al., 2024). This type of analysis is especially relevant in the tourism sector, which is characterized by seasonal fluctuations along with a vulnerability to external shocks, such as economic or health crises (
IDB Invest, 2026).
Clustering is an unsupervised learning technique that identifies data patterns and groups them into homogeneous sets according to common characteristics. In the business world, clustering has been used to segment firms with similar financial profiles, facilitating the identification of patterns in variables such as profitability, capital structure, and liquidity that are not always evident through traditional analytical methods (
Hair et al., 2019).
In the context of the tourism sector, clustering has proven to be an effective tool for revealing underlying financial structures and heterogeneous behaviors among firms. A study by (
la Hoz et al., 2020) have applied clustering techniques to segment tourism companies based on their financial performance, examining differences in profitability and solvency. Moreover, (
Herrera et al., 2022) applied this technique in Ecuador to classify tourism companies as per their financial health and capacity to adjust to economic fluctuations.
In particular, the combination of financial indicators and clustering techniques enables the identification of differentiated financial profiles without imposing prior assumptions on the data structure, which is especially valuable in sectors characterized by high volatility and heterogeneity (
De La Hoz & Polo, 2017).
Previous research has applied clustering-based and data analysis approaches across different geographical contexts, including the United States (
Dotson et al., 2014), Colombia (
Fontalvo Herrera & La Hoz-Granadillo, 2020), and Ecuador (
Herrera et al., 2022). demonstrating the versatility of these techniques in corporate financial analysis. However, a gap remains in the literature regarding the specific analysis of the tourism sector in the city of Barranquilla, particularly from a behavior-based financial segmentation perspective.
Although the application of clustering and Machine Learning techniques to financial analysis has expanded considerably in recent years, important gaps remain in the literature, particularly in the context of local tourism economies. Most existing studies focus on national or cross-country comparisons, while relatively little attention has been given to city-level markets characterized by high firm concentration, structural heterogeneity, and a predominance of microenterprises. In the case of Barranquilla, despite the economic relevance of tourism, there is still limited empirical evidence examining financial behavior from a segmentation perspective.
Moreover, much of the previous research combines clustering with predictive or forecasting models, emphasizing classification accuracy rather than the structural interpretation of financial patterns. As a result, fewer studies concentrate on identifying clear and statistically robust financial archetypes derived exclusively from unsupervised learning approaches. This limits the understanding of how firms differ in terms of capital structure, profitability dynamics, and liquidity conditions beyond traditional categorizations.
Another recurring limit in the literature is the reliance on firm size as a primary classification criterion. While size-based groupings provide useful descriptive information, they may conceal substantial heterogeneity in financial behavior within the same scale category, particularly in sectors such as tourism where microenterprises dominate but operate under very different economic conditions.
Against this backdrop, the present study seeks to advance the literature by offering a behavior-based financial segmentation of tourism firms in Barranquilla grounded in normalized financial ratios and robust unsupervised clustering techniques. Rather than presupposing structural differences based on scale, the analysis allows financial patterns to emerge directly from the data. The findings not only provide updated empirical evidence for a post-pandemic local context but also demonstrate that firm size does not explain the observed segmentation, reinforcing the relevance of structural financial behavior as a more meaningful differentiating factor. In doing so, the study contributes a replicable and interpretable framework for analyzing financial heterogeneity in tourism-oriented economies.
3. Materials and Methods
The present study adopts an observational design with a quantitative approach and is aimed at identifying patterns of financial behavior among firms in the tourism sector of Barranquilla through the application of unsupervised learning techniques. The analysis is based on a secondary financial dataset corresponding to fiscal year 2024, obtained from the Barranquilla Chamber of Commerce, the official institution responsible for the collection, validation, and administration of financial statements for companies registered within its jurisdiction. The dataset comprises firm-level administrative and accounting information for tourism-related companies domiciled in Barranquilla, Colombia, and the initial sample consists of 563 active firms with available financial statements for the reference year.
The methodological framework focuses on transforming raw accounting variables into normalized financial ratios and subsequently applying clustering algorithms to group firms into homogeneous profiles according to their financial structure, profitability, and liquidity. This approach allows the identification of non-linear relationships among the analyzed variables and facilitates a behavior-oriented segmentation of firms rather than one driven solely by size or scale (
Figure 1).
The workflow was implemented using the Python programming language (v3.10 Python Software Foundation, Wilmington, DE 19801, USA) within an interactive development environment (Jupyter Notebook, version 7.5.5, Project Jupyter, Berkeley, CA, USA), ensuring the reproducibility of the analysis. Data processing and manipulation were carried out using the Pandas and NumPy libraries; statistical and algorithmic modeling was conducted with Scikit-learn, UMAP-learn, and SciPy; and visual diagnostics and exploratory validation were supported by Matplotlib and Seaborn.
3.1. Data Understanding and Feature Engineering
The original variables comprise total assets, net equity, total liabilities, net income, current assets, subscribed capital, asset size classification, and renewal date. Since these raw accounting figures are strongly influenced by firm scale, they were not directly suitable for unsupervised segmentation. Therefore, classical financial ratios were first constructed to normalize firm performance, including Return on Assets (ROA), Return on Equity (ROE), financial leverage, equity ratio, and an approximated current liquidity ratio, enabling comparisons across heterogeneous firms based on financial behavior rather than size.
Subsequently, a feature engineering stage was applied to enrich the financial representation and capture multidimensional characteristics relevant for clustering. Six additional metrics were defined, covering financial structure (liability ratio and capital multiplier), liquidity (current asset ratio), solvency (debt coverage), stability (equity-to-current-assets ratio), and efficiency (return on current assets). In total, 11 financial metrics were retained and grouped into four dimensions: profitability (3 metrics), financial structure (4 metrics), liquidity (3 metrics), and solvency (1 metric). Due to the absence of operational revenue and expense variables, efficiency indicators based on income flows could not be computed; therefore, the analysis focuses on balance-sheet-driven dimensions. The preprocessing stage included exploratory data analysis, correlation analysis using a |r| > 0.7 threshold, detection of missing values, and outlier identification via the interquartile range (IQR) method. A summary of financial metrics used in the analysis is provided in
Table 1.
3.2. Treatment of Missing Values and Outliers
Missing values were not imputed, as they represent genuine absence of reported financial information and artificial imputation could introduce bias in clustering outcomes. Instead, the analysis prioritized metrics with full availability across firms in subsequent stages.
Outliers were identified using the interquartile range (IQR) method, defining lower and upper bounds as Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, respectively. Rather than removing entire observations, extreme values were retained to preserve potentially informative financial behaviors. To reduce the influence of extreme observations on distance-based algorithms, robust preprocessing techniques were applied in later stages.
3.3. Feature Selection
Feature selection was conducted to ensure statistical robustness, interpretability, and suitability for clustering.
First, a correlation matrix was constructed to detect multicollinearity among the derived financial ratios. Highly correlated variables were considered redundant, and only one representative metric was retained based on financial interpretability and prevalence in prior literature. Second, a variance threshold criterion was applied to remove features with near-zero variance, ensuring that all retained variables contributed meaningful discriminatory information.
Finally, the optimal subset of financial metrics was selected based on a combined criterion balancing: (i) low redundancy, (ii) sufficient variance, (iii) data completeness, and (iv) financial relevance. This subset constitutes the feature space used as input for the clustering procedures described in the following section.
3.4. Optimal Number of Clusters
The optimal number of clusters was determined using a multi-criteria validation framework to ensure methodological rigor and reproducibility. Prior to clustering, the selected financial ratios were normalized to ensure scale comparability, evaluating both StandardScaler (zero mean and unit variance) and RobustScaler (median centering with interquartile range scaling); the latter was selected due to its robustness to outliers typically present in financial data. Clustering was performed using the k-means as based algorithm with k-means++ initialization, evaluating a range of cluster solutions from k = 2 to k = 10. Cluster validity was assessed through complementary internal and comparative techniques, including the Elbow Method based on within-cluster sum of squares (WCSS) to analyze inertia reduction, the Silhouette Score to evaluate intra-cluster cohesion and inter-cluster separation, the Davies–Bouldin Index to assess cluster compactness and overlap, and the Calinski–Harabasz Index to quantify the ratio of between- to within-cluster variance. Additionally, dendrogram analysis was employed to support structural inspection in hierarchical clustering.
3.5. Clustering Algorithm Selection
The selection of the clustering algorithm was performed through a systematic and quantitative comparison of multiple unsupervised learning techniques representing different clustering paradigms. The evaluated methods included partitional approaches (K-Means as baseline and K-Medoids using the PAM algorithm for increased robustness to outliers), hierarchical clustering with different linkage criteria (Agglomerative Clustering using Ward, Complete, and Average linkages), density-based algorithms (DBSCAN, OPTICS, and HDBSCAN), and probabilistic models (Gaussian Mixture Models) capable of capturing elliptical cluster structures. To ensure an objective comparison, all algorithms were evaluated using a consistent set of quantitative validation metrics, including the Silhouette Score to assess intra-cluster cohesion and inter-cluster separation, the Davies–Bouldin Index to evaluate cluster compactness and overlap, the Calinski–Harabasz Index to measure the ratio of between- to within-cluster variance, and a cluster balance metric designed to penalize extreme disparities in cluster sizes. These metrics were normalized to a common [0, 1] scale and aggregated into a composite score using balanced weights (25% per metric), which served as the primary optimization objective. Hyperparameter tuning was conducted independently for each algorithm using a randomized search strategy with up to 1000 iterations, subject to a strict validity criterion that retained only configurations producing meaningful cluster structures, including a minimum cluster size threshold of 5% of the total sample. The optimization process was constrained by a maximum execution time per algorithm to ensure computational feasibility. Algorithms that failed to converge or produced degenerate clustering solutions under these constraints were excluded from further consideration, while successfully tuned models were retained for their analysis.
3.6. Cluster Characterization and Business Interpretation
Cluster characterization was conducted through a structured statistical and visual analysis aimed at identifying differentiated financial profiles while preserving methodological rigor and interpretability. For each cluster, descriptive statistics, including mean, median, and standard deviation, were computed across the selected financial metrics to summarize central tendency and dispersion. Given the non-Gaussian behavior commonly observed in financial ratios, non-parametric hypothesis testing was employed to assess the statistical significance of inter-cluster differences, using the Kruskal–Wallis test to evaluate global differences in medians across clusters and the Mann–Whitney U test for pairwise comparisons when applicable. To support intuitive and comparative analysis, multiple visualization techniques were applied, including radar charts constructed from normalized financial metrics to represent cluster-level profiles and heatmaps based on median values to highlight relative intensities across features. The analysis was performed on a dataset comprising companies with complete financial information, using a fixed set of financial features consistently across all clusters, while density-based outliers identified during clustering were treated separately to avoid distortion of cluster profiles. This combined statistical and visual framework enabled a robust characterization of clusters and provided a foundation for subsequent business-oriented interpretation.
3.7. Dimensionality Reduction for Visualization
Dimensionality reduction techniques were applied to project the high-dimensional financial feature space into two dimensions for visualization and cluster interpretation, with the objective of identifying the method that best preserves the underlying data structure. Three complementary approaches were evaluated: Principal Component Analysis (PCA) as a linear variance-maximization baseline, t-distributed Stochastic Neighbor Embedding (t-SNE) as a non-linear technique emphasizing local neighborhood structure, and Uniform Manifold Approximation and Projection (UMAP) to balance local and global structural preservation. To enable an objective comparison, the quality of the two-dimensional embeddings was assessed using quantitative metrics capturing different aspects of structural fidelity, including trustworthiness to measure local neighborhood preservation, continuity to evaluate the maintenance of global structural relationships, and Spearman rank correlation to quantify the correspondence between pairwise distances in the original and reduced spaces. These metrics were normalized and aggregated into a composite score using predefined weighted contributions to ensure balanced evaluation. PCA was evaluated using the full set of principal components for variance analysis prior to two-dimensional projection, while t-SNE was tested across multiple perplexity values to explore sensitivity to local versus global structure, and UMAP was evaluated using different neighborhood sizes to assess its ability to balance structural scales.
4. Results
The results section presents the empirical findings derived from the application of the proposed unsupervised learning framework to the financial data of tourism firms in Barranquilla. First, the characteristics of the dataset and the construction of the financial feature space are described, including data completeness, variability, and redundancy patterns. Subsequently, the results of the feature selection process, cluster validation, algorithm comparison, and cluster characterization are reported using quantitative metrics and statistical tests. Finally, dimensionality reduction outcomes are presented to support the visualization and interpretation of the identified financial behavior profiles.
4.1. Dataset Overview and Feature Engineering Outcomes
The final dataset analyzed comprised 525 active firms from the tourism sector in Barranquilla, predominantly classified as micro and small enterprises. After constructing the complete set of financial metrics, only 125 firms (23.8%) presented non-missing values across all variables. This reduction reflects the uneven availability of financial information within the sector, particularly among firms with limited operational activity or recent market entry.
A total of 11 financial metrics were obtained for subsequent clustering analysis, including five classical financial ratios (ROA, ROE, leverage, equity ratio, and approximated current liquidity) and six additional metrics derived through feature engineering, capturing dimensions of financial structure, liquidity, solvency, stability, and efficiency. Data completeness varied across metrics. The current asset ratio showed full coverage, while profitability- and leverage-related indicators exhibited a high proportion of zero values, with 68–76% of firms reporting zero observations. The debt coverage ratio presented the lowest availability, with valid data for 168 firms.
Outlier detection using the interquartile range (IQR) method identified extreme values in all 11 financial metrics, with outlier proportions exceeding 5% in every case. The highest incidence of outliers was observed for the equity-to-current-assets ratio (34.3%), followed by ROA (27.6%) and the capital multiplier (26.9%), indicating substantial dispersion in financial profiles across firms. Correlation analysis revealed several strong linear relationships among the metrics (|r| > 0.7), as show in
Table 2. Notably, near-perfect correlations were observed between leverage and capital multiplier (r = 0.998) and between ROA and return on current assets (r = 0.997), along with high correlations among profitability-related indicators and between debt coverage and profitability measures. These results highlight the presence of redundancy within the feature space prior to clustering.
4.2. Feature Selection Results
Feature selection was performed to define a statistically robust and interpretable feature space suitable for clustering. Correlation analysis identified two variables with near-perfect linear dependence (|r| > 0.99). Specifically, the capital multiplier exhibited an almost perfect correlation with leverage (r = 0.998) and return on current assets was nearly identical to ROA (r = 0.997). In both cases, the more widely used and standard financial indicators leverage and ROA were retained, while their redundant counterparts were excluded from further analysis.
Variance analysis of the remaining metrics confirmed that all retained variables exhibited sufficient dispersion, and no features were removed due to near-zero variance. Subsequently, a trade-off analysis was conducted to assess the impact of increasing the number of features on data availability (
Figure 2). The results show that data completeness remained stable when using up to six features, with approximately 500 firms (100%) retaining complete information. When expanding the feature set to eight metrics, the number of usable observations remained close to the full sample (approximately 500 firms; 97.1% completeness). In contrast, including a ninth feature led to a sharp reduction in the effective sample size, leaving only about 100 firms (31.8%) with complete data. Based on this balance between feature richness, interpretability, and data completeness, an eight-feature subset was selected for clustering.
The final feature space used for clustering comprises eight financial metrics, covering profitability (ROA and ROE), financial structure (leverage, equity ratio, and liability ratio), and liquidity (approximated current liquidity, current asset ratio, and equity-to-current-assets ratio) (
Table 3). Residual correlation analysis among the selected features revealed no problematic dependencies (|r| > 0.7), except for the expected correlation between ROA and ROE (r ≈ 0.77), which were retained due to their complementary perspectives on firm performance.
4.3. Determination of the Optimal Cluster Structure
The optimal number of clusters was evaluated using complementary internal validation metrics applied to k-means clustering solutions ranging from
to
. The results obtained from each validation method are summarized in
Table 4, while the comparative behavior of the metrics across different values of
is illustrated in
Figure 3. Additionally, the hierarchical structure of the data was inspected through dendrogram analysis, as shown in
Figure 4.
The Elbow Method, based on the within-cluster sum of squares (WCSS), showed a pronounced reduction in inertia from k = 2 to k = 3, followed by a substantially slower rate of decrease for larger values of k. This change in slope suggests a clear inflection point at k = 3, with only marginal improvements in cluster compactness for k ≥ 4, indicating diminishing returns from adding additional clusters beyond this point.
The Silhouette Score achieved its highest values for k = 2 and k = 3, with scores close to 1, indicating strong intra-cluster cohesion and well-separated clusters. As the number of clusters increased beyond k = 3, the silhouette score declined markedly, reflecting reduced clustering quality and increasing overlap between clusters.
The Davies–Bouldin Index exhibited its lowest values in the range of k = 2 to k = 4, indicating improved cluster compactness and minimal inter-cluster similarity within this interval. For larger values of k, the index increased steadily, suggesting a degradation in clustering performance as clusters became less distinct.
The Calinski–Harabasz Index increased monotonically with k, reaching its highest value at k = 10. However, rather than relying on the absolute maximum, the relative rate of increase was considered. The most substantial gains occurred between k = 2 and k = 4, after which improvements became progressively smaller, indicating limited benefit from further partitioning of the data.
The hierarchical clustering dendrogram constructed using Ward linkage revealed pronounced vertical gaps among the last fusion steps, suggesting natural partitioning structures between
and
. The largest separations were observed in the final 30 merges, supporting the presence of a limited number of well-defined clusters (
Figure 4).
Across all validation techniques, the consolidated results indicated the strongest consensus for a three-cluster solution (), which was supported by the majority of internal metrics.
4.4. Comparative Performance of Clustering Algorithms
The comparative performance of the evaluated clustering algorithms is summarized through multiple internal validation metrics, as illustrated in
Figure 5, while the aggregated composite scores and their metric-level contributions are presented in
Figure 6.
As shown in
Figure 6, substantial variability was observed across algorithms with respect to cluster cohesion, compactness, separation, and balance. In terms of Silhouette Score, OPTICS achieved the highest normalized value, reaching the maximum score among all evaluated methods. Density-based approaches, including HDBSCAN and Mean Shift, also exhibited relatively high silhouette values, whereas partitional and probabilistic methods, such as MiniBatch K-Means, Spectral Clustering, and Gaussian Mixture Models, obtained noticeably lower scores.
Regarding cluster compactness, assessed through the Davies–Bouldin Index, OPTICS again yielded the lowest value among all algorithms, indicating reduced overlap between clusters. Other density-based methods, such as HDBSCAN and Mean Shift, showed intermediate values, while higher indices were observed for Gaussian Mixture Models and MiniBatch K-Means.
The Calinski–Harabasz Index, used to quantify the ratio of between-cluster to within-cluster variance, exhibited a pronounced peak for OPTICS, with values several orders of magnitude higher than those obtained by alternative algorithms. In contrast, hierarchical and partitional methods produced comparatively low Calinski–Harabasz scores, reflecting weaker separation structures in the resulting cluster configurations.
Cluster size balance, measured through the coefficient of variation of cluster sizes, revealed moderate variability across algorithms. Gaussian Mixture Models and MiniBatch K-Means achieved the lowest imbalance values, while Mean Shift and DBSCAN presented higher dispersion in cluster sizes. OPTICS exhibited an intermediate balance score, reflecting the coexistence of core clusters and a noise component.
The aggregation of all normalized metrics into a composite score is presented in
Figure 7. OPTICS obtained the highest composite score (0.944), substantially exceeding the scores of all other evaluated algorithms. HDBSCAN ranked second with a composite score slightly above 0.5, while the remaining methods achieved composite scores below 0.4.
The metric-level contribution to the composite score, also illustrated in
Figure 7, shows that OPTICS consistently contributed positively across all evaluated dimensions, with strong contributions from the Silhouette Score, Davies–Bouldin Index, and Calinski–Harabasz Index. In contrast, competing algorithms exhibited unbalanced contributions, typically dominated by one or two metrics while underperforming in others.
Based on the quantitative comparison across individual metrics and the aggregated composite score, OPTICS was selected as the clustering algorithm for subsequent analysis, a direct comparison between the selected algorithm and the k-means baseline is reported in
Table 5. The optimal OPTICS configuration produced a clustering structure consisting of two core clusters and a noise component. Specifically, 174 firms (34.1%) were classified as outliers (Cluster −1), while 37 firms (7.3%) and 299 firms (58.6%) were assigned to Cluster 0 and Cluster 1, respectively, resulting in a total of 510 clustered firms.
4.5. Cluster Characterization and Business-Oriented Interpretation
4.5.1. Dataset and Cluster Structure
The cluster characterization analysis was conducted on a dataset comprising 510 companies with complete financial information. Based on the density-based clustering results, two regular clusters (Cluster 0 and Cluster 1) were identified, while 174 companies (34.1%) were classified as outliers (Cluster −1) and analyzed separately to avoid distortion of cluster-level profiles.
All clusters were characterized using a consistent set of eight financial ratios, covering profitability, financial structure, and liquidity dimensions.
4.5.2. Statistical Differentiation Between Clusters
Non-parametric hypothesis testing was applied to assess inter-cluster differences in financial indicators. The results of the Kruskal–Wallis test are reported in
Table 6.
Six out of eight financial features (75%) showed statistically significant differences across clusters (). The most discriminative variables corresponded to financial structure indicators (leverage, equity ratio, and debt ratio), all of which presented highly significant differences (). Profitability indicators (ROA and ROE) also exhibited significant differences, whereas two liquidity-related ratios did not show statistically significant variation.
Pairwise comparisons between the two regular clusters using the Mann–Whitney U test (
Table 7) confirmed significant differences in financial structure and profitability metrics, with large effect sizes for leverage-related indicators and moderate effect sizes for profitability measures.
4.5.3. Financial Profile Characterization of Regular Clusters
Cluster 0 comprised 37 companies (7.3%), all classified as micro-sized firms. As illustrated in the financial profile visualization (
Figure 7), this cluster exhibited a nearly flat normalized profile across profitability and leverage dimensions.
Median ROA and ROE values were equal to zero, indicating the absence of recorded operating activity. From a capital structure perspective, Cluster 0 showed no financial leverage, with a median leverage value of zero and a median equity ratio of 100%. Liquidity indicators were uniformly low, with median values well below standard reference levels.
The heatmap of median financial characteristics (
Figure 8) further highlights the dominance of equity-based financing and the absence of debt in this cluster.
Cluster 1 represented the largest group, comprising 299 companies (58.6%), predominantly micro-sized enterprises (97.0%). The financial profile visualization (
Figure 8) shows a markedly unbalanced profile for this cluster, characterized by strong leverage intensity and reduced equity participation.
Median ROA and ROE values were negative, indicating operating losses. Financial structure metrics revealed high leverage, with a median leverage value exceeding four times equity and a median equity ratio below 20%. Liquidity indicators were low and comparable to those observed in Cluster 0.
These contrasts are clearly emphasized in the heatmap representation (
Figure 9), where Cluster 1 displays high intensity in debt-related ratios and reduced equity contribution.
The outlier group (Cluster −1) included 174 companies (34.1%) and exhibited heterogeneous financial behavior. Median values for profitability, leverage, and equity ratios deviated substantially from the interquartile ranges observed in the regular clusters, as summarized in
Table 8.
The size composition of this group was more diverse, including micro, small, medium, and one large firm, reinforcing its atypical and non-homogeneous nature.
4.5.4. Visual Validation of Cluster Profiles
Visual analysis supported the statistical findings. The radar charts of normalized financial metrics (
Figure 7) revealed clear structural differences between the two regular clusters: a flat, low-activity profile for Cluster 0 and a leveraged, imbalanced profile for Cluster 1.
The heatmap of median values (
Figure 8) highlighted strong contrasts in financial structure indicators, particularly equity ratio and debt intensity, while liquidity-related metrics showed similar levels across clusters.
4.5.5. Behavioral Financial Archetypes and Business Interpretation
The combined statistical and visual analyses reveal the presence of distinct financial behavior patterns across the identified clusters. These differences are consistently supported by non-parametric hypothesis testing, median-based comparisons, and cluster-level visualizations, confirming that the segmentation captures heterogeneous financial dynamics rather than superficial structural similarities.
Cluster 0 is characterized by a financial profile with null profitability, absence of financial leverage, and limited liquidity, indicating firms without observable operating activity during the analyzed period. In contrast, Cluster 1 exhibits active financial behavior, marked by high leverage, reduced equity participation, negative profitability, and constrained liquidity, reflecting firms engaged in ongoing operations under financial stress or restructuring conditions.
The robustness of this behavioral segmentation is further reinforced by the consistency between statistical significance tests and visual validation tools. Radar charts highlight contrasting structural patterns between clusters, while heatmap representations emphasize sharp differences in capital structure metrics and relatively homogeneous liquidity conditions. Box plot distributions corroborate these findings by showing minimal overlap in leverage-related indicators and partial overlap in profitability metrics.
The outlier group aggregates firms with atypical financial configurations that deviate substantially from the dominant patterns observed in regular clusters. These entities exhibit extreme values in profitability and capital structure indicators, suggesting the presence of non-standard financial behavior that cannot be adequately represented by the regular cluster archetypes.
Overall, the resulting segmentation defines clear financial archetypes that are interpretable, statistically supported, and suitable for business-oriented analysis. This structure provides a coherent framework for differentiating firms based on their financial behavior, enabling targeted analytical, managerial, or risk-oriented applications without relying on predefined firm attributes.
4.6. Evaluation of Dimensionality Reduction Techniques for Visualization
To facilitate the visual inspection of the clustering structure and the spatial distribution of firms in a two-dimensional space, three dimensionality reduction techniques: PCA, t-SNE, and UMAP, were evaluated and compared based on quantitative embedding quality metrics.
4.6.1. Principal Component Analysis (PCA)
The variance explained by each principal component is reported in the left image in
Figure 10. The first principal component (PC1) accounted for 27.14% of the total variance, while the second principal component (PC2) explained an additional 22.14%. Together, the two-dimensional PCA projection captured 49.28% of the total variance of the standardized financial feature space. The cumulative variance exceeded 75% with four components and 95% with six components, as illustrated in the right image in
Figure 9.
4.6.2. Comparison of Dimensionality Reduction Techniques
The quality of the two-dimensional embeddings obtained with PCA, t-SNE, and UMAP was assessed using trustworthiness, continuity, and Spearman rank correlation. The quantitative comparison of these metrics is summarized in
Table 9 and visualized in
Figure 10.
PCA achieved a trustworthiness score of 0.650 and a continuity score of 0.817, while exhibiting the highest Spearman rank correlation (0.890), indicating strong preservation of pairwise distance rankings between the original and reduced spaces. t-SNE obtained the highest trustworthiness (0.830) and continuity (0.895), reflecting improved preservation of local neighborhood relationships. UMAP achieved intermediate values across all metrics, with trustworthiness and continuity scores of 0.785 and 0.834, respectively.
The normalized composite scores, integrating all evaluation metrics, ranked t-SNE as the highest-performing technique for two-dimensional visualization (composite score = 0.834), followed by PCA (0.786) and UMAP (0.766), as reported in
Table 8.
4.6.3. Two-Dimensional Cluster Visualization
The two-dimensional projections obtained with each technique are shown in
Figure 11, where clusters identified in the high-dimensional space are overlaid on the reduced representations. The PCA-based visualization shows partial overlap between clusters, whereas the t-SNE embedding presents a clearer spatial separation between the main clusters and outliers. The UMAP projection also reveals well-defined groupings, although with greater compactness and reduced inter-cluster spacing.
Based on the composite evaluation metrics and the visual separation observed in the two-dimensional embeddings, t-SNE was selected as the primary technique for cluster visualization in subsequent analyses. The final t-SNE representation, including cluster centroids, is presented in
Figure 12.
5. Discussion
This study aimed to characterize and segment tourism firms in Barranquilla based on their financial behavior using normalized financial ratios and unsupervised learning techniques. Unlike traditional firm classifications driven by size or scale, the proposed framework focuses on identifying structural patterns in profitability, capital structure, liquidity, and solvency. The results demonstrate that this approach enables the identification of robust and interpretable financial archetypes that reflect heterogeneous economic dynamics within the sector.
Beyond the identification of financial archetypes, it is essential to contrast these findings with the existing literature to evaluate their theoretical and methodological implications.
5.1. Financial Behavior Archetypes in the Tourism Sector
The clustering analysis revealed the presence of two regular clusters and a group of outliers, each representing distinct financial behavior archetypes. These profiles are not merely statistical groupings but reflect differentiated patterns of financial activity, capital allocation, and operational intensity.
Cluster 0 corresponds to firms exhibiting no observable financial activity during the analyzed fiscal year. These entities are characterized by null profitability indicators, absence of financial leverage, full reliance on equity financing, and low liquidity levels. The financial configuration of this group suggests firms that are formally registered but inactive, recently created, or undergoing temporary suspension of operations. The homogeneity of this profile across multiple financial dimensions supports its interpretation as a distinct archetype rather than a transitional or noisy group.
Cluster 1 represents firms with active operations but under financial stress. This group is characterized by high leverage, reduced equity participation, negative profitability, and constrained liquidity. The coexistence of operational activity with persistent losses and aggressive capital structures suggests firms facing structural challenges related to cost management, debt servicing, or unfavorable market conditions. These findings align with previous studies highlighting the relevance of capital structure and profitability as key differentiators in financial segmentation frameworks.
The outlier group aggregates firms with atypical financial configurations that deviate substantially from the dominant patterns observed in the regular clusters. The heterogeneity of this group, both in terms of financial ratios and firm size composition, suggests the presence of non-standard business models, exceptional financial events, or data irregularities. Treating these firms as a separate category rather than forcing their inclusion into regular clusters preserves the interpretability and robustness of the segmentation.
These results are consistent with prior research demonstrating the capacity of clustering techniques to reveal latent financial structures within heterogeneous sectors (
Griffin et al., 2023;
Han et al., 2019). Like those studies, the use of unsupervised learning enabled the identification of underlying patterns that traditional descriptive financial analysis may fail to detect. In this sense, the present findings reinforce the methodological relevance of clustering for uncovering hidden configurations of risk exposure and profitability in volatile environments such as tourism.
5.2. Behavior-Based Segmentation Versus Size-Based Classification
One of the most relevant findings of this study is that the identified clusters are not explained by firm size. Both regular clusters are overwhelmingly composed of micro-sized enterprises, yet they exhibit markedly different financial behaviors. This result provides empirical evidence supporting the hypothesis that financial performance and risk profiles in the tourism sector are better explained by behavioral and structural characteristics than by scale alone.
The statistically significant differences observed in financial structure and profitability indicators, contrasted with the limited discriminative power of liquidity metrics, reinforce the notion that capital allocation and operational efficiency rather than firm size, are the primary drivers of financial outcomes. By demonstrating that microenterprises in Barranquilla follow substantially different financial trajectories despite their similar scale, this study provides localized validation for behavior-based segmentation. These results strengthen prior arguments regarding the inherent limitations of size-driven classifications in heterogeneous economic contexts (
Chen et al., 2024).
5.3. Interpretation of Discriminative Financial Dimensions
The analysis highlights financial structure indicators such as leverage, equity ratio, and liability ratio, as the most discriminative variables across clusters. These metrics exhibited strong statistical significance and minimal distributional overlap, underscoring their relevance in distinguishing financial archetypes. Profitability indicators (ROA and ROE) also contributed to cluster differentiation, although with moderate overlap, reflecting shared challenges across firms within the sector.
The prominence of leverage and profitability indicators as primary differentiating factors is coherent with prior studies emphasizing the role of capital structure and return-based metrics in financial segmentation frameworks (
Karahuta et al., 2017;
Vilas et al., 2022). The identification of a financially stressed cluster characterized by high indebtedness and negative profitability further aligns with the literature describing the structural vulnerability of the tourism sector to external shocks and volatility (
Brida et al., 2021).
Although liquidity is traditionally regarded as a crucial indicator of financial health and short-term solvency (
Matias et al., 2024), the results obtained for Barranquilla in 2024 indicate that liquidity ratios did not significantly differentiate the main clusters. Rather than contradicting the literature, this finding suggests that constrained liquidity may represent a structural characteristic affecting the sector during the analyzed period. Consequently, liquidity appears as a shared contextual condition rather than a discriminating variable within the segmentation framework.
5.4. Role of Visualization and Outlier Treatment
The use of dimensionality reduction techniques supported the qualitative interpretation of the clustering results by providing intuitive two-dimensional representations of the high-dimensional financial space. While visual compactness varied across methods, the selected visualization approach facilitated the identification of cluster separation, internal dispersion, and the spatial positioning of outliers. Importantly, visualization was used as a complementary interpretative tool rather than as a criterion for defining clusters, preserving methodological rigor.
The explicit identification and separate treatment of outliers further strengthened the analytical framework. By acknowledging the presence of atypical financial behaviors, the study avoids oversimplification and recognizes the inherent complexity of financial data in real-world settings.
From a methodological standpoint, the use of the OPTICS algorithm also addresses limitations previously identified in local research contexts (
Fontalvo Herrera et al., 2023). Unlike simple partitional models such as k-means, density-based approaches are better suited to handle skewed financial distributions and extreme values. The substantial improvement observed in clustering quality, reflected in the marked increase in the Silhouette Score relative to the baseline model, demonstrates the added value of robust clustering techniques for real-world financial datasets characterized by zero inflation and outliers.
5.5. Contextual Scope and Temporal Considerations
The results of this study are conditioned by the temporal context of the data, which correspond to the 2024 fiscal year. Consequently, the identified financial behavior patterns reflect the structural and operational conditions of the tourism sector during this specific period. Given the sensitivity of the sector to economic cycles, demand fluctuations, and external shocks, the interpretation of the results should be framed within this temporal scope. Future studies incorporating longitudinal data may provide additional insights into the persistence or evolution of the identified financial archetypes.
5.6. Implications for Financial Analysis and Monitoring
Although this study does not aim to prescribe operational or policy actions, the identified financial archetypes may inform the development of analytical frameworks for financial monitoring, risk assessment, and firm segmentation. By relying exclusively on normalized financial ratios and unsupervised learning, the proposed approach offers a flexible and scalable tool that can be adapted to different institutional or analytical contexts without relying on predefined firm attributes.
6. Conclusions
This study developed and validated a behavior-based framework for the financial segmentation of tourism firms in Barranquilla using normalized financial ratios and unsupervised learning techniques. By avoiding firm size as a classification criterion, the proposed approach focused on identifying structural patterns in financial behavior related to profitability, capital structure, and liquidity.
The empirical results confirm that the framework is capable of identifying distinct and statistically robust financial profiles within the sector. Two regular clusters were consistently detected, representing firms without observable operational activity and firms with active operations but under significant financial pressure, respectively. These profiles were supported by statistically significant differences across most financial dimensions, particularly those related to capital structure and profitability.
A relevant contribution of this study is the empirical confirmation that firm size does not drive the observed segmentation. Despite the predominance of micro-sized enterprises in both regular clusters, the identified profiles exhibit substantially different financial configurations, reinforcing the value of behavior-based segmentation approaches for analyzing heterogeneous sectors.
From a methodological standpoint, the study highlights the importance of rigorous feature engineering based on financial ratios and the use of robust clustering techniques capable of handling skewed distributions, zero inflation, and extreme values. The explicit identification of outlier firms as a separate group further enhances the interpretability of the results and preserves relevant information that would otherwise be lost through aggressive data filtering.
The findings of this study are subject to several limitations that should be acknowledged. First, the analysis is based on financial data corresponding to a single fiscal year (2024), which constrains the temporal generalizability of the identified financial profiles. The observed patterns reflect the structural and economic conditions of the tourism sector during that specific period and may vary under different macroeconomic or sectoral contexts.
Second, the availability and quality of financial information limited the effective sample size for clustering, particularly due to the high prevalence of zero values in profitability indicators and missing data in certain ratios. Although these characteristics are inherent to real-world financial datasets, they may affect the stability of the identified clusters.
Third, while the outlier group was deliberately preserved to maintain analytical integrity, its heterogeneity prevents detailed characterization within the current framework and requires complementary qualitative or firm-level analysis.
Future research may extend this work in several directions. Incorporating longitudinal financial data would allow the analysis of cluster stability and transitions over time, providing insights into the persistence and evolution of financial behavior patterns. Additionally, integrating qualitative information or operational variables could support a deeper interpretation of outlier firms and atypical financial configurations.
Further studies may also explore the application of the proposed framework to other economic sectors or geographic contexts, assessing its generalizability and comparative performance. Finally, the development of monitoring or early-warning tools based on the identified financial archetypes represents a promising avenue for applied research, provided that such extensions are supported by additional empirical validation.
Overall, this study provides a robust and replicable foundation for behavior-based financial segmentation using unsupervised learning, contributing to a more nuanced understanding of financial heterogeneity in the tourism sector.
7. Patents
The authors declare that no patents resulted from the work reported in this manuscript.