Unsupervised Machine Learning for Financial Behavior Profiling of Tourism Firms in Barranquilla, Colombia

Coronell, Leidy Haidy Perez; Herrera, Tomás José Fontalvo; Africano, Gloria Naranjo; De-La-Hoz-Franco, Emiro; Escorcia-Gutierrez, José; Crissien Borrero, Tito José

doi:10.3390/jrfm19040281

Open AccessArticle

Unsupervised Machine Learning for Financial Behavior Profiling of Tourism Firms in Barranquilla, Colombia

by

Leidy Haidy Perez Coronell

¹

,

Tomás José Fontalvo Herrera

²

,

Gloria Naranjo Africano

^3,*

,

Emiro De-La-Hoz-Franco

⁴

,

José Escorcia-Gutierrez

^4,*

and

Tito José Crissien Borrero

⁵

¹

Faculty of Engineering, Corporación Universitaria Latinoamericana CUL, Barranquilla 080002, Colombia

²

Faculty of Economics Sciences, University of Cartagena, Cartagena 130001, Colombia

³

Business Growth Center, MACONDOLAB, Faculty of Engineering, Universidad Simón Bolívar, Barranquilla 080002, Colombia

⁴

Department of Computational Science and Electronics, Universidad de la Costa (CUC), Barranquilla 080002, Colombia

⁵

Technological Development and Innovation (IDITEK), University Foundation for Research, Barranquilla 080002, Colombia

^*

Authors to whom correspondence should be addressed.

J. Risk Financial Manag. 2026, 19(4), 281; https://doi.org/10.3390/jrfm19040281

Submission received: 23 November 2025 / Revised: 25 February 2026 / Accepted: 27 February 2026 / Published: 13 April 2026

(This article belongs to the Section Financial Technology and Innovation)

Download

Browse Figures

Versions Notes

Abstract

This study aims to identify and characterize the financial profiles of tourism-sector firms in Barranquilla through the application of unsupervised Machine Learning techniques, with the purpose of analyzing patterns of financial behavior based on profitability, capital structure, and liquidity. The research adopts a quantitative and descriptive design, using secondary financial data for fiscal year 2024 obtained from the Barranquilla Chamber of Commerce. The initial sample comprised 563 active tourism firms. Based on basic accounting variables, normalized financial indicators were constructed through a feature engineering process that included correlation analysis, variable selection, and robust scaling. A range of clustering algorithms representing different methodological paradigms as partitional, hierarchical, density-based, and probabilistic, were evaluated using a multicriteria validation framework combining internal cluster quality metrics and cluster size balance. The OPTICS algorithm was selected as the most suitable method for the final segmentation. The results revealed two regular financial clusters and a group of atypical firms. One cluster corresponds to firms with no observable financial activity, characterized by zero profitability, absence of leverage, and exclusive reliance on equity financing. The second cluster groups financially active firms exhibiting high indebtedness, low equity participation, negative profitability, and liquidity constraints, reflecting conditions of financial distress. Non-parametric statistical tests confirmed significant differences between clusters, primarily in indicators related to capital structure and profitability, while firm size did not account for the observed segmentation. Overall, the findings demonstrate that behavior-based financial segmentation supported by unsupervised Machine Learning and normalized financial ratios enables the identification of robust and interpretable financial archetypes, with capital structure and profitability emerging as the main differentiating factors.

Keywords:

financial profile; unsupervised machine learning; clustering; financial analysis; business segmentation

1. Introduction

The tourism industry is a very dynamic and financially relevant sector, globally, especially in emerging regions such as Latin America, where it significantly contributes to the generation of employment and income. In this region, tourism generates one out of every ten jobs and attracts approximately 5% of total investment; moreover, in 2023 it contributed more than USD 629 billion to Latin American GDP and employed 24.6 million people, figures that underscore its economic significance (IDB Invest, 2026). However, tourism companies face significant financial risks due to financial volatility and unforeseen events, such as financial or health crises, which have a direct effect over the sector’s demand and stability (Brida et al., 2021; Inter-American Development Bank, 2024). In Colombia, particularly in cities such as Barranquilla, tourism has become a major financial catalyst, highlighting the need for analytical tools that enable a deeper understanding of firms’ financial behavior and support the strengthening of their sustainability and competitiveness in uncertain environments (la Hoz et al., 2020; MINCIT, 2024). In this regard, previous studies conducted in Barranquilla have successfully applied clustering techniques and neural networks to classify firms based on their financial behavior, demonstrating the usefulness of these methods in local business contexts (la Hoz et al., 2020).

Financial statements offer a comprehensive assessment of a company’s financial health, delivering crucial indicators of liquidity, earnings, and financial stability (Brealey et al., 2018; Matias et al., 2024). However, traditional financial analysis approaches often prove insufficient to capture the complexity, heterogeneity, and high dispersion that characterize the tourism sector, particularly in volatile economic contexts and in local markets with diverse business structures (Hair et al., 2019). These limitations hinder the identification of differentiated financial patterns and constrain the understanding of underlying structural behaviors within the data.

As an unsupervised learning technique, clustering groups companies with similar financial characteristics together. This enables hidden patterns to be identified and risk and profitability profiles to be segmented within the tourism sector (Han et al., 2019; James et al., 2021). Its application in financial analysis has proven effective for classifying firms according to their economic behavior, particularly in highly competitive and heterogeneous sectors such as tourism (Griffin et al., 2023). Likewise, clustering techniques make it possible to capture inter-firm heterogeneity by avoiding the assumption that all companies follow the same strategy, which is especially relevant in local markets where microenterprises coexist alongside large hotel chains (Adamska & Dąbrowski, 2021; Celebi et al., 2013).

The application of clustering techniques and neural networks in the analysis of the financial statements and performance of tourism companies has gained popularity in recent years. Previous studies have employed these techniques to categorize firms based on financial indicators (Vilas et al., 2022) as well as operational characteristics, whereas neural networks, often combined with clustering, have proven effective in analyzing the relations between variables and predicting outcomes, such as travel costs (Herrera et al., 2022; Tiwari & Tripathi, 2023). This approach relies on the use of normalized financial ratios to capture genuine differences in economic behavior, profitability, capital structure, liquidity, and operational efficiency, thereby providing a more accurate representation of firms’ financial dynamics (Karahuta et al., 2017).

Complementarily, the use of data analysis techniques and Machine Learning has enhanced the ability to understand complex financial contexts and to support decision-making grounded in empirical evidence (Karahuta et al., 2017). These techniques have been implemented in many geographical contexts, including the United States, Colombia, and Ecuador, thereby showing their versatility in tourism research (Dotson et al., 2014; Fontalvo Herrera & La Hoz-Granadillo, 2020; Herrera et al., 2022)

Several studies have analyzed the vulnerability of the tourism sector to financial risks, pointing out that economic instability in Latin America increases this risk (ONU, 2022). Recent research suggests that combining clustering and neural networks improves the accuracy of financial profile analysis and forecasting. This enables a more proactive, strategic response to uncertainty (Bravo et al., 2023; Fontalvo Herrera & La Hoz-Granadillo, 2020). However, these studies have limitations, such as biases in the selection of financial data and the lack of sufficient time series in local markets, such as that of Barranquilla (Fontalvo Herrera et al., 2023; Kelliher et al., 2018).

In the context of Barranquilla’s tourism sector, behavior-based financial segmentation supported by clustering techniques enables a deeper understanding of the sector’s financial landscape. This approach facilitates the identification of financial profiles exhibiting similar behavioral patterns, which is essential for strategic analysis and in has proven to be effective toolformed decision-making (De La Hoz & Polo, 2017). Based on this segmentation, it becomes possible to assess differences in financial performance and levels of risk exposure, thereby strengthening analytical capacity in the face of economic volatility and changing market conditions phenomena that are recurrent in the tourism sector and further intensified by unforeseen events such as economic crises or pandemics (Brida et al., 2021; Inter-American Development Bank, 2024).

This study analyzes the financial behavior of tourism companies in Barranquilla through an approach based exclusively on unsupervised Machine Learning techniques, aimed at financial segmentation using key financial ratios. The analysis employs indicators such as return on assets (ROA), return on equity (ROE), the degree of financial leverage, the capital ratio, and a proxy for current liquidity, in order to identify differentiated and statistically supported financial archetypes. By focusing on financial statements corresponding to fiscal year 2024 (a period marked by post-pandemic recovery and a significant increase in tourism flows in Colombia), the study seeks to provide an up-to-date overview of the financial performance of Barranquilla’s tourism sector and the main challenges faced by firms within this context.

This study primarily aims to develop a financial segmentation approach based on clustering techniques that contributes to a better understanding of the financial behavior of tourism firms in Barranquilla. By classifying firms into financially homogeneous groups, the study seeks to identify relevant patterns in their financial statements, providing an analytical tool to strengthen efficiency, competitiveness, and enterprise risk analysis. Additionally, the results are intended to serve as input for investors and public policy makers, supporting the design of strategies aimed at strengthening and promoting the sustainable development of local tourism.

This research seeks to answer key questions for the financial analysis of the tourism sector in Barranquilla: What financial characteristics and dominant financial profiles distinguish tourism firms in Barranquilla when analyzed using normalized financial ratios? To what extent do unsupervised learning techniques enable the identification of robust and financially homogeneous groups of tourism firms based on their financial behavior? How can a clustering-based financial segmentation approach support the interpretation and strategic analysis of financial performance in the tourism sector?

Based on these questions, the following research objectives are established:

Characterize the financial information of tourism-sector firms in Barranquilla for fiscal year 2024 through the use of normalized financial ratios, in order to analyze their economic behavior.
Apply unsupervised Machine Learning techniques to segment tourism firms in Barranquilla into financially homogeneous clusters, identifying differentiated profiles based on capital structure, profitability, and liquidity.
Analyze and interpret the resulting financial profiles with the purpose of providing an analytical tool that supports strategic analysis and enhances the understanding of financial performance in the local tourism sector.

These objectives guide the research toward a structural analysis of firms’ financial behavior, providing relevant inputs for informed decision-making and for strengthening the tourism sector in Barranquilla.

2. State of the Art

To achieve the objectives of this study, the state of the art addresses the role of financial statements in business analysis, highlighting their relevance for the assessment of economic performance. In addition, clustering techniques are reviewed as unsupervised learning tools for the financial segmentation of firms and their application in identifying patterns of economic behavior within the tourism sector. Finally, previous studies employing Machine Learning–based approaches to support strategic analysis and decision-making in business contexts characterized by high heterogeneity and volatility are examined.

Financial statements constitute fundamental instruments for evaluating a firm’s economic and financial condition. Documents such as the balance sheet, income statement, and cash flow statement provide structured information on assets, liabilities, equity, revenues, and expenses, enabling a comprehensive understanding of an organization’s financial performance (Brealey et al., 2018).

Specifically, financial analysis makes it possible to interpret these statements by means of specific indicators. For example, a liquidity analysis studies a company’s ability to cover its short-term obligations, while a profitability analysis assesses its ability to generate profit according to its revenues, assets, or equity (Matias et al., 2024). This type of analysis is especially relevant in the tourism sector, which is characterized by seasonal fluctuations along with a vulnerability to external shocks, such as economic or health crises (IDB Invest, 2026).

Clustering is an unsupervised learning technique that identifies data patterns and groups them into homogeneous sets according to common characteristics. In the business world, clustering has been used to segment firms with similar financial profiles, facilitating the identification of patterns in variables such as profitability, capital structure, and liquidity that are not always evident through traditional analytical methods (Hair et al., 2019).

In the context of the tourism sector, clustering has proven to be an effective tool for revealing underlying financial structures and heterogeneous behaviors among firms. A study by (la Hoz et al., 2020) have applied clustering techniques to segment tourism companies based on their financial performance, examining differences in profitability and solvency. Moreover, (Herrera et al., 2022) applied this technique in Ecuador to classify tourism companies as per their financial health and capacity to adjust to economic fluctuations.

In particular, the combination of financial indicators and clustering techniques enables the identification of differentiated financial profiles without imposing prior assumptions on the data structure, which is especially valuable in sectors characterized by high volatility and heterogeneity (De La Hoz & Polo, 2017).

Previous research has applied clustering-based and data analysis approaches across different geographical contexts, including the United States (Dotson et al., 2014), Colombia (Fontalvo Herrera & La Hoz-Granadillo, 2020), and Ecuador (Herrera et al., 2022). demonstrating the versatility of these techniques in corporate financial analysis. However, a gap remains in the literature regarding the specific analysis of the tourism sector in the city of Barranquilla, particularly from a behavior-based financial segmentation perspective.

Although the application of clustering and Machine Learning techniques to financial analysis has expanded considerably in recent years, important gaps remain in the literature, particularly in the context of local tourism economies. Most existing studies focus on national or cross-country comparisons, while relatively little attention has been given to city-level markets characterized by high firm concentration, structural heterogeneity, and a predominance of microenterprises. In the case of Barranquilla, despite the economic relevance of tourism, there is still limited empirical evidence examining financial behavior from a segmentation perspective.

Moreover, much of the previous research combines clustering with predictive or forecasting models, emphasizing classification accuracy rather than the structural interpretation of financial patterns. As a result, fewer studies concentrate on identifying clear and statistically robust financial archetypes derived exclusively from unsupervised learning approaches. This limits the understanding of how firms differ in terms of capital structure, profitability dynamics, and liquidity conditions beyond traditional categorizations.

Another recurring limit in the literature is the reliance on firm size as a primary classification criterion. While size-based groupings provide useful descriptive information, they may conceal substantial heterogeneity in financial behavior within the same scale category, particularly in sectors such as tourism where microenterprises dominate but operate under very different economic conditions.

Against this backdrop, the present study seeks to advance the literature by offering a behavior-based financial segmentation of tourism firms in Barranquilla grounded in normalized financial ratios and robust unsupervised clustering techniques. Rather than presupposing structural differences based on scale, the analysis allows financial patterns to emerge directly from the data. The findings not only provide updated empirical evidence for a post-pandemic local context but also demonstrate that firm size does not explain the observed segmentation, reinforcing the relevance of structural financial behavior as a more meaningful differentiating factor. In doing so, the study contributes a replicable and interpretable framework for analyzing financial heterogeneity in tourism-oriented economies.

3. Materials and Methods

The present study adopts an observational design with a quantitative approach and is aimed at identifying patterns of financial behavior among firms in the tourism sector of Barranquilla through the application of unsupervised learning techniques. The analysis is based on a secondary financial dataset corresponding to fiscal year 2024, obtained from the Barranquilla Chamber of Commerce, the official institution responsible for the collection, validation, and administration of financial statements for companies registered within its jurisdiction. The dataset comprises firm-level administrative and accounting information for tourism-related companies domiciled in Barranquilla, Colombia, and the initial sample consists of 563 active firms with available financial statements for the reference year.

The methodological framework focuses on transforming raw accounting variables into normalized financial ratios and subsequently applying clustering algorithms to group firms into homogeneous profiles according to their financial structure, profitability, and liquidity. This approach allows the identification of non-linear relationships among the analyzed variables and facilitates a behavior-oriented segmentation of firms rather than one driven solely by size or scale (Figure 1).

The workflow was implemented using the Python programming language (v3.10 Python Software Foundation, Wilmington, DE 19801, USA) within an interactive development environment (Jupyter Notebook, version 7.5.5, Project Jupyter, Berkeley, CA, USA), ensuring the reproducibility of the analysis. Data processing and manipulation were carried out using the Pandas and NumPy libraries; statistical and algorithmic modeling was conducted with Scikit-learn, UMAP-learn, and SciPy; and visual diagnostics and exploratory validation were supported by Matplotlib and Seaborn.

3.1. Data Understanding and Feature Engineering

The original variables comprise total assets, net equity, total liabilities, net income, current assets, subscribed capital, asset size classification, and renewal date. Since these raw accounting figures are strongly influenced by firm scale, they were not directly suitable for unsupervised segmentation. Therefore, classical financial ratios were first constructed to normalize firm performance, including Return on Assets (ROA), Return on Equity (ROE), financial leverage, equity ratio, and an approximated current liquidity ratio, enabling comparisons across heterogeneous firms based on financial behavior rather than size.

Subsequently, a feature engineering stage was applied to enrich the financial representation and capture multidimensional characteristics relevant for clustering. Six additional metrics were defined, covering financial structure (liability ratio and capital multiplier), liquidity (current asset ratio), solvency (debt coverage), stability (equity-to-current-assets ratio), and efficiency (return on current assets). In total, 11 financial metrics were retained and grouped into four dimensions: profitability (3 metrics), financial structure (4 metrics), liquidity (3 metrics), and solvency (1 metric). Due to the absence of operational revenue and expense variables, efficiency indicators based on income flows could not be computed; therefore, the analysis focuses on balance-sheet-driven dimensions. The preprocessing stage included exploratory data analysis, correlation analysis using a |r| > 0.7 threshold, detection of missing values, and outlier identification via the interquartile range (IQR) method. A summary of financial metrics used in the analysis is provided in Table 1.

3.2. Treatment of Missing Values and Outliers

Missing values were not imputed, as they represent genuine absence of reported financial information and artificial imputation could introduce bias in clustering outcomes. Instead, the analysis prioritized metrics with full availability across firms in subsequent stages.

Outliers were identified using the interquartile range (IQR) method, defining lower and upper bounds as Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, respectively. Rather than removing entire observations, extreme values were retained to preserve potentially informative financial behaviors. To reduce the influence of extreme observations on distance-based algorithms, robust preprocessing techniques were applied in later stages.

3.3. Feature Selection

Feature selection was conducted to ensure statistical robustness, interpretability, and suitability for clustering.

First, a correlation matrix was constructed to detect multicollinearity among the derived financial ratios. Highly correlated variables were considered redundant, and only one representative metric was retained based on financial interpretability and prevalence in prior literature. Second, a variance threshold criterion was applied to remove features with near-zero variance, ensuring that all retained variables contributed meaningful discriminatory information.

Finally, the optimal subset of financial metrics was selected based on a combined criterion balancing: (i) low redundancy, (ii) sufficient variance, (iii) data completeness, and (iv) financial relevance. This subset constitutes the feature space used as input for the clustering procedures described in the following section.

3.4. Optimal Number of Clusters

The optimal number of clusters was determined using a multi-criteria validation framework to ensure methodological rigor and reproducibility. Prior to clustering, the selected financial ratios were normalized to ensure scale comparability, evaluating both StandardScaler (zero mean and unit variance) and RobustScaler (median centering with interquartile range scaling); the latter was selected due to its robustness to outliers typically present in financial data. Clustering was performed using the k-means as based algorithm with k-means++ initialization, evaluating a range of cluster solutions from k = 2 to k = 10. Cluster validity was assessed through complementary internal and comparative techniques, including the Elbow Method based on within-cluster sum of squares (WCSS) to analyze inertia reduction, the Silhouette Score to evaluate intra-cluster cohesion and inter-cluster separation, the Davies–Bouldin Index to assess cluster compactness and overlap, and the Calinski–Harabasz Index to quantify the ratio of between- to within-cluster variance. Additionally, dendrogram analysis was employed to support structural inspection in hierarchical clustering.

3.5. Clustering Algorithm Selection

The selection of the clustering algorithm was performed through a systematic and quantitative comparison of multiple unsupervised learning techniques representing different clustering paradigms. The evaluated methods included partitional approaches (K-Means as baseline and K-Medoids using the PAM algorithm for increased robustness to outliers), hierarchical clustering with different linkage criteria (Agglomerative Clustering using Ward, Complete, and Average linkages), density-based algorithms (DBSCAN, OPTICS, and HDBSCAN), and probabilistic models (Gaussian Mixture Models) capable of capturing elliptical cluster structures. To ensure an objective comparison, all algorithms were evaluated using a consistent set of quantitative validation metrics, including the Silhouette Score to assess intra-cluster cohesion and inter-cluster separation, the Davies–Bouldin Index to evaluate cluster compactness and overlap, the Calinski–Harabasz Index to measure the ratio of between- to within-cluster variance, and a cluster balance metric designed to penalize extreme disparities in cluster sizes. These metrics were normalized to a common [0, 1] scale and aggregated into a composite score using balanced weights (25% per metric), which served as the primary optimization objective. Hyperparameter tuning was conducted independently for each algorithm using a randomized search strategy with up to 1000 iterations, subject to a strict validity criterion that retained only configurations producing meaningful cluster structures, including a minimum cluster size threshold of 5% of the total sample. The optimization process was constrained by a maximum execution time per algorithm to ensure computational feasibility. Algorithms that failed to converge or produced degenerate clustering solutions under these constraints were excluded from further consideration, while successfully tuned models were retained for their analysis.

3.6. Cluster Characterization and Business Interpretation

Cluster characterization was conducted through a structured statistical and visual analysis aimed at identifying differentiated financial profiles while preserving methodological rigor and interpretability. For each cluster, descriptive statistics, including mean, median, and standard deviation, were computed across the selected financial metrics to summarize central tendency and dispersion. Given the non-Gaussian behavior commonly observed in financial ratios, non-parametric hypothesis testing was employed to assess the statistical significance of inter-cluster differences, using the Kruskal–Wallis test to evaluate global differences in medians across clusters and the Mann–Whitney U test for pairwise comparisons when applicable. To support intuitive and comparative analysis, multiple visualization techniques were applied, including radar charts constructed from normalized financial metrics to represent cluster-level profiles and heatmaps based on median values to highlight relative intensities across features. The analysis was performed on a dataset comprising companies with complete financial information, using a fixed set of financial features consistently across all clusters, while density-based outliers identified during clustering were treated separately to avoid distortion of cluster profiles. This combined statistical and visual framework enabled a robust characterization of clusters and provided a foundation for subsequent business-oriented interpretation.

3.7. Dimensionality Reduction for Visualization

Dimensionality reduction techniques were applied to project the high-dimensional financial feature space into two dimensions for visualization and cluster interpretation, with the objective of identifying the method that best preserves the underlying data structure. Three complementary approaches were evaluated: Principal Component Analysis (PCA) as a linear variance-maximization baseline, t-distributed Stochastic Neighbor Embedding (t-SNE) as a non-linear technique emphasizing local neighborhood structure, and Uniform Manifold Approximation and Projection (UMAP) to balance local and global structural preservation. To enable an objective comparison, the quality of the two-dimensional embeddings was assessed using quantitative metrics capturing different aspects of structural fidelity, including trustworthiness to measure local neighborhood preservation, continuity to evaluate the maintenance of global structural relationships, and Spearman rank correlation to quantify the correspondence between pairwise distances in the original and reduced spaces. These metrics were normalized and aggregated into a composite score using predefined weighted contributions to ensure balanced evaluation. PCA was evaluated using the full set of principal components for variance analysis prior to two-dimensional projection, while t-SNE was tested across multiple perplexity values to explore sensitivity to local versus global structure, and UMAP was evaluated using different neighborhood sizes to assess its ability to balance structural scales.

4. Results

The results section presents the empirical findings derived from the application of the proposed unsupervised learning framework to the financial data of tourism firms in Barranquilla. First, the characteristics of the dataset and the construction of the financial feature space are described, including data completeness, variability, and redundancy patterns. Subsequently, the results of the feature selection process, cluster validation, algorithm comparison, and cluster characterization are reported using quantitative metrics and statistical tests. Finally, dimensionality reduction outcomes are presented to support the visualization and interpretation of the identified financial behavior profiles.

4.1. Dataset Overview and Feature Engineering Outcomes

The final dataset analyzed comprised 525 active firms from the tourism sector in Barranquilla, predominantly classified as micro and small enterprises. After constructing the complete set of financial metrics, only 125 firms (23.8%) presented non-missing values across all variables. This reduction reflects the uneven availability of financial information within the sector, particularly among firms with limited operational activity or recent market entry.

A total of 11 financial metrics were obtained for subsequent clustering analysis, including five classical financial ratios (ROA, ROE, leverage, equity ratio, and approximated current liquidity) and six additional metrics derived through feature engineering, capturing dimensions of financial structure, liquidity, solvency, stability, and efficiency. Data completeness varied across metrics. The current asset ratio showed full coverage, while profitability- and leverage-related indicators exhibited a high proportion of zero values, with 68–76% of firms reporting zero observations. The debt coverage ratio presented the lowest availability, with valid data for 168 firms.

Outlier detection using the interquartile range (IQR) method identified extreme values in all 11 financial metrics, with outlier proportions exceeding 5% in every case. The highest incidence of outliers was observed for the equity-to-current-assets ratio (34.3%), followed by ROA (27.6%) and the capital multiplier (26.9%), indicating substantial dispersion in financial profiles across firms. Correlation analysis revealed several strong linear relationships among the metrics (|r| > 0.7), as show in Table 2. Notably, near-perfect correlations were observed between leverage and capital multiplier (r = 0.998) and between ROA and return on current assets (r = 0.997), along with high correlations among profitability-related indicators and between debt coverage and profitability measures. These results highlight the presence of redundancy within the feature space prior to clustering.

4.2. Feature Selection Results

Feature selection was performed to define a statistically robust and interpretable feature space suitable for clustering. Correlation analysis identified two variables with near-perfect linear dependence (|r| > 0.99). Specifically, the capital multiplier exhibited an almost perfect correlation with leverage (r = 0.998) and return on current assets was nearly identical to ROA (r = 0.997). In both cases, the more widely used and standard financial indicators leverage and ROA were retained, while their redundant counterparts were excluded from further analysis.

Variance analysis of the remaining metrics confirmed that all retained variables exhibited sufficient dispersion, and no features were removed due to near-zero variance. Subsequently, a trade-off analysis was conducted to assess the impact of increasing the number of features on data availability (Figure 2). The results show that data completeness remained stable when using up to six features, with approximately 500 firms (100%) retaining complete information. When expanding the feature set to eight metrics, the number of usable observations remained close to the full sample (approximately 500 firms; 97.1% completeness). In contrast, including a ninth feature led to a sharp reduction in the effective sample size, leaving only about 100 firms (31.8%) with complete data. Based on this balance between feature richness, interpretability, and data completeness, an eight-feature subset was selected for clustering.

The final feature space used for clustering comprises eight financial metrics, covering profitability (ROA and ROE), financial structure (leverage, equity ratio, and liability ratio), and liquidity (approximated current liquidity, current asset ratio, and equity-to-current-assets ratio) (Table 3). Residual correlation analysis among the selected features revealed no problematic dependencies (|r| > 0.7), except for the expected correlation between ROA and ROE (r ≈ 0.77), which were retained due to their complementary perspectives on firm performance.

4.3. Determination of the Optimal Cluster Structure

The optimal number of clusters was evaluated using complementary internal validation metrics applied to k-means clustering solutions ranging from

k = 2

to

k = 10

. The results obtained from each validation method are summarized in Table 4, while the comparative behavior of the metrics across different values of

k

is illustrated in Figure 3. Additionally, the hierarchical structure of the data was inspected through dendrogram analysis, as shown in Figure 4.

The Elbow Method, based on the within-cluster sum of squares (WCSS), showed a pronounced reduction in inertia from k = 2 to k = 3, followed by a substantially slower rate of decrease for larger values of k. This change in slope suggests a clear inflection point at k = 3, with only marginal improvements in cluster compactness for k ≥ 4, indicating diminishing returns from adding additional clusters beyond this point.

The Silhouette Score achieved its highest values for k = 2 and k = 3, with scores close to 1, indicating strong intra-cluster cohesion and well-separated clusters. As the number of clusters increased beyond k = 3, the silhouette score declined markedly, reflecting reduced clustering quality and increasing overlap between clusters.

The Davies–Bouldin Index exhibited its lowest values in the range of k = 2 to k = 4, indicating improved cluster compactness and minimal inter-cluster similarity within this interval. For larger values of k, the index increased steadily, suggesting a degradation in clustering performance as clusters became less distinct.

The Calinski–Harabasz Index increased monotonically with k, reaching its highest value at k = 10. However, rather than relying on the absolute maximum, the relative rate of increase was considered. The most substantial gains occurred between k = 2 and k = 4, after which improvements became progressively smaller, indicating limited benefit from further partitioning of the data.

The hierarchical clustering dendrogram constructed using Ward linkage revealed pronounced vertical gaps among the last fusion steps, suggesting natural partitioning structures between

k = 2

and

k = 4

. The largest separations were observed in the final 30 merges, supporting the presence of a limited number of well-defined clusters (Figure 4).

Across all validation techniques, the consolidated results indicated the strongest consensus for a three-cluster solution (

k = 3

), which was supported by the majority of internal metrics.

4.4. Comparative Performance of Clustering Algorithms

The comparative performance of the evaluated clustering algorithms is summarized through multiple internal validation metrics, as illustrated in Figure 5, while the aggregated composite scores and their metric-level contributions are presented in Figure 6.

As shown in Figure 6, substantial variability was observed across algorithms with respect to cluster cohesion, compactness, separation, and balance. In terms of Silhouette Score, OPTICS achieved the highest normalized value, reaching the maximum score among all evaluated methods. Density-based approaches, including HDBSCAN and Mean Shift, also exhibited relatively high silhouette values, whereas partitional and probabilistic methods, such as MiniBatch K-Means, Spectral Clustering, and Gaussian Mixture Models, obtained noticeably lower scores.

Regarding cluster compactness, assessed through the Davies–Bouldin Index, OPTICS again yielded the lowest value among all algorithms, indicating reduced overlap between clusters. Other density-based methods, such as HDBSCAN and Mean Shift, showed intermediate values, while higher indices were observed for Gaussian Mixture Models and MiniBatch K-Means.

The Calinski–Harabasz Index, used to quantify the ratio of between-cluster to within-cluster variance, exhibited a pronounced peak for OPTICS, with values several orders of magnitude higher than those obtained by alternative algorithms. In contrast, hierarchical and partitional methods produced comparatively low Calinski–Harabasz scores, reflecting weaker separation structures in the resulting cluster configurations.

Cluster size balance, measured through the coefficient of variation of cluster sizes, revealed moderate variability across algorithms. Gaussian Mixture Models and MiniBatch K-Means achieved the lowest imbalance values, while Mean Shift and DBSCAN presented higher dispersion in cluster sizes. OPTICS exhibited an intermediate balance score, reflecting the coexistence of core clusters and a noise component.

The aggregation of all normalized metrics into a composite score is presented in Figure 7. OPTICS obtained the highest composite score (0.944), substantially exceeding the scores of all other evaluated algorithms. HDBSCAN ranked second with a composite score slightly above 0.5, while the remaining methods achieved composite scores below 0.4.

The metric-level contribution to the composite score, also illustrated in Figure 7, shows that OPTICS consistently contributed positively across all evaluated dimensions, with strong contributions from the Silhouette Score, Davies–Bouldin Index, and Calinski–Harabasz Index. In contrast, competing algorithms exhibited unbalanced contributions, typically dominated by one or two metrics while underperforming in others.

Based on the quantitative comparison across individual metrics and the aggregated composite score, OPTICS was selected as the clustering algorithm for subsequent analysis, a direct comparison between the selected algorithm and the k-means baseline is reported in Table 5. The optimal OPTICS configuration produced a clustering structure consisting of two core clusters and a noise component. Specifically, 174 firms (34.1%) were classified as outliers (Cluster −1), while 37 firms (7.3%) and 299 firms (58.6%) were assigned to Cluster 0 and Cluster 1, respectively, resulting in a total of 510 clustered firms.

4.5. Cluster Characterization and Business-Oriented Interpretation

4.5.1. Dataset and Cluster Structure

The cluster characterization analysis was conducted on a dataset comprising 510 companies with complete financial information. Based on the density-based clustering results, two regular clusters (Cluster 0 and Cluster 1) were identified, while 174 companies (34.1%) were classified as outliers (Cluster −1) and analyzed separately to avoid distortion of cluster-level profiles.

All clusters were characterized using a consistent set of eight financial ratios, covering profitability, financial structure, and liquidity dimensions.

4.5.2. Statistical Differentiation Between Clusters

Non-parametric hypothesis testing was applied to assess inter-cluster differences in financial indicators. The results of the Kruskal–Wallis test are reported in Table 6.

Six out of eight financial features (75%) showed statistically significant differences across clusters (

p < 0.05

). The most discriminative variables corresponded to financial structure indicators (leverage, equity ratio, and debt ratio), all of which presented highly significant differences (

p < 0.001

). Profitability indicators (ROA and ROE) also exhibited significant differences, whereas two liquidity-related ratios did not show statistically significant variation.

Pairwise comparisons between the two regular clusters using the Mann–Whitney U test (Table 7) confirmed significant differences in financial structure and profitability metrics, with large effect sizes for leverage-related indicators and moderate effect sizes for profitability measures.

4.5.3. Financial Profile Characterization of Regular Clusters

Cluster 0 comprised 37 companies (7.3%), all classified as micro-sized firms. As illustrated in the financial profile visualization (Figure 7), this cluster exhibited a nearly flat normalized profile across profitability and leverage dimensions.

Median ROA and ROE values were equal to zero, indicating the absence of recorded operating activity. From a capital structure perspective, Cluster 0 showed no financial leverage, with a median leverage value of zero and a median equity ratio of 100%. Liquidity indicators were uniformly low, with median values well below standard reference levels.

The heatmap of median financial characteristics (Figure 8) further highlights the dominance of equity-based financing and the absence of debt in this cluster.

Cluster 1 represented the largest group, comprising 299 companies (58.6%), predominantly micro-sized enterprises (97.0%). The financial profile visualization (Figure 8) shows a markedly unbalanced profile for this cluster, characterized by strong leverage intensity and reduced equity participation.

Median ROA and ROE values were negative, indicating operating losses. Financial structure metrics revealed high leverage, with a median leverage value exceeding four times equity and a median equity ratio below 20%. Liquidity indicators were low and comparable to those observed in Cluster 0.

These contrasts are clearly emphasized in the heatmap representation (Figure 9), where Cluster 1 displays high intensity in debt-related ratios and reduced equity contribution.

The outlier group (Cluster −1) included 174 companies (34.1%) and exhibited heterogeneous financial behavior. Median values for profitability, leverage, and equity ratios deviated substantially from the interquartile ranges observed in the regular clusters, as summarized in Table 8.

The size composition of this group was more diverse, including micro, small, medium, and one large firm, reinforcing its atypical and non-homogeneous nature.

4.5.4. Visual Validation of Cluster Profiles

Visual analysis supported the statistical findings. The radar charts of normalized financial metrics (Figure 7) revealed clear structural differences between the two regular clusters: a flat, low-activity profile for Cluster 0 and a leveraged, imbalanced profile for Cluster 1.

The heatmap of median values (Figure 8) highlighted strong contrasts in financial structure indicators, particularly equity ratio and debt intensity, while liquidity-related metrics showed similar levels across clusters.

4.5.5. Behavioral Financial Archetypes and Business Interpretation

The combined statistical and visual analyses reveal the presence of distinct financial behavior patterns across the identified clusters. These differences are consistently supported by non-parametric hypothesis testing, median-based comparisons, and cluster-level visualizations, confirming that the segmentation captures heterogeneous financial dynamics rather than superficial structural similarities.

Cluster 0 is characterized by a financial profile with null profitability, absence of financial leverage, and limited liquidity, indicating firms without observable operating activity during the analyzed period. In contrast, Cluster 1 exhibits active financial behavior, marked by high leverage, reduced equity participation, negative profitability, and constrained liquidity, reflecting firms engaged in ongoing operations under financial stress or restructuring conditions.

The robustness of this behavioral segmentation is further reinforced by the consistency between statistical significance tests and visual validation tools. Radar charts highlight contrasting structural patterns between clusters, while heatmap representations emphasize sharp differences in capital structure metrics and relatively homogeneous liquidity conditions. Box plot distributions corroborate these findings by showing minimal overlap in leverage-related indicators and partial overlap in profitability metrics.

The outlier group aggregates firms with atypical financial configurations that deviate substantially from the dominant patterns observed in regular clusters. These entities exhibit extreme values in profitability and capital structure indicators, suggesting the presence of non-standard financial behavior that cannot be adequately represented by the regular cluster archetypes.

Overall, the resulting segmentation defines clear financial archetypes that are interpretable, statistically supported, and suitable for business-oriented analysis. This structure provides a coherent framework for differentiating firms based on their financial behavior, enabling targeted analytical, managerial, or risk-oriented applications without relying on predefined firm attributes.

4.6. Evaluation of Dimensionality Reduction Techniques for Visualization

To facilitate the visual inspection of the clustering structure and the spatial distribution of firms in a two-dimensional space, three dimensionality reduction techniques: PCA, t-SNE, and UMAP, were evaluated and compared based on quantitative embedding quality metrics.

4.6.1. Principal Component Analysis (PCA)

The variance explained by each principal component is reported in the left image in Figure 10. The first principal component (PC1) accounted for 27.14% of the total variance, while the second principal component (PC2) explained an additional 22.14%. Together, the two-dimensional PCA projection captured 49.28% of the total variance of the standardized financial feature space. The cumulative variance exceeded 75% with four components and 95% with six components, as illustrated in the right image in Figure 9.

4.6.2. Comparison of Dimensionality Reduction Techniques

The quality of the two-dimensional embeddings obtained with PCA, t-SNE, and UMAP was assessed using trustworthiness, continuity, and Spearman rank correlation. The quantitative comparison of these metrics is summarized in Table 9 and visualized in Figure 10.

PCA achieved a trustworthiness score of 0.650 and a continuity score of 0.817, while exhibiting the highest Spearman rank correlation (0.890), indicating strong preservation of pairwise distance rankings between the original and reduced spaces. t-SNE obtained the highest trustworthiness (0.830) and continuity (0.895), reflecting improved preservation of local neighborhood relationships. UMAP achieved intermediate values across all metrics, with trustworthiness and continuity scores of 0.785 and 0.834, respectively.

The normalized composite scores, integrating all evaluation metrics, ranked t-SNE as the highest-performing technique for two-dimensional visualization (composite score = 0.834), followed by PCA (0.786) and UMAP (0.766), as reported in Table 8.

4.6.3. Two-Dimensional Cluster Visualization

The two-dimensional projections obtained with each technique are shown in Figure 11, where clusters identified in the high-dimensional space are overlaid on the reduced representations. The PCA-based visualization shows partial overlap between clusters, whereas the t-SNE embedding presents a clearer spatial separation between the main clusters and outliers. The UMAP projection also reveals well-defined groupings, although with greater compactness and reduced inter-cluster spacing.

Based on the composite evaluation metrics and the visual separation observed in the two-dimensional embeddings, t-SNE was selected as the primary technique for cluster visualization in subsequent analyses. The final t-SNE representation, including cluster centroids, is presented in Figure 12.

5. Discussion

This study aimed to characterize and segment tourism firms in Barranquilla based on their financial behavior using normalized financial ratios and unsupervised learning techniques. Unlike traditional firm classifications driven by size or scale, the proposed framework focuses on identifying structural patterns in profitability, capital structure, liquidity, and solvency. The results demonstrate that this approach enables the identification of robust and interpretable financial archetypes that reflect heterogeneous economic dynamics within the sector.

Beyond the identification of financial archetypes, it is essential to contrast these findings with the existing literature to evaluate their theoretical and methodological implications.

5.1. Financial Behavior Archetypes in the Tourism Sector

The clustering analysis revealed the presence of two regular clusters and a group of outliers, each representing distinct financial behavior archetypes. These profiles are not merely statistical groupings but reflect differentiated patterns of financial activity, capital allocation, and operational intensity.

Cluster 0 corresponds to firms exhibiting no observable financial activity during the analyzed fiscal year. These entities are characterized by null profitability indicators, absence of financial leverage, full reliance on equity financing, and low liquidity levels. The financial configuration of this group suggests firms that are formally registered but inactive, recently created, or undergoing temporary suspension of operations. The homogeneity of this profile across multiple financial dimensions supports its interpretation as a distinct archetype rather than a transitional or noisy group.

Cluster 1 represents firms with active operations but under financial stress. This group is characterized by high leverage, reduced equity participation, negative profitability, and constrained liquidity. The coexistence of operational activity with persistent losses and aggressive capital structures suggests firms facing structural challenges related to cost management, debt servicing, or unfavorable market conditions. These findings align with previous studies highlighting the relevance of capital structure and profitability as key differentiators in financial segmentation frameworks.

The outlier group aggregates firms with atypical financial configurations that deviate substantially from the dominant patterns observed in the regular clusters. The heterogeneity of this group, both in terms of financial ratios and firm size composition, suggests the presence of non-standard business models, exceptional financial events, or data irregularities. Treating these firms as a separate category rather than forcing their inclusion into regular clusters preserves the interpretability and robustness of the segmentation.

These results are consistent with prior research demonstrating the capacity of clustering techniques to reveal latent financial structures within heterogeneous sectors (Griffin et al., 2023; Han et al., 2019). Like those studies, the use of unsupervised learning enabled the identification of underlying patterns that traditional descriptive financial analysis may fail to detect. In this sense, the present findings reinforce the methodological relevance of clustering for uncovering hidden configurations of risk exposure and profitability in volatile environments such as tourism.

5.2. Behavior-Based Segmentation Versus Size-Based Classification

One of the most relevant findings of this study is that the identified clusters are not explained by firm size. Both regular clusters are overwhelmingly composed of micro-sized enterprises, yet they exhibit markedly different financial behaviors. This result provides empirical evidence supporting the hypothesis that financial performance and risk profiles in the tourism sector are better explained by behavioral and structural characteristics than by scale alone.

The statistically significant differences observed in financial structure and profitability indicators, contrasted with the limited discriminative power of liquidity metrics, reinforce the notion that capital allocation and operational efficiency rather than firm size, are the primary drivers of financial outcomes. By demonstrating that microenterprises in Barranquilla follow substantially different financial trajectories despite their similar scale, this study provides localized validation for behavior-based segmentation. These results strengthen prior arguments regarding the inherent limitations of size-driven classifications in heterogeneous economic contexts (Chen et al., 2024).

5.3. Interpretation of Discriminative Financial Dimensions

The analysis highlights financial structure indicators such as leverage, equity ratio, and liability ratio, as the most discriminative variables across clusters. These metrics exhibited strong statistical significance and minimal distributional overlap, underscoring their relevance in distinguishing financial archetypes. Profitability indicators (ROA and ROE) also contributed to cluster differentiation, although with moderate overlap, reflecting shared challenges across firms within the sector.

The prominence of leverage and profitability indicators as primary differentiating factors is coherent with prior studies emphasizing the role of capital structure and return-based metrics in financial segmentation frameworks (Karahuta et al., 2017; Vilas et al., 2022). The identification of a financially stressed cluster characterized by high indebtedness and negative profitability further aligns with the literature describing the structural vulnerability of the tourism sector to external shocks and volatility (Brida et al., 2021).

Although liquidity is traditionally regarded as a crucial indicator of financial health and short-term solvency (Matias et al., 2024), the results obtained for Barranquilla in 2024 indicate that liquidity ratios did not significantly differentiate the main clusters. Rather than contradicting the literature, this finding suggests that constrained liquidity may represent a structural characteristic affecting the sector during the analyzed period. Consequently, liquidity appears as a shared contextual condition rather than a discriminating variable within the segmentation framework.

5.4. Role of Visualization and Outlier Treatment

The use of dimensionality reduction techniques supported the qualitative interpretation of the clustering results by providing intuitive two-dimensional representations of the high-dimensional financial space. While visual compactness varied across methods, the selected visualization approach facilitated the identification of cluster separation, internal dispersion, and the spatial positioning of outliers. Importantly, visualization was used as a complementary interpretative tool rather than as a criterion for defining clusters, preserving methodological rigor.

The explicit identification and separate treatment of outliers further strengthened the analytical framework. By acknowledging the presence of atypical financial behaviors, the study avoids oversimplification and recognizes the inherent complexity of financial data in real-world settings.

From a methodological standpoint, the use of the OPTICS algorithm also addresses limitations previously identified in local research contexts (Fontalvo Herrera et al., 2023). Unlike simple partitional models such as k-means, density-based approaches are better suited to handle skewed financial distributions and extreme values. The substantial improvement observed in clustering quality, reflected in the marked increase in the Silhouette Score relative to the baseline model, demonstrates the added value of robust clustering techniques for real-world financial datasets characterized by zero inflation and outliers.

5.5. Contextual Scope and Temporal Considerations

The results of this study are conditioned by the temporal context of the data, which correspond to the 2024 fiscal year. Consequently, the identified financial behavior patterns reflect the structural and operational conditions of the tourism sector during this specific period. Given the sensitivity of the sector to economic cycles, demand fluctuations, and external shocks, the interpretation of the results should be framed within this temporal scope. Future studies incorporating longitudinal data may provide additional insights into the persistence or evolution of the identified financial archetypes.

5.6. Implications for Financial Analysis and Monitoring

Although this study does not aim to prescribe operational or policy actions, the identified financial archetypes may inform the development of analytical frameworks for financial monitoring, risk assessment, and firm segmentation. By relying exclusively on normalized financial ratios and unsupervised learning, the proposed approach offers a flexible and scalable tool that can be adapted to different institutional or analytical contexts without relying on predefined firm attributes.

6. Conclusions

This study developed and validated a behavior-based framework for the financial segmentation of tourism firms in Barranquilla using normalized financial ratios and unsupervised learning techniques. By avoiding firm size as a classification criterion, the proposed approach focused on identifying structural patterns in financial behavior related to profitability, capital structure, and liquidity.

The empirical results confirm that the framework is capable of identifying distinct and statistically robust financial profiles within the sector. Two regular clusters were consistently detected, representing firms without observable operational activity and firms with active operations but under significant financial pressure, respectively. These profiles were supported by statistically significant differences across most financial dimensions, particularly those related to capital structure and profitability.

A relevant contribution of this study is the empirical confirmation that firm size does not drive the observed segmentation. Despite the predominance of micro-sized enterprises in both regular clusters, the identified profiles exhibit substantially different financial configurations, reinforcing the value of behavior-based segmentation approaches for analyzing heterogeneous sectors.

From a methodological standpoint, the study highlights the importance of rigorous feature engineering based on financial ratios and the use of robust clustering techniques capable of handling skewed distributions, zero inflation, and extreme values. The explicit identification of outlier firms as a separate group further enhances the interpretability of the results and preserves relevant information that would otherwise be lost through aggressive data filtering.

The findings of this study are subject to several limitations that should be acknowledged. First, the analysis is based on financial data corresponding to a single fiscal year (2024), which constrains the temporal generalizability of the identified financial profiles. The observed patterns reflect the structural and economic conditions of the tourism sector during that specific period and may vary under different macroeconomic or sectoral contexts.

Second, the availability and quality of financial information limited the effective sample size for clustering, particularly due to the high prevalence of zero values in profitability indicators and missing data in certain ratios. Although these characteristics are inherent to real-world financial datasets, they may affect the stability of the identified clusters.

Third, while the outlier group was deliberately preserved to maintain analytical integrity, its heterogeneity prevents detailed characterization within the current framework and requires complementary qualitative or firm-level analysis.

Future research may extend this work in several directions. Incorporating longitudinal financial data would allow the analysis of cluster stability and transitions over time, providing insights into the persistence and evolution of financial behavior patterns. Additionally, integrating qualitative information or operational variables could support a deeper interpretation of outlier firms and atypical financial configurations.

Further studies may also explore the application of the proposed framework to other economic sectors or geographic contexts, assessing its generalizability and comparative performance. Finally, the development of monitoring or early-warning tools based on the identified financial archetypes represents a promising avenue for applied research, provided that such extensions are supported by additional empirical validation.

Overall, this study provides a robust and replicable foundation for behavior-based financial segmentation using unsupervised learning, contributing to a more nuanced understanding of financial heterogeneity in the tourism sector.

7. Patents

The authors declare that no patents resulted from the work reported in this manuscript.

Author Contributions

The authors have contributed to the development of the article as follows: L.H.P.C. worked on writing the original draft, validation, resources, methodology and conceptualization; T.J.F.H. worked on the editing, validation, resources, methodology and conceptualization; G.N.A. worked on the validation, resources, methodology and conceptualization; E.D.-L.-H.-F. worked on writing, reviewing, editing and data curation; T.J.C.B. worked on writing, reviewing, editing and data curation; and J.E.-G. worked on writing, reviewing, editing and data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Database obtained from the Barranquilla Chamber of Commerce.

Conflicts of Interest

The authors declare no conflict of interest.

References

Adamska, A., & Dąbrowski, T. J. (2021). Investor reactions to sustainability index reconstitutions: Analysis in different institutional contexts. Journal of Cleaner Production, 297, 126715. [Google Scholar] [CrossRef]
Bravo, J., Alarcón, R., Valdivia, C., & Serquén, O. (2023). Application of machine learning techniques to predict visitors to the tourist attractions of the moche route in Peru. Sustainability, 15(11), 8967. [Google Scholar] [CrossRef]
Brealey, R. A., Myers, S. C., & Allen, F. (2018). Principles of corporate finance. McGraw-Hill. [Google Scholar]
Brida, J. G., Olivera, M., & Segarra, V. (2021). Crescimento econômico e turismo na América Latina e no Caribe. Revista Brasileira de Pesquisa em Turismo, 15(1), 2300. [Google Scholar] [CrossRef]
Celebi, M. E., Kingravi, H. A., & Vela, P. A. (2013). A comparative study of efficient initialization methods for the K-means clustering algorithm. Expert Systems with Applications, 40(1), 200–210. [Google Scholar] [CrossRef]
Chen, L., Fan, M., & Wang, J. (2024). A controlled data envelopment analysis clustering approach based on individual perspective. Information Sciences, 677, 120932. [Google Scholar] [CrossRef]
De La Hoz, E., & Polo, L. L. (2017). Application of Cluster analysis techniques and artificial neural networks for the evaluation of the exporting capability of a company. Información Tecnológica, 28(4), 67–74. [Google Scholar] [CrossRef]
Dotson, M., Dave, D., Clark, J. D., & Aggarwal, A. (2014). A neural network and cluster analytic approach in tourism research in the United States. International Journal of Management and Social Sciences, 4(1). Available online: https://journals.foundationspeak.com/index.php/ijmss/article/view/170 (accessed on 31 October 2025).
Fontalvo Herrera, T. J., & La Hoz-Granadillo, E. D. (2020). Método conglomerado-análisis discriminante-análisis envolvente de datos para clasificar y evaluar eficiencia empresarial. Entramado, 16(2), 46–55. [Google Scholar] [CrossRef]
Fontalvo Herrera, T. J., Vega Hernández, M. A., & Mejía Zambrano, F. (2023). Método de clustering e inteligencia artificial para clasificar y proyectar delitos violentos en Colombia. Revista Científica General José María Córdova, 21(42), 551–572. [Google Scholar] [CrossRef]
Griffin, E. C., Keskin, B. B., & Allaway, A. W. (2023). Clustering retail stores for inventory transshipment. European Journal of Operational Research, 311(2), 690–707. [Google Scholar] [CrossRef]
Hair, J. F., Babin, B. J., Anderson, R. E., & Black. (2019). Multivariate data analysis (8th ed.). Cengage Learning Asia Pte Ltd. Taiwan Branch. [Google Scholar]
Han, J., Kamber, M., & Pei, J. (2019). Data mining: Concepts and techniques. Morgan Kaufmann. [Google Scholar]
Herrera, A., Arroyo, Á., Jiménez, A., & Herrero, Á. (2022). Analysis of the Tourism industry in ecuador by means of soft computing techniques. In International workshop on soft computing models in industrial and environmental applications (pp. 811–820). Springer. [Google Scholar] [CrossRef]
IDB Invest. (2026). Tourism. Available online: https://idbinvest.org/en/sectors/tourism (accessed on 1 November 2025).
Inter-American Development Bank. (2024). Tourism. Inter-American Development Bank. [Google Scholar]
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning. Springer. [Google Scholar] [CrossRef]
Karahuta, M., Gallo, P., Matušková, D., Šenková, A., & Šambronská, K. (2017). Forecast of using neural networks in the tourism sector. CBU International Conference Proceedings, 5, 218–223. [Google Scholar] [CrossRef]
Kelliher, F., Reinl, L., Johnson, T. G., & Joppe, M. (2018). The role of trust in building rural tourism micro firm network engagement. Tourism Management, 68, 1–12. [Google Scholar] [CrossRef]
la Hoz, L. E., Iglesias, M. A., & Perez Coronell. (2020). Método cluster-RNA para clasificar, caracterizar y pronosticar perfiles competitivos del sector tiendas minoristas en la ciudad de barranquilla. Ingeniería y Competitividad (Inge CuC), 16(1), 234–240. [Google Scholar] [CrossRef]
Matias, F., Rebelo, S., Andraz, G., & Guerreiro, J. (2024). The influence of firm characteristics and macroeconomic factors on financial performance: Evidence from the Portuguese hotel industry. Economies, 12(11), 306. [Google Scholar] [CrossRef]
MINCIT. (2024). Impacto del turismo en la economía colombiana. MINCIT. Available online: https://www.mincit.gov.co/ (accessed on 1 November 2025).
ONU. (2022). Resumen del año | 2022 el año de repensar el turismo. ONU. Available online: https://www.unwto.org/es/omt-2022-resumen-ano (accessed on 31 October 2025).
Tiwari, M., & Tripathi, S. (2023). Application of clustering algorithms on tourism industry. International Journal of Research in Applied Science and Engineering Technology, 11(5), 2290–2294. [Google Scholar] [CrossRef]
Vilas, P., Andreu, L., & Sarto, J. L. (2022). Cluster analysis to validate the sustainability label of stock indices. Journal of Cleaner Production, 330, 129862. [Google Scholar] [CrossRef]

Figure 1. General workflow. Source: The authors.

Figure 2. Trade-off between feature count and data completeness. Source: The authors.

Figure 3. Comparative analysis of internal validation metrics for different numbers of clusters. Source: The authors. Curves correspond to the Elbow (WCSS), Silhouette, Davies–Bouldin, and Calinski–Harabasz metrics evaluated across k values; higher Silhouette and Calinski–Harabasz and lower Davies–Bouldin indicate better clustering quality.

Figure 4. Hierarchical clustering dendrogram using Ward linkage (last 30 merges). Source: The authors.

Figure 5. Comparative performance of clustering algorithms across individual validation metrics. Source: The authors.

Figure 6. Composite score comparison for evaluated clustering algorithms. Source: The authors.

Figure 7. Visualization of normalized financial profiles by cluster. Source: The authors.

Figure 8. Heatmap of median financial characteristics by cluster. Source: The authors.

Figure 9. Variance charts. Source: The authors.

Figure 10. Comparison of Dimension Reduction Techniques. Source: The authors.

Figure 11. 2D Cluster Visualization—Comparison of Techniques. Source: The authors. The yellow box highlights the t-SNE embedding, which was selected as the primary technique for cluster visualization due to its clearer separation between clusters and outliers.

Figure 12. Detailed visualization of t-SNE. Source: The authors. (Left) The star markers indicate the cluster centroids (i.e., the mean 2D position of each cluster in the t-SNE space) and are used as visual references for the central location of the clusters. (Right) The same t-SNE embedding colored by company size category.

Table 1. Summary of financial metrics.

Category	Variable/Metric	Description
Original variables of the dataset	Total assets	Total value of the company’s assets.
	Net equity	Value of net equity.
	Total liabilities	Total liabilities or financial obligations.
	Operating income	Income from ordinary operating activities.
	Net income	Final result after taxes.
	Current assets	Liquid or short-term assets.
	Subscribed capital	Formally subscribed share capital.
	Operating expenses	Total operating expenses of the fiscal year.
Base financial ratios	Return on assets (ROA)	Net income/Total assets.
	Return on equity (ROE)	Net income/Net equity.
	Financial leverage	Total liabilities/Net equity.
	Equity ratio	Net equity/Total assets.
	Approximated current liquidity	Current assets/Total liabilities.
Feature-engineered metrics	Liability ratio	Total liabilities/Total assets.
	Capital multiplier	Total assets/Net equity.
	Current asset ratio	Current assets/Total assets.
	Debt coverage	Net income/Total liabilities.
	Equity-to-current-assets ratio	Net equity/Current assets.
	Return on current assets	Net income/Current assets.

Source: Author’s elaboration.

Table 2. High Correlations (|r| > 0.7).

Metric 1	Metric 2	Correlation (r)
Leverage	Capital multiplier	0.9975
Return on assets (ROA)	Current asset profitability	0.9968
Return on assets (ROA)	Return on equity (ROE)	0.7714
Return on equity (ROE)	Current asset profitability	0.7697
Current asset profitability	Debt coverage	0.7582
Liability ratio	Current asset ratio	0.7456
Return on assets (ROA)	Debt coverage	0.7006

Source: The authors.

Table 3. Financial Features Selected for Clustering.

Financial Dimension	Metric	Description
Profitability	Return on assets (ROA)	Profitability relative to total assets
Profitability	Return on equity (ROE)	Profitability relative to shareholders’ equity
Financial Structure	Leverage	Financial leverage (liabilities-to-equity ratio)
	Equity ratio	Proportion of equity relative to total assets
	Liability ratio	Proportion of liabilities relative to total assets
Liquidity	Approximate current liquidity	Short-term payment capacity
	Current asset ratio	Proportion of liquid assets
	Equity-to-current-assets ratio	Equity strength relative to liquid assets

Source: The authors.

Table 4. Consensus of internal validation methods for determining the optimal number of clusters.

Method	Optimal k	Observation
Elbow Method (WCSS)	3	Clear inflection point; diminishing returns beyond k = 3
Silhouette Score	2–3	Highest values (≈0.98–0.99)
Davies–Bouldin Index	2–4	Lowest index values (local minima)
Calinski–Harabasz Index	3–4	Largest relative gains before saturation
Average (Consensus)	3	Best balance across all metrics

Source: The authors.

Table 5. Comparison between OPTICS and k-means clustering (baseline).

Metric	K-Means (k = 3)	OPTICS	Improvement
Silhouette score	~0.08	0.35	337%
Davies–Bouldin index	~2.6	1.50	42%
Calinski–Harabasz index	~185	10.93	5.81%
Composite score	~0.45	0.94	109%

Source: The authors.

Table 6. Kruskal–Wallis test results for inter-cluster differences.

Feature	H-Statistic	p-Value	Significant
leverage	584.237	<0.001	✓
equity_ratio	584.237	<0.001	✓
ratio_pasivo	584.237	<0.001	✓
roa	128.945	<0.001	✓
roe	102.473	0.001	✓
liquidez_corriente_aproximada	58.234	0.016	✓
ratio_activo_corriente	21.456	0.143	✗
ratio_patrimonio_activo_corriente	18.923	0.169	✗

Source: The authors. ✓ indicates statistical significance at p < 0.05; ✗ indicates non-significant differences.

Table 7. Mann–Whitney U test results for pairwise comparisons.

Feature	Median C0	Median C1	Difference	% Change	p-Value	Effect Size
leverage	0.0000	42.356	↑ +4.236	–	<0.001	0.89
equity_ratio	10.000	0.1912	↓ −0.809	−80.9%	<0.001	0.89
ratio_pasivo	0.0000	0.8088	↑ +0.809	–	<0.001	0.89
roa	0.0000	−0.0021	↓ −0.002	–	<0.001	0.45
roe	0.0000	−0.0089	↓ −0.009	–	0.001	0.42
liquidez_corriente_aproximada	0.2847	0.2156	↓ −0.069	−24.3%	0.016	0.31

Source: The authors. Arrows indicate direction of change (↑ increase; ↓ decrease) relative to Cluster 0.

Table 8. Median financial indicators for regular clusters and outliers.

Metric	Outliers Median	Regular Median	% Difference
roa	0.0000	−0.0021	100%
roe	0.0000	−0.0089	100%
leverage	18.234	42.356	−57%
equity_ratio	0.3542	0.1912	85%
liquidez_corriente	0.2589	0.2156	20%

Source: The authors.

Table 9. Quantitative comparison of dimensionality reduction techniques for two-dimensional visualization.

Method	Trustworthiness	Continuity	Spearman Correlation	Composite Score
PCA	0.65	0.817	0.89	0.786
t-SNE	0.83	0.895	0.778	0.834
UMAP	0.785	0.834	0.678	0.766

Source: The authors.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Coronell, L.H.P.; Herrera, T.J.F.; Africano, G.N.; De-La-Hoz-Franco, E.; Escorcia-Gutierrez, J.; Crissien Borrero, T.J. Unsupervised Machine Learning for Financial Behavior Profiling of Tourism Firms in Barranquilla, Colombia. J. Risk Financial Manag. 2026, 19, 281. https://doi.org/10.3390/jrfm19040281

AMA Style

Coronell LHP, Herrera TJF, Africano GN, De-La-Hoz-Franco E, Escorcia-Gutierrez J, Crissien Borrero TJ. Unsupervised Machine Learning for Financial Behavior Profiling of Tourism Firms in Barranquilla, Colombia. Journal of Risk and Financial Management. 2026; 19(4):281. https://doi.org/10.3390/jrfm19040281

Chicago/Turabian Style

Coronell, Leidy Haidy Perez, Tomás José Fontalvo Herrera, Gloria Naranjo Africano, Emiro De-La-Hoz-Franco, José Escorcia-Gutierrez, and Tito José Crissien Borrero. 2026. "Unsupervised Machine Learning for Financial Behavior Profiling of Tourism Firms in Barranquilla, Colombia" Journal of Risk and Financial Management 19, no. 4: 281. https://doi.org/10.3390/jrfm19040281

APA Style

Coronell, L. H. P., Herrera, T. J. F., Africano, G. N., De-La-Hoz-Franco, E., Escorcia-Gutierrez, J., & Crissien Borrero, T. J. (2026). Unsupervised Machine Learning for Financial Behavior Profiling of Tourism Firms in Barranquilla, Colombia. Journal of Risk and Financial Management, 19(4), 281. https://doi.org/10.3390/jrfm19040281

Article Menu

Unsupervised Machine Learning for Financial Behavior Profiling of Tourism Firms in Barranquilla, Colombia

Abstract

1. Introduction

2. State of the Art

3. Materials and Methods

3.1. Data Understanding and Feature Engineering

3.2. Treatment of Missing Values and Outliers

3.3. Feature Selection

3.4. Optimal Number of Clusters

3.5. Clustering Algorithm Selection

3.6. Cluster Characterization and Business Interpretation

3.7. Dimensionality Reduction for Visualization

4. Results

4.1. Dataset Overview and Feature Engineering Outcomes

4.2. Feature Selection Results

4.3. Determination of the Optimal Cluster Structure

4.4. Comparative Performance of Clustering Algorithms

4.5. Cluster Characterization and Business-Oriented Interpretation

4.5.1. Dataset and Cluster Structure

4.5.2. Statistical Differentiation Between Clusters

4.5.3. Financial Profile Characterization of Regular Clusters

4.5.4. Visual Validation of Cluster Profiles

4.5.5. Behavioral Financial Archetypes and Business Interpretation

4.6. Evaluation of Dimensionality Reduction Techniques for Visualization

4.6.1. Principal Component Analysis (PCA)

4.6.2. Comparison of Dimensionality Reduction Techniques

4.6.3. Two-Dimensional Cluster Visualization

5. Discussion

5.1. Financial Behavior Archetypes in the Tourism Sector

5.2. Behavior-Based Segmentation Versus Size-Based Classification

5.3. Interpretation of Discriminative Financial Dimensions

5.4. Role of Visualization and Outlier Treatment

5.5. Contextual Scope and Temporal Considerations

5.6. Implications for Financial Analysis and Monitoring

6. Conclusions

7. Patents

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI