1. Introduction
With advances in modern technology, data are increasingly collected at discrete time points or over continuous intervals, leading to the growing prominence of functional data in contemporary statistical analysis [
1]. Functional data analysis (FDA) offers a range of statistical methods tailored to the unique features of this data type, with functional regression becoming a widely adopted approach for modeling relationships between responses and predictors. Extensive research in this field can be broadly classified into three main categories, depending on whether the responses and predictors are functional or scalar. The first category considers both responses and predictors as functional data, with notable contributions from [
2,
3]. The second category focuses on functional responses with scalar predictors, as studied by [
4]. The third category addresses scalar responses with functional predictors, exemplified by the work of [
5,
6]. These classifications highlight the diverse applications of functional regression and underscore the continued growth of this research area.
As a specialized type of functional data, distributional or density function data have become increasingly prevalent across a variety of research fields. These data arise in contexts such as cross-sectional or intraday stock returns [
7], mortality densities [
8], and intra-hub connectivity distributions in neuro-imaging [
9]. Unlike conventional functional data, which typically involve time-ordered or sequential measurements, distributional data capture the entire underlying structure without reliance on sample ordering. This characteristic allows them to effectively reveal complex relationships that might be overlooked by standard functional data analysis methods. In recent years, with the growing use of distributional data, there has been a rising interest in developing regression models where random distributions serve as either responses or predictors. These models provide a more nuanced understanding of variable relationships when scalar representations are insufficient. Specifically, this article focuses on a function-on-scalar regression framework, where density functions act as responses and scalar variables as predictors. This flexible and robust approach offers a powerful tool for tackling real-world problems involving complex distributional data and holds promise for uncovering deeper insights into the mechanisms underlying observed phenomena.
Density functions, when viewed as elements of a Hilbert space, do not form a linear subspace due to their inherent constraints of nonnegativity and unit integral. These constraints pose challenges for the direct application of traditional linear methods to density functions. To overcome these difficulties, several approaches have been developed. A notable strategy is to adopt a geometric viewpoint by choosing an appropriate metric. For example, Ref. [
10] utilized an infinite-dimensional extension of Aitchison geometry to construct a density-on-scalar linear regression model within Bayes–Hilbert spaces. This framework respects the intrinsic structure of density functions while maintaining their essential properties. Similarly, Ref. [
11] proposed a distribution-on-distribution regression approach based on the Wasserstein metric and the tangent bundle of the Wasserstein space. This method offers a powerful framework for modeling relationships between probability distributions and provides a more meaningful measure of distances between distributions from a probabilistic perspective.
An alternative approach to addressing the constraints of density functions is to map them into a Hilbert space via transformation methods. For instance, Ref. [
12] proposed a continuous and invertible transformation, such as the log-quantile-density (LQD) transformation, that maps probability densities into an unconstrained space of square-integrable functions. This transformation effectively removes the restrictions imposed by nonnegativity and normalization, allowing density functions to be analyzed as elements of a standard Hilbert space. Building on this idea, Ref. [
13] developed an additive regression model with densities as responses, enabling the integration of density functions into regression frameworks, which can be expressed as follows.
Here, denotes the density for the ith unit, and represents the LQD transformation. In this model, the function captures the baseline effect, while represents the additive effect of the l-th covariate . The term is the error function associated with the i-th unit, accounting for random variation in the model. By adopting this framework, the model enables a deeper understanding of the underlying structure and relationships within the data, providing a powerful tool for statistical inference in applications involving density functions.
Model (
1) is well-suited for homogeneous data, where observations exhibit uniform characteristics. However, numerous empirical studies have shown that real-world data often display both intra-class homogeneity and inter-class heterogeneity. In such cases, treating the data as entirely homogeneous may overlook important group differences, potentially leading to inaccurate or inefficient statistical inferences. Intra-class homogeneity implies that observations within the same group share similar patterns, while inter-class heterogeneity acknowledges that different groups may behave distinctly. Ignoring either aspect can result in suboptimal modeling. To address this challenge, latent group-structured regression models can be employed. These models explicitly accommodate both intra-class homogeneity and inter-class heterogeneity, enabling more accurate estimation and improved prediction. By incorporating latent group structures, the model differentiates between groups while preserving shared characteristics within each group, thereby enhancing efficiency and the reliability of statistical conclusions. Latent group-structured regression thus offers a powerful framework for analyzing heterogeneous data and provides valuable insights into complex processes across various applications.
This phenomenon is particularly evident when analyzing COVID-19 data alongside various influencing factors. To evaluate the progression of the epidemic in individual countries relative to the global context, we consider the relative daily mortality rate over a 240-day period for each country as the response variable. The daily mortality rate is defined as the number of deaths per day normalized by the country’s total population, and the relative rate is calculated as the ratio of each country’s mortality rate to the global total. This relative daily mortality rate is treated as a density function, representing the distribution of mortality over time. To align the time scale across countries, we set the first day with at least 30 reported deaths as time zero for each country. The density of the relative daily mortality rate following this benchmark reflects how each country’s epidemic trajectory contributes to the global situation. This study includes data from 149 countries, with the corresponding densities of relative daily mortality rates illustrated in
Figure 1. A detailed examination reveals both homogeneity and heterogeneity in the shapes of these density functions. While some countries display similar patterns, suggesting homogeneity, others exhibit distinct trends, highlighting the heterogeneous impact of the pandemic across regions. This coexistence of shared and divergent characteristics underscores the complex nature of the COVID-19 crisis, where common responses and varying regional effects both play critical roles in shaping the global health emergency.
For further analysis, six predictors are considered to explain the variation in relative daily mortality rates: ‘aging’ (the percentage of the population aged 65 and over), ‘beds’ (the number of hospital beds per 1000 people), ‘physicians’ (the number of physicians per 1000 people), ‘nurses’ (the number of nurses per 1000 people), ‘GDP’ (gross domestic product per capita in US dollars), and ‘diabetes’ (the percentage of the population with diabetes). Given the global nature of the epidemic, it is reasonable to assume that the effects of these predictors on relative daily mortality rates are broadly consistent across countries. However, other unobserved factors, such as national epidemic prevention strategies and cultural practices, may contribute to both inter-group heterogeneity and intra-group homogeneity. For example, China’s public health policies and social behaviors differ markedly from those in countries like the United Kingdom and the United States. These differences may manifest as heterogeneity in the intercept functions. As shown in
Section 3.2.1, the United Kingdom and the United States are classified into the same group, while China falls into a different group. This observation highlights the necessity of adopting an additive model that incorporates subject-specific intercept functions and latent group structures, thereby capturing intra-group homogeneity and inter-group heterogeneity, i.e.,
Here, denotes the density of the relative daily mortality rate for country i, and represents the vector of covariates. The function denotes the LQD transformation. Specifically, the intercept function is modeled as if country i belongs to group k, where is one of K group-specific functions.
A substantial body of literature has introduced various methods for identifying latent group structures in data situated within Euclidean spaces. For instance, Ref. [
14] proposed a distance-based clustering algorithm applied to kernel estimates of nonparametric regression functions. Building on this, Ref. [
15] developed an extension using a multiscale statistic, thereby avoiding the need to select a specific bandwidth. Ref. [
16] introduced the classifier-Lasso (C-Lasso), a shrinkage method designed for linear panel data models with latent group structures. This approach was further extended by [
17], who proposed a penalized sieve estimation-based C-Lasso method tailored for heterogeneous, time-varying panel data. Additionally, Ref. [
18] presented a kernel-based hierarchical agglomerative clustering (HAC) algorithm that imposes fewer restrictive assumptions than earlier approaches, making it more flexible for complex data structures. Collectively, these contributions offer significant methodological advances for analyzing functional and panel data, providing robust tools for uncovering latent group structures in heterogeneous datasets.
In this study, we employ the hierarchical agglomerative clustering (HAC) method to identify latent group structures within the data. Specifically, we first apply HAC to the estimated individual intercept functions, enabling the classification of the density functions into four distinct groups, each reflecting a different epidemic pattern. The resulting clusters are presented in
Figure 2, providing strong empirical support for the use of a functional additive model with latent group structures in the intercept function, underscoring its necessity for effectively capturing the heterogeneous dynamics of COVID-19 data.
To capture both intra-group homogeneity and inter-group heterogeneity that are present in the data, we extend the additive functional regression model for density responses originally proposed by [
13], allowing it to accommodate heterogeneity in the density functions:
and
Here, denotes a partition of the index set , such that , and for any . Moreover, we assume that for all , indicating that the group-specific intercept functions are distinct across groups. The number of groups K, as well as the group membership for each individual, are assumed to be unknown and must be inferred from the data.
In the proposed model, represent random density functions, each associated with a p-dimensional covariates vector , all defined on a common support . Without loss of generality, we assume . Let denote the LQD transformation, such that . The function captures the subject-specific intercept, while are the bivariate additive components. For identification purposes, we impose the constraint for all and . The error processes are assumed independent with zero conditional mean and covariance function .
Clearly, the proposed model (
2) naturally extends the functional additive model framework. In particular, when the subject-specific functions are homogeneous, that is, when
, model (
2) simplifies to the additive functional regression model for density responses introduced by [
13]:
While the LQD transformation introduced by [
12] effectively facilitates the representation of density functions in a linear space, and the subsequent additive model proposed by [
13] enables regression modeling with density-valued responses, both approaches assume that the data are homogeneous across observations. This assumption may be overly restrictive in real-world applications where population-level heterogeneity is prevalent. The key innovation of our work lies in the integration of the LQD transformation with latent group structure learning within the additive modeling framework. By simultaneously estimating subject-specific density functions and uncovering latent group memberships, our method captures both within-group similarity and between-group variation. This joint modeling approach not only enhances interpretability and predictive power but also expands the applicability of density regression methods to heterogeneous settings, marking a substantive advancement over the existing literature.
In practical applications, only random samples drawn from the underlying densities are typically observed. To handle this, we begin by estimating each density using the modified kernel density estimation method proposed by [
12], combined with the LQD transformation. Next, we employ the hierarchical agglomerative clustering (HAC) method, which requires estimates of the subject-specific functions, to identify and estimate the latent group structure. To accomplish this, we introduce a three-step estimation procedure that leverages the advantages of both spline smoothing and local polynomial smoothing techniques. In the first step, for computational efficiency, we use a B-spline series approximation to estimate the subject-specific functions
and the additive components
. Based on these initial estimates of
, the second step applies the HAC method to determine the group membership for each subject. While spline smoothing is efficient computationally, it poses challenges when establishing asymptotic properties. Therefore, in the third step, we use backfitted local linear regression to improve the estimation efficiency of the group-specific functions
and the additive components
. We further establish several theoretical results, including the uniform convergence rates of the estimators, consistency of both the estimated number of groups and their memberships, asymptotic normality of the group-specific functions, and properties of the post-clustering additive component estimators. These findings provide a rigorous theoretical basis for the proposed approach.
The remainder of this paper is organized as follows.
Section 2 describes the materials and methods, including preliminary work, identification and estimation Method, and theoretical results. Specifically,
Section 2.1 introduces the modified kernel estimation method along with the LQD transformation for density functions, which serve as the foundational steps.
Section 2.2 outlines the procedure for identifying and estimating the latent group structures and the additive components within the model. The theoretical results are presented in
Section 2.3. In
Section 3.1, Monte Carlo simulations are conducted to evaluate the performance of the proposed method.
Section 3.2 demonstrates the application of our approach to COVID-19 and GDP data analysis. Finally,
Section 4 offers a conclusion of the findings and
Section 5 discusses potential directions for future research. Detailed proofs of the theoretical results and additional numerical results are provided in the
supplementary materials.
3. Results
3.1. Numerical Study
In this section, we conduct a simulation study to evaluate the performance of the proposed estimation procedure under the model specified in (
2). We consider the setting with
latent groups and
covariates. The regression model (
2) with latent group structure given in (
3) is
where the conditional mean function is given by
, and the group-specific baseline functions are defined as:
with additive component functions
for
. The groups’ memberships are defined by
,
,
, with group sizes set as
,
, and
.
The covariates are generated via the transformation , , where is the cumulative distribution function of the standard normal distribution, and are bivariate normal random vectors with mean zero and covariance matrix . The random error term is defined as , where and , independently.
The conditional mean functions
correspond to the LQD transformations of the conditional density
. More specifically, the inverse of the log-quantile transformation is given by
, where
. Consequently, the conditional distribution function
and quantile function
satisfy
To generate the response observations, for each , let , independently of . The observed responses at time points are then given by , such that , where are the random response densities. Without loss of generality, we assume that . We consider combinations of sample sizes , and numbers of observations . Each scenario is replicated 200 times. For the initial estimation step, the spline basis functions are of order , with interior knots for .
Figure 3 and
Figure 4 present the average performance of the pre- and post-clustering estimators for the group-specific baseline functions and the bivariate additive components, respectively, under the setting
and
. In each figure, the true functions, as well as the pre- and post-clustering estimates, are shown sequentially from left to right, allowing for visual comparison of the estimation accuracy before and after clustering.
As shown in
Figure 3, although the pre-clustering estimates roughly capture the overall shapes of the true density functions, notable deviations remain. In particular, the estimated curves exhibit discrepancies in capturing extreme values, despite aligning reasonably well with the true locations of minima and maxima. In contrast, the post-clustering estimates offer a more accurate approximation of the true density functions. These estimates not only reflect the general shape of each curve but also closely match the true values at critical points, including extrema and turning points. A similar pattern is observed in
Figure 4, which presents the estimation results for the bivariate additive components. These findings collectively highlight the effectiveness of the proposed identification and estimation procedure.
Let
denote the set of true groups and
denote the estimated clustering. To evaluate the performance of the clustering algorithm, we consider two widely used evaluation metrics. The first is Purity, a standard measure in clustering analysis, defined as
The second metric is the normalized mutual information (NMI), which quantifies the similarity between clusters [
24]. Here, we define the NMI between the estimated clusters
and true clusters
C. It is defined as
where the mutual information between
and
C is given by
and the entropy of the estimated clustering
is
, with
defined similarly for the true clustering. Both purity and NMI are invariant to permutations of cluster labels, making them suitable for evaluating the quality of clustering results. Values closer to 1 indicate better alignment between the estimated and true clusters, reflecting higher clustering accuracy.
To evaluate the effectiveness of the proposed procedure, we compare the performance of three estimators. The first is the pre-clustering estimator, , which is computed without accounting for any group structure. The second is the oracle estimator, with given K, which assumes knowledge of the true number of true groups. The third is the post-clustering estimator, , constructed based on the estimated group memberships obtained from the data. To assess the performance of these estimators, we use the root MSE (RMSE). Specifically, for the pre-clustering estimator, the RMSEs are defined as and The RMSEs for the oracle and post-clustering estimators are defined in the same way.
Table 1,
Table 2 and
Table 3 summarize the results obtained under various settings, including the estimated values of
K, as well as the averages and standard deviations of NMI, purity, and RMSE for the estimators. First, with respect to the clustering algorithm’s performance, the results using the GAIC and GBIC criteria were largely consistent. Under both criteria, the accuracy of the estimated
was consistently high. Moreover, the NMI and purity values improved as the sample size
n and the number of observations
T increased. In particular, when
and
, the true number of clusters
K was correctly identified in 100% of the runs, with NMI and purity values approaching 1. Second, regarding the performance of the estimators, as shown in
Table 2 and
Table 3, the RMSEs for all estimators decreased as both the sample size
n and the number of observations
T increased. Across all scenarios, the oracle and post-clustering estimators consistently outperformed the pre-clustering estimators. The difference in performance between them diminished with increasing
n and
T. Notably, at
and
, the performances of the oracle and post-clustering estimators were nearly indistinguishable. Furthermore, the performance of the post-clustering estimators under the GAIC and GBIC criteria was almost identical.
3.2. Real Data Analysis
In this section, we demonstrate the proposed methodology through two case studies in the social sciences.
3.2.1. COVID-19 Data
As outlined in
Section 1, our primary goal is to explore the relationships between epidemic trends across different countries and various socioeconomic factors. To this end, we compiled a comprehensive COVID-19 dataset containing the daily number of deaths from 22 January 2020, to 15 December 2020, for 190 countries and regions. This dataset is publicly accessible via the Coronavirus Resource Center at Johns Hopkins University, accessed on 15 January 2021 (
https://coronavirus.jhu.edu/). Considering the different starting points of the pandemic across countries, we standardized the observation period to 240 days, beginning from the earliest date on which any country reported at least 30 deaths. The relative daily mortality rate was selected as the response variable for our analysis.
To ensure the validity of our analysis under the assumptions of the proposed methodology, we restricted the sample to countries with covariate values lying within a compact support, defined by the empirical minimum and maximum observed values for each predictor. Countries with socioeconomic covariate values outside these ranges were excluded to avoid extrapolation beyond the data support, which could lead to unreliable model estimates. This step resulted in the exclusion of 41 countries from the original 190, leaving a final sample of 149 countries. The exclusion criteria aimed to mitigate the influence of outliers or extreme covariate values that may distort the model fitting and inference. While this filtering may introduce some bias by omitting countries with unique socioeconomic characteristics, it is necessary to ensure comparability and stable estimation across units.
For the six socioeconomic predictors introduced in
Section 1, we obtained the most recent data available from 2019, sourced from the World Bank, accessed on 27 March 2021 (
https://data.worldbank.org/indicator). These predictors serve as key explanatory variables, enabling a detailed investigation of factors shaping the progression of the COVID-19 pandemic in different national contexts.
Since the raw data consist of relative mortality rates aggregated over daily intervals, we first applied smoothing techniques to construct the functional density responses
,
, depicted over time (see
Figure 1). The marked differences in the shapes of these density functions motivated the incorporation of a latent group fixed effect in the functional additive model:
We applied the HAC algorithm to classify the spline estimates of
, with the number of clusters selected based on an information criterion.
Supplementary Figure S1 displays the GAIC and GBIC values for different clusters counts. As shown in the figure, the optimal number of clusters was identified as four. The group memberships are detailed in
Supplementary Table S1.
Following the clustering step, we used a backfitted local linear regression method to refine the estimates of the group-specific functions
and
. The estimated latent group structures are presented in
Figure 5, while the corresponding density functions are displayed in
Figure 2. Notably, Group 4 exhibited a trend distinct from that of the other groups. In this cluster, the relative daily mortality rate increased over time, resulting in higher mortality rates compared to the global average during the study period. In contrast, Group 3 exhibited a high initial daily mortality rate, which declined sharply over time, suggesting that the epidemic was relatively well controlled in these countries. Group 1 also experienced a decline in mortality rates over time, although less pronounced than that in Group 3. Finally, Group 2 displayed relatively mild fluctuations in the relative daily mortality rate compared to the other groups. These findings highlight the heterogeneous progression of the epidemic across countries, as reflected in the differing trajectories of relative mortality rates within each latent group.
To quantify the contribution of each of the selected socioeconomic variables and the individual function, we employed an empirical version of the fraction of variance explained (FVE) criterion [
13]. Specifically, the empirical FVE of the
l-th covariate
is defined as the ratio
, where
,
. Here,
, and
.
The model selection was conducted using a backward elimination procedure, sequentially removing the predictor with the smallest FVE among those included at each step. The process was terminated when the mean squared error (MSE), defined as , increased after removing a predictor. Here, denotes the fitted density for observation i at the d-th step, and the initial fit corresponds to with representing the total number of predictors at the start.
To further validate the variable selection obtained from the backward elimination procedure, we conducted significance testing for the retained predictors by calculating their p-values using a bootstrap approach. The variables ‘aging’, ‘physicians’, and ‘GDP’ demonstrated strong statistical significance, with respective p-values of 0.011, 0.028, and 0.009. These results indicate that these predictors have meaningful effects on the response variable and justify their inclusion in the final model. In contrast, the excluded variables ‘beds’, ‘nurses’, and ‘diabetes’ exhibited p-values greater than 0.1, consistent with their removal due to lack of a significant contribution. Overall, the p-value analysis confirms the robustness of the backward elimination process and highlights the key role of included predictors in explaining the observed variation.
Figure 6 illustrates the effects of the three predictors, ‘aging’, ‘physicians’, and ‘GDP’, through heat maps, with corresponding FVE values of 45.52%, 59.42%, and 73.98%, respectively. The heat map for ‘physicians’ reveals that the influence of this predictor on the relative daily mortality rate varies over time. Specifically, countries with a high number of physicians per 1000 people exhibit an initial peak followed by a decline to a minimum, whereas countries with fewer physicians display the opposite trend. Similar or contrasting patterns can be observed in the heat maps for ‘aging’ and ‘GDP.’ These findings indicate that the effects of these socioeconomic factors on COVID-19 mortality dynamics are not constant, but rather evolve over time, with their impact differing according to the country’s specific characteristics.
To evaluate the overall performance of the proposed methodology, we computed the RMSEs of pre- and post-clustering estimators for the fitted densities, defined as
, where
, with
and
representing the pre- and post-clustering estimators, respectively. For comparison, we also calculated the RMSE for the homogeneous additive model (
1) proposed by [
13]. The RMSEs for the pre- and post-clustering estimates were 0.6972 and 0.3751, respectively, while the homogeneous additive model yielded an RMSE of 0.8433. These results underscore the importance of accounting for heterogeneity in relative daily mortality rates across countries. The clustering-based approach substantially improves the model’s effectiveness in analyzing COVID-19 data. Additionally,
Supplementary Figure S2 displays the observed and fitted density curves for three representative countries from each group. Overall, the estimated density functions closely matched the observed density curves, demonstrating the robustness of the model and its ability to accurately capture the underlying dynamics of the epidemic across diverse national contexts.
Building on the analysis, our study reveals significant heterogeneity in COVID-19 epidemic trajectories across countries, while quantitatively assessing the dynamic influence of key socioeconomic factors. By leveraging the LQD-based functional additive model combined with clustering, we identified four latent groups reflecting distinct patterns of relative daily mortality rates. The fraction of variance explained (FVE) analysis highlighted aging population, number of physicians, and GDP as the most influential predictors, whose effects vary meaningfully over time. This temporal variation underscores that the impact of these factors is not static but closely tied to country-specific healthcare capacity and economic conditions, offering valuable insights for designing timely and targeted public health interventions.
Despite demonstrating robust performance in capturing complex epidemic heterogeneity and time-varying covariate effects, the proposed methodology has some limitations. The approach relies heavily on accurate functional estimation and smoothing, which may be sensitive to data noise and missingness, especially in countries with incomplete reporting. Moreover, the model currently includes a limited set of socioeconomic variables; future work could enhance explanatory power by incorporating additional factors such as policy responses or population mobility. While the clustering procedure improves model fit substantially, the choice of the number of groups depends on information criteria, introducing some subjectivity. Overall, this study provides a novel framework for understanding epidemic dynamics through functional data analysis, but further efforts are needed to improve data quality and extend model complexity for broader applicability.
3.2.2. GDP Data
GDP per capita is widely recognized as a fundamental indicator for evaluating a country’s macroeconomic performance and overall level of economic development. It serves as a proxy for the standard of living, economic productivity, and general well-being of a nation’s population. In this empirical application, we investigate the relationship between per capita GDP and a set of key socioeconomic variables that are believed to influence a country’s economic trajectory. The proposed model is specified as follows:
where
denotes the literacy rate (i.e., the percentage of educated individuals aged 15 years and above) of country
i;
is the total population;
refers to the per capita GDP in the base year; and
represents the average per capita GDP over the 50-year period. By incorporating these covariates, the model facilitates an in-depth examination of how factors such as educational attainment, demographic scale, and historical economic performance jointly shape a country’s economic development. This framework enables a nuanced understanding of the mechanisms underlying economic growth, capturing both current and long-term socioeconomic influences.
The data utilized in this analysis were sourced from the World Bank database, accessed on 25 March 2021 (
https://data.worldbank.org), covering the period from 1970 to 2019. After excluding countries with incomplete records, the final dataset comprised
countries, each observed over
time points. To examine the relative economic standing of each country within a global context, we adopted the same methodological framework described in
Section 3.2.1 to estimate the relative per capita GDP density for each country. The resulting density estimates, presented in
Figure 7, reveal substantial heterogeneity in economic trajectories across nations over time.
To uncover and classify latent group structures based on the estimated density functions, we implemented the proposed procedure and applied the hierarchical agglomerative clustering (HAC) algorithm. The clustering results, displayed in
Supplementary Figure S3, were assessed using two model selection criteria, the Generalized Bayesian Information Criterion (GBIC) and Generalized Akaike Information Criterion (GAIC), both of which consistently indicated that the individual-specific functions
should be partitioned into three distinct clusters. The corresponding group memberships, capturing countries with similar patterns in the distributions of relative per capita GDP, are summarized in
Supplementary Table S2.
Following the identification of latent group structures, post-clustering estimation was performed to refine the group-specific functions, yielding the estimates
for each cluster. These refined functions, along with the corresponding density functions of relative per capita GDP for the three identified groups, are illustrated in
Figure 8. The clustering results offer valuable insights into the heterogeneity of economic development trajectories across countries, emphasizing distinct patterns in their relative economic positions over time with respect to the global average.
The relative per capita GDP in Group 3 showed a clear upward trend over time, indicating that these countries have been gaining economic influence in the global landscape. Conversely, Group 1 experienced a downward trend, suggesting a decline in their relative position within the world economy. Group 2, meanwhile, displayed a more stable pattern, with steady growth and less pronounced fluctuations compared to the other groups. These distinct trajectories highlight the diverse paths of economic development among countries, reflecting both emerging economic powers and those facing stagnation or decreasing influence on the global stage.
A backward elimination procedure, guided by the fraction of variance explained (FVE) criterion, was employed to select the most significant predictors for the final model. At each step, the predictor contributing the least to explaining variance was removed, and the model was refitted. To validate the statistical significance of the retained variables, p-values were calculated for each predictor in the final model using a bootstrap procedure to account for the complexity of the functional additive framework. The resulting p-values for the key predictors were 0.021 for ‘education’, 0.015 for ‘population’, and 0.001 for ‘average GDP’, indicating strong evidence that all three variables significantly influence the response. The variable ‘oriGDP’ was excluded due to a comparatively high p-value (0.35) and low contribution to the explained variance, confirming its lack of significance. Incorporating p-values alongside the FVE criterion provides a robust validation of the selected model, ensuring that the retained covariates are not only important in terms of explained variance but also statistically significant, thereby strengthening the interpretability and reliability of the final additive model.
Figure 9 presents heat maps illustrating the effects of these predictors, with corresponding FVE values of 50.75%, 15.41%, and 30.42%, respectively. The heat map for ‘education’ reveals its dynamic influence on relative per capita GDP over time. In particular, countries with higher literacy rates experienced a stronger impact in earlier periods, which gradually diminished, whereas countries with lower literacy rates exhibited the opposite trend. The patterns associated with the remaining predictors display either similar or contrasting dynamics, underscoring the complex and evolving relationship between these socioeconomic factors and economic development across countries.
The RMSE values for the pre- and post-clustering estimations were 0.6385 and 0.2971, respectively, whereas the RMSE for the homogeneous additive model was 0.7962. These results highlight the necessity of incorporating heterogeneity into the analysis and demonstrate the superior performance of the proposed identification and clustering procedure when applied to the GDP data. Furthermore, three representative countries were selected from each identified group, and the corresponding observed and fitted density curves are illustrated in
Supplementary Figure S4. Overall, the fitted densities closely match the observed curves, underscoring the robustness and accuracy of the proposed model in capturing the underlying economic dynamics across different countries.
In summary, this study highlights the critical role of accounting for heterogeneity in modeling the economic development of countries over time. By uncovering latent group structures through advanced clustering techniques, we are able to distinguish distinct trajectories in relative per capita GDP that reflect varying economic realities across nations. The inclusion of key socioeconomic covariates—education, population, and average GDP—provides a nuanced understanding of how these factors interact dynamically with economic outcomes. Our results demonstrate that these influences are neither static nor uniform, but rather evolve in complex ways depending on a country’s specific context and developmental stage. The superior predictive accuracy of the clustering-based approach, as evidenced by significantly reduced RMSE values compared to homogeneous models, underscores the value of embracing heterogeneity in such analyses.
While the methodology proves robust and insightful, it also has limitations that open avenues for future research. For instance, the assumption of fixed group memberships over time may not fully capture the fluidity of economic changes and transitions experienced by countries. Incorporating time-varying clustering or allowing for overlapping group structures could offer a more flexible and realistic modeling framework. Additionally, extending the model to integrate a broader set of covariates—such as technological innovation, institutional quality, or trade openness—would deepen our understanding of the multifaceted drivers behind economic growth. Overall, this work lays a strong foundation for more refined and comprehensive studies, aiming to better inform policymakers and economists about the diverse paths of economic development on the global stage.