Abstract
In this paper, we focused on developing copula-based modeling procedures that effectively capture the dependence between response and explanatory variables. Building upon the work of Noh et al. (J. Am. Stat. Assoc. 2013, 108, 676–688) we extended copula-based regression to accommodate both continuous and discrete covariates. Specifically, we explored the construction of copulas to estimate the conditional mean of the response variable given the covariates, elucidating the relationship between copula structures and marginal distributions. We considered various estimation methods for copulas and distribution functions, presenting a diverse array of estimators for the conditional mean function. These estimators range from non-parametric to semi-parametric and fully parametric, offering flexibility in modeling regression relationships. An adapted algorithm is applied to construct copulas and simulations are carried out to replicate datasets, estimate prediction model parameters, and compare with the OLS method. The practicality and efficacy of our proposed methodologies, grounded in the principles of copula-based regression, are substantiated through methodical simulation studies.
Keywords:
least squares regression; copulas; copula-based regression; archimedean copulas; gaussian copula; IMSE MSC:
60E05; 62H05; 62H20; 62J05; 62G05; 62G08
1. Introduction
Scientists are often led to study the relationships and dependencies between the response variable and several other covariates. However, regression analysis is the statistical tool for investigating such relationships and it is one of the most commonly used statistical methods in many scientific fields, such as medicine, biology, agriculture, economics, engineering, sociology, etc. In medical research, econometrics, and other research fields, it is very common to use regression analysis to interpret the correlation existing between different variables. However, the basic form of the regression analysis is not suitable for many cases, where the relationships are often non-linear and the probability distribution of the output variable may be an abnormal distribution.
For such dependence modeling problems, we attempt to provide a functional form that will summarize the relationship between response and explanatory variables. In several practical situations, as an example, a vector of covariates is used to explain, interpret, or predict the response variable Y. This is encountered in many fields, including medical fields and social science. The type of functional relationship we attempt to figure out could depend on the marginal behavior of variables or their joint behavior. In this paper, we consider the construction of dependent modeling procedures based on the separation of these two behaviors when the covariates are a mixture of continuous and discrete variables.
For this context, we consider procedures that allow the representation of a multivariate distribution as a function of its uni-variate marginals through a connection function called a copula. Copulas have been increasingly popular for modeling statistical dependence in multivariate sets of data and have been applied to various areas, including medical research, environmental science, econometrics, actuarial science, agronomy, and others. A key feature of copulas is that they provide flexible representations of the multivariate distribution by allowing for the dependence structure of the variables of interest to be modeled separately from the marginal structure and, by specifying a copula, we summarize all the dependencies between margins (see Nelsen [1] for more about this subject).
The power of this approach principally lies in the ability for a practitioner to model the dependence structure independently of the marginal behaviors. Furthermore, the advantages of using copulas in modeling are the allowance to model both linear and non-linear dependence, an arbitrary choice of a marginal distribution, and the capability of modeling extreme endpoints. However, the principal advantage of a copula regression is that there are no restrictions and no specification on the probability distributions that can be used.
It is interesting to note that copula-based regression models offer significant advantages in capturing complex dependencies between variables, making them highly useful in various fields. In finance, they allow for better portfolio risk management by modeling non-linear dependencies between asset returns and macroeconomic factors, especially during market downturns. In insurance, copula-based regression can be applied to explain pricing in terms of different dependent types of claims, such as frequency and severity. In environmental studies, regression as function of a copula is useful to establish the relationship between rainfall and river discharge, especially in the case of non-linear dependence. In healthcare, regression with a copula enables researchers to examine how lifestyle factors influence health outcomes, such as cholesterol levels, while capturing the potential interdependence among these health indicators.
In the literature, there exist many recent studies of regression based on copulas; as examples, we cite Sheikhi et al. [2] and Ali et al. [3] among others. As a new contribution to this domain, we consider in this paper the estimation problem of the mean regression function for a regression model, where is a random vector of dimension and Y is a random variable with cumulative distribution function (c.d.f.) and density function . Y is the response variable and is the set of covariates. We denote by the c.d.f. of the variables and we denote by its corresponding density. For a given , we will note by the shortcut for . From the inspiring work of Sklar [4], the c.d.f. of evaluated at can be expressed in terms of , where C is the copula distribution of , that is, the function from to defined by
Recently, Noh et al. [5] exploited the above decomposition to introduce a novel idea consisting of expressing the mean regression function , in terms of the copula and margins of as follows.
where is the copula density corresponding to C and is the copula density of . This shows that the mean regression function is the ratio of a numerator that only captures the mean dependence between Y and X and a denominator that captures the dependence within X. It is worth mentioning that the formula is only valid when the covariates are continuous. A new reformulation is needed when the covariates are not all continuous, which is the case for many real-world applications, especially in medicine.
Furthermore, Noh et al. [5] proposed a semi-parametric estimator for the regression function given in (1). Specifically, they utilized the inference function for margins (IFM) technique to estimate the copula-based regression curve. This method proceeds in two stages: first, it estimates the marginal parameters, and then it estimates the corresponding dependence parameter. These authors demonstrate, both theoretically and empirically, that the resulting estimates obtained exhibit desirable properties when the parametric copula family is adequately chosen.
Noh et al. [5] stimulated extensive research on copula-based regression. Noh et al. [6] applied the method of Noh et al. [5] to the quantile regression with i.i.d. or time series that are completely observed. De Backer et al. [7] extended the method of Noh et al. [6] to the quantile regression with censored data. Kraus and Czado [8] studied the quantile regression with complete data, using D-vine copulas. Rémillard et al. [9] discussed the asymptotic connection between the estimators of Noh et al. [6] and Kraus and Czado [8]. Chang and Joe [10] proposed an algorithm for computing the conditional distribution function via the vine copula. Furthermore, Nagler and Vatter [11] unified various copula-based regressions by formulating a general loss function which may not be continuously differentiable. Their generalized regression model includes the conditional mean regression of Noh et al. [5], the conditional quantile regression of Noh et al. [6], and the asymmetric least squares of Newey and Powell [12] as special cases. The unified framework enhances the systematic interpretation of the different existing regressions. For additional discussion into similar methods, see [13,14,15,16,17] and the literature cited therein.
As an extension of the framework by Noh et al. [5], we incorporate discrete variables into the set of covariates . By establishing a connection with various classes of copulas through an alternative equation to (1), we calculate the conditional mean, , of . In this context, we develop the relationship between the copula and the marginals. Furthermore, we illustrate this relationship for specific families of copulas, such as Archimedean copulas and the Gaussian copula, highlighting their properties that are beneficial for our analysis.
The next step involved addressing the estimation problem. Here, we also adopt a semi-parametric approach along with the inference function for margins (IFM) method to estimate the proposed regression curve. First, we estimate the marginal distributions using their empirical distributions, and then we estimate the dependence parameter associated with the underlying copula. A simulation studies for different classes of copulas and different distributions for the output Y are considered to illustrate the usefulness of the findings.
The rest of the paper is organized as follows. Section 2, discusses different copula concepts in the multivariate setting. Section 3 outlines the copula-based regression model proposed for case where the set of covariates includes both discrete and continuous variables. Section 4 covers the estimation procedure of the proposed regression model. Section 5 is dedicated to a simulation study that assesses the performance of the suggested copula-based regression. Conclusion and remarks come in Section 6.
2. Preliminaries
Copulas are a mathematical concept used in multivariate analysis to describe the dependency structure between components of a multivariate random vector. They play a central role in various fields employing multivariate statistical analysis, such as risk management and finance. Therefore, copulas provide a framework for modeling the relationships between variables by describing their joint distribution independently of their marginal distributions.
This section provides a brief overview of the copula concept, which will be utilized in the development of the proposed model. According to Nelsen [1], a multivariate copula is defined as follows.
Definition 1.
A d-dimensional copula is a function from to with following properties:
- 1.
- For every ,
- (i)
- if at least one of coordinate of u is 0.
- (ii)
- if all coordinates of u are 1 except .
- 2.
- For every and in such that , ,where,
Sklar’s Theorem is a fundamental result in copula theory. It enables us to express the joint distribution of a multivariate random vector in terms of their marginal distributions and a copula function. It can be stated as follows (see, Nelsen [1]).
Theorem 1.
Let H be a d-dimensional distribution function with marginal distributions . Then, there exists a d-copula C such that for all ,
If are all continuous, then C is unique; otherwise, C is uniquely determined on . Conversely, if C is a d-copula and are distribution functions, then the function H defined by is a d-dimensional distribution function with margins .
It is well-known that Sklar’s Theorem has numerous practical applications in various fields involving multivariate data analysis. For instance, Sklar’s Theorem is commonly employed to analyze dependencies among different financial assets. It enables us to understand how the dependence structure between the prices of different assets might affect the overall risk of a portfolio.
2.1. Archimedean Copulas
Archimedean copulas constitute an important class of parametric copulas. This type of copula describes the dependence structure between random variables with greater flexibility through a single function called the generator. The latter is often expressed in terms of dependence parameters that control the strength of dependence among the components of a given random vector.
The generator of a d-dimensional Archimedean copula is an increasing and continuous function defined from to such that and . Suppose that is differentiable up to the order with derivatives noted by for and let be its inverse, that is, . Hereafter is the definition of a multivariate Archimedean copula. For details on this subject, see McNeil and Nešlehová [18].
Definition 2.
The d-dimensional Archimedean copula is defined through its generator ϕ as follows:
where the generator ϕ is subject to the conditions that for , and is non-increasing and convex.
Hereafter, we present the Clayton copula and the Frank copula, both considered among the most popular multivariate Archimedean copulas. The d-dimensional Clayton copula is defined as follows:
It is an Archimedean copula whose inverse generator function is defined, for all , by
Likewise, the d-dimensional Frank family copula is expressed, for all , by:
Its inverse generator function is given, for all , by
For , the Frank copula describes only the positive dependence, whereas in the two-dimensional case, this copula models both positive and negative association.
2.2. Gaussian Copulas
The d-Gaussian copula is defined through the standardized d-multivariate normal distribution . The correlation matrix represents the dependence parameters of this copula. Specifically, is expressed by
where denotes the standard normal distribution. In other words, the multivariate Gaussian copula is explicitly given by,
where I is the identical matrix. The bivariate Gaussian copula is reduced to
where represents the Pearson correlation coefficient, a parameter within the range , serving as the dependence parameter for this copula.
3. Model Description
Starting from a random vector , where Y is a continuous random variable with cumulative distribution function , assume that , where the random vectors and are continuous and discrete, respectively, and suppose without loss of generality that, for any , . Denote by the distribution functions of , respectively, and, for all , let . Let C and be the copulas of and , respectively. Moreover, for , set,
For , set such that and , for . For , let be the forward difference operator defined by
and set .
Proposition 1.
For all , the conditional mean, of Y given is expressed by,
Proof.
For all , let be the conditional density function of Y given . Clearly, one has
□
Remark 1.
3.1. Archimedean Copula-Based Predicted Mean
Suppose that the dependence structure of is described by an Archimedean class of copulas C with generator . This means the copulas C and are expressed, for all , by
where the function represents the inverse of the generator . Therefore, the partial derivative of the copulas C and are given by
and
Hence, the regression curve is given by
To exemplify the above conditional mean, let us examine the scenario where and , implying that covariate is continuous, while covariate is discrete. In such instances, we have:
where
and
Example 1.
Illustrating Equation (11), let us assume that follows the Clayton copula as described in (2). Specifically, for all ,
The generator of this copula and its inverse given in (3) satisfy
Hence, standard calculations show that (11) reduces to
Likewise, let us express Equation (11) when follows the Frank copula expressed in (4), namely,
Calculations similar to those used previously lead to
where
and
3.2. Gaussian Copula-Based Predicted Mean
This section presents the expression of the regression curve when the copula C of is Gaussian. This means that the copula C is expressed in terms of the standardized -multivariate normal distribution and the correlation matrix , which is assumed to be non-singular, as follows.
where denotes the standard normal distribution and where we note that
To derive and , let us first decompose the correlation matrix as follows.
where and represent the correlation matrices of the -continuous random vector and the -discrete random vector , respectively. Furthermore, denotes the correlation matrix between the random vectors and and .
Consider the -uniform random vector with distribution C, and set and . Therefore, one observes, for all ,
where is the copula density of . Let and . Since is normally distributed, then the conditional random vector is distributed as
with distribution function . It follows that
where is the copula density of the random vector . It remains to derive , where is the Gaussian copula with correlation matrix
where and represent the correlation matrices of the q-continuous random vector and the -discrete random vector , respectively. Similarly, denotes the correlation matrix between the random vectors and and . It follows that, for all ,
where is the copula density of the random vector . Using the fact that the conditional random vector is distributed as
with distribution function . It follows that
Therefore, the predicted mean is given by
Example 2.
Consider the case and . This means that the covariate is continuous and the covariate is discrete. Assume further that the copula of is Gaussian with the correlation matrix
In such a case, we have,
where . It remains to calculate the copula density of and the conditional normal distribution and . Since the copula of is Gaussian with correlation matrix, Then, the copula density is given by
Also, we have
Standard calculations show that, from this, is the distribution of , such that
and
Likewise, is the distribution of , such that
where
and
Example 3.
As a continuation of Example 2, in order to give a closed form for the conditional mean , we consider the case where the variables Y and are distributed as standard normal and where the correlation matrix of the Gaussian copula is determined by and . Thus, the conditional expectation is given by:
for any discrete random variables and for any .
4. Estimation
Consider a sample of n observations from the random vector . For , denote . To estimate the conditional mean described in (7), we need to estimate the marginal distributions as well as the partial derivatives of the copulas C and , namely, and , provided in (6). In this paper, we use a semi-parametric methodology that first consists of estimating margins through their rescaled empirical distributions given by
respectively, where stands for the indicator function for given event A. An alternative method for estimating these quantities is through the use of the kernel smoothing technique. Typically, this approach yields more accurate estimations compared to the method based on empirical distributions. The idea behind this method is to estimate the distributions using
represents a non-negative function known as the kernel, while h signifies the bandwidth. It is well known that the selection of the bandwidth is crucial and significantly influences the accuracy of the estimation.
The second step is to estimate parametrically the copula of . To this end, assume that the copula C is identified as a member of some parametric family, , where This means that there exists such that and . Therefore, the copula C is then estimated by , where is an estimator of . This estimator is typically obtained by maximizing, in terms of , the expressed pseudo-likelihood function,
where . In other words, the estimator of the is given by
Finally, the conditional mean is estimated using (7) as follows:
Example 4.
Let us examine the above estimation procedure in the scenario where and , the situation entails being a continuous covariate, while is discrete. Additionally, let us assume that the copula governing is the Clayton copula with parameter . The theoretical conditional mean is provided in (12). Its estimated couterpart can be derived from (19) as follows:
The estimators , , , and can be computed using a sample , , selected from the distribution of . Similarly, in the case where the dependence structure of is modeled by a Frank copula, the estimated conditional mean can be derived from (13) and (19) as follows:
and are given in (14) and (15), respectively.
5. Simulation Study
The objective of this section is to conduct simulations to compare the proposed conditional mean estimator with some competitors. To achieve this, we focus on the case where with mixed covariates; specifically, is continuous, and is discrete. In this case, the proposed estimator is deduced from its general form expressed in (19) as follows,
where
As scenarios, we consider the most common cases to show the improvement of our estimator over the OLS estimator. However, for the copula of , we consider Clayton, Frank, and Gumbel with parameter and for the variables or , while and with distribution , and . The generalized inverse of is
or equivalently,
Simulation algorithm:
- Given , and .
- For .
- Generate from a copula .
- Set , and .
- Use the generated sample , to estimate and define the empirical distributions of , , and .
- Evaluate the estimator for belonging to the grid defined by
For fixed , we first compute the theoretical value and then evaluate using J random samples of size n. We denote the corresponding estimates by , where . To assess the performance, we employ the empirical integrated mean squared error (IMSE), which is formulated as follows:
where denotes the cardinality number of the grid F. Notably, can be decomposed into the square of empirical bias, , and the empirical variance, , as follows:
In this simulation study, different values of the parameters are considered, which represent different dependence scenarios ranging from weak to strong, with Kendall’s tau, , values lying in the interval . With a sample size , the response, Y, is generated from distribution and Student’s t-distribution with 3 degrees of freedom. Also, is generated from a Uniform(0, 1) and with distribution , and , where and . In this context, we report and compare the integrated mean square error (IMSE) and the integrated mean absolute error (IMAE) with the respective errors derived from the least squares (ls) regression method. This comprehensive approach ensured the reliability of the comparison by accounting for variability in outcomes across multiple realizations. The reported values in Table 1 and Table 2, corresponding to normal distribution and Student’s t-distribution, respectively, represent the averages calculated from a total of 100 realizations. The results show that the proposed method consistently outperformed the least squares regression method across all the scenarios. This dominance was evident in both metrics, IMSE and IMAE, and across all varieties of Kendall’s tau values and sample sizes. We also analyzed the evolution of MSE with sample size, confirming a clear reduction as n grows, improving estimator accuracy and stability (see Table 3). Specifically, we considered and , which are relatively small. As n increases, the estimator improves significantly in terms of IMSE.
Table 1.
Simulation results for normal distribution with .
Table 2.
Simulation results for Student’s t with and .
Table 3.
Simulation results for normal distribution with and .
Particularly, the proposed method revealed a more accurate and robust performance, indicating a lower IMSE and IMAE between the estimated and actual values than the least squares method. This enhanced performance can be attributed to the proposed method’s ability to more effectively capture and account for the underlying correlation structure represented by Kendall’s tau in the data. Unlike the least squares method, which assumes a specific form of relationship (linear), the proposed method offers a more flexible and robust approach to analyzing data with varying degrees of correlation and complexity.
6. Conclusions
This paper extends the copula-based regression model introduced by Noh et al. [5] by addressing the scenario where covariates are mixed, encompassing both continuous and discrete explanatory variables. Unlike the original model, which dealt exclusively with continuous covariates, the proposed approach broadens the applicability of the copula-based regression framework. The parameter estimation has been performed using the inference function for margins (IFM), which first estimates the marginal parameters and then estimates the corresponding dependence parameter. Through detailed examples, we demonstrated the estimation of the proposed regression equation and conducted a comprehensive simulation study under various scenarios involving different types of copulas. The results of the simulation study indicate that the suggested model performs favorably compared to classical regression approaches, showcasing its potential to handle mixed-covariate data effectively. This extension provides a valuable contribution to the field of regression analysis, offering a new regression tool for researchers and practitioners dealing with diverse explanatory data types. An interesting potential research direction involves extending this concept to regression with multivariate responses using the same mixed covariates. This extension is particularly relevant in various practical applications where multiple outcomes need to be modeled simultaneously. For instance, in environmental studies, multivariate regression can be used to assess how industrial emissions simultaneously impact both air and water quality, accounting for the complex interactions between pollutants. In healthcare, it enables researchers to examine how lifestyle factors influence multiple health outcomes, such as blood pressure, cholesterol levels, and blood sugar levels.
Author Contributions
Writing—review & editing, S.A., O.K. and M.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported in part by Grant CARP2023 from the College of Business and Economics at UAE University, for the Promotion of research.
Data Availability Statement
No new data were created or analyzed in this study.
Acknowledgments
We would like to express our sincere thanks to the anonymous referees for their constructive comments and suggestions, which improved the earlier version of our paper. We are also very grateful to UAE University Research Affairs for funding the APC.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Nelsen, R.B. An Introduction to Copulas; Springer Series in Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
- Sheikhi, A.; Arad, F.; Mesiar, R. A heteroscedasticity diagnostic of a regression analysis with copula dependent random variables. Braz. J. Probab. Stat. 2022, 36, 408–419. [Google Scholar] [CrossRef]
- Ali, A.; Pathak, A.K.; Arshad, M.; Emura, T. Copula-based regression estimation in the presence of outliers. Commun. Stat. Simul. Comput. 2024, 1–26. [Google Scholar] [CrossRef]
- Sklar, M. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris 1959, 8, 229–231. [Google Scholar]
- Noh, H.; Ghouch, A.E.; Bouezmarni, T. Copula-based regression estimation and inference. J. Am. Stat. Assoc. 2013, 108, 676–688. [Google Scholar] [CrossRef]
- Noh, H.; Ghouch, A.E.; Van Keilegom, I. Semiparametric conditional quantile estimation through copula-based multivariate models. J. Bus. Econ. Stat. 2015, 33, 167–178. [Google Scholar] [CrossRef]
- De Backer, M.; El Ghouch, A.; Van Keilegom, I. Semiparametric copula quantile regression for complete or censored data. Electron. J. Stat. 2017, 11, 1660–1698. [Google Scholar] [CrossRef]
- Kraus, D.; Czado, C. D-vine copula based quantile regression. Comput. Stat. Data Anal. 2017, 110, 1–18. [Google Scholar] [CrossRef]
- Rémillard, B.; Nasri, B.; Bouezmarni, T. On copula-based conditional quantile estimators. Stat. Probab. Lett. 2017, 128, 14–20. [Google Scholar] [CrossRef]
- Chang, B.; Joe, H. Prediction based on conditional distributions of vine copulas. Comput. Stat. Data Anal. 2019, 139, 45–63. [Google Scholar] [CrossRef]
- Nagler, T.; Vatter, T. Solving estimating equations with copulas. J. Am. Stat. Assoc. 2023, 119, 1168–1180. [Google Scholar] [CrossRef]
- Newey, W.K.; Powell, J.L. Asymmetric least squares estimation and testing. Econometrica 1987, 55, 819–847. [Google Scholar] [CrossRef]
- Coia, V.; Joe, H.; Nolde, N. Copula-based conditional tail indices. J. Multivar. Anal. 2023, 201, 105268. [Google Scholar] [CrossRef]
- Mesfioui, M.; Bouezmarni, T.; Belalia, M. Copula-based link functions in binary regression models. Stat. Pap. 2023, 64, 557–585. [Google Scholar] [CrossRef]
- Smith, M.S. Implicit copulas: An overview. Econom. Stat. 2023, 28, 81–104. [Google Scholar] [CrossRef]
- Hans, N.; Klein, N.; Faschingbauer, F.; Schneider, M.; Mayr, A. Boosting distributional copula regression. Biometrics 2023, 79, 2298–2310. [Google Scholar] [CrossRef] [PubMed]
- Nazeri Tahroudi, M.; Ramezani, Y.; De Michele, C.; Mirabbasi, R. Application of copula-based approach as a new data-driven model for downscaling the mean daily temperature. Int. J. Climatol. 2023, 43, 240–254. [Google Scholar] [CrossRef]
- McNeil, A.J.; Nešlehová, J. Multivariate Archimedean copulas, d-monotone functions and ℓ1-norm symmetric distributions. Ann. Statist. 2009, 37, 3059–3097. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).