Copula-Based Regression with Mixed Covariates

Saeed Aldahmani; Othmane Kortbi; Mhamed Mesfioui

doi:10.3390/math12223525

,

and

¹

Department of Statistics and Business Analytics, UAE University, Al Ain P.O. Box 15551, United Arab Emirates

²

Département de Mathématiques et d’Informatique, Université du Québec à Trois-Rivières, Trois-Rivières, QC G9A 5H7, Canada

^*

Author to whom correspondence should be addressed.

Mathematics2024, 12(22), 3525;https://doi.org/10.3390/math12223525

This article belongs to the Section D1: Probability and Statistics

Version Notes

Order Reprints

Abstract

In this paper, we focused on developing copula-based modeling procedures that effectively capture the dependence between response and explanatory variables. Building upon the work of Noh et al. (J. Am. Stat. Assoc. 2013, 108, 676–688) we extended copula-based regression to accommodate both continuous and discrete covariates. Specifically, we explored the construction of copulas to estimate the conditional mean of the response variable given the covariates, elucidating the relationship between copula structures and marginal distributions. We considered various estimation methods for copulas and distribution functions, presenting a diverse array of estimators for the conditional mean function. These estimators range from non-parametric to semi-parametric and fully parametric, offering flexibility in modeling regression relationships. An adapted algorithm is applied to construct copulas and simulations are carried out to replicate datasets, estimate prediction model parameters, and compare with the OLS method. The practicality and efficacy of our proposed methodologies, grounded in the principles of copula-based regression, are substantiated through methodical simulation studies.

Keywords:

least squares regression; copulas; copula-based regression; archimedean copulas; gaussian copula; IMSE

MSC:

60E05; 62H05; 62H20; 62J05; 62G05; 62G08

1. Introduction

Scientists are often led to study the relationships and dependencies between the response variable and several other covariates. However, regression analysis is the statistical tool for investigating such relationships and it is one of the most commonly used statistical methods in many scientific fields, such as medicine, biology, agriculture, economics, engineering, sociology, etc. In medical research, econometrics, and other research fields, it is very common to use regression analysis to interpret the correlation existing between different variables. However, the basic form of the regression analysis is not suitable for many cases, where the relationships are often non-linear and the probability distribution of the output variable may be an abnormal distribution.

For such dependence modeling problems, we attempt to provide a functional form that will summarize the relationship between response and explanatory variables. In several practical situations, as an example, a vector of covariates

X = (X_{1}, \dots, X_{d})

is used to explain, interpret, or predict the response variable Y. This is encountered in many fields, including medical fields and social science. The type of functional relationship we attempt to figure out could depend on the marginal behavior of variables or their joint behavior. In this paper, we consider the construction of dependent modeling procedures based on the separation of these two behaviors when the covariates are a mixture of continuous and discrete variables.

For this context, we consider procedures that allow the representation of a multivariate distribution as a function of its uni-variate marginals through a connection function called a copula. Copulas have been increasingly popular for modeling statistical dependence in multivariate sets of data and have been applied to various areas, including medical research, environmental science, econometrics, actuarial science, agronomy, and others. A key feature of copulas is that they provide flexible representations of the multivariate distribution by allowing for the dependence structure of the variables of interest to be modeled separately from the marginal structure and, by specifying a copula, we summarize all the dependencies between margins (see Nelsen [1] for more about this subject).

The power of this approach principally lies in the ability for a practitioner to model the dependence structure independently of the marginal behaviors. Furthermore, the advantages of using copulas in modeling are the allowance to model both linear and non-linear dependence, an arbitrary choice of a marginal distribution, and the capability of modeling extreme endpoints. However, the principal advantage of a copula regression is that there are no restrictions and no specification on the probability distributions that can be used.

It is interesting to note that copula-based regression models offer significant advantages in capturing complex dependencies between variables, making them highly useful in various fields. In finance, they allow for better portfolio risk management by modeling non-linear dependencies between asset returns and macroeconomic factors, especially during market downturns. In insurance, copula-based regression can be applied to explain pricing in terms of different dependent types of claims, such as frequency and severity. In environmental studies, regression as function of a copula is useful to establish the relationship between rainfall and river discharge, especially in the case of non-linear dependence. In healthcare, regression with a copula enables researchers to examine how lifestyle factors influence health outcomes, such as cholesterol levels, while capturing the potential interdependence among these health indicators.

In the literature, there exist many recent studies of regression based on copulas; as examples, we cite Sheikhi et al. [2] and Ali et al. [3] among others. As a new contribution to this domain, we consider in this paper the estimation problem of the mean regression function for a regression model, where

X = {(X_{1}, \dots, X_{d})}^{T}

is a random vector of dimension

d \geq 1

and Y is a random variable with cumulative distribution function (c.d.f.)

F_{0}

and density function

f_{0}

. Y is the response variable and

X

is the set of covariates. We denote by

F_{i}

the c.d.f. of the variables

X_{i}

and we denote by

f_{i}

its corresponding density. For a given

x = {(x_{1}, \dots, x_{d})}^{T}

, we will note by

F (x)

the shortcut for

(F_{1} (x_{1}), \dots, F_{d} (x_{d}))

. From the inspiring work of Sklar [4], the c.d.f. of

{(Y, X^{T})}^{T}

evaluated at

{(y, x^{T})}^{T}

can be expressed in terms of

C (F_{0} (y), F (x))

, where C is the copula distribution of

{(Y, X^{T})}^{T}

, that is, the function from

{[0, 1]}^{d + 1}

to

[0, 1]

defined by

C (u_{0}, u_{1}, \dots, u_{d}) = P (F_{0} (Y) \leq u_{0}, F_{1} (X_{1}) \leq u_{1}, \dots, F_{d} (X_{d}) \leq u_{d}) .

Recently, Noh et al. [5] exploited the above decomposition to introduce a novel idea consisting of expressing the mean regression function

m (x)

, in terms of the copula and margins of

x

as follows.

m (x) = \frac{E [Y c (F_{0} (Y), F (x))]}{c_{X} (F (x))} .

(1)

where

c (u_{0}, v) \equiv c (u_{0}, v_{1}, \dots, v_{d})

is the copula density corresponding to C and

c_{X} (v) \equiv c_{X} (v_{1}, \dots, v_{d})

is the copula density of

X

. This shows that the mean regression function

m (x)

is the ratio of a numerator that only captures the mean dependence between Y and X and a denominator that captures the dependence within X. It is worth mentioning that the formula is only valid when the covariates are continuous. A new reformulation is needed when the covariates are not all continuous, which is the case for many real-world applications, especially in medicine.

Furthermore, Noh et al. [5] proposed a semi-parametric estimator for the regression function given in (1). Specifically, they utilized the inference function for margins (IFM) technique to estimate the copula-based regression curve. This method proceeds in two stages: first, it estimates the marginal parameters, and then it estimates the corresponding dependence parameter. These authors demonstrate, both theoretically and empirically, that the resulting estimates obtained exhibit desirable properties when the parametric copula family is adequately chosen.

Noh et al. [5] stimulated extensive research on copula-based regression. Noh et al. [6] applied the method of Noh et al. [5] to the quantile regression with i.i.d. or time series that are completely observed. De Backer et al. [7] extended the method of Noh et al. [6] to the quantile regression with censored data. Kraus and Czado [8] studied the quantile regression with complete data, using D-vine copulas. Rémillard et al. [9] discussed the asymptotic connection between the estimators of Noh et al. [6] and Kraus and Czado [8]. Chang and Joe [10] proposed an algorithm for computing the conditional distribution function via the vine copula. Furthermore, Nagler and Vatter [11] unified various copula-based regressions by formulating a general loss function which may not be continuously differentiable. Their generalized regression model includes the conditional mean regression of Noh et al. [5], the conditional quantile regression of Noh et al. [6], and the asymmetric least squares of Newey and Powell [12] as special cases. The unified framework enhances the systematic interpretation of the different existing regressions. For additional discussion into similar methods, see [13,14,15,16,17] and the literature cited therein.

As an extension of the framework by Noh et al. [5], we incorporate discrete variables into the set of covariates

X_{1}, \dots, X_{d}

. By establishing a connection with various classes of copulas through an alternative equation to (1), we calculate the conditional mean,

m (x)

, of

Y | X = x

. In this context, we develop the relationship between the copula and the marginals. Furthermore, we illustrate this relationship for specific families of copulas, such as Archimedean copulas and the Gaussian copula, highlighting their properties that are beneficial for our analysis.

The next step involved addressing the estimation problem. Here, we also adopt a semi-parametric approach along with the inference function for margins (IFM) method to estimate the proposed regression curve. First, we estimate the marginal distributions using their empirical distributions, and then we estimate the dependence parameter associated with the underlying copula. A simulation studies for different classes of copulas and different distributions for the output Y are considered to illustrate the usefulness of the findings.

The rest of the paper is organized as follows. Section 2, discusses different copula concepts in the multivariate setting. Section 3 outlines the copula-based regression model proposed for case where the set of covariates includes both discrete and continuous variables. Section 4 covers the estimation procedure of the proposed regression model. Section 5 is dedicated to a simulation study that assesses the performance of the suggested copula-based regression. Conclusion and remarks come in Section 6.

2. Preliminaries

Copulas are a mathematical concept used in multivariate analysis to describe the dependency structure between components of a multivariate random vector. They play a central role in various fields employing multivariate statistical analysis, such as risk management and finance. Therefore, copulas provide a framework for modeling the relationships between variables by describing their joint distribution independently of their marginal distributions.

This section provides a brief overview of the copula concept, which will be utilized in the development of the proposed model. According to Nelsen [1], a multivariate copula is defined as follows.

Definition 1.

A d-dimensional copula is a function from

{[0, 1]}^{d}

to

[0, 1]

with following properties:

1.

For every

u = (u_{1}, \dots, u_{d}) \in {[0, 1]}^{d}

,

(i): $C (u) = 0$ if at least one of coordinate of u is 0.
(ii): $C (u) = u_{k}$ if all coordinates of u are 1 except $u_{k}$ .

2.

For every

a = (a_{1}, \dots, a_{d})

and

b = (b_{1}, \dots, b_{d})

in

{[0, 1]}^{d}

such that

a_{i} \leq b_{i}

,

i = 1, \dots, d

,

Δ_{a_{d}}^{b_{d}} Δ_{a_{d - 1}}^{b_{d - 1}} \dots Δ_{a_{1}}^{b_{1}} C (u) \geq 0, for all u \in {[0, 1]}^{d},

where,

Δ_{a_{k}}^{b_{k}} C (u) = C (u_{1}, \dots, u_{k - 1}, b_{k}, u_{k + 1}, \dots, u_{d}) - C (u_{1}, \dots, u_{k - 1}, a_{k}, u_{k + 1}, \dots, u_{d})

Sklar’s Theorem is a fundamental result in copula theory. It enables us to express the joint distribution of a multivariate random vector in terms of their marginal distributions and a copula function. It can be stated as follows (see, Nelsen [1]).

Theorem 1.

Let H be a d-dimensional distribution function with marginal distributions

F_{1}, F_{2}, \dots, F_{d}

. Then, there exists a d-copula C such that for all

(x_{1}, \dots, x_{d}) \in R^{n}

,

H (x_{1}, \dots, x_{d}) = C (F_{1} (x_{1}), F_{2} (x_{2}), \dots, F_{d} (x_{d})) .

If

F_{1}, F_{2}, \dots, F_{d}

are all continuous, then C is unique; otherwise, C is uniquely determined on

Ran (F_{1}) \times \dots \times Ran (F_{d})

. Conversely, if C is a d-copula and

F_{1}, F_{2}, \dots, F_{d}

are distribution functions, then the function H defined by is a d-dimensional distribution function with margins

F_{1}, F_{2}, \dots, F_{d}

.

It is well-known that Sklar’s Theorem has numerous practical applications in various fields involving multivariate data analysis. For instance, Sklar’s Theorem is commonly employed to analyze dependencies among different financial assets. It enables us to understand how the dependence structure between the prices of different assets might affect the overall risk of a portfolio.

2.1. Archimedean Copulas

Archimedean copulas constitute an important class of parametric copulas. This type of copula describes the dependence structure between random variables with greater flexibility through a single function called the generator. The latter is often expressed in terms of dependence parameters that control the strength of dependence among the components of a given random vector.

The generator of a d-dimensional Archimedean copula is an increasing and continuous function

ϕ

defined from

[0, \infty)

to

[0, 1]

such that

ϕ (0) = 0

and

ϕ (\infty) = 1

. Suppose that

ϕ

is differentiable up to the order

d - 2

with derivatives noted by

ϕ^{(i)}

for

i = 1, \dots, d - 2

and let

ψ

be its inverse, that is,

ψ = ϕ^{- 1}

. Hereafter is the definition of a multivariate Archimedean copula. For details on this subject, see McNeil and Nešlehová [18].

Definition 2.

The d-dimensional Archimedean copula

C_{ϕ}

is defined through its generator ϕ as follows:

C_{ϕ} (u_{1}, \dots, u_{d}) = ϕ (ψ (u_{1}) + \dots + ψ (u_{d})), for all (u_{1}, \dots, u_{d}) \in {[0, 1]}^{d},

where the generator ϕ is subject to the conditions that

{(- 1)}^{i} ϕ^{(i)} (x) \geq 0

for

i = 1, \dots, d - 2

, and

{(- 1)}^{d - 2} ϕ^{(d - 2)}

is non-increasing and convex.

Hereafter, we present the Clayton copula and the Frank copula, both considered among the most popular multivariate Archimedean copulas. The d-dimensional Clayton copula is defined as follows:

C_{ϕ} (u_{1}, \dots, u_{d}) = {(u_{1}^{- θ} + \dots + u_{d}^{- θ} - d + 1)}^{- \frac{1}{θ}}, for all (u_{1}, \dots, u_{d}) \in {[0, 1]}^{d}

(2)

It is an Archimedean copula whose inverse generator function is defined, for all

t > 0

, by

ψ_{θ} (t) = {(1 + θ t)}^{- \frac{1}{θ}}, for all θ \in (0, \infty) .

(3)

Likewise, the d-dimensional Frank family copula is expressed, for all

(u_{1}, \dots, u_{d}) \in {[0, 1]}^{d}

, by:

C_{ϕ} (u_{1}, \dots, u_{d}) = - \frac{1}{θ} ln (1 + \frac{(e^{- θ u_{1}} - 1) (e^{- θ u_{2}} - 1) \dots (e^{- θ u_{d}} - 1)}{{(e^{- θ} - 1)}^{d - 1}}), θ > 0 .

(4)

Its inverse generator function is given, for all

t > 0

, by

ψ_{θ} (t) = - ln (\frac{e^{- θ t} - 1}{e^{- θ} - 1}) for all (u_{1}, \dots, u_{d}) \in {[0, 1]}^{d}), θ > 0 .

(5)

For

d \geq 3

, the Frank copula describes only the positive dependence, whereas in the two-dimensional case, this copula models both positive and negative association.

2.2. Gaussian Copulas

The d-Gaussian copula

C_{Σ}

is defined through the standardized d-multivariate normal distribution

Φ_{Σ}

. The correlation matrix

Σ

represents the dependence parameters of this copula. Specifically,

C_{Σ}

is expressed by

C_{Σ} (u_{1}, \dots, u_{d}) = Φ_{Σ} (Φ^{- 1} (u_{1}), \dots, Φ^{- 1} (u_{d})) for all (u_{1}, \dots, u_{d}) \in {[0, 1]}^{d}),

where

Φ

denotes the standard normal distribution. In other words, the multivariate Gaussian copula is explicitly given by,

C_{Σ} (u_{1}, \dots, u_{d}) = \int_{- \infty}^{Φ^{- 1} (u_{1})} \dots \int_{- \infty}^{Φ^{- 1} (u_{d})} \frac{1}{2 π \det (Σ)} exp \{- \frac{1}{2} (x^{⊤} (Σ^{- 1} - I) x)\} d x,

where I is the identical matrix. The bivariate Gaussian copula is reduced to

C_{ρ} (u_{1}, u_{2}) = \int_{- \infty}^{Φ^{- 1} (u_{1})} \int_{- \infty}^{Φ^{- 1} (u_{2})} \frac{1}{2 π \sqrt{1 - ρ^{2}}} exp \{- \frac{x^{2} - 2 ρ x + y^{2}}{2 (1 - ρ^{2})}\} d x d y,

where

ρ

represents the Pearson correlation coefficient, a parameter within the range

[- 1, 1]

, serving as the dependence parameter for this copula.

3. Model Description

Starting from a random vector

(Y, X)

, where Y is a continuous random variable with cumulative distribution function

F_{0}

, assume that

X = (X_{q}, X_{d - q})

, where the random vectors

X_{q} = (X_{1}, \dots, X_{q})

and

X_{d - q} = (X_{q + 1}, \dots, X_{d})

are continuous and discrete, respectively, and suppose without loss of generality that, for any

i = q + 1, \dots, d

,

Ran (X_{i}) \subset N = {0, 1, 2, \dots}

. Denote by

F_{1}, \dots, F_{d}

the distribution functions of

X_{1}, \dots, X_{d}

, respectively, and, for all

x_{d} \in R^{q} \times Π_{q + 1}^{d} Ran (X_{i})

, let

F_{d} (x) = (F_{1} (x_{1}), \dots, F_{d} (x_{d}))

. Let C and

C_{X}

be the copulas of

(Y, X)

and

X

, respectively. Moreover, for

(u, v_{d}) = (u, v_{1}, \dots, v_{d}) \in {(0, 1)}^{d + 1}

, set,

\begin{matrix} \partial_{q + 1} C (u, v_{d}) = \frac{\partial^{q + 1}}{\partial u \partial v_{1} \dots \partial v_{q}} C (u, v_{d}) and \partial_{q} C_{X} (v_{d}) = \frac{\partial^{q}}{\partial v_{1} \dots \partial v_{q}} C_{X} (v_{d}) . \end{matrix}

(6)

For

i = q + 1, \dots, d

, set

δ_{i} = (δ_{i, 1}, \dots, δ_{i, d})

such that

δ_{i, i} = 1

and

δ_{i, j} = 0

, for

i \neq j

. For

i = q + 1, \dots, d

, let

Δ_{i}

be the forward difference operator defined by

\begin{matrix} Δ_{i} \partial_{q + 1} C (F_{0} (k), F_{d} (x_{d})) & = & \partial_{q + 1} C (F_{0} (k), F_{d} (x_{d})) - \partial_{q + 1} C (F_{0} (k), F_{d} (x_{d} - δ_{i})), \\ Δ_{i} \partial_{q} C_{X} (F_{d} (x_{d})) & = & \partial_{q} C_{X} (F_{d} (x_{d})) - \partial_{q} C_{X} (F_{d} (x_{d} - δ_{i})) . \end{matrix}

and set

Δ_{q}^{d} = \prod_{i = q + 1}^{d} Δ_{i}

.

Proposition 1.

For all

x \in R^{q} \times Π_{q + 1}^{d} Ran (X_{i})

, the conditional mean, of Y given

X = x

is expressed by,

\begin{matrix} m (x) = \frac{E \{Y Δ_{q}^{d} \partial_{q + 1} C (F_{0} (Y), F_{d} (x))\}}{Δ_{q}^{d} \partial_{q} C_{X} (F_{d} (x))} . \end{matrix}

(7)

Proof.

For all

x \in R^{q} \times Π_{q + 1}^{d} Ran (X_{i})

, let

f (. | x)

be the conditional density function of Y given

X = x

. Clearly, one has

\begin{matrix} m (x) & = & E (Y | X = x) \\ = & \int y f (y | x) d y \\ = & \int \frac{y Δ_{q}^{d} \partial_{q + 1} C (F_{0} (y), F_{d} (x)) f_{0} (y) f_{1} (x_{1}) \dots f_{q} (x_{q})}{Δ_{q}^{d} \partial_{q} C_{X} (F_{d} (x)) f_{1} (x_{1}) \dots f_{q} (x_{q})} d y \\ = & \int \frac{y Δ_{q}^{d} \partial_{q + 1} C (F_{0} (y), F_{d} (x)) f_{0} (y)}{Δ_{q}^{d} \partial_{q} C_{X} (F_{d} (x))} d y \\ = & \frac{E \{Y Δ_{q}^{d} \partial_{q + 1} C (F_{0} (Y), F_{d} (x)\}}{Δ_{q}^{d} \partial_{q} C_{X} (F_{d} (x))} . \end{matrix}

□

Remark 1.

A developed expression of the conditional mean can be obtained by using the expanded expression of

Δ_{q}^{d}

described by

\begin{matrix} Δ_{q}^{d} = 1 + \sum_{S \subset {1, \dots, d - q}} {(- 1)}^{| S |} \prod_{i \in S} T_{i}, \end{matrix}

(8)

where

T_{i} = 1 - Δ_{i}

and

| S |

denotes the cardinality number of any nonempty subset S of

{1, \dots, d - q}

. Therefore, one sees from (7) and (8) that

m (x)

can expressed by,

\begin{matrix} \frac{E \{Y \partial_{q + 1} C (F_{0} (Y), F_{d} (x_{d})\} + \sum_{S \subset {1, \dots, d - q}} {(- 1)}^{| S |} E \{Y \prod_{i \in S} T_{i} \partial_{q + 1} C (F_{0} (Y), F_{d} (x_{d})\}}{\partial_{q} C_{X} (F_{d} (x)) + \sum_{S \subset {1, \dots, d - q}} {(- 1)}^{| S |} \prod_{i \in S} T_{i} \partial_{q} C_{X} (F_{d} (x))}, \end{matrix}

(9)

where

\begin{matrix} T_{i} \partial_{q + 1} C (F_{0} (k), F_{d} (x_{d})) & = & \partial_{q + 1} C (F_{0} (k), F_{d} (x_{d} - δ_{i})), \\ T_{i} \partial_{q} C_{X} (F_{d} (x_{d})) & = & \partial_{q} C_{X} (F_{d} (x_{d} - δ_{i})) . \end{matrix}

3.1. Archimedean Copula-Based Predicted Mean

Suppose that the dependence structure of

(Y, X)

is described by an Archimedean class of copulas C with generator

ϕ

. This means the copulas C and

C_{X}

are expressed, for all

(u, v_{d}) \in {(0, 1)}^{d + 1}

, by

C (u, v_{d}) = ψ (ϕ (u) + \sum_{j = 1}^{d} ϕ (v_{j})) and C_{X} (v_{d}) = ψ (\sum_{j = 1}^{d} ϕ (v_{j})),

where the function

ψ

represents the inverse of the generator

ϕ

. Therefore, the partial derivative of the copulas C and

C_{X}

are given by

\partial_{q + 1} C (u, v_{d}) = ψ^{(q + 1)} (ϕ (u) + \sum_{j = 1}^{d} ϕ (v_{j})) ϕ^{'} (u) \prod_{j = 1}^{q} ϕ^{'} (v_{j}),

and

\partial_{q} C_{X} (v_{d}) = ψ^{(q)} (\sum_{j = 1}^{d} ϕ (v_{j})) \prod_{j = 1}^{q} ϕ^{'} (v_{j}) .

Hence, the regression curve is given by

\begin{matrix} m (x) = \frac{E \{Y ϕ^{'} (F_{0} (Y)) Δ_{q}^{d} ψ^{(q + 1)} (ϕ (F_{0} (Y)) + \sum_{j = 1}^{d} ϕ (F_{j} (x_{j}))\}}{Δ_{q}^{d} ψ^{(q)} (\sum_{j = 1}^{d} ϕ (F_{j} (x_{j}))} . \end{matrix}

(10)

To exemplify the above conditional mean, let us examine the scenario where

d = 2

and

q = 1

, implying that covariate

X_{1}

is continuous, while covariate

X_{2}

is discrete. In such instances, we have:

\begin{matrix} m (x) & = & \frac{E \{Y ϕ^{'} (F_{0} (Y)) Δ_{1}^{2} ψ^{″} \{ϕ (F_{0} (Y)) + ϕ (F_{1} (x_{1})) + ϕ (F_{2} (x_{2}))\}\}}{Δ_{1}^{2} ψ^{'} \{ϕ (F_{1} (x_{1})) + ϕ (F_{2} (x_{2}))\}}, \end{matrix}

(11)

where

\begin{matrix} Δ_{1}^{2} ψ^{″} \{ϕ (F_{0} (Y) + ϕ (F_{1} (x_{1})) + ϕ (F_{2} (x_{2}))\} & = & ψ^{″} \{ϕ (F_{0} (Y)) + ϕ (F_{1} (x_{1})) + ϕ (F_{2} (x_{2}))\} \\ - ψ^{″} \{ϕ (F_{0} (Y)) + ϕ (F_{1} (x_{1})) + ϕ (F_{2} (x_{2} - 1))\}, \end{matrix}

and

\begin{matrix} Δ_{1}^{2} ψ^{'} \{ϕ (F_{1} (x_{1})) + ϕ (F_{2} (x_{2}))\} & = & ψ^{'} \{ϕ (F_{1} (x_{1})) + ϕ (F_{2} (x_{2}))\} - ψ^{'} \{ϕ (F_{1} (x_{1})) + ϕ (F_{2} (x_{2} - 1))\} . \end{matrix}

Example 1.

Illustrating Equation (11), let us assume that

(Y, X_{1}, X_{2})

follows the Clayton copula

C_{θ}

as described in (2). Specifically, for all

θ \in (0, \infty)

,

C_{θ} (u, v_{1}, v_{2}) = {(u^{- θ} + v_{1}^{- θ} + v_{2}^{- θ} - 2)}^{- \frac{1}{θ}}, for all (u, v_{1}, v_{2}) \in {[0, 1]}^{3} .

The generator

ϕ_{θ}

of this copula and its inverse

ψ_{θ}

given in (3) satisfy

ψ_{θ} (t) = \frac{t^{- θ} - 1}{θ}, ψ_{θ}^{'} (t) = - {(1 + θ t)}^{- \frac{1}{θ} - 1}, ψ_{θ}^{″} (t) = (θ + 1) {(1 + θ t)}^{- \frac{1}{θ} - 2} .

Hence, standard calculations show that (11) reduces to

\begin{matrix} m (x) & = & \frac{(1 + θ) E \{Y F_{0} {(Y)}^{- θ - 1} Δ_{1}^{2} {(F_{0} {(Y)}^{- θ} + F_{1} {(x_{1})}^{- θ} + F_{2} {(x_{2})}^{- θ} - 2)}^{- \frac{1}{θ} - 2}\}}{Δ_{1}^{2} {(F_{1} {(x_{1})}^{- θ} + F_{2} {(x_{2})}^{- θ} - 1)}^{- \frac{1}{θ} - 1}} . \end{matrix}

(12)

Likewise, let us express Equation (11) when

(Y, X_{1}, X_{2})

follows the Frank copula

C_{θ}

expressed in (4), namely,

C_{ϕ} (u, v_{1}, v_{2}) = - \frac{1}{θ} ln (1 + \frac{(e^{- θ u} - 1) (e^{- θ v_{1}} - 1) (e^{- θ v_{2}} - 1)}{{(e^{- θ} - 1)}^{2}}), θ \in R ∖ \{0\} .

Calculations similar to those used previously lead to

\begin{matrix} m (x) & = & \frac{E \{Y Δ_{1}^{2} K_{1, θ} (F_{0} (Y), F_{1} (x_{1}), F_{2} (x_{2}))\}}{Δ_{1}^{2} K_{2, θ} (F_{1} (x_{1}), F_{2} (x_{2}))}, \end{matrix}

(13)

where

\begin{matrix} K_{1, θ} (u, v_{1}, v_{2}) = \frac{e^{- θ u} e^{- θ v_{1}} (e^{- θ v_{2}} - 1) [{(1 - e^{- θ})}^{2} + e^{- θ u} (e^{- θ v_{1}} - 1) (e^{- θ v_{2}} - 1)]}{{[{(1 - e^{- θ})}^{2} + (e^{- θ u} + 1) (e^{- θ v_{1}} - 1) (e^{- θ v_{2}} - 1)]}^{2}}, \end{matrix}

(14)

and

\begin{matrix} K_{2, θ} (v_{1}, v_{2}) = \frac{e^{- θ v_{1}} (e^{- θ v_{2}} - 1)}{1 - e^{- θ} + (e^{- θ v_{1}} - 1) (e^{- θ v_{2}} - 1)} . \end{matrix}

(15)

3.2. Gaussian Copula-Based Predicted Mean

This section presents the expression of the regression curve when the copula C of

(Y, X)

is Gaussian. This means that the copula C is expressed in terms of the standardized

(d + 1)

-multivariate normal distribution

Φ_{Σ}

and the correlation matrix

Σ

, which is assumed to be non-singular, as follows.

C (u, v; Σ) = Φ_{Σ} (Φ^{- 1} (u), Φ^{- 1} (v)) \forall (u, v) \in {[0, 1]}^{d + 1},

where

Φ

denotes the standard normal distribution and where we note that

Φ^{- 1} (v) = (Φ^{- 1} (v_{1}), Φ^{- 1} (v_{2}), \dots, Φ^{- 1} (v_{d})) .

To derive

\partial_{q + 1} C (u, v_{d})

and

\partial_{q} C_{X} (v_{d})

, let us first decompose the correlation matrix

Σ

as follows.

Σ = (\begin{matrix} Σ_{11} & Σ_{12} \\ Σ_{21} & Σ_{22} \end{matrix}), with sizes (\begin{matrix} (q + 1) \times (q + 1) & (q + 1) \times (d - q) \\ (d - q) \times (q + 1) & (d - q) \times (d - q) \end{matrix}),

where

Σ_{11}

and

Σ_{22}

represent the correlation matrices of the

(q + 1)

-continuous random vector

(Y, X_{q})

and the

(d - q)

-discrete random vector

X_{d - q}

, respectively. Furthermore,

Σ_{12}

denotes the correlation matrix between the random vectors

X_{d - q}

and

(Y, X_{q})

and

Σ_{21} = {Σ_{12}}^{⊤}

.

Consider the

(d + 1)

-uniform random vector

(U, V_{1}, \dots, V_{d})

with distribution C, and set

V_{q} = (V_{1}, \dots, V_{q})

and

V_{d - q} = (V_{q + 1}, \dots, V_{d})

. Therefore, one observes, for all

(u, v_{q}, v_{d - q}) \in (0, 1) \times {(0, 1)}^{q} \times {(0, 1)}^{d - q}

,

\begin{matrix} \partial_{q + 1} C (u, v_{d}) = P (V_{d - q} \leq v_{d - q} ∣ U = u, V_{q} = v_{q}) c_{q + 1} (u, v_{q}), \end{matrix}

where

c_{q + 1}

is the copula density of

(U, V_{q})

. Let

Φ^{- 1} (V_{d - q}) = (Φ^{- 1} (V_{1}), \dots, Φ^{- 1} (V_{d - q}))

and

Φ^{- 1} (u, v_{q}) = (Φ^{- 1} (u), Φ^{- 1} (v_{1}), \dots, Φ^{- 1} (q))

. Since

(Φ^{- 1} (U), Φ^{- 1} (V_{1}), \dots, Φ^{- 1} (V_{d}))

is normally distributed, then the conditional random vector

Φ^{- 1} (V_{d - q}) ∣ U = u, V_{q} = v_{q}

is distributed as

N (Σ_{21} Σ_{11}^{- 1} Φ^{- 1} (u, v_{q}), Σ_{22} - Σ_{21} Σ_{11}^{- 1} Σ_{21}^{⊤}),

with distribution function

G_{q + 1} (\cdot ∣ Φ^{- 1} (u, v_{q}))

. It follows that

\begin{matrix} \partial_{q + 1} C (u, v_{d}) & = & G_{q + 1} (Φ^{- 1} (V_{d - q}) ∣ Φ^{- 1} (u, v_{q})) c_{q + 1} (u, v_{q}), \end{matrix}

where

c_{q + 1} (u, v_{q})

is the copula density of the random vector

(U, V_{q})

. It remains to derive

\partial_{q} C_{X} (v_{d})

, where

C_{X}

is the Gaussian copula

X

with correlation matrix

\tilde{Σ} = (\begin{matrix} {\tilde{Σ}}_{11} & {\tilde{Σ}}_{12} \\ {\tilde{Σ}}_{21} & Σ_{22} \end{matrix}), with sizes (\begin{matrix} q \times q & q \times (d - q) \\ (d - q) \times q & (d - q) \times (d - q) \end{matrix}),

where

{\tilde{Σ}}_{11}

and

Σ_{22}

represent the correlation matrices of the q-continuous random vector

X_{q}

and the

(d - q)

-discrete random vector

X_{d - q}

, respectively. Similarly,

{\tilde{Σ}}_{12}

denotes the correlation matrix between the random vectors

X_{d - q}

and

X_{q}

and

{\tilde{Σ}}_{21} = {\tilde{Σ}}_{12}^{⊤}

. It follows that, for all

(v_{q}, v_{d - q}) \in \times {(0, 1)}^{q} \times {(0, 1)}^{d - q}

,

\begin{matrix} \partial_{q} C_{X} (v_{d}) = P (V_{d - q} \leq v_{d - q} ∣, V_{q} = v_{q}) c_{q} (v_{q}), \end{matrix}

where

c_{q}

is the copula density of the random vector

V_{q}

. Using the fact that the conditional random vector

Φ^{- 1} (V_{d - q}) ∣ V_{q} = v_{q}

is distributed as

N ({\tilde{Σ}}_{21} {\tilde{Σ}}_{11}^{- 1} Φ^{- 1} (v_{q}), Σ_{22} - {\tilde{Σ}}_{21} {\tilde{Σ}}_{11}^{- 1} {\tilde{Σ}}_{21}^{⊤}),

with distribution function

G_{q} (\cdot ∣ Φ^{- 1} (v_{q}))

. It follows that

\begin{matrix} \partial_{q} C_{X} (v_{d}) = G_{q} (Φ^{- 1} (V_{d - q}) ∣ Φ^{- 1} (v_{q})) c_{q} (v_{q}), \end{matrix}

Therefore, the predicted mean is given by

\begin{matrix} m (x) = \frac{E \{Y c_{q + 1} (F_{0} (Y), F_{q} (x_{q})) Δ_{q}^{d} G_{q + 1} (Φ^{- 1} (F_{d - q} (x_{d - q})) ∣ Φ^{- 1} (F_{0} (Y), F_{q} (x_{q}))\}}{c_{q} (F_{q} (x_{q})) Δ_{q}^{d} G_{q} (Φ^{- 1} (F_{d - q} (x_{d - q})) ∣ Φ^{- 1} (F_{q} (x_{q}))} . \end{matrix}

Example 2.

Consider the case

d = 2

and

q = 1

. This means that the covariate

X_{1}

is continuous and the covariate

X_{2}

is discrete. Assume further that the copula of

(Y, X_{1}, X_{2})

is Gaussian with the correlation matrix

Σ = (\begin{matrix} 1 & ρ_{12} & ρ_{13} \\ ρ_{12} & 1 & ρ_{23} \\ ρ_{13} & ρ_{23} & 1 \end{matrix}) .

In such a case, we have,

\begin{matrix} m (x) = \frac{E \{Y c_{2} (F_{0} (Y), F_{1} (x_{1})) Δ_{1}^{2} G_{2} (Φ^{- 1} (F_{2} (x_{2})) ∣ Φ^{- 1} (F_{0} (Y), F_{1} (x_{1}))\}}{c_{1} (F_{1} (x_{1})) Δ_{1}^{2} G_{1} (Φ^{- 1} (F_{2} (x_{2})) ∣ Φ^{- 1} (F_{1} (x_{1})))}, \end{matrix}

where

c_{1} (F_{1} (x_{1})) = 1

. It remains to calculate the copula density

c_{2}

of

(U, V_{1})

and the conditional normal distribution

G_{1} (\cdot ∣ Φ^{- 1} (F_{1} (x_{1}))

and

G_{2} (\cdot ∣ Φ^{- 1} (F_{0} (Y), F_{1} (x_{1}))

. Since the copula of

(U, V_{1})

is Gaussian with correlation matrix,

Σ_{11} = (\begin{matrix} 1 & ρ_{12} \\ ρ_{12} & 1 \end{matrix}) .

Then, the copula density

c_{2} (F_{0} (Y), F_{1} (x_{1}))

is given by

\begin{matrix} \frac{1}{\sqrt{1 - ρ_{12}^{2}}} exp \{\frac{2 ρ_{12} Φ^{- 1} (F_{0} (Y)) Φ^{- 1} (F_{1} (x_{1})) - ρ_{12}^{2} [Φ^{- 1} {(F_{0} (Y))}^{2} + Φ^{- 1} {(F_{1} (x_{1}))}^{2}]}{2 (1 - ρ_{12}^{2})}\} . \end{matrix}

Also, we have

Σ_{12} = Σ_{21}^{⊤} = (ρ_{13}, ρ_{23}), Σ_{22} = 1, {\tilde{Σ}}_{11} = 1, {\tilde{Σ}}_{12} = {\tilde{Σ}}_{21} = ρ_{23} .

Standard calculations show that, from this,

G_{2} (\cdot ∣ Φ^{- 1} (F_{0} (Y), F_{1} (x_{1})))

is the distribution of

N (μ (Y, x_{1}), σ^{2})

, such that

\begin{matrix} μ (Y, x_{1}) = \frac{(ρ_{13} - ρ_{12} ρ_{23}) Φ^{- 1} (F_{0} (Y)) + (ρ_{23} - ρ_{12} ρ_{13}) Φ^{- 1} (F_{1} (x_{1}))}{1 - ρ_{12}^{2}}, \end{matrix}

and

\begin{matrix} σ^{2} = \frac{1 - ρ_{12}^{2} - ρ_{13}^{2} - ρ_{23}^{2} + 2 ρ_{12} ρ_{13} ρ_{23}}{1 - ρ_{12}^{2}} . \end{matrix}

Likewise,

G_{1} (\cdot ∣ Φ^{- 1} (F_{1} (x_{1}))

is the distribution of

N (\tilde{μ} (x_{1}), {\tilde{σ}}^{2})

, such that

\begin{matrix} \tilde{μ} (x_{1}) = ρ_{23} Φ^{- 1} (F_{1} (x_{1})) and {\tilde{σ}}^{2} = 1 - ρ_{23}^{2} . \end{matrix}

\begin{matrix} m (x) = \frac{E \{Y c_{2} (F_{0} (Y), F_{1} (x_{1})) Δ_{1}^{2} Φ (σ^{- 1} [Φ^{- 1} (F_{2} (x_{2}) - μ (Y, x_{1})])\}}{Δ_{1}^{2} Φ ({\tilde{σ}}^{- 1} [Φ^{- 1} (F_{2} (x_{2}) - \tilde{μ} (x_{1}))])}, \end{matrix}

where

\begin{matrix} Δ_{1}^{2} Φ (σ^{- 1} [Φ^{- 1} (F_{2} (x_{2}) - μ (Y, x_{1})]) & = & Φ (σ^{- 1} [Φ^{- 1} (F_{2} (x_{2}) - μ (Y, x_{1})]) \\ - Φ (σ^{- 1} [Φ^{- 1} (F_{2} (x_{2} - 1) - μ (Y, x_{1})]), \end{matrix}

and

\begin{matrix} Δ_{1}^{2} Φ ({\tilde{σ}}^{- 1} [Φ^{- 1} (F_{2} (x_{2}) - \tilde{μ} (x_{1})]) & = & Φ ({\tilde{σ}}^{- 1} [Φ^{- 1} (F_{2} (x_{2}) - \tilde{μ} (x_{1})]) \\ - Φ ({\tilde{σ}}^{- 1} [Φ^{- 1} (F_{2} (x_{2} - 1) - \tilde{μ} (x_{1})]) . \end{matrix}

Example 3.

As a continuation of Example 2, in order to give a closed form for the conditional mean

m (y | x)

, we consider the case where the variables Y and

X_{1}

are distributed as standard normal and where the correlation matrix of the Gaussian copula is determined by

ρ_{12} = 0

and

ρ_{13} = ρ_{23} = 0.6

. Thus, the conditional expectation is given by:

\begin{matrix} m (x) = \frac{E [Y \{Φ (1.89 [Φ^{- 1} (F_{2} (x_{2}) - 0.6 Y - 0.6 x_{1}]) - Φ (1.89 [Φ^{- 1} (F_{2} (x_{2} - 1) - 0.6 Y - 0.6 x_{1}])\}]}{Φ (1.25 [Φ^{- 1} (F_{2} (x_{2}) - 0.6 x_{1})]) - Φ (1.25 [Φ^{- 1} (F_{2} (x_{2} - 1) - 0.6 x_{1})])}, \end{matrix}

for any discrete random variables

X_{2}

and for any

(y, x_{1}, x_{2}) \in R^{2} \times Ran (X_{2})

.

4. Estimation

Consider a sample of n observations

(Y_{1}, X_{1}), \dots, (Y_{n}, X_{n})

from the random vector

(Y, X)

. For

i = 1, \dots, n

, denote

X_{i} = (X_{i 1}, \dots, X_{i d})

. To estimate the conditional mean described in (7), we need to estimate the marginal distributions

F_{0}, F_{1}, \dots, F_{d}

as well as the partial derivatives of the copulas C and

C_{X}

, namely,

\partial_{q + 1} C

and

\partial_{q} C_{X}

, provided in (6). In this paper, we use a semi-parametric methodology that first consists of estimating margins

F_{0}, F_{1}, \dots, F_{d}

through their rescaled empirical distributions given by

F_{n 0} (y) = \frac{1}{n + 1} \sum_{i = 1}^{n} I (Y \leq y) and F_{n j} (x) = \frac{1}{n + 1} \sum_{i = 1}^{n} I (X_{i j} \leq x), j = 1, \dots, d,

respectively, where

I_{A}

stands for the indicator function for given event A. An alternative method for estimating these quantities is through the use of the kernel smoothing technique. Typically, this approach yields more accurate estimations compared to the method based on empirical distributions. The idea behind this method is to estimate the distributions

F_{0}, F_{1}, \dots, F_{d}

using

{\hat{F}}_{0} (y) = \frac{1}{n + 1} \sum_{i = 1}^{n} K (\frac{y - Y_{i}}{h}) and {\hat{F}}_{j} (x) = \frac{1}{n + 1} \sum_{i = 1}^{n} K (\frac{x - X_{i j}}{h}), j = 1, \dots, d .

K (\cdot)

represents a non-negative function known as the kernel, while h signifies the bandwidth. It is well known that the selection of the bandwidth is crucial and significantly influences the accuracy of the estimation.

The second step is to estimate parametrically the copula of

(Y, X)

. To this end, assume that the copula C is identified as a member of some parametric family,

F = \{C_{θ}, θ \in Θ\}

, where

Θ \subset R .

This means that there exists

θ_{0} \in Θ

such that

C = C_{θ_{0}}

and

C_{X} = C_{X, θ_{0}}

. Therefore, the copula C is then estimated by

C_{\hat{θ}}

, where

\hat{θ}

is an estimator of

θ_{0}

. This estimator is typically obtained by maximizing, in terms of

θ

, the expressed pseudo-likelihood function,

L (θ) = \sum_{i = 1}^{n} ln \{c_{θ} ({\hat{F}}_{0} (Y_{i}), {\hat{F}}_{d} (X_{i}))\}

where

{\hat{F}}_{d} (X_{i}) = ({\hat{F}}_{1} (X_{i 1}), \dots, {\hat{F}}_{d} (X_{i d}))

. In other words, the estimator of the

θ

is given by

\hat{θ} =_{θ \in Θ} L (θ) .

Finally, the conditional mean

m (x)

is estimated using (7) as follows:

\begin{matrix} \hat{m} (x) = \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} Δ_{q}^{d} \partial_{q + 1} C_{\hat{θ}} ({\hat{F}}_{0} (Y_{i}), {\hat{F}}_{d} (x))}{Δ_{q}^{d} \partial_{q} C_{X, \hat{θ}} ({\hat{F}}_{d} (x))} . \end{matrix}

(16)

Example 4.

Let us examine the above estimation procedure in the scenario where

d = 2

and

q = 1

, the situation entails

X_{1}

being a continuous covariate, while

X_{2}

is discrete. Additionally, let us assume that the copula governing

(Y, X_{1}, X_{2})

is the Clayton copula with parameter

θ \in (0, \infty)

. The theoretical conditional mean

\hat{m} (x)

is provided in (12). Its estimated couterpart can be derived from (19) as follows:

\begin{matrix} \hat{m} (x) & = & \frac{(1 + \hat{θ}) \sum_{i = 1}^{n} Y_{i} {\hat{F}}_{0} {(Y_{i})}^{- \hat{θ} - 1} Δ_{1}^{2} {({\hat{F}}_{0} {(Y_{i})}^{- \hat{θ}} + {\hat{F}}_{1} {(x_{1})}^{- \hat{θ}} + {\hat{F}}_{2} {(x_{2})}^{- \hat{θ}} - 2)}^{- \frac{1}{\hat{θ}} - 2}}{n Δ_{1}^{2} {({\hat{F}}_{1} {(x_{1})}^{- \hat{θ}} + {\hat{F}}_{2} {(x_{2})}^{- \hat{θ}} - 1)}^{- \frac{1}{\hat{θ}} - 1}} . \end{matrix}

(17)

The estimators

\hat{θ}

,

{\hat{F}}_{0}

,

{\hat{F}}_{1}

, and

{\hat{F}}_{2}

can be computed using a sample

(Y_{i}, X_{1 i}, X_{2 i})

,

i = 1, \dots, n

, selected from the distribution of

(Y, X_{1}, X_{2})

. Similarly, in the case where the dependence structure of

(Y, X_{1}, X_{2})

is modeled by a Frank copula, the estimated conditional mean

\hat{m} (x)

can be derived from (13) and (19) as follows:

\begin{matrix} \hat{m} (x) & = & \frac{1}{Δ_{1}^{2} K_{2, \hat{θ}} ({\hat{F}}_{1} (x_{1}), {\hat{F}}_{2} (x_{2}))} \frac{1}{n} \sum_{i = 1}^{n} Y_{i} Δ_{1}^{2} K_{1, \hat{θ}} ({\hat{F}}_{0} (Y_{i}), {\hat{F}}_{1} (x_{1}), {\hat{F}}_{2} (x_{2})), \end{matrix}

(18)

K_{1, θ}

and

K_{2, θ}

are given in (14) and (15), respectively.

5. Simulation Study

The objective of this section is to conduct simulations to compare the proposed conditional mean estimator with some competitors. To achieve this, we focus on the case where

d = 2

with mixed covariates; specifically,

X_{1}

is continuous, and

X_{2}

is discrete. In this case, the proposed estimator is deduced from its general form expressed in (19) as follows,

\begin{matrix} \hat{m} (x_{1}, x_{2}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} [\partial_{2} C_{\hat{θ}} ({\hat{F}}_{0} (Y_{i}), {\hat{F}}_{1} (x_{1}), {\hat{F}}_{2} (x_{2})) - \partial_{2} C_{\hat{θ}} ({\hat{F}}_{0} (Y_{i}), {\hat{F}}_{1} (x_{1}), {\hat{F}}_{2} (x_{2} - 1))]}{\partial_{1} C_{X, \hat{θ}} ({\hat{F}}_{1} (x_{1}), {\hat{F}}_{2} (x_{2})) - \partial_{1} C_{X, \hat{θ}} ({\hat{F}}_{1} (x_{1}), {\hat{F}}_{2} (x_{2} - 1))}, \end{matrix}

(19)

where

\partial_{2} C_{\hat{θ}} (u, v_{1}, v_{2}) = \frac{\partial^{2} C_{\hat{θ}} (u, v_{1}, v_{2})}{\partial u \partial v_{1}} and \partial_{1} C_{X, \hat{θ}} (v_{1}, v_{2}) = \frac{\partial C_{X, \hat{θ}} (v_{1}, v_{2})}{\partial v_{1}} .

As scenarios, we consider the most common cases to show the improvement of our estimator over the OLS estimator. However, for the copula of

(Y, X_{1}, X_{2})

, we consider Clayton, Frank, and Gumbel with parameter

θ \in (0, \infty)

and for the variables

Y \sim N (μ, σ^{2})

or

Y \sim t (d f)

, while

X_{1} \sim U_{[a, b]}

and

X_{2} \in {1, 2, 3}

with distribution

F_{2} (1) = p_{1}

,

F_{2} (2) = p_{1} + p_{2}

and

F_{2} (3) = 1

. The generalized inverse of

F_{2}

is

F_{2}^{- 1} (t) = inf \{v \in \{1, 2, 3\} : F_{2} (v) > t\},

or equivalently,

F_{2}^{- 1} (t) = I (0 \leq t < p_{1}) + 2 I (p_{1} \leq t < p_{1} + p_{2}) + 3 I (p_{1} + p_{2} \leq t \leq 1) .

Simulation algorithm:

Given $n, a, b, p_{1}, p_{2}, μ, σ^{2}$ , and $θ$ .
For $i = 1, \dots, n$ .
Generate $u_{i}, v_{i, 1}, v_{i, 2}$ from a copula $C_{θ}$ .
Set $y_{i} = F_{0}^{- 1} (u_{i})$ , $x_{i, 1} = F_{1}^{- 1} (v_{i, 1})$ and $x_{i, 2} = F_{2}^{- 1} (v_{i, 2})$ .
Use the generated sample $(y_{i}, x_{i, 1}, x_{i, 2})$ , $i = 1, \dots, n$ to estimate $θ$ and define the empirical distributions of $F_{0}$ , $F_{1}$ , and $F_{2}$ .
Evaluate the estimator $\hat{m} (x_{1}, x_{2})$ for $(x_{1}, x_{2})$ belonging to the grid defined by

$F = E \times {1, 2, 3} where E = {i / K, i = 1, \dots, K} .$

For fixed

(x_{1}, x_{2}) \in F

, we first compute the theoretical value

m (x_{1}, x_{2})

and then evaluate

\hat{m} (x_{1}, x_{2})

using J random samples of size n. We denote the corresponding estimates by

{\hat{m}}^{(j)} (x_{1}, x_{2})

, where

j = 1, \dots, J

. To assess the performance, we employ the empirical integrated mean squared error (IMSE), which is formulated as follows:

IMSE = \frac{1}{J} \sum_{j = 1}^{J} [\frac{1}{| F |} \sum_{(x_{1}, x_{2}) \in F} {({\hat{m}}^{(j)} (x_{1}, x_{2}) - m (x_{1}, x_{2}))}^{2}],

(20)

where

| F |

denotes the cardinality number of the grid F. Notably,

IMSE

can be decomposed into the square of empirical bias,

{IBIAS}^{2}

, and the empirical variance,

IVAR

, as follows:

IMSE = {IBIAS}^{2} + IVAR .

In this simulation study, different values of the parameters are considered, which represent different dependence scenarios ranging from weak to strong, with Kendall’s tau,

τ

, values lying in the interval

(0.2, 0.75)

. With a sample size

n = 200

, the response, Y, is generated from

N (0, 1)

distribution and Student’s t-distribution with 3 degrees of freedom. Also,

X_{1}

is generated from a Uniform(0, 1) and

X_{2} \in {1, 2, 3}

with distribution

F_{2} (1) = p_{1}

,

F_{2} (2) = p_{1} + p_{2}

and

F_{2} (3) = 1

, where

p_{1} = 0.3

and

p_{2} = 0.5

. In this context, we report and compare the integrated mean square error (IMSE) and the integrated mean absolute error (IMAE) with the respective errors derived from the least squares (ls) regression method. This comprehensive approach ensured the reliability of the comparison by accounting for variability in outcomes across multiple realizations. The reported values in Table 1 and Table 2, corresponding to normal distribution and Student’s t-distribution, respectively, represent the averages calculated from a total of 100 realizations. The results show that the proposed method consistently outperformed the least squares regression method across all the scenarios. This dominance was evident in both metrics, IMSE and IMAE, and across all varieties of Kendall’s tau values and sample sizes. We also analyzed the evolution of MSE with sample size, confirming a clear reduction as n grows, improving estimator accuracy and stability (see Table 3). Specifically, we considered

n = 50

and

n = 100

, which are relatively small. As n increases, the estimator improves significantly in terms of IMSE.

Table 1. Simulation results for normal distribution with

n = 200

.

Table 2. Simulation results for Student’s t with

d f = 3

and

n = 200

.

Table 3. Simulation results for normal distribution with

n = 50

and

n = 100

.

Particularly, the proposed method revealed a more accurate and robust performance, indicating a lower IMSE and IMAE between the estimated and actual values than the least squares method. This enhanced performance can be attributed to the proposed method’s ability to more effectively capture and account for the underlying correlation structure represented by Kendall’s tau in the data. Unlike the least squares method, which assumes a specific form of relationship (linear), the proposed method offers a more flexible and robust approach to analyzing data with varying degrees of correlation and complexity.

6. Conclusions

This paper extends the copula-based regression model introduced by Noh et al. [5] by addressing the scenario where covariates are mixed, encompassing both continuous and discrete explanatory variables. Unlike the original model, which dealt exclusively with continuous covariates, the proposed approach broadens the applicability of the copula-based regression framework. The parameter estimation has been performed using the inference function for margins (IFM), which first estimates the marginal parameters and then estimates the corresponding dependence parameter. Through detailed examples, we demonstrated the estimation of the proposed regression equation and conducted a comprehensive simulation study under various scenarios involving different types of copulas. The results of the simulation study indicate that the suggested model performs favorably compared to classical regression approaches, showcasing its potential to handle mixed-covariate data effectively. This extension provides a valuable contribution to the field of regression analysis, offering a new regression tool for researchers and practitioners dealing with diverse explanatory data types. An interesting potential research direction involves extending this concept to regression with multivariate responses using the same mixed covariates. This extension is particularly relevant in various practical applications where multiple outcomes need to be modeled simultaneously. For instance, in environmental studies, multivariate regression can be used to assess how industrial emissions simultaneously impact both air and water quality, accounting for the complex interactions between pollutants. In healthcare, it enables researchers to examine how lifestyle factors influence multiple health outcomes, such as blood pressure, cholesterol levels, and blood sugar levels.

Author Contributions

Writing—review & editing, S.A., O.K. and M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by Grant CARP2023 from the College of Business and Economics at UAE University, for the Promotion of research.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

We would like to express our sincere thanks to the anonymous referees for their constructive comments and suggestions, which improved the earlier version of our paper. We are also very grateful to UAE University Research Affairs for funding the APC.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nelsen, R.B. An Introduction to Copulas; Springer Series in Statistics; Springer: New York, NY, USA, 2006. [Google Scholar]
Sheikhi, A.; Arad, F.; Mesiar, R. A heteroscedasticity diagnostic of a regression analysis with copula dependent random variables. Braz. J. Probab. Stat. 2022, 36, 408–419. [Google Scholar] [CrossRef]
Ali, A.; Pathak, A.K.; Arshad, M.; Emura, T. Copula-based regression estimation in the presence of outliers. Commun. Stat. Simul. Comput. 2024, 1–26. [Google Scholar] [CrossRef]
Sklar, M. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris 1959, 8, 229–231. [Google Scholar]
Noh, H.; Ghouch, A.E.; Bouezmarni, T. Copula-based regression estimation and inference. J. Am. Stat. Assoc. 2013, 108, 676–688. [Google Scholar] [CrossRef]
Noh, H.; Ghouch, A.E.; Van Keilegom, I. Semiparametric conditional quantile estimation through copula-based multivariate models. J. Bus. Econ. Stat. 2015, 33, 167–178. [Google Scholar] [CrossRef]
De Backer, M.; El Ghouch, A.; Van Keilegom, I. Semiparametric copula quantile regression for complete or censored data. Electron. J. Stat. 2017, 11, 1660–1698. [Google Scholar] [CrossRef]
Kraus, D.; Czado, C. D-vine copula based quantile regression. Comput. Stat. Data Anal. 2017, 110, 1–18. [Google Scholar] [CrossRef]
Rémillard, B.; Nasri, B.; Bouezmarni, T. On copula-based conditional quantile estimators. Stat. Probab. Lett. 2017, 128, 14–20. [Google Scholar] [CrossRef]
Chang, B.; Joe, H. Prediction based on conditional distributions of vine copulas. Comput. Stat. Data Anal. 2019, 139, 45–63. [Google Scholar] [CrossRef]
Nagler, T.; Vatter, T. Solving estimating equations with copulas. J. Am. Stat. Assoc. 2023, 119, 1168–1180. [Google Scholar] [CrossRef]
Newey, W.K.; Powell, J.L. Asymmetric least squares estimation and testing. Econometrica 1987, 55, 819–847. [Google Scholar] [CrossRef]
Coia, V.; Joe, H.; Nolde, N. Copula-based conditional tail indices. J. Multivar. Anal. 2023, 201, 105268. [Google Scholar] [CrossRef]
Mesfioui, M.; Bouezmarni, T.; Belalia, M. Copula-based link functions in binary regression models. Stat. Pap. 2023, 64, 557–585. [Google Scholar] [CrossRef]
Smith, M.S. Implicit copulas: An overview. Econom. Stat. 2023, 28, 81–104. [Google Scholar] [CrossRef]
Hans, N.; Klein, N.; Faschingbauer, F.; Schneider, M.; Mayr, A. Boosting distributional copula regression. Biometrics 2023, 79, 2298–2310. [Google Scholar] [CrossRef] [PubMed]
Nazeri Tahroudi, M.; Ramezani, Y.; De Michele, C.; Mirabbasi, R. Application of copula-based approach as a new data-driven model for downscaling the mean daily temperature. Int. J. Climatol. 2023, 43, 240–254. [Google Scholar] [CrossRef]
McNeil, A.J.; Nešlehová, J. Multivariate Archimedean copulas, d-monotone functions and ℓ₁-norm symmetric distributions. Ann. Statist. 2009, 37, 3059–3097. [Google Scholar] [CrossRef] [PubMed]

Table 1. Simulation results for normal distribution with

n = 200

.

Table 1. Simulation results for normal distribution with

n = 200

.

		IMSE		IMAE
Copula	Parameters	$\hat{m_{c}}$	$\hat{m_{ls}}$	$\hat{m_{c}}$	$\hat{m_{ls}}$
Clayton	$θ = 0.50 (τ = 0.20)$	0.005	0.015	0.084	0.071
	$θ = 2.00 (τ = 0.50)$	0.012	0.033	0.116	0.104
	$θ = 6.00 (τ = 0.75)$	0.032	0.095	0.147	0.136
Frank	$θ = 2.37 (τ = 0.25)$	0.00026	0.120	0.011	0.283
	$θ = 5.73 (τ = 0.50)$	0.00004	0.305	0.004	0.459
	$θ = 14.14 (τ = 0.75)$	0.00001	0.613	0.002	0.670
Gumbel	$θ = 1.25 (τ = 0.20)$	0.020	0.050	0.105	0.151
	$θ = 2.00 (τ = 0.50)$	0.036	0.095	0.134	0.208
	$θ = 4.00 (τ = 0.75)$	0.062	0.232	0.144	0.285
Gausian	$ρ_{12} = 0.4, ρ_{13} = 0.4, ρ_{23} = - 0.4$	0.079	0.080	0.236	0.234
	$(τ = 0.26)$
	$ρ_{12} = 0.9, ρ_{13} = 0.9, ρ_{23} = 0.85$	0.051	0.071	0.172	0.210
	$(τ = 0.69)$

Table 2. Simulation results for Student’s t with

d f = 3

and

n = 200

.

Table 2. Simulation results for Student’s t with

d f = 3

and

n = 200

.

		IMSE		IMAE
Copula	Parameters	$\hat{m_{c}}$	$\hat{m_{ls}}$	$\hat{m_{c}}$	$\hat{m_{ls}}$
Clayton	$θ = 0.50 (τ = 0.20)$	0.011	0.044	0.121	0.121
	$θ = 2.00 (τ = 0.50)$	0.022	0.079	0.157	0.158
	$θ = 6.00 (τ = 0.75)$	0.061	0.291	0.206	0.247
Frank	$θ = 2.37 (τ = 0.25)$	0.00112	0.270	0.023	0.420
	$θ = 5.73 (τ = 0.50)$	0.00024	0.673	0.010	0.679
	$θ = 14.14 (τ = 0.75)$	0.00016	1.556	0.008	1.059
Gumbel	$θ = 1.25 (τ = 0.20)$	0.051	0.138	0.157	0.246
	$θ = 2.00 (τ = 0.50)$	0.076	0.264	0.199	0.352
	$θ = 4.00 (τ = 0.75)$	0.120	0.709	0.194	0.576
Gausian	$ρ_{12} = 0.4, ρ_{13} = 0.4, ρ_{23} = - 0.4$	0.400	0.525	0.431	0.448
	$(τ = 0.26)$
	$ρ_{12} = 0.9, ρ_{13} = 0.9, ρ_{23} = 0.85$	0.087	0.301	0.220	0.435
	$(τ = 0.69)$

Table 3. Simulation results for normal distribution with

n = 50

and

n = 100

.

Table 3. Simulation results for normal distribution with

n = 50

and

n = 100

.

IMSE		$n = 50$		$n = 100$
Copula	Parameters	$\hat{m_{c}}$	$\hat{m_{ls}}$	$\hat{m_{c}}$	$\hat{m_{ls}}$
Clayton	$θ = 0.50 (τ = 0.20)$	0.031	0.079	0.014	0.042
	$θ = 2.00 (τ = 0.50)$	0.047	0.103	0.031	0.080
	$θ = 6.00 (τ = 0.75)$	0.078	0.257	0.070	0.199
Frank	$θ = 2.37 (τ = 0.25)$	0.001	0.159	0.0006	0.138
	$θ = 5.73 (τ = 0.50)$	0.00017	0.371	0.00009	0.306
	$θ = 14.14 (τ = 0.75)$	0.00006	0.661	0.00002	0.598
Gumbel	$θ = 1.25 (τ = 0.20)$	0.027	0.088	0.026	0.067
	$θ = 2.00 (τ = 0.50)$	0.059	0.127	0.043	0.103
	$θ = 4.00 (τ = 0.75)$	0.085	0.269	0.064	0.237
Gausian	$ρ_{12} = 0.4, ρ_{13} = 0.4, ρ_{23} = - 0.4$	0.127	0.113	0.101	0.095
	$(τ = 0.26)$
	$ρ_{12} = 0.9, ρ_{13} = 0.9, ρ_{23} = 0.85$	0.362	0.091	0.101	0.078
	$(τ = 0.69)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Copula-Based Regression with Mixed Covariates

Abstract

1. Introduction

2. Preliminaries

2.1. Archimedean Copulas

2.2. Gaussian Copulas

3. Model Description

3.1. Archimedean Copula-Based Predicted Mean

3.2. Gaussian Copula-Based Predicted Mean

4. Estimation

5. Simulation Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics