Next Article in Journal
Upper Bounds for Chebyshev Permutation Arrays
Previous Article in Journal
Neighbor-Enhanced Link Prediction in Bipartite Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sinkhorn Distributionally Robust Conditional Quantile Prediction with Fixed Design

Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei 230052, China
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(6), 557; https://doi.org/10.3390/e27060557
Submission received: 7 April 2025 / Revised: 23 May 2025 / Accepted: 23 May 2025 / Published: 25 May 2025
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
This paper proposes a novel data-driven distributionally robust framework for conditional quantile prediction under the fixed design setting of the covariates, which we refer to as Sinkhorn distributionally robust conditional quantile prediction. We derive a convex programming dual reformulation of the proposed problem and further develop a conic optimization reformulation for the case with finite support. Our method’s superior performance is demonstrated through several numerical experiments, highlighting its effectiveness in practical applications.

1. Introduction

Quantile regression [1] aims to estimate the value of a specific quantile of a given random variable. During the past few decades, quantile regression has been widely applied in various fields of academic research, including economics [2], clinical medicine [3], and environmental science [4], among others. In the financial field, for example, the 1% or 5% quantile describing the left tail of the distribution of the profit and loss account (the so-called Value-at-Risk (VaR)) is of interest to risk managers [5].
Under the random design framework [1,6], theoretical performance bounds are established by assuming that the data samples are independent and identically distributed (i.i.d.). Although this assumption holds in some cases, it is often unrealistic in practice, as external features may be predetermined or designed, preventing data samples from being treated as realizations of a random variable from an underlying distribution. This phenomenon of predetermined or designed external features is known as fixed design in regression analysis [7].
Common methods for solving conditional quantile prediction problems in the random design include sample average approximation (SAA) [8]. SAA has the advantage of strong applicability, but it is limited by increasing computational and storage costs as the sample size grows. However, when using the fixed design setting, the solution obtained using the SAA method may not be robust [9]. To achieve a robust yet not overly conservative solution, one can consider the distributionally robust optimization (DRO) approach [10,11]. DRO, which serves as a modeling paradigm for decision-making under model uncertainty, seeks to identify an optimal decision by minimizing the worst-case (or adversarial) outcome over a set of plausible probability distributions, commonly referred to as the ambiguity set.
The construction of the ambiguity set plays a crucial role in the performance of the DRO method. In the literature, there are two main approaches to constructing ambiguity sets. The first approach involves defining the ambiguity sets based on descriptive statistics, such as moment conditions [11,12,13], shape constraints [14,15], marginal distributions [16,17], etc. The second approach, which has become popular in recent years, considers distributions within a pre-specified statistical distance from a nominal distribution. Usually, the nominal distribution is the empirical distribution of the data samples. Commonly used statistical distances in the literature include ϕ -divergence [18,19,20], Wasserstein distance [21,22,23], maximum mean discrepancy [24], and Sinkhorn distance [25,26,27,28].
Among all metric-based ambiguity sets, the Kullback–Leibler (KL) ambiguity set received relatively early attention. However, as argued in [21], when the ambiguity set is centered at the discrete empirical distribution and the unknown true distribution is absolutely continuous with respect to the Lebesgue measure, any KL ambiguity set will necessarily exclude the true distribution. This occurs because the unknown true distribution is a continuous distribution that cannot assign positive probability mass to each training sample. In contrast, the Wasserstein ambiguity set naturally contains both discrete and continuous distributions, thereby overcoming the limitation of the KL ambiguity set. Subsequently, although Wasserstein distance is relatively popular in DRO research, it still has certain limitations. Wasserstein DRO is typically tractable only under stringent conditions on the loss function (see Table 1 in [27]), and for data-driven Wasserstein DRO, where the nominal distribution is finitely supported (usually the empirical distribution), the worst-case distribution is discrete and supported on, at most, N + 1 points, although the underlying true distribution in many practical applications may be continuous [23]. This raises concerns about whether Wasserstein DRO hedges the correct family of distributions and whether it may lead to overly conservative performance.
To address these potential issues while maintaining the advantages of Wasserstein DRO, this paper adopts the Sinkhorn distance [25,26] to construct the ambiguity set. Sinkhorn distance, a smoothed variant of the Wasserstein distance, is defined as the minimal transport cost between two distributions in an entropy-regularized optimal transport problem. This design improves the computational complexity of Wasserstein distance, with demonstrated efficiency gains enabling widespread applications in domain adaptation [29,30], generative modeling [31,32], and dimensionality reduction [33,34,35].
Sinkhorn DRO refers to constructing an ambiguity set of distributions using Sinkhorn distance, finding the worst-case distribution that maximizes the loss within this ambiguity set, and subsequently finding the optimal decision within the decision set that minimizes the loss corresponding to the worst-case distribution. As argued by [27], computing the Sinkhorn distance and solving the Sinkhorn DRO problem are two different computational tasks. From the perspective of dual formulations, the former is a standard stochastic optimization problem, whereas the latter is a conditional stochastic optimization [36]. In the latter case, computing unbiased gradients becomes extremely challenging. Thus, the Sinkhorn DRO problem is computationally non-trivial.
To the best of our knowledge, research on Sinkhorn DRO is still in its early stages, with representative theoretical works including [27] and [28]. Compared to the Wasserstein DRO, its advantages mainly lie in the fact that the dual problem of Sinkhorn DRO is computationally tractable for a broader class of loss functions, cost functions, nominal distributions, and probability supports [27], and its worst-case distribution is absolutely continuous with respect to a pre-specified reference measure, such as the Lebesgue or counting measure. This characteristic of Sinkhorn DRO highlights its flexibility as a modeling choice and provides a more realistic representation of uncertainty that better aligns with the underlying true distribution in practical scenarios [27].
Currently, under the fixed design setting, the authors of [9] proposed a distributionally robust conditional quantile prediction problem based on a type-1 Wasserstein ambiguity set, establishing the finite sample guarantee and asymptotic consistency. Building on this work, Ref. [37] employed the type-p Wasserstein ambiguity set, thereby generalizing the original problem. Recognizing the advantages of Sinkhorn DRO over Wasserstein DRO, in this paper, we propose a Sinkhorn distributionally robust conditional quantile prediction (DRCQP) problem under the fixed design setting and establish its strong dual reformulation to ensure computational tractability. When the support is finite, the dual form reduces to an exponential cone program. We further investigate the theoretical relationships between our proposed problem and some classic DRCQP problems. Finally, through empirical analysis, we evaluate the out-of-sample performance of our proposed method and conduct sensitivity analyses of the key parameter η , thereby providing practical references for real-world applications.
The remainder of the paper is structured as follows. Section 2 presents the preliminary knowledge, including the standard linear regression model and the conditional quantile regression model. Section 3 introduces the definition of Sinkhorn distance and establishes the Sinkhorn DRCQP problem under the fixed design setting. Section 4 aims to tackle the proposed problem by providing an equivalent tractable reformulation under some mild assumptions, along with a conic optimization reformulation developed for finite support cases. Section 5 evaluates the performance of our DRO method through numerical experiments. Finally, Section 6 concludes the paper with a summary of key findings.
Notations: Let R n denote the n-dimensional Euclidean space. For a positive integer N, let [ N ] represent the set { 1 , 2 , , N } . For a measurable set Z , denote the set of measures (not necessarily probability measures) as M ( Z ) on Z , and let P ( Z ) denote the set of probability measures on Z . Given a probability distribution P and a measure μ , we denote P μ if P is absolutely continuous with respect to μ . P Q denotes the product measure of two probability distributions P and Q . Given a measure μ M ( Z ) and a measurable variable f : Z R , we write E z μ [ f ( z ) ] for R f ( z ) d μ ( z ) . For a given element x, denote the one-point probability distribution as δ z supported on { z } .

2. Preliminaries

In this paper, we consider Y R as the response variable of interest and ( X 1 , , X p ) R p as the vector of explanatory variables. The standard linear regression model without an intercept term is
Y = β 1 X 1 + β 2 X 2 + + β p X p + ε ,
where β = ( β 1 , , β p ) R p is the vector of unknown regression parameters, and ε R is a random variable of residuals. We assume that ε is independent of ( X 1 , , X p ) with zero mean and unknown variance σ 2 .
The ordinary least squares (OLSs) method is the classic approach to estimating the unknown β , but it is highly sensitive to outliers and performs poorly when ε follows the non-Gaussian distribution [1]. To overcome these limitations, Koenker and Basset [1] proposed quantile regression, which minimizes the asymmetric absolute deviations to characterize how explanatory variables affect different quantiles of the conditional distribution of the response variable Y. This research originated from the definition of the α -quantile Q α ( Y ) of the response variable Y:
Q α ( Y ) = inf { q R : P ( Y q ) α } , 0 < α < 1 .
In addition to (2), Q α ( Y ) can also be expressed as the solution to the following stochastic optimization problem:
Q α ( Y ) = arg min q R E [ ρ α ( Y q ) ] ,
where ρ α is the quantile loss function (also called the check function), defined as
ρ α ( u ) = u ( α 1 ( u 0 ) ) = α u , if u > 0 , ( α 1 ) u , if u 0 .
which assigns different weights to positive and negative residuals.
In the fixed design setting, we focus on the out-of-sample performance with certain feature vectors ( X 1 , , X p ) = ( x 1 , , x p ) f : = x f (Here, the superscript “f” in x f represents the fixed design.). According to the linear problem (1) and the independence between ( X 1 , , X p ) and ε , the α -th conditional quantile Q α ( Y | x f ) of the response variable Y takes the following form:
Q α ( Y | x f ) = Q α ( β x f + ε | x f ) = β x f + Q α ( ε ) = β x f + s α ,
where s α : = Q α ( ε ) is defined as the α -th quantile of ε . Consider N observed samples { ( x i f , y i ) } i = 1 N generated from model (1). According to [1], for α ( 0 , 1 ) , the α -th regression quantile is defined as the solution to the following problem:
min β ˜ R p , s R i = 1 N ρ α y i β ˜ x i f s .
The aforementioned quantile regression problem can be viewed as applying the sample average approximation (SAA) method to solve the conditional stochastic optimization problem in (3). Under the random design setting, the SAA method performs well since the empirical distribution converges to the true distribution as the sample size N increases. However, under the fixed design setting, this convergence cannot be guaranteed, resulting in the deterioration of estimation efficiency. This motivates us to find more effective methods to solve the conditional quantile regression problem under the fixed design setting.

3. Problem Setup

3.1. Sinkhorn Distance

The core of distributionally robust optimization (DRO) lies in selecting a suitable ambiguity set. Such a set must balance computational tractability with practical interpretability. It should be sufficiently rich to cover relevant distributions yet exclude unnecessary ones that may lead to overly conservative decisions. Recently, the Sinkhorn distance has gained significant attention for constructing ambiguity sets. Its definition is as follows:
Definition 1
(Sinkhorn distance, [25,26,27]). Let Z be a measurable set. Consider distributions P , Q P ( Z ) , and let μ , ν M ( Z ) be two reference measures such that P μ , Q ν . For the entropic regularization parameter η 0 , the Sinkhorn distance between two distributions P and Q is defined as
W η ( P , Q ) = inf γ Γ ( P , Q ) { E ( z , z ) γ [ d ( z , z ) p ] + η H ( γ | μ ν ) } ,
where Γ ( P , Q ) is defined as the set of all Borel probability distributions on Z × Z with marginal distributions P and Q , d ( z , z ) p denoting the cost function, and H ( γ | μ ν ) denotes the relative entropy (also called Kullback–Leibler divergence, KL divergence) of γ with respect to the product measure μ ν :
H ( γ | μ ν ) = E ( z , z ) γ log d γ ( z , z ) d μ ( z ) d ν ( z ) .
When η = 0 , the Sinkhorn distance W η ( P , Q ) reduces to the classic Wasserstein metric, and as η goes to infinity, the entropic regularization term dominates; the distance increasingly emphasizes the entropy of the coupling rather than the transport cost. In this paper, we study a distributionally robust conditional quantile prediction problem, which is formulated as a one-dimensional minmax optimization problem. This leads us to focus on Z R , and we take the standard distance d ( z , z ) = | z z | .
Based on the definition of the Sinkhorn distance, we introduce the Sinkhorn ball centered on a reference distribution to effectively capture and characterize the distributional ambiguity in our analysis.
Definition 2.
For a fixed distribution P P ( Z ) and r > 0 , the Sinkhorn ball B r , η ( P ) of radius r centered on P is defined as
B r , η ( P ) : = Q P ( Z ) | W η ( P , Q ) r .
Due to the convex entropic regularizer in W η ( P , Q ) with respect to P [38], the Sinkhorn distance W η ( P , Q ) is convex in P , i.e., W η ( P , λ Q 1 + ( 1 λ ) Q 2 ) λ W η ( P , Q 1 ) + ( 1 λ ) W η ( P , Q 2 ) for all probability distributions Q 1 , Q 2 P ( Z ) and all 0 λ 1 . Therefore, the Sinkhorn ball B r , η ( P ) is a convex set.

3.2. Sinkhorn Distributionally Robust Conditional Quantile Prediction with Fixed Design

Under the fixed design setting, the optimal solution obtained by the SAA method for conditional quantile prediction is not robust [9]. To obtain a robust yet not overly conservative solution, we employ the DRO approach for modeling conditional quantile prediction, building upon similar methodologies proposed in [9]. The corresponding mathematical formulation is presented as follows:
min β ˜ R p , s R sup P Y P Y E P Y ρ α Y β ˜ x f s ,
where P Y is a distribution of Y conditional on x f , and P Y is an ambiguity set constructed from the observed data points { ( x i f , y i ) } i = 1 N that are generated by model (1).
To transfer the distributional uncertainty from the response variable Y to the residual term ε ˜ and thereby simplify problem (5), we define ε ˜ = Y β ˜ x f = ε + ( β β ˜ ) x f . Thus, problem (5) can be rewritten as
min β ˜ R p , s R sup P ε ˜ P ε ˜ E P ε ˜ [ ρ α ( ε ˜ s ) ] ,
where P ε ˜ and P ε ˜ are, respectively, the candidate distribution and ambiguity set of ε ˜ .
Note that P ε ˜ depends on the unknown β ˜ . Following [9], we first estimate β ˜ using the OLS estimator β ^ N OLS , which can be expressed as
β ^ N OLS = ( X f ) X f 1 ( X f ) y ,
where N samples { ( x i , y i ) } i = 1 N are observed from model (1), and X f is the fixed design matrix with a sequence of (row) vectors { ( x 1 f ) , , ( x N f ) } and y = ( y 1 , , y N ) . Notably, although the row vectors { ( x 1 f ) , , ( x N f ) } are not i.i.d., we still assume the fixed design matrix X f satisfies Assumptions 1 and 2 in [9]; thus, β ^ N OLS is a consistent estimator of β ˜ .
We define ε = ε + ( β β ^ N OLS ) x f with an unknown distribution P ε , which has the corresponding samples { ε i OLS } i = 1 N with ε i OLS = y i ( β ^ N OLS ) x i f . Its empirical distribution P ^ N ε OLS = 1 N i = 1 N δ ε i OLS leads to the Sinkhorn ball B r , η P ^ N ε OLS , where B r , η ( P ) is defined by (4). With this clarification, we now formally introduce our distributionally robust conditional quantile prediction problem based on the Sinkhorn ball:
min s R sup P ε B r , η P ^ N ( ε OLS ) E P ε [ ρ α ( ε s ) ] ,
where P ε and B r , η P ^ N ε OLS denote the candidate distribution and the Sinkhorn uncertainty set of ε , respectively. We refer to problem (6) as a Sinkhorn distributionally robust conditional quantile prediction (DRCQP) problem. In the remainder of this paper, we will continue to study this problem and present its tractable reformulation.

4. Tractable Reformulation

In this section, we focus on the Sinkhorn DRCQP problem (6), which is intractable due to the inner maximization problem involving uncountable candidate distributions within the Sinkhorn ambiguity set. Thus, we provide a strong dual reformulation that effectively transforms this problem into a finite-dimensional optimization problem. Moreover, under the assumption of a finite support, problem (6) can be reformulated as an equivalent conic optimization problem, which facilitates its analysis and solution. Finally, we provide the theoretical connections between problem (6) and other formulations studied in the machine learning literature.
Following the discussion in Remark 2 of [27], we take the reference measure μ = P ^ N ( ε OLS ) and let ν be the Lebesgue measure. Other options for reference measures are discussed in Remark 2 of [27]. For notational clarity, we define
r ¯ : = r + η N i = 1 N log R e | ε i OLS ε | p / η d ε .
To reformulate problem (6), we make the following assumption:
Assumption 1.
(i)     ν z : 0 | ε OLS ε | p < = 1 for P ^ N -almost every ε OLS ;
(ii) 
R e | ε OLS ε | p / η d ε < for P ^ N almost every ε OLS ;
(iii) 
Z is a measurable space, and the objective function in problem (6is measurable;
(iv) 
For every joint distribution γ on Z × Z with a first marginal distribution P ^ N ( ε OLS ) , it has a regular conditional distribution γ ε OLS , given that the value of the first marginal equals ε OLS .
Assumption 1 is commonly imposed to ensure both the Sinkhorn distance W η ( P , Q ) and the expected loss E P ε ρ α ( ϵ s ) are well-defined. For a more detailed discussion of this assumption, we refer to [27]. Given Assumption 1, we can derive a convex dual reformulation of problem (6).
Theorem 1.
Let the reference measure μ = P ^ N ( ε OLS ) , ν be the Lebesgue measure in R and r ¯ 0 . Under Assumption 1, problem (6admits the following equivalent reformulation:
min s R , λ 0 F ( s , λ ) λ r ¯ + λ η N i = 1 N log E Q i ( ε ) [ e ρ α ( ε s ) / ( λ η ) ] ,
where r ¯ is given by (7), and the kernel probability distribution Q i ( ε ) is given by
d Q i ( ε ) d ε = e | ε i OLS ε | p / η R e | ε i OLS ζ | p / η d ζ , i [ N ] .
Proof of Theorem 1.
We begin by addressing the inner maximization problem in (6),
sup P ε B r , η ( P ^ N ( ε OLS ) ) E P ε ρ α ( ε s ) .
For a fixed s R , problem (10) is equivalent to
sup P ε P ( R ) E P ε ρ α ε s s . t . W η P ^ N ε OLS , P ε r .
According to the definition of the Sinkhorn distance W η P ^ N ε OLS , P ε with ν taken as the Lebesgue measure on R (i.e., d ν ( ε ) = d ε ), problem (11) admits the following equivalent formulation:
sup γ P ( R × R ) E P ε ρ α ε s s . t . E ( ε OLS , ε ) γ | ε OLS ε | p + η log d γ ( ε OLS , ε ) d P ^ N ( ε OLS ) d ε r ,
where ε OLS P ^ N ε OLS and ε P ε . The joint distribution γ ( ε OLS , ε ) has marginal distributions P ^ N ( ε OLS ) and P ε , respectively. According to the disintegration theorem in [39], we represent the joint distribution γ such that d γ ( ε OLS , ε ) = d P ^ N ( ε OLS ) d γ ε OLS ( ε ) holds for any ( ε OLS , ε ) , where γ ε OLS is the conditional distribution of ε , given the first marginal of γ equals ε OLS . Thereby, the constraint in problem (12) is equivalent to
E P ^ N ε OLS E γ ε OLS ( ε ) | ε OLS ε | p + η log d γ ε OLS ( ε ) d ε r .
Note that any feasible solution γ satisfies γ P ^ N ε OLS ν , and hence, γ ε OLS ν . Since ν is the Lebesgue measure in R , the term log d γ ε OLS ( ε ) d ε is well-defined. By applying the change-in-measure identity, we have
log d γ ε OLS ( ε ) d ε = log d Q ε OLS , η ( ε ) d ε + log d γ ε OLS ( ε ) d Q ε OLS , η ( ε ) ,
in which
d Q ε OLS , η ( ε ) d ε = e | ε OLS ε | p / η R e | ε OLS ζ | p / η d ζ .
Thus, constraint (13) can be reformulated as
E P ^ N ε OLS E γ ε OLS ( ε ) | ε OLS ε | p + η log e | ε OLS ε | p / η R e | ε OLS ζ | p / η d ζ + η log d γ ε OLS ( ε ) d Q ε OLS , η ( ε ) r ,
where γ ε OLS ( ε ) P ( R ) . For the integrand in the expectation term on the left-hand side of constraint (14), we can write
η log e | ε OLS ε | p / η R e | ε OLS ζ | p / η d ζ = | ε OLS ε | p η log R e | ε OLS ζ | p / η d ζ .
By combining it with the first term | ε OLS ε | p , we have (14) being equivalent to
η E P ^ N ε OLS E γ ε OLS ( ε ) log d γ ε OLS ( ε ) d Q ε OLS , η ( ε ) r ¯ .
Similarly, the objective function of problem (10) can be written as
E P ^ N ε OLS E γ ε OLS ( ε ) [ ρ α ε s ] .
By introducing the Lagrange multiplier λ associated with constraint (15), we can reformulate problem (10) as
sup γ ε OLS ( ε ) P ( R ) inf λ 0 E P ^ N ε OLS E γ ε OLS ( ε ) ρ α ε s λ η log d γ ε OLS ( ε ) d Q ε OLS , η ( ε ) + λ r ¯ .
Note that its objective function ρ α ( ε s ) is convex in ε , and the feasible region B r , η P ^ N ε OLS is a convex set; we have that problem (10) is a convex problem, and thus, the strong duality holds. Therefore, under Assumption 1, according to Theorem 1 of [27], problem (16) is equivalent to
inf λ 0 sup γ ε OLS ( ε ) P ( R ) E P ^ N ε OLS E γ ε OLS ( ε ) ρ α ε s λ η log d γ ε OLS ( ε ) d Q ε OLS , η ( ε ) + λ r ¯ = inf λ 0 λ r ¯ + λ η E P ^ N ε OLS log E Q ε OLS , η ( ε ) e ρ α ε s / ( λ η ) .
For i = 1 , , N , given ε OLS = ε i OLS , denote Q i : = Q ε OLS , η , i [ N ] . One can verify that for any ( ε OLS , ε ) , there exist distributions { Q i } i = 1 N such that the joint distribution γ : = 1 N i = 1 N δ ε i OLS Q i . As a result, problem (17) is equivalent to
inf λ 0 λ r ¯ + λ η N i = 1 N log E Q i ( ε ) [ e ρ α ( ε s ) / ( λ η ) ] .
Therefore, problem (6) is equivalent to problem (8). This completes the proof. □
From a computational perspective, the dual reformulation of Sinkhorn DRCQP (8) is non-trivial, as it can be viewed as a conditional stochastic optimization problem [36]. To solve it efficiently, ref. [27] proposed an efficient static mirror descent algorithm with biased subgradient estimators for Sinkhorn DRO problems, and ref. [40] developed a Nested-SGD algorithm designed for large-scale regularized Sinkhorn DRO problems. Detailed discussions of sample complexity and convergence guarantees are provided in these works.
Following Wang et al. [27], to distinguish the two cases F ( s , λ ) < and F ( s , λ ) = , we introduce the light-tail condition on the quantile loss function ρ α ( ε s ) .
Assumption 2.
λ ˜ > 0 exists such that E Q i ( ε ) e | ε s | / ( λ ˜ η ) < for P ^ N almost every ε OLS .
Moreover, when the support Z is finite, the dual formulation (8) can be reformulated as the following conic programming problem under Assumption 2 (above). Therefore, the new formulation can be solved efficiently using off-the-shelf solvers such as CVX [41], which employ the interior point method.
Theorem 2
(Conic reformulation for finite support). Suppose that the support Z contains L max elements, i.e., Z = { ε j } j = 1 L max . If Assumption 2 holds and r ¯ 0 , problem (8can be formulated as the following conic optimization:
min λ , s , g , a λ r ¯ + 1 N i = 1 N g i s . t . λ η j = 1 L max q i , j a i , j , i [ N ] , ( λ η , a i , j , ρ α ( ε j s ) g i ) K exp , i [ N ] , j [ L max ] , λ 0 , s R , g R N , a R N × L max .
where q i , j : = P Q i ( ε ) { ε = ε j } with the distribution Q i ( ε ) defined in (9), and K exp denotes the exponential cone K exp = ( ν , λ , δ ) R + × R + × R : exp ( δ / ν ) λ / ν .
Proof of Theorem 2.
We now introduce the epi-graphical variables g i , i = 1 , , N to reformulate the dual objective (8) as
min s , λ , g i λ r ¯ + 1 N i = 1 N g i s . t . λ η log E Q i ( ε ) [ e ρ α ( ε s ) / ( λ η ) ] g i , i s R , λ > 0 , g i R .
For a fixed i, the i-th constraint can be reformulated as
exp g i λ η E Q i ( ε ) [ e ρ α ( ε s ) / ( λ η ) ] = 1 E Q i ( ε ) [ e ( ρ α ( ε s ) g i ) / ( λ η ) ] = λ η E Q i ( ε ) [ λ η e ( ρ α ( ε s ) g i ) / ( λ η ) ] = λ η j = 1 L max q i , j a i , j a i , j λ η exp ρ α ( ε j s ) g i λ η , j ,
where the second constraint set can be formulated as
( λ η , a i , j , ρ α ( ε j s ) g i ) K exp .
Substituting this expression into the dual objective completes the proof. □
The following remarks elucidate the connection between the Sinkhorn DRCQP problem (8) and the DRCQP problems based on other ambiguity sets.
Remark 1
(Connection with Wasserstein DRCQP). As the regularization parameter η 0 , formulation (6reduces to
min s sup P ε P ( ε ) E P ε ρ α ( ε s ) ,
where P ( ε ) = { P ε : W ( P ^ N ( ε OLS ) , P ε ) p r } is the ambiguity set with Wasserstein distance W ( P , Q ) p = inf γ Γ ( P , Q ) E ( z , z ) γ [ d ( z , z ) p ] . Furthermore, the dual objective of the Sinkhorn DRCQP problem (8converges to
λ r + 1 N i = 1 N sup ε R ρ α ( ε s ) λ | ε i OLS ε | p .
For the complete proof, we refer to Remark 3 in [27]. The derivation and simplified results for the cases p = 1 and p > 1 are provided in [9] and [37], respectively.
Remark 2
(Connection with KL DRCQP). Applying Jensen’s inequality, we obtain an upper bound for the dual objective function of the Sinkhorn DRCQP
λ r ¯ + λ η log 1 N i = 1 N E Q i ( ε ) [ e ρ α ε s / ( λ η ) ] ,
which corresponds to the dual objective for the following KL DRO [19]
min s sup P ε P ( ε ) E P ε ρ α ε s ,
where P ( ε ) = { P ε : D K L ( P ε P 0 ) r ¯ / η } is the ambiguity set with KL divergence D K L ( P ε P 0 ) = E P ε log d P ε ( ε ) d P 0 ( ε ) , and P 0 satisfies d P 0 ( ε ) = 1 N i = 1 N d Q i ( ε ) .
Proof of Remark 2.
Recall that Assumption 2 requires that the quantile loss ρ α ( ε s ) has a light tail under the distribution Q i ( ε ) for P ^ N almost every ε OLS , which satisfies Assumption 1 in [18]. Consequently, by applying Theorem 1 in [18], problem (21) is equivalent to the following one-layer optimization problem:
min λ 0 λ r ¯ η + λ log E P 0 ( ε ) e ρ α ( ε s ) / λ .
Substituting λ = λ η and d P 0 ( ε ) = 1 N i = 1 N d Q i ( ε ) , we can obtain problem (20). This completes the proof. □

5. Numerical Experiments

In Section 5.1, our numerical experiments on the simulated dataset are presented, and the real-world dataset is addressed in Section 5.2. First, for the conditional quantile prediction problem under the fixed design setting, we compare the out-of-sample performance and computational time of the Sinkhorn DRO method used in this paper with the SAA method [8], other DRO methods (such as Wasserstein DRO [9,37], KL DRO [19]), and other machine learning techniques, like robust quantile regression with Catoni’s log-truncated loss [42,43] and Cauchy-truncated loss [44]. Second, we conduct experiments to examine the impact of varying the entropy regularization parameter η on the out-of-sample performance of our Sinkhorn DRCQP problem.
In this paper, we adopt two metrics to evaluate the out-of-sample performance of all models. The first metric is the out-of-sample expected cost achieved by the optimal solution s ^ N to the conditional quantile prediction problem, obtained via the SAA method, the DRO method, and the machine learning methods, denoted as J s a a , J d r o , and J m l , respectively. Additionally, for different DRO methods, their corresponding out-of-sample expected costs are denoted as J s i n k p (Sinkhorn DRO), J w a s s p (Wasserstein DRO), and J k l (KL DRO), where p represents the order of the cost function in the Sinkhorn distance and Wasserstein distance. For different machine learning methods, their corresponding out-of-sample expected costs are denoted as J c a t (Catoni’s log-truncated robust quantile regression) and J c a u (Cauchy-truncated robust quantile regression), respectively. Furthermore, the optimal solution s ^ N for each model can be expressed as follows.
  • Conditional quantile prediction problem based on the SAA method [8]:
    β ^ N , s ^ N = arg min β , s 1 N i = 1 N ρ α y i β x i s .
  • KL DRCQP problem [19]:
    s ^ N , λ ^ N = arg min s , λ λ log 1 N i = 1 N e ρ α ( ε i OLS s ) / λ + λ r .
  • Type 1-Wasserstein DRCQP problem [9]:
    s ^ N , λ ^ N = arg min s , λ max { α , 1 α } λ r + 1 N i = 1 N ρ α ( ε i OLS s ) .
  • Type p-Wasserstein ( p 2 ) DRCQP problem [37]:
    • s ^ N , λ ^ N = arg min s , λ 0 λ r + 1 N i = 1 N ρ α ( ε i OLS s ) + λ 1 p 1 ( p 1 ) α p p p 1 I 1 + 1 α p p p 1 I 2
    • where I 1 = 1 N i = 1 N 1 { ε i OLS s + a } , I 2 = 1 I 1 , a = ( λ p ) 1 p 1 1 1 p ( 1 α ) p p 1 α p p 1
    • and ρ α ( u ) = u α 1 ( u a ) = α u if u > a , ( α 1 ) u if u a .
  • Catoni’s log-truncated robust quantile regression [43]:
    β ^ N , s ^ N = arg min β , s 1 N κ i = 1 N Ψ κ ρ α y i β x i s ,
    where κ > 0 is a robustification parameter to be tuned, and the non-decreasing influence function Ψ : R R is
    Ψ ( u ) = log ( 1 + u + u 2 2 ) if u > 0 , log ( 1 u + u 2 2 ) if u 0 .
  • Cauchy-truncated robust quantile regression [44]:
    β ^ N , s ^ N = arg min β , s 1 N i = 1 N Φ ρ α y i β x i s ,
    where κ > 0 , and the truncation function Φ : R R is Φ κ ( u ) = κ log 1 + u κ .
The second metric is the improvement achieved by the DRO strategy compared to the SAA approach, calculated as J d r o J s a a J s a a [9].

5.1. Simulation

In this section, we evaluate the performance of our proposed method on the simulated dataset. We generate the dataset under the fixed design setting, following a similar setup to [45]. Define m i : [ 0 , 1 ] R , i = 1 , 2 , 3 as follows:
m 1 ( x ) = sin ( 5 x ) , m 2 ( x ) = e 5 x ( 5 x ) 3 , and m 3 ( x ) = 1 5 x + 1 + sin ( 5 x ) .
and the response variable is generated according to the model
y i = m 1 ( x i ) + 2 m 2 ( x i ) + 3 m 3 ( x i ) + ε i , i = 1 , 2 , , N ,
where x i = i / n , β = ( 1 , 2 , 3 ) and ε i are i.i.d. samples drawn from a heavy-tailed distribution (e.g., Pareto distribution and Student’s t-distribution). According to the above model, a total of 2000 groups of sample data were generated. The first 1500 groups of samples were used as the training set, and retraining was performed every 50 data points, with a total of 30 training sessions. The last 500 groups of samples served as the test set.
Our simulation experiment consists of two parts: Section 5.1.1 presents a comprehensive performance comparison between the Sinkhorn DRCQP problem and the other five problems. The comparison includes the out-of-sample expected cost and the computational time for solving each problem. Section 5.1.2 evaluates the impact of the entropy regularization parameter η on the final results of the Sinkhorn DRO.

5.1.1. Comparison of Out-of-Sample Performance and Computational Time

In this section, we compare the out-of-sample performance of six methods on the simulated dataset under different configurations of the distribution of ε , the quantile α , the order of the cost function p, and the entropy regularization parameter η , with a fixed radius of the ambiguity set r = 0.2 . Figure 1 and Figure 2 illustrate the out-of-sample performance at intervals of 50 data points. Table 1 and Table 2 present the comparative out-of-sample performance and the computational time of six methods for the training sample sizes N = 100 , 800, and 1500, respectively.
Our results show that the proposed method in this paper outperforms the five traditional methods in most cases. As the sample size used for training increases, the out-of-sample expected costs of the SAA method gradually converge, while the out-of-sample costs of the DRO methods remain consistently stable and exhibit excellent performance. Although the out-of-sample performance of the two machine learning methods (denoted as J c a t and J c a u ; κ = 1 ) is stable, their out-of-sample expected costs are higher than those of the DRO method in most cases. Additionally, the performance improvement of the DRO methods over the SAA method decreases as the training sample size expands. Among the three DRO methods, Sinkhorn DRO demonstrates the most remarkable performance improvement over the SAA method, which can also be significantly compared. For example, when taking ε P ( 2 , 1 ) , α = 0.7 and N = 1500 , the improvement in J s i n k 1 over J s a a is 9.40 % , whereas the improvement of J w a s s 1 over J s a a is only 3.33 % .
In Table 2, the total computation time required for each method is recorded as the sample size increases from 1 to 1500 at intervals of 50 data points. As shown in Table 2, although the Sinkhorn DRO ranks second in terms of the total computation time among the six methods, its out-of-sample performance is the best in most cases. Therefore, it combines computational efficiency and prediction accuracy. In contrast, while the KL DRO has a shorter computation time, its out-of-sample performance is inferior to that of the Sinkhorn DRO method. Meanwhile, although the Wasserstein DRO delivers comparable predictive accuracy to Sinkhorn DRO, its computational time ranges from 14 to 53 times longer, rendering it less efficient computationally. Notably, the SAA method shows both the poorest predictive performance and the lowest computational efficiency among all methods in most cases, indicating that it is not well-suited for solving conditional quantile prediction problems under the fixed design settings.

5.1.2. Comparison of the Impacts of Parameter Settings

In this section, we analyze the effect of the entropy regularization parameter η on the out-of-sample performance of the Sinkhorn DRO method under the fixed design settings.
As shown in Figure 3, regardless of whether ε follows a heavy-tailed distribution (e.g., t ( 5 ) or P ( 5 , 1 ) ), the Sinkhorn DRO method can achieve better predictive performance than both Wasserstein DRO and KL DRO through scientific adjustment of the entropy regularization parameter η . Specifically, when ε t ( 5 ) , J s i n k 1 reaches its minimum value of 0.4199 at η = 0.1 , while J s i n k 2 achieves its minimum of 0.4243 at η = 0.08 . When ε P ( 5 , 1 ) , both J s i n k 1 and J s i n k 2 attain their respective minimum of 0.1025 and 0.1023 at η = 0.3 . Furthermore, when η 0 , the results of J s i n k approximate those of J w a s s , which aligns with the theoretical results in Remark 1. As η continues to increase, J s i n k first reaches its optimal value before gradually deteriorating, though it still maintains a noticeable advantage over J k l .

5.2. Real-World Applications

In this section, we validate the performance of our proposed method using the real-world dataset. Our goal was to predict the total demand for public bicycles in Seoul, South Korea, per hour, taking into account external factors such as temperature and humidity. By predicting the quantiles of the stochastic demand process, we can better manage bicycle inventory, ensure a stable supply of rental bicycles, and improve the comfort of public transportation. Section 5.2.1 provides an overview of the real-world dataset and its reconstruction steps. In Section 5.2.2, we evaluate the out-of-sample performance of Sinkhorn DRO against the following: (i) the SAA baseline, (ii) two DRO benchmarks (Wasserstein DRO and KL DRO), and (iii) two robust quantile regression methods (with Catoni’s log-truncated loss and Cauchy-truncated loss). In addition, Section 5.2.3 evaluates the impact of relevant parameters on the final results, providing guidance for effective parameter tuning.

5.2.1. Data Selection and Reconstruction

The data used in the real-world applications were sourced from the UCI Machine Learning Repository, established in [46]. It includes a real dataset of the hourly rental count of public bicycles in Seoul, South Korea (which serves as the response variable, denoted as Y), along with corresponding weather data and information about holidays. In total, there are 8760 valid instances (valid means that the hourly rental count of public bicycles is not zero). To account for the influence of multicollinearity, we ultimately retained seven covariates to construct the design matrix X f . The specific information about these covariates is shown in Table 3.
Since the observations are time series data, and covariates, such as temperature, humidity, and wind speed, tend to be closely correlated within a few consecutive days, the observations should not be treated as i.i.d. data. This means we need to consider the fixed design setting. The specific steps we take to process the dataset are as follows.
  • Step 1: Construct a linear regression problem on the preprocessed dataset and use the coefficients estimated using the OLS method as the coefficients for potential true linear relationships, namely β 0 ;
  • Step 2: Discard the demand column of the dataset and retain only the columns of the covariates. Then, generate new demand observations y i by adding i.i.d. simulated noise ε i following the normal distribution. Specifically, y i = β 0 x i + ε i ;
  • Step 3: Divide the observations into training and testing sets based on time periods. Starting from the 8000-th observation, the next 500 time periods are used as the test set. For any given sample size N, the training set consists of the N observations immediately preceding the start of the test set;
  • Step 4: For each observation in the test set, i.i.d. noises are simulated 50 times, resulting in a total of 500 × 50 = 25,000 data points for the test set. These data points are used to evaluate the predictive performance of each method.

5.2.2. Comparison of Out-of-Sample Performance and Computational Time

In this section, we compare the out-of-sample performance of six methods under different configurations of the quantile α , the standard deviation of the error term σ , the order of the cost function p, and the entropy regularization parameter η , with a fixed radius of the ambiguity set r = 0.05 . Figure 4 and Figure 5 illustrate the out-of-sample expected costs at intervals of 50 data points. Table 4 presents the comparative out-of-sample performance of the six methods for sample sizes N = 100 ,   550 , and 1500, with J d r o J s a a J s a a shown in parentheses. Our results show that the proposed method outperforms the SAA method, two DRO benchmarks (Wasserstein DRO and KL DRO), and two robust quantile regression methods (with Catoni’s log-truncated loss and Cauchy-truncated loss) in most cases.
Specifically, we can observe that while the SAA method maintains some convergence for large samples, its out-of-sample performance is poor in the case of small samples. This is mainly because the data points in the training and testing sets are not randomly generated, failing to meet the assumption of i.i.d. conditions. Therefore, the performance guarantee of the SAA method under the fixed design setting is no longer applicable [9]. Moreover, while both Wasserstein DRO and KL DRO can be employed to formulate conditional quantile problems under the fixed design setting and demonstrate relatively stable performance, in most scenarios, their out-of-sample performance is slightly inferior to the Sinkhorn DRO method. In addition, compared to robust quantile regression using Catoni’s log-truncated loss and Cauchy-truncated loss when κ = 1 , it is found that in most cases, Sinkhorn DRO is outperformed by neither method, and both robust regression techniques are shown to outperform traditional quantile regression methods.

5.2.3. Comparison of the Impacts of Parameter Settings

In this section, we analyze the effect of the entropy regularization parameter η on the out-of-sample performance of our Sinkhorn DRCQP problem.
Overall, for the fixed α , σ , N , p , r , we compute the corresponding out-of-sample expected cost J s i n k and compare it with J w a s s and J k l under the same settings. Figure 6a and Figure 7a show that when p = 1 , as η increases, J s i n k exhibits an increasing trend. Figure 6b and Figure 7b show that when p = 2 , as η increases, J s i n k first decreases and subsequently increases.
Specifically, for the case p = 2 , when η = 0.005 , the results of J s i n k (0.0997 in Figure 6b and 0.0947 in Figure 7b) are remarkably close to those of J w a s s (0.0999 in Figure 6b and 0.0948 in Figure 7b), which validates Remark 1 in this paper. As η increases to 0.045, J s i n k achieves its minimum values of 0.0948 in Figure 6b and 0.0916 in Figure 7b. When η > 0.045 , J s i n k deteriorates rapidly but consistently remains lower than J k l (0.1394 in Figure 6b and 0.1440 in Figure 7b). Generally speaking, Figure 6b and Figure 7b clearly demonstrate that by selecting appropriate values of η , the Sinkhorn DRCQP can outperform both the classic Wasserstein- and the KL divergence-based DRCQP approaches.

6. Conclusions

This paper proposes a Sinkhorn DRCQP problem under the fixed design setting. We demonstrate the tractability of the proposed model by deriving its strong dual formulation, which takes the form of an exponential cone programming problem for finite support cases. Finally, we conduct numerical experiments to evaluate the performance and computational time of the proposed method in comparison with five conventional methods, including SAA, Wasserstein DRO, KL DRO, Catoni’s log-truncated robust quantile regression, and Cauchy-truncated robust quantile regression. The experimental results demonstrate the superior performance of our approach by choosing suitable values of the entropy regularization parameter η .

Author Contributions

Conceptualization, G.J.; methodology, G.J. and T.M.; software, G.J.; validation, G.J. and T.M.; formal analysis, G.J.; investigation, T.M.; resources, T.M.; data curation, G.J.; writing—original draft preparation, G.J.; writing—review and editing, T.M.; visualization, G.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 12371476), and by the Key Program for Youth Innovation of the University of Science and Technology of China.

Data Availability Statement

The data that support the analysis of this study are openly available in https://doi.org/10.24432/C5F62R, accessed on 28 Februray 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
i.i.d.Independent and identically distributed
SAASample average approximation
DRODistributionally robust optimization
KLKullback–Leibler
OLSOrdinary least squares
DRCQPDistributionally robust conditional quantile prediction

References

  1. Koenker, R.; Bassett, G. Regression Quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
  2. Foster, N. The Impact of Trade Liberalisation on Economic Growth: Evidence from a Quantile Regression Analysis. Kyklos 2008, 61, 543–567. [Google Scholar] [CrossRef]
  3. Hong, H.G.; Christiani, D.C.; Li, Y. Quantile regression for survival data in modern cancer research: Expanding statistical tools for precision medicine. Precis. Clin. Med. 2019, 2, 90–99. [Google Scholar] [CrossRef] [PubMed]
  4. Reich, B.J. Spatiotemporal quantile regression for detecting distributional changes in environmental processes. J. R. Stat. Soc. Ser. C Appl. Stat. 2012, 61, 535–553. [Google Scholar] [CrossRef]
  5. Basel Committee on Banking Supervision: Revisions to the Basel II Market Risk Framework. 2009. Available online: https://www.bis.org/publ/bcbs158.htm (accessed on 13 July 2009).
  6. Ban, G.Y.; Rudin, C. The big data newsvendor: Practical insights from machine learning. Oper. Res. 2019, 67, 90–108. [Google Scholar] [CrossRef]
  7. Rosset, S.; Tibshirani, R.J. From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. J. Am. Stat. Assoc. 2018, 115, 138–151. [Google Scholar] [CrossRef]
  8. Kleywegt, A.J.; Shapiro, A.; Homem-de-Mello, T. The sample average approximation method for stochastic discrete optimization. SIAM J. Optim. 2002, 12, 479–502. [Google Scholar] [CrossRef]
  9. Qi, M.; Cao, Y.; Shen, Z.J. Distributionally robust conditional quantile prediction with fixed-design. Manag. Sci. 2022, 68, 1639–1658. [Google Scholar] [CrossRef]
  10. Scarf, H.E. Studies in the mathematical theory of inventory and production. In A Min–Max Solution of an Inventory Problem; Arrow, K.J., Karlin, S., Scarf, H.E., Eds.; Stanford University Press: Stanford, CA, USA, 1958; pp. 201–209. Available online: https://www.rand.org/pubs/papers/P910.html (accessed on 28 February 2025).
  11. Delage, E.; Ye, Y. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 2010, 58, 595–612. [Google Scholar] [CrossRef]
  12. Zymler, S.; Kuhn, D.; Rustem, B. Distributionally robust joint chance constraints with second-order moment information. Math. Program. 2013, 137, 167–198. [Google Scholar] [CrossRef]
  13. Wiesemann, W.; Kuhn, D.; Sim, M. Distributionally robust convex optimization. Oper. Res. 2014, 62, 1358–1376. [Google Scholar] [CrossRef]
  14. Popescu, I. A semidefinite programming approach to optimal-moment bounds for convex classes of distributions. Math. Oper. Res. 2005, 30, 632–657. [Google Scholar] [CrossRef]
  15. Van Parys, B.P.; Goulart, P.J.; Kuhn, D. Generalized gauss inequalities via semidefinite programming. Math. Program. 2015, 156, 271–302. [Google Scholar] [CrossRef]
  16. Natarajan, K.; Song, M.; Teo, C.P. Persistency problem and its applications in choice probleming. Manag. Sci. 2009, 55, 453–469. [Google Scholar] [CrossRef]
  17. Agrawal, S.; Ding, Y.; Saberi, A.; Ye, Y. Price of correlations in stochastic optimization. Oper. Res. 2012, 60, 150–162. [Google Scholar] [CrossRef]
  18. Hu, Z.; Hong, L.J. Kullback-Leibler Divergence Constrained Distributionally Robust Optimization. 2012. Available online: https://optimization-online.org/?p=12225 (accessed on 23 November 2012).
  19. Ben-Tal, A.; den Hertog, D.; De Waegenaere, A.; Melenberg, B.; Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 2013, 59, 341–357. [Google Scholar] [CrossRef]
  20. Bayraksan, G.; Love, D.K. Data-driven stochastic programming using phi-divergences. INFORMS TutORials Oper. Res. 2015, 11, 1–19. [Google Scholar] [CrossRef]
  21. Kuhn, D.; Esfahani, P.M.; Nguyen, V.A.; Shafieezadeh-Abadeh, S. Wasserstein distributionally robust optimization: Theory and applications in machine learning. INFORMS TutORials Oper. Res. 2019, 15, 130–166. [Google Scholar] [CrossRef]
  22. Blanchet, J.; Murthy, K. Quantifying distributional problem risk via optimal transport. Math. Oper. Res. 2019, 44, 565–600. [Google Scholar] [CrossRef]
  23. Gao, R.; Kleywegt, A. Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. 2022, 48, 603–655. [Google Scholar] [CrossRef]
  24. Staib, M.; Jegelka, S. Distributionally robust optimization and generalization in kernel methods. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 819, pp. 9134–9144. [Google Scholar] [CrossRef]
  25. Sinkhorn, R. A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 1964, 35, 876–879. [Google Scholar] [CrossRef]
  26. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’13), New York, NY, USA, 5–10 December 2013; Volume 2, pp. 2292–2300. [Google Scholar] [CrossRef]
  27. Wang, J.; Gao, R.; Xie, Y. Sinkhorn Distributionally Robust Optimization. 2021. Available online: https://arxiv.org/abs/2109.11926 (accessed on 26 March 2025).
  28. Azizian, W.; Iutzeler, F.; Malick, J. Regularization for wasserstein distributionally robust optimization. ESAIM Control Optim. Calc. Var. 2023, 29, 29–33. [Google Scholar] [CrossRef]
  29. Courty, N.; Flamary, R.; Tuia, D. Domain adaptation with regularized optimal transport. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Nancy, France, 15–19 September 2014; Volume 8724, pp. 274–289. [Google Scholar] [CrossRef]
  30. Courty, N.; Flamary, R.; Tuia, D.; Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1853–1865. [Google Scholar] [CrossRef] [PubMed]
  31. Luise, G.; Rudi, A.; Pontil, M.; Ciliberto, C. Differential properties of sinkhorn approximation for learning with wasserstein distance. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018; Volume 31, pp. 5864–5874. [Google Scholar] [CrossRef]
  32. Patrini, G.; Van den Berg, R.; Forre, P.; Carioni, M.; Bhargav, S.; Welling, M.; Genewein, T.; Nielsen, F. Sinkhorn Autoencoders. In Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, Tel Aviv, Israel, 22–25 July 2019; Volume 115, pp. 733–743. [Google Scholar]
  33. Lin, T.; Fan, C.; Ho, N.; Cuturi, M.; Jordan, M. Projection robust wasserstein distance and riemannian optimization. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Volume 787, pp. 9383–9397. [Google Scholar] [CrossRef]
  34. Wang, J.; Gao, R.; Xie, Y. Two-sample test using projected wasserstein distance. In Proceedings of the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 12–20 July 2021; pp. 3320–3325. [Google Scholar] [CrossRef]
  35. Wang, J.; Gao, R.; Xie, Y. Two-sample test with kernel projected wasserstein distance. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, 28–30 March 2022; Volume 151, pp. 3315–3320. [Google Scholar]
  36. Hu, Y.; Chen, X.; He, N. Sample complexity of sample average approximation for conditional stochastic optimization. SIAM J. Optim. 2023, 30, 2103–2133. [Google Scholar] [CrossRef]
  37. Cheng, B.; Xie, X. Distributionally Robust Conditional Quantile Prediction with Wasserstein Ball. JUSTC 2023, accepted. [Google Scholar]
  38. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; pp. 25–54. [Google Scholar]
  39. Joseph, T.C.; David, P. Conditioning as disintegration. Stat. Neerl. 1997, 51, 287–317. [Google Scholar] [CrossRef]
  40. Yang, Y.; Zhou, Y.; Lu, Z. A Stochastic Algorithm for Sinkhorn Distance-Regularized Distributionally Robust Optimization. In Proceedings of the OPT2024: 16th Annual Workshop on Optimization for Machine Learning, Vancouver, BC, Canada, 15 December 2024. [Google Scholar]
  41. Grant, M.; Boyd, S. CVX:Matlab Software for Disciplined Convex Programming. Available online: https://cvxr.com/cvx/ (accessed on 1 January 2020).
  42. Catoni, O. Challenging the empirical mean and empirical variance: A deviation study. Ann. l’IHP Probab. Stat. 2012, 48, 1148–1185. [Google Scholar] [CrossRef]
  43. Xu, L.; Yao, F.; Yao, Q.; Zhang, H. Non-asymptotic guarantees for robust statistical learning under infinite variance assumption. J. Mach. Learn. Res. 2023, 24, 1–46. [Google Scholar] [CrossRef]
  44. Xu, Y.; Zhu, S.; Yang, S.; Zhang, C.; Jin, R.; Yang, T. Learning with non-convex truncated losses by SGD. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI 2019), Tel Aviv, Israel, 22–25 July; Volume 115, pp. 701–711. [CrossRef]
  45. Furer, D.; Kohler, M.; Krzyżak, A. Fixed-design regression estimation based on real and artificial data. J. Nonparametr. Stat. 2013, 25, 223–241. [Google Scholar] [CrossRef]
  46. Dua, D.; Graff, C. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/dataset/560/seoul+bike+sharing+demand (accessed on 29 February 2020).
Figure 1. Out-of-sample performance of six methods under varying configurations with Student’s t-distributed ε : (a) α = 0.2 , p = 1 , η = 0.01 , and ε t ( 2 ) ; (b) α = 0.7 , p = 2 , η = 0.05 , and ε t ( 2 ) ; (c) α = 0.2 , p = 1 , η = 0.05 , and ε t ( 5 ) ; (d) α = 0.7 , p = 2 , η = 0.05 , and ε t ( 5 ) .
Figure 1. Out-of-sample performance of six methods under varying configurations with Student’s t-distributed ε : (a) α = 0.2 , p = 1 , η = 0.01 , and ε t ( 2 ) ; (b) α = 0.7 , p = 2 , η = 0.05 , and ε t ( 2 ) ; (c) α = 0.2 , p = 1 , η = 0.05 , and ε t ( 5 ) ; (d) α = 0.7 , p = 2 , η = 0.05 , and ε t ( 5 ) .
Entropy 27 00557 g001
Figure 2. Out-of-sample performance of six methods under varying configurations with Pareto distribution ε : (a) α = 0.2 , p = 1 , η = 0.05 , and ε P ( 2 , 1 ) ; (b) α = 0.7 , p = 2 , η = 0.05 , and ε P ( 2 , 1 ) ; (c) α = 0.2 , p = 1 , η = 0.01 , and ε P ( 5 , 1 ) ; (d) α = 0.7 , p = 2 , η = 0.4 , and ε P ( 5 , 1 ) .
Figure 2. Out-of-sample performance of six methods under varying configurations with Pareto distribution ε : (a) α = 0.2 , p = 1 , η = 0.05 , and ε P ( 2 , 1 ) ; (b) α = 0.7 , p = 2 , η = 0.05 , and ε P ( 2 , 1 ) ; (c) α = 0.2 , p = 1 , η = 0.01 , and ε P ( 5 , 1 ) ; (d) α = 0.7 , p = 2 , η = 0.4 , and ε P ( 5 , 1 ) .
Entropy 27 00557 g002
Figure 3. Out-of-sample performance of three DRO methods for α = 0.7 and r = 0.2 under varying configurations: (a) p = 1 and ε t ( 5 ) ; (b) p = 2 and ε t ( 5 ) ; (c) p = 1 and ε P ( 5 , 1 ) ; (d) p = 2 and ε P ( 5 , 1 ) .
Figure 3. Out-of-sample performance of three DRO methods for α = 0.7 and r = 0.2 under varying configurations: (a) p = 1 and ε t ( 5 ) ; (b) p = 2 and ε t ( 5 ) ; (c) p = 1 and ε P ( 5 , 1 ) ; (d) p = 2 and ε P ( 5 , 1 ) .
Entropy 27 00557 g003
Figure 4. Out-of-sample performance of six methods for α = 0.2 and p = 1 under varying σ and η : (a) σ = 0.02 , η = 0.005 ; (b) σ = 0.2 , η = 0.01 ; (c) σ = 2 , η = 0.8 .
Figure 4. Out-of-sample performance of six methods for α = 0.2 and p = 1 under varying σ and η : (a) σ = 0.02 , η = 0.005 ; (b) σ = 0.2 , η = 0.01 ; (c) σ = 2 , η = 0.8 .
Entropy 27 00557 g004
Figure 5. Out-of-sample performance of six methods for α = 0.7 and p = 2 under varying σ and η : (a) σ = 0.02 , η = 1 ; (b) σ = 0.2 , η = 0.05 ; (c) σ = 2 , η = 0.5 .
Figure 5. Out-of-sample performance of six methods for α = 0.7 and p = 2 under varying σ and η : (a) σ = 0.02 , η = 1 ; (b) σ = 0.2 , η = 0.05 ; (c) σ = 2 , η = 0.5 .
Entropy 27 00557 g005
Figure 6. Out-of-sample performance for α = 0.2 , σ = 0.2 , N = 500 , and r = 0.2 : (a) p = 1 ; (b) p = 2 .
Figure 6. Out-of-sample performance for α = 0.2 , σ = 0.2 , N = 500 , and r = 0.2 : (a) p = 1 ; (b) p = 2 .
Entropy 27 00557 g006
Figure 7. Out-of-sample performance for α = 0.7 , σ = 0.2 , N = 500 , and r = 0.2 : (a) p = 1; (b) p = 2 .
Figure 7. Out-of-sample performance for α = 0.7 , σ = 0.2 , N = 500 , and r = 0.2 : (a) p = 1; (b) p = 2 .
Entropy 27 00557 g007
Table 1. Out-of-sample performance of six methods.
Table 1. Out-of-sample performance of six methods.
Noise α  N J s a a J c a t J c a u J wass 1 J wass 2 J sink 1 J sink 2 J kl
t(2)0.21007.78e+040.6300.6120.574 (−99.99%)0.574 (−99.99%)0.559 (−99.99%)0.559 (−99.99%)0.603 (−99.99%)
8001.0600.6300.6120.587 (−44.63%)0.586 (−44.68%)0.564 (−46.76%)0.559 (−47.31%)0.603 (−43.12%)
15000.6640.6300.6120.587 (−11.61%)0.581 (−12.55%)0.564 (−14.99%)0.559 (−15.87%)0.603 (−9.18%)
0.71001.75e+040.6880.6600.667 (−99.99%)0.657 (−99.99%)0.648 (−99.99%)0.635 (−99.99%)0.676 (−99.99%)
8003.8420.6880.6600.664 (−82.71%)0.661 (−82.80%)0.648 (−83.14%)0.635 (−83.48%)0.676 (−82.41%)
15000.6800.6880.6600.665 (−2.23%)0.659 (−3.14%)0.648 (−4.75%)0.635 (−6.65%)0.676 (−0.62%)
t(5)0.21008.50e+030.3890.3840.363 (−99.99%)0.371 (−99.99%)0.348 (−99.99%)0.348 (−99.99%)0.362 (−99.99%)
8008.7550.3890.3840.353 (−95.97%)0.355 (−95.94%)0.348 (−96.03%)0.348 (−96.03%)0.362 (−95.86%)
15000.3640.3890.3840.351 (−3.42%)0.358 (−1.53%)0.348 (−4.40%)0.348 (−4.40%)0.362 (−0.42%)
0.71002.86e+030.4330.4250.428 (−99.99%)0.423 (−99.99%)0.418 (−99.99%)0.418 (−99.99%)0.427 (−99.99%)
8006.8850.4330.4250.421 (−93.88%)0.422 (−93.87%)0.418 (−93.93%)0.418 (−93.93%)0.427 (−93.79%)
15000.4360.4330.4250.421 (−3.59%)0.424 (−2.94%)0.418 (−4.33%)0.418 (−4.31%)0.427 (−2.09%)
P(2,1)0.21002.16e+030.1960.1940.193 (−99.99%)0.193 (−99.99%)0.183 (−99.99%)0.183 (−99.99%)0.194 (−99.99%)
8000.6730.1960.1940.193 (−71.38%)0.193 (−71.38%)0.183 (−72.82%)0.183 (−72.82%)0.194 (−71.11%)
15000.1940.1960.1940.193 (−0.97%)0.193 (−0.97%)0.183 (−5.95%)0.183 (−5.95%)0.194 (−0.01%)
0.71004.38e+040.5190.5110.508 (−99.99%)0.508 (−99.99%)0.476 (−99.99%)0.476 (−99.99%)0.513 (−99.99%)
8000.5420.5190.5110.508 (−6.32%)0.508 (−6.31%)0.476 (−12.20%)0.476 (−12.20%)0.513 (−5.33%)
15000.5250.5190.5110.508 (−3.33%)0.508 (−3.31%)0.476 (−9.40%)0.476 (−9.40%)0.513 (−2.30%)
P(5,1)0.21001.06e+030.0500.0490.047 (−99.99%)0.049 (−99.99%)0.046 (−99.99%)0.046 (−99.99%)0.050 (−99.99%)
8000.2250.0500.0490.047 (−78.85%)0.049 (−78.28%)0.046 (−79.56%)0.046 (−79.56%)0.050 (−77.66%)
15000.0500.0500.0490.047 (−5.89%)0.049 (−3.36%)0.046 (−9.04%)0.046 (−9.04%)0.050 (−0.57%)
0.71001.43e+030.1210.1260.107 (−99.99%)0.109 (−99.99%)0.102 (−99.99%)0.102 (−99.99%)0.108 (−99.99%)
8000.1300.1210.1260.107 (−17.92%)0.105 (−19.66%)0.102 (−21.46%)0.102 (−21.53%)0.108 (−16.95%)
15000.1160.1210.1260.107 (−7.60%)0.105 (−9.62%)0.102 (−11.58%)0.102 (−11.67%)0.108 (−6.51%)
Here, t ( ζ ) denotes the Student’s t-distribution with ζ degrees of freedom, and P ( a , b ) denotes the Pareto distribution with shape parameter a and scale parameter b. The values in parentheses in our tables are all calculated using the formula ( J dro J saa ) / J saa .
Table 2. Total computational time (in seconds) of six methods.
Table 2. Total computational time (in seconds) of six methods.
Noise α SAACATCAU1-WDRO2-WDRO1-SDRO2-SDROKL-DRO
t(2)0.22.11e+023.35e+025.39e+021.80e+021.05e+024.76e+003.23e+001.90e-01
0.72.48e+023.17e+022.27e+021.53e+029.85e+015.39e+002.14e+002.00e-01
t(5)0.24.18e+023.40e+025.13e+029.43e+018.89e+014.58e+002.12e+001.70e-01
0.73.90e+023.35e+024.00e+029.96e+019.18e+014.75e+002.59e+001.90e-01
P(2,1)0.23.26e+022.91e+023.16e+021.29e+021.38e+025.08e+003.17e+001.50e-01
0.73.81e+021.99e+022.44e+029.82e+011.01e+024.34e+001.95e+002.00e-01
P(5,1)0.22.88e+023.38e+025.05e+029.67e+011.03e+026.62e+002.37e+001.50e-01
0.73.62e+023.62e+023.44e+028.92e+019.19e+015.85e+003.97e+001.70e-01
CAT, CAU, 1-WDRO, 2-WDRO, 1-SDRO, and 2-WDRO represent Catoni’s log-truncated robust quantile regression, Cauchy-truncated robust quantile regression, type-1 Wasserstein DRO, type-2 Wasserstein DRO, type-1 Sinkhorn DRO, and type-2 Sinkhorn DRO, respectively.
Table 3. Summary of covariates.
Table 3. Summary of covariates.
FeatureTypeValue RangeUnit
TemperatureNumeric(−17.8, 39.4)°C
HumidityNumeric(0, 98)%
Wind speedNumeric(0, 7.4)m/s
VisibilityNumeric(270, 20000)m
Dew point temperatureNumeric(−30.6, 27.2)°C
Solar radiationNumeric(0, 3.52)MJ/m2
RainfallNumeric(0, 35)mm
Table 4. Out-of-sample performance of six methods.
Table 4. Out-of-sample performance of six methods.
σ α  N J saa J cat J cau J wass 1 J wass 2 J sink 1 J sink 2 J kl
0.020.21000.0310.0080.0080.006 (−81.91%)0.006 (−81.91%)0.006 (−82.13%)0.006 (−82.12%)0.007 (−76.73%)
5500.0140.0080.0080.006 (−60.53%)0.006 (−60.53%)0.006 (−60.99%)0.006 (−60.97%)0.007 (−49.20%)
15000.0090.0080.0080.006 (−38.39%)0.006 (−38.39%)0.006 (−39.11%)0.006 (−39.07%)0.007 (−20.71%)
0.71000.0410.0080.0080.007 (−83.03%)0.007 (−83.03%)0.007 (−83.26%)0.007 (−83.26%)0.007 (−81.91%)
5500.0070.0080.0080.007 (0.41%)0.007 (0.41%)0.007 (−0.59%)0.007 (−0.43%)0.007 (7.45%)
15000.0080.0080.0080.007 (−7.97%)0.007 (−7.97%)0.007 (−8.77%)0.007 (−8.73%)0.007 (−1.40%)
0.20.21000.3120.0800.0800.057 (−81.85%)0.073 (−76.72%)0.057 (−81.92%)0.058 (−81.57%)0.082 (−73.89%)
5500.1430.0800.0800.057 (−60.39%)0.075 (−47.31%)0.056 (−60.93%)0.058 (−59.76%)0.082 (−32.75%)
15000.0920.0800.0800.057 (−38.18%)0.071 (−22.19%)0.056 (−39.04%)0.058 (−37.18%)0.082 (−18.73%)
0.71000.3850.0800.0800.071 (−81.67%)0.090 (−76.71%)0.070 (−82.00%)0.070 (−81.97%)0.089 (−76.97%)
5500.1780.0800.0800.071 (−60.20%)0.081 (−54.40%)0.070 (−60.87%)0.070 (−60.81%)0.089 (−49.93%)
15000.0970.0800.0800.070 (−27.51%)0.081 (−16.63%)0.070 (−28.61%)0.070 (−28.49%)0.089 (−8.65%)
20.21001.7150.6770.6870.567 (−66.94%)0.565 (−67.08%)0.558 (−67.48%)0.558 (−67.48%)0.601 (−64.93%)
5500.7580.6770.6870.567 (−25.19%)0.568 (−25.04%)0.558 (−26.38%)0.558 (−26.35%)0.601 (−20.65%)
15000.6030.6770.6870.567 (−6.01%)0.567 (−5.97%)0.558 (−7.48%)0.558 (−7.48%)0.601 (−0.27%)
0.71004.1530.6960.6960.704 (−83.06%)0.707 (−82.98%)0.695 (−83.27%)0.694 (−83.29%)0.745 (−82.06%)
5501.6630.6960.6960.701 (−57.85%)0.705 (−57.61%)0.693 (−58.30%)0.694 (−58.26%)0.745 (−55.19%)
15000.8570.6960.6960.700 (−18.27%)0.702 (−18.04%)0.694 (−18.98%)0.694 (−18.98%)0.745 (−13.01%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, G.; Mao, T. Sinkhorn Distributionally Robust Conditional Quantile Prediction with Fixed Design. Entropy 2025, 27, 557. https://doi.org/10.3390/e27060557

AMA Style

Jiang G, Mao T. Sinkhorn Distributionally Robust Conditional Quantile Prediction with Fixed Design. Entropy. 2025; 27(6):557. https://doi.org/10.3390/e27060557

Chicago/Turabian Style

Jiang, Guohui, and Tiantian Mao. 2025. "Sinkhorn Distributionally Robust Conditional Quantile Prediction with Fixed Design" Entropy 27, no. 6: 557. https://doi.org/10.3390/e27060557

APA Style

Jiang, G., & Mao, T. (2025). Sinkhorn Distributionally Robust Conditional Quantile Prediction with Fixed Design. Entropy, 27(6), 557. https://doi.org/10.3390/e27060557

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop