Constrained Reweighting of Distributions: An Optimal Transport Approach

We commonly encounter the problem of identifying an optimally weight-adjusted version of the empirical distribution of observed data, adhering to predefined constraints on the weights. Such constraints often manifest as restrictions on the moments, tail behavior, shapes, number of modes, etc., of the resulting weight-adjusted empirical distribution. In this article, we substantially enhance the flexibility of such a methodology by introducing a nonparametrically imbued distributional constraint on the weights and developing a general framework leveraging the maximum entropy principle and tools from optimal transport. The key idea is to ensure that the maximum entropy weight-adjusted empirical distribution of the observed data is close to a pre-specified probability distribution in terms of the optimal transport metric, while allowing for subtle departures. The proposed scheme for the re-weighting of observations subject to constraints is reminiscent of the empirical likelihood and related ideas, but offers greater flexibility in applications where parametric distribution-guided constraints arise naturally. The versatility of the proposed framework is demonstrated in the context of three disparate applications where data re-weighting is warranted to satisfy side constraints on the optimization problem at the heart of the statistical task—namely, portfolio allocation, semi-parametric inference for complex surveys, and ensuring algorithmic fairness in machine learning algorithms.


Introduction
Maximum entropy principle (Shannon, 1948;Jaynes, 1957) states that in situations characterized by uncertainty and limited prior-knowledge-guided constraints, the optimal choice among all feasible probability distributions is the probability distribution that is the least informative or most uniformly spread.This idea is at the heart of numerous statistical tasks that has permeated into every corner of modern machine learning research.Prominent instances of such constrained entropy maximization include applications in image reconstruction (Skilling and Bryan, 1984), ill-posed inverse problems (Gamboa and Gassiat, 1997), portfolio optimization (Bera and Park, 2008), generalised methods of moment models (Chib et al., 2018), natural language processing (Gudivada, 2018), network analysis (Magrans de Abril et al., 2018), reinforcement learning (Eysenbach and Levine, 2021), to name a few.We refer the readers to Kardar (2007); Cover and Thomas (2012) for book-length reviews.
For maximum entropy inference, the specified constraints imposed on the probability distributions frequently manifest as constraints pertaining to moments (Chib et al., 2018), tail characteristics (Einmahl et al., 2008), distributional shapes (Chernozhukov et al., 2023), modal counts, and similar properties.In many cases, however, constructing constraints with the desired level of flexibility is challenging, if not unfeasible -refer to Sections 3 and 5 for specific examples in the context of inference in complex surveys and moment condition based of portfolio optimization respectively.On a related note, a recent article (Chakraborty et al., 2023) introduced a flexible framework for introducing more elaborate constraints on probability distributions in the context of conducting robust Bayesian inference.
In this article, we offer a novel solution to this problem via introducing a probability distribution-guided constrained entropy maximization framework, that not only offers versatility but also enhances the interpretability of the inferential output.The main concept revolves around ensuring that a weight adjusted empirical distribution of the observed data closely aligns with a predetermined family of probability distributions, measured through a statistical distance (Rachev et al., 2007).Importantly, the family of probability distributions is potentially continuous, but any weighted-adjusted empirical distribution of the observed data is discrete.This eliminates the possibility of adopting many common statistical discrepancies, e.g Kullback-Leibler, total variation, Hellinger's distance, to place the probability distribution-guided constraints.In practice, we need to exercise ardent care for our choice tailored to the application of interest.For homogeneity of exposition across all scenarios in this article, we considered the Wasserstein metric (Villani, 2003;Santambrogio, 2015).
The idea of data re-weighting is of course not new.Wang et al. (2017) suggested elevating the likelihood of individual observations using data-driven weights, to conduct robust inference under mild model misspecification.Wen et al. (2014) proposed a data re-weighting scheme to align the data with a different target distribution, enabling inference under covariate shift.Other compelling ideas involving re-weighting can be traced in fair learning (Yan et al., 2022), natural language processing (Ramas et al., 2022), variational tempering (Mandt et al., 2016), etc. Complementing the existing literature, we propose a versatile data re-weighting framework, borrowing from the maximum entropy principle and optimal transport, that renders itself useful in a multitude of statistical tasks.
The rest of the paper is organised as follows.The general framework of the proposed probability distribution guided constrained entropy maximization is motivated and introduced in Section 2. Section 3, 4, and 5 presents applications of our methodology in the context of semi-parametric inference in complex surveys, in ensuring demographic parity in machine learning algorithms, and entropy based portfolio optimization respectively.Finally, we conclude with a discussion.

General Framework
Let [a] denote the set of integers {1, . . ., a}.Let Ω denote the set of all possible discrete distributions ω with atoms s = (s 1 , . . ., s m ) T .The entropy of the discrete probability distribution where δ is the Dirac's delta function.The entropy H m (w) is a measure of randomness which is maximized at the discrete uniform distribution with w i = 1/m for all i.In many statistical tasks, the core challenge constitutes of optimizing a functional F : Ω → Ω ′ with respect to ω subject to a constraint ω ∈ Ω 0 (⊂ Ω).A simple example is when s = (s 1 , . . ., s m ) T is the observed sample itself.Then, the set Ω is simply characterised by the class of weighted empirical distributions of the observed data, In the sequel, we shall see more general examples where the constraint set Ω 0 can be identified with a subset of an (m − 1)-dimensional probability simplex S m−1 = {w : m i=1 w i = 1, w i > 0, i ∈ [m]}, for some m ∈ {1, 2, . ..}.
Given s = (s 1 , . . ., s n ) T , parametric inference constitutes approximating the empirical distribution (1/n) n i=1 δ s i (•) via a parametric family of distributions {f θ : θ ∈ Θ}, and learn the parameter θ from data.Such procedures often fall prey to model misspecification (White, 1982), leading to untrustworthy inference.To avoid complete model specification, a popular class of semi-parametric approaches (Hall, 2005) operate under a milder assumption that the weight adjusted empirical distribution n i=1 w i δ s i (•) satisfies moment restrictions of the form n i=1 w i g(s i , θ) = 0, where g is vector of known functions on R d × Θ.In numerous instances, achieving such moment based constraints with the intended degree of flexibility proves to be arduous, if not practically impossible-we elaborate on this more in the sequel.To that end, in this article, we offer a middle ground between the fully parametric and semiparametric moment condition models, that allows for flexible modeling assumptions while enjoying coherent interpretability similar to parametric inference.We propose to operate under a restriction of the form D( n i=1 w i δ s i (•), f θ ) ≤ ε, where D is a statistical distance and ε is an user-defined hyper-parameter.Our goal is to infer about θ while allowing for mild deviations from the parametric model f θ , and ε measures the maximum allowable discrepancy.
Inference under moment condition models often constitutes computing the maximum entropy weighted-adjusted empirical distribution of (s 1 , . . ., s n ) T that satisfies some prespecified moment conditions (Chib et al., 2018(Chib et al., , 2021)).That is, for every θ ∈ Θ, we calculate n i=1 w ⋆ i (θ) δ s i (•) where w ⋆ (θ) = arg max w∈S n−1 H n (w), subject to n i=1 w i g(s i , θ) = 0.Under the proposed framework, we too appeal to the maximum entropy principle and compute the maximum entropy weight adjusted empirical distribution of s that satisfies the parametric distribution guided constraint.That is, for every θ ∈ Θ, we calculate n i=1 w ⋆ i (θ) δ s i (•) where where D is a statistical distance, and ε is an user defined parameter.In ensuing applications in this article, we often solve the dual optimization problem for operational ease.In that, for each θ ∈ Θ and λ ≥ 0, we calculate n i=1 w ⋆ i (θ) The parameter λ controls the extent of departure from the guiding parametric distribution.One pivotal aspect yet to be addressed within the proposed framework is that, an weight adjusted empirical distribution is discrete, but in the context of a specific problem, the guiding distribution f θ is potentially continuous.For instance, in Section 5, in the context of entropy based portfolio allocation, f θ takes the form of a skew normal distribution (Azzalini and Valle, 1996).This precludes the utilization of several standard statistical distances, such as total variation, Hellinger's distance, χ 2 distance, etc., for implementing the distance-based constraint.In this article, due to its versatility, we employ the Wasserstein metric (Villani, 2003) with L 2 cost as the distance measure D. To that end, we briefly recall some relevant facts about the 2-Wasserstein metric.The Wasserstein space P 2 (R d ) is defined as the set of probability measures µ with finite moment of order 2, i.e {µ : R d ∥x∥ 2 dµ(x) < ∞}, where ∥ • ∥ is the euclidean norm on R d .Definition 1.For p 0 , p 1 ∈ P 2 (R d ), let π(p 0 , p 1 ) ⊂ P 2 (R d × R d ) denote the subset of joint probability measures (or couplings) ν on R d × R d with marginal distributions p 0 and p 1 , respectively.Then, the 2-Wasserstein distance W 2 between p 0 and p 1 is defined as Importantly, if both p 0 , p 1 ∈ P 2 (R) with quantile functions F −1 0 , F −1 1 , we have an easily tractable expression, (Panaretos and Zemel, 2019) This is heavily utilized in subsequent sections.With that, we have all the essential ingredients to delve down on the specific applications of interest.
An instance of application of the proposed framework emerges within the realm of semiparametric inference in complex survey data (Gunawan et al., 2020;Lumley, 2010).In survey sampling, we wish to infer about a collection of features of a finite population P := {X i , i ∈ [N ]}.We are provided with a non-representative sample (x 1 , . . ., x n ) obtained from P via a complex survey, and the corresponding survey weights π = (π 1 , . . ., π n ), 0 < π i < ∞.In the general framework, this task involves finding the optimal ω ∈ Ω 0 ⊂ Ω such that where (s 1 , . . ., s n ) = (x ⋆ 1 , . . ., x ⋆ n ) is an i.i.d pseudo sample of size n obtained from the complex survey sample (x 1 , . . ., x n ), via weighted finite population Bayesian bootstrap (Dong et al., 2014;Cohen, 1997;Lo, 1993) to adjust for the survey weights; and m = n.The restriction Ω 0 is dictated by the parametric model that the analyst posits on finite population P to infer about the features of interest in the finite population.
The next application in this article deals with the issue of ensuring demographic parity (Agarwal et al., 2019;Gajane and Pechenizkiy, 2018) in machine learning algorithms.Suppose we have data (x i , y i , a i ) ∈ X × Y × {S, T } for n individuals on covariate x ∈ R p , continuous response y ∈ R, and protected/sensitive attribute A with labels {S, T }.For the sake of simplicity in exposition, we further assume that and n = n S + n T .The goal is to learn a predictive rule h : X × {S, T } → Y, that satisfies a specific notion of demographic parity.Refer to section 4 for details.We shall see that, this task involves finding the optimal ω ∈ Ω 0 ⊂ Ω such that where is the negative of the loss function utilised to learn the predictive rule h for individuals with a = T , θ (T ) are the associated parameters.The optimality of ω and restriction Ω 0 are determined by the notion of demographic parity utilized.
An application of a slightly modified version of the general framework is identified in portfolio allocation problems (Markowitz, 1952;Bera and Park, 2008;Elton et al., 2014), where the goal is to identify the optimal atoms of the discrete distributions, rather than the weights assigned to the atoms.This task translates to finding the optimal ω ∈ Ω 0 ⊂ Ω such that where s i = d j=1 w j r i,j , i ∈ [n]; refer to section 5 for details.The optimality criterion and the restriction Ω 0 are driven by the fund manager's portfolio allocation objectives and the assumed model for the return distribution, respectively.

Semi-parametric Inference in Complex Surveys
Survey data (Gunawan et al., 2020;Lumley, 2010) commonly arises from complex sampling methods such as stratification and multistage sampling wherein individuals in the finite population has unequal probabilities of inclusion into the sample.Prominent instances of extensive surveys implementing these methodologies include the National Health and Nutrition Examination Surveys (NHANES), the British Household Panel Survey (BHPS), the Household Income and Labour Dynamics in Australia (HILDA) survey, etc.In complex surveys, the survey sample lacks representativeness, since the individuals with varying demographic characteristics in the finite population of interest, have varying probabilities of selection into the sample.Consequently, traditional methods of inference and estimation manifests in bias and poor coverage of estimators.
A prevalent approach to address this challenge entails carefully exploiting the sampling weights available with complex survey data sets.These weights could be used to rectify the biases introduced by the unequal probability sampling, and enable us to create pseudo equal probability samples from the population.If a survey participant falls within a demographic group with a low probability of selection or response, their weight is increased accordingly.Commonly, the available information only includes the survey data set and the associated sampling weights for each unit in the sample.That is, there is limited or no information available about the complex sampling methodology or the precise technique employed for deriving these weights.This situation presents a compelling inferential challenge, which we shall delve into further in the following discussion.
Assume we have a finite population P := {X i , i ∈ [N ]}, and we wish to infer about a collection of features of P. We are provided with a non-representative sample x = (x 1 , . . ., x n ) obtained via a complex survey design, and the corresponding survey weights π = (π 1 , . . ., π n ), 0 < π i < ∞.It is assumed that the weights have been designed so that π i is inversely proportional to the likelihood that the survey design selects an observation with the same demographic characteristics as observation x i .That is, observations with a lower probability of being selected than they would have under a simple random sampling approach are assigned greater weight than they would receive in a simple random sampling scenario.Conversely, observations with a higher probability of selection receive lower weights than they would in a simple random sampling setup.The π i -s are scaled to ensure that n i=1 π i = n.

Related Works
Pseudo maximum likelihood (PMLE) based approaches are very popular to conduct parametric inference with complex survey data, where we posit a parametric model f θ to model P and θ encodes the features of interest of P. The pseudo loglikelihood of θ takes the form , 2007;Gunawan et al., 2020).The pseudo likelihood estimate of θPMLE satisfies the first order condition ∂L(θ) ∂θ = n i=1 π i ∂ ∂θ log f θ (x i ).Under certain regularity condition (White, 1982), where θ 0 is the true value of θ, and H π and V π are estimated by respectively.
As an alternative, a semi-parametric inference framework can be developed where the feature of interest θ of the finite population P := {X i , i ∈ [N ]}, instead a of parametric family of distributions as earlier, is described by the set of estimating equations 1 N N i=1 g(X i , θ) = 0 with a vector known functions g.This approach avoids complete parametric specification of the model, and widely utilized in Statistics and Econometrics (Chib et al., 2018(Chib et al., , 2021)).Given a sample x = (x 1 , . . ., x n ) T and survey weights x = (π 1 , . . ., π n ) T , the exponentially tilted empirical likelihood (Schennach, 2005) is given by Here and elsewhere, we use MCM as an acronym for moment condition model.When the convex hull of ∪ n i=1 g(x i , θ) contains the origin, leading to L MCM (θ) = n i=1 w ⋆ i (θ), with

Proposed Methodology
Importantly, it is often unwieldy, if not impossible, to put more flexible constraints on the parameter of interest via moment conditions.In this article, we intend to provide the additional flexibility to the ETEL framework via providing the scope for statistical distance based parametric distribution guided constraints.However, it is not straight forward to accomplish that in the context of complex survey data, due to the presence of the survey weights.To carefully circumnavigate this issue, we first reconstruct M pseudo true populations of size N from the observed complex survey sample of size n via Weighted Finite Population Bayesian bootstrap (Dong et al., 2014;Cohen, 1997;Lo, 1993) to adjust for the survey weights; next draw an i.i.d pseudo sample of size n from each of the pseudo true populations, and finally construct an ETEL based on each of the M pseudo samples.likelihood with parametric distribution guided constraint takes the form where δ is the indicator function, f θ is a parametric distribution of choice, and ε is an user defined parameter denoting the maximum extent of departure from the parametric distribution of choice.Here and elsewhere, we use BDCM as an acronym for bootstrapped distributionally constrained models.Importantly, the inference on the M pseudo true samples can be carried out in parallel.The final estimates of θ is obtained via combining the estimated obtained from the M i.i.d pseudo samples.

Experiments
Based on the numerical experiments in (Gunawan et al., 2020), we design simulation studies to compare the proposed distribution guided guided entropy maximization approach with the popular pseudo likelihood approach.Suppose the random variables (X, Z) jointly follows a bivariate normal distribution with mean (µ x , µ z ) ′ = (0, 10) ′ , marginal variances (σ 2 x , σ 2 z ) ′ = (4, 16) ′ and correlation ρ ∈ {0.1, 0.5, 0.8}.The variable X is the variable of interest; we aim to estimate its mean µ x and variance σ 2 x .The variable Z is a selection variable, i.e the Z-value a population unit determines the probability of inclusion of the unit into the sample.Particularly, we posit that the inclusion probability of X s into the sample is given by π ⋆ s = Φ(β 0 + β 1 Z s ),, where Φ(•) is the cumulative distribution function of a standard normal distribution.When a population unit is included in the sample, we observe x s and assign a survey weight π s such that π s ∝ 1/π ⋆ s .Importantly, we assume that we do not directly observe Z s .The selected sample of size n is denoted as (x, π) ′ .We scale the weights such that they sum up to 1, and we have The objective is to utilize (x, π) ′ to estimate the population parameters of interest (µ x , σ 2 x ).We generate N = 100, 000 values of (X s , Z s ) as a finite population.We set β 0 = 0.1, β 1 = −1.8, and draw samples of sizes n ∈ {500, 1000, 1500, 2000, 2500} from the finite population.Under each data generating set up, we utilize 100 Monte Carlo simulations.For the Pseudo maximum likelihood (PMLE) approach, we simply posit the model f θ ≡ Normal(µ x , σ 2 x ).For the proposed BCDM approach, we assume the moment constraint based on the function g(x, µ x ) = x − µ x , and the Wasserstein distance constraint based on the parametric family of distributions f θ ≡ Normal(µ x , σ 2 x ).For each of the replicates, we choose M = 500; and to ensure comparability of PMLE and BDCM, we set ε = W 2 2 n i=1 1/nδ x ⋆ i (•), f θ , where θ is the estimate of θ obtained via PMLE.The bias and the coverage of the pseudo maximum likelihood and moment condition model based estimators for varying data generating mechanisms are presented in Table 1.A case study with complex survey data from National Health and Nutrition Examination Surveys (NHANES) is provided in the supplement.

National Health and Nutrition Examination Surveys (NHANES) Data Analysis
NHANES is a series of surveys designed to assess the health and nutritional status of individuals in the United States.

Demographic Parity
Discrimination pertains to unfair treatment of individuals based on specific demographic characteristics known as protected attributes.The goal of demographic parity or statistical parity (Agarwal et al., 2019;Gajane and Pechenizkiy, 2018) in machine learning is to design algorithms that yield fair inferences devoid of discrimination due to membership to certain demographic groups determined by a protected attribute.First, we introduce the mathematical formalization of the notions of demographic parity.To that end, we assume that X denotes the feature vector used for predictions, A is the protected attribute with two levels {S, T }, and Y is the response.Parity constraints are phrased in terms of the distribution over (X, A, Y ).Two definitions are in order.
Definition 2 (Demographic parity, (Agarwal et al., 2019)).A predictor h satisfies demographic parity under the distribution over (X, A, Y ) if h(X) is independent of the protected attribute A, i.e ,P[h(X) Definition 3 (Demographic parity in expectation, (Agarwal et al., 2019) ).A predictor h satisfies demographic parity under the distribution over

Proposed Methodology
Although the notions of demographic parity in Definitions 2 and 3 coincide when we work with binary responses, the latter may be amenable to simple computational algorithms (Fitzsimons et al., 2019) compared to the general definition.However, the notion of demographic parity in expectation is somewhat prohibitive since one cannot control the predictor h over its entire domain.For example, depending on the application of interest, we may be solely interested in controlling the tails of the predictor (Yang et al., 2019).Taking refuge to our semi-parametric inference framework, we offer a flexible as well as a computationally feasible compromise between the notions in Definitions 2 and 3. To that end, we introduce the notion of demographic parity in the Wasserstein metric next.
Definition 4 (Demographic parity in Wasserstein metric).A predictor h achieves demographic parity in Wasserstein metric with bias ε, under the distribution over ( Suppose we have data (x i , y i , a i ) ∈ R d × R × {S, T } for n individuals on p-dimensional covariate x, univariate continuous response y, and levels of the protected attribute a ∈ {S, T }.For the sake of simplicity in exposition, we also assume that where n = n S + n T .Next, we posit a predictive model where h is potentially non-linear, and (θ (S) , θ (T ) ) is the model parameter of interest to be estimated under the demographic parity constraint In particular, we consider the empirical cdf of h under sub-population S, F h S = 1/n S n S i=1 δ h(x iS ) (•); and a weighted empirical cdf of h under sub-population T , Here δ is the Dirac delta measure.The goal is to infer about (θ (S) , θ (T ) , w) ensuring that demographic parity constraint i.e F h S , F h T are close with respect W 2 2 , at the same time the extent of re-weighting in F h T is minimal i.e the entropy − n i=n S +1 w i log w i is close to the maximal entropy log n T .A related idea in Jiang et al. ( 2020) deals with W 1 constrained fair classification problems, but our approach of additionally re-weighting the observations offers more flexibility with possible ramifications in studying fairness in mis-specified models.
We achieve this through an in-model approach solving the optimization problem: where n i=ns+1 w i = 1 and l i (θ . For a resulting re-weighting vector w ⋆ = (w ⋆ n S +1 , . . ., w ⋆ n ) ′ , we can obtain fair prediction at a new x ∈ T via a weighted kernel density estimate at x.As a competitor to the in-model scheme, motivated by popular post-processing schemes to ensure fairness (Xian et al., 2023;Nandy et al., 2022), we utilize two-step procedure: Step 1: We obtain model parameter estimates by ( θ(S) , θ(T) , σ2 ) = arg max followed by a post-processing step at ( θ(S) , θ(T) , σ2 ) to obtain w ⋆ Step 2: A case study on algorithmic mental health monitoring is provided next.An additional case study on algorithmic criminal risk assessment is also included.

Distress Analysis Interview Corpus (DAIC)
The Distress Analysis Interview Corpus (DAIC) (Gratch et al., 2014) is a multi-modal clinical interview collection, accessible upon request via the DAIC-WOZ website.Computer agents based on such clinical interviews are deemed to be used for making mental health diagnosis in realtion to certain employment decisions, and concerns about the fairness of such tools with respect to the biological gender of the individuals are raised.Specifically, we focus on predicting the PHQ-8 score, that captures the individual's severity of depression, as a function of the individual's verbal signals during the clinical interviews, while biological gender serves as a protected attribute.In particular, the Fourier series analysis of the speech signal of the individuals yield verbal attributes of interest, that in turn could be potentially used in diagnosis of the individual's severity of depression.Therefore, it is of interest to develop novel methods to produce predictions while avoiding disparate treatment on the basis of the biological genders.More precisely, we want to ensure that the demographic parity constraint is satisfied here, which in this context, simply dictates that the weighted empirical CDFs of biological gender-specific fitted PHQ-8 scores are identical or similar.
The PHQ-8 scores range from 0 to 27 with a score from 0 − 4 considered none or minimal, 5 − 9 mild, 10 − 14 moderate, 15 − 19 moderately severe, and 20 − 27 severe.In this application, we work with this PHQ-8 (continuous response), biological gender (binary protected attribute), and 17 derived audio/verbal features (continuous covariates) corresponding to the n = 107 subjects.The PHQ-8 score for two biological genders show a clear discrepancy.Therefore, we shall assess the relative performance of the in-model scheme in (4.1)) and two-step scheme (4.2)-(4.3) in ensuring demographic parity with respect to biological gender (refer to Figure 1).As earlier, for the sake of simplicity of exposition, we use linear regression (i.e h is linear in the covariates) as our predictive model of choice.When we fit the predictive model without any fairness constraint, the fitted empirical cumulative distribution functions corresponding to the two biological genders are widely different.Our in model scheme, as well as two-step scheme significantly reduce the discrepancy owing to their in-built fairness-based regularization.As noted earlier, the in model scheme provides lower bias since it performs the two-step optimization simultaneously.

COMPAS Recidivism Data Analysis
We consider a case study on algorithmic criminal risk assessment.We shall focus on the popular COMPAS data set (Aliverti et al., 2021) that includes information on criminal history for the defendants in Broward County, Florida, available from the propublica website.
For each individual, several features on criminal history are available, such as the number of past felonies, misdemeanors, and juvenile offenses; additional demographic information includes the sex, age, and ethnic group of each defendant.We focus on predicting two-year recidivism score y (continuous) as a function of the defendant's demographic information except for race and criminal history x, while race (categorical) serves as a protected attribute.
Algorithms for making such predictions are routinely used in courtrooms to advise judges, and concerns about the fairness of such tools with respect to the race of the defendants are raised.Therefore, it is of interest to develop novel methods to produce predictions while avoiding disparate treatment on the basis of the protected attribute race.More precisely, we want to ensure that the demographic parity constraint is satisfied, which in this context, simply dictates that the weighted empirical CDFs of race-specific fitted recidivism scores are identical or similar.
For simplicity of exposition, we only consider two levels for the protected attribute race, namely, African-American or non-African-American, and consider a sub-sample of the entire data set with 100 defendants corresponding to each level of the protected attribute.As covariate, for each defendant, we consider demographic information -sex (binary), age (continuous), marital status (categorical); and criminal status -legal status (categorical), supervision level (categorical), custody status (categorical).We use linear regression (i.e h is linear in the covariates) as our predictive model of choice; the methodology readily extends to more complicated models.The histograms of raw recidivism score for African-Americans versus non-African-Americans show a clear discrepancy (refer to Figure ??).We shall assess the relative performance of the in-model scheme in (4.1) and two-step scheme in (4.2)-(4.3) in ensuring demographic parity with respect to the protected attribute race (refer to Figure 3).When we fit the predictive model without any fairness constraint, the fitted empirical cumulative distribution functions corresponding to the two sub-populations are widely different.Our in-model scheme, as well as two-step scheme significantly reduce the discrepancy owing to their in-built fairness-based regularization.As expected, the in-model scheme provides slightly lower bias since it performs the two-step optimization simultaneously.

Entropy Based Portfolio Allocation
We present an application of the proposed parametric distribution guided entropy maximization framework to portfolio allocation problems (Markowitz, 1952;Bera and Park, 2008;Elton et al., 2014).Portfolio optimization is concerned with the allocation of an investor's wealth over several assets to optimize specific objective(s) based on historical data on asset returns.To elucidate the problem clearly, let R (i) = (R i,1 , R i,2 , . . ., R i,d ) ′ be the excess returns on d risky assets recorded over time i ∈ [n].The portfolio (w 1 , . . ., w d ) is a vector of weights that represents the investor's relative allocation of the wealth satisfying d i=1 w i = 1 and w i ≥ 0, i ∈ [d].The goal is to learn the (w 1 , . . ., w d ) subject to specific constraints based on historical data.

Related Works
Markowitz's mean-variance optimization (Markowitz, 1952) is widely recognized as one of the foundational formulations of the portfolio selection problem.The traditional mean variance (MV) optimal portfolio weights (Markowitz, 1952) are obtained via T are the mean and variance of the historical return, and λ > 0 is a risk aversion parameter.Given a specific mean and covariance matrix, the Markowitz paradigm offers an elegant approach to achieve an efficient allocation where the pursuit of higher expected returns inevitably entails assuming greater risk.However, in this framework, it is essential either for the asset returns to follow a normal distribution or for the utility to solely depend on the first two moments.Real-world financial returns, as indicated by empirical evidence (Mills, 1995;Peiro, 1999), diverge from normal distribution assumptions and commonly exhibit heavier tails and lack of symmetry.To that end, (Mehlawat et al., 2021;Campbell R. Harvey and Müller, 2010) proposed to utilize higher order moments in the portfolio allocation problem.However, portfolios created using sample moments of stock returns tend to be excessively concentrated in a small number of assets, which contradicts the fundamental principle of diversification.To that end, several approaches are proposed in the literature that ensures shrinkage of the portfolio weights towards maximum diversification (Bera and Park, 2008;Zhou et al., 2015;li Kang et al., 2021), i.e maximizes the entropy of the portfolio weights.In particular, (Bera and Park, 2008) proposed to obtain the portfolio weights solving the optimization problem arg max w H d (w) subject to d i=1 w i µ i ≥ µ 0 , w T Σw ≤ σ 2 0 , such that d i=1 w i = 1, and (µ 0 , σ 2 0 ) are the target mean and variance of the portfolio return.Basically, this approach constitutes obtaining the portfolio weight via entropy maximization subject to moment based constraints.

Proposed Methodology
Importantly, empirical evidence suggests that, there is merit in modeling the asset returns via non-normal distributions (Campbell R. Harvey and Müller, 2010;Park, 2021), e.g skewnormal distribution (Azzalini and Valle, 1996).However, it is often unwieldy to put more flexible constraints on the portfolio weights in terms of moment conditions.In this section, we intend to provide the additional flexibility to the entropy based portfolio optimization framework via providing the scope for statistical distance based parametric distribution guided constraints.Our semi-parametric framework provides an formidable alternative to the existing literature, since (a) we can flexibly specify the distribution of the expected return, and (b) the entropy provides direct handle on portfolio diversity.We achieve this by obtaining portfolio weights via the optimization problem arg max w H d (w) subject to is the empirical distribution of the portfolio return, f θ is the centering parametric family of distribution of choice, θ 0 is the fixed target value of θ, and ε is user defined parameter.For practical purposes, it is useful to express the optimization problem above as the following arg min Further, the user defined parameter λ ⋆ ∈ [0, 1] controls the balance between the portfolio diversity and extent of deviation from the target distribution f θ 0 .
For exposition in this article, we choose f θ 0 to be a Skew-normal distribution (Azzalini and Valle, 1996) with parameters θ = (ω, ζ, α) ′ .For α = 0, we can recover the Normal distribution as absolute value of skewness increases and absolute value of α increases.For α > 0 the distribution left skewed and it is right skewed for α < 0. If Z ∼ SN(ζ, ω, α), then we have µ 0 = E(Z), σ 2 0 = Var(Z), γ 0 = Skewness.This allows us to set (ζ, ω, α) to achieve target θ 0 = (µ 0 , σ 2 0 , γ 0 ) of the portfolio return distribution.This resulting skew normal density with fully specified parameters then serve as the target distribution to calculate portfolio weights based on (5.1).The user can select any flexible probability distribution for modelling the portfolio return and follow the prescribed recipe to compute target parameter values.

Historical Stock Returns Data Analysis
We consider stock returns data of 5 companies (AMZN, AAPL, XOM, T, MS) for the period January 2000 to December 2020, publicly available from Yahoo! Finance.The data is aggregated at monthly level.The goal is to compare mean-variance optimal portfolio and the proposed parametric distribution guided portfolio allocation frame work.First, we compute the mean-variance optimal portfolio for varying value of the risk aversion parameter λ ∈ [0, 10]. Figure 5 records the skewness, excess kurtosis, and number of zero portfolio weights for the mean-variance optimal portfolio for varying λ.We focus on λ set at 1 -a choice at which 3 out of 5 portfolio weights are 0, and the optimal portfolio return distribution is negatively skewed and leptokurtic.This exposes the fact that, once we have fixed the λ, mean-variance optimal portfolio optimization framework does not offer direct control over portfolio diversity, and we potentially obtain portfolio allocations concentrated on very few assets.Next, we fix the parameters of a skew-normal density θ = (ω, ζ, α) ′ such that it's mean, variance, and skewness match with the same quantities of the mean-variance optimal portfolio return at λ = 1.Finally, we compute the skew normal distribution guided maximum entropy portfolio, for varying value of the balance parameter λ ⋆ ∈ [0, 1] in (5.1). Figure 6 present the entropy of the portfolio weights and the departure of the portfolio return distribution from the guiding skew normal distribution as a function of λ ⋆ ∈ [0, 1].This showcases that, contrary to the mean-variance optimal portfolio allocation, here the fund manager can choose a specific λ ⋆ to ensure desired level of portfolio diversity, while maintaining fidelity towards a pre-specified distribution of the portfolio return distribution.(ii) Small value of λ leads to zero weight to several assets.

Concluding Remarks
We introduced a nonparametrically flavoured framework that aims to align the maximum entropy weight adjusted empirical distribution of observed data closely with a predefined and potentially continuous probability distribution, while permitting mild deviations.The framework's versatility is showcased in three distinct applications.We anticipate the proposed methodology's utility in numerous other statistical tasks requiring data re-weighting, e.g robustness (Wang et al., 2017), covariate shifts (Wang et al., 2017), ill-posed inverse problems (Gamboa and Gassiat, 1997), etc.

Figure 2 :
Figure2: Distress Analysis Interview Corpus.Maximum likelihood estimates of the regression coefficients under both two-step and in model schemes.In the in model scheme the estimates get slightly modified since the regression coefficients and the weights assigned to the data are learned simultaneously.For details on the in model and two-step approaches, refer to equations (4.1) and (4.2)-(4.3)respectively.

Figure 5 :
Figure5: Limitations of mean-variance optimal portfolio: (i) The skewness and excess kurtosis plots provide evidence that the normality assumption for expected returns does not hold.(ii) Small value of λ leads to zero weight to several assets.

Figure 6 :
Figure 6: With a fixed target skew normal return, varying values of λ ⋆ ∈ [0, 1] provide different balances between diversity & departure from target.Desired degree of diversification can be achieved λ ⋆ via a simple grid search on λ ⋆ ∈ [0, 1] .

Table 2 :
NHANES Data.Bias = ||β − β|| and coverage of the moment condition model based estimates of the regression parameters for varying sample sizes.
as a function of race and age category.For each n ∈ {250, 500, 1000, 2000}, we utilize 100 Monte Carlo simulations.For the distribution guided entropy maximization approach, we assume constraints on the score function of the Logistic regression.The coverage of the moment condition model based estimates of the regression coefficients is presented in Table