Next Article in Journal
A Matrix Approach by Convolved Fermat Polynomials for Solving the Fractional Burgers’ Equation
Previous Article in Journal
Linear Sixth-Order Conservation Difference Scheme for KdV Equation
Previous Article in Special Issue
Estimation for Two Sensitive Variables Using Randomization Response Model Under Stratified Random Sampling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimal Subsampling for Upper Expectation Parametric Regression

Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan 250100, China
Mathematics 2025, 13(7), 1133; https://doi.org/10.3390/math13071133
Submission received: 7 March 2025 / Revised: 26 March 2025 / Accepted: 27 March 2025 / Published: 30 March 2025
(This article belongs to the Special Issue Statistical Theory and Application, 2nd Edition)

Abstract

:
In classic regression analysis, the error term of the model typically conforms to the requirement of being independent and identically distributed. However, in the realm of big data, it is exceedingly common for the error term to exhibit varying distributions due to discrepancies in data collection timing and sources. In this article, we expand upon and refine the upper expectation parameter regression model, introducing a novel upper expectation loss form, which handles distribution heterogeneity via group-specific μ i , to ensure consistent and efficient parameter estimation. Furthermore, we establish the asymptotic properties of this estimation. To address the challenges posed by big data or the restrictions of privacy, we propose a method utilizing Poisson subsampling to devise a new loss function. Under certain assumptions, this method satisfies the condition of asymptotic normality. Additionally, based on the asymptotic properties, we determine the optimal sampling probability and introduce the optimal subsampling technique. Our sampling method surpasses uniform sampling, which reduces MSE by about 50 % compared to uniform sampling, and is straightforward to implement in practical scenarios. Subsequent simulation experiments and real-world examples further demonstrate the effectiveness of our approach.

1. Introduction

Big data usually refers to datasets that are large in different aspects: there can be many observations, many variables, or both. Big data presents new challenges for statistical methods and inference. Excessive data may result in data that cannot be stored and calculated in one machine. In addition, the data may also come from different sources and have inhomogeneities, which would cause failure of traditional statistical methods. Specifically, the question we are interested in is whether it is possible to extract a computationally feasible model that is suitable for data from different times or different sources, or more generally, data with different potential distributions. Our goal is to obtain a robust parameter estimate and a prediction based on the expected upper limit.
Almost all classical statistical models rely on various assumptions, with the most crucial being that the model in question possesses a probability distribution, whether known or unknown. Classical linear expectations and determinant statistics hinge on the certainty of this distribution or model. However, in cases of model heterogeneity, traditional statistical methods may become inapplicable. For instance, the classical maximum likelihood may not exist or be uniquely determined due to the absence of a definitive likelihood function. In addition, the classical least squares estimation is invalid because the parameters are defined by linear expectations. Moreover, classic statistical models, such as linear regression models, may not be well defined, since their identifiability hinges on the average certainty, and the regression function becomes unidentifiable in its absence. Consequently, to attain the objective of statistical inference, it is imperative to devise a novel statistical framework and corresponding methodologies.
In the absence of certainty in distribution, the expectations derived are often nonlinear. Early studies on nonlinear expectations can be traced back to [1] in the realm of robust statistics and [2] in the field of imprecise probability. Over recent decades, theories and methods related to nonlinear expectations have seen substantial development and have gained recognition in application areas such as financial risk measurement and control. Ref. [3] presented a notable instance of nonlinear expectation within the context of backward stochastic differential equations, termed g-expectation. As an extension, ref. [4] introduced g-expectation and its related forms. Within the framework of nonlinear expectations, the most prevalent distribution is the so-called G-normal distribution, which was first put forth by [4]. Moreover, refs. [5,6] established the law of large numbers and the central limit theorem, serving as the theoretical cornerstone for nonlinear expectations.
Addressing statistical problems arising from distribution heterogeneity, ref. [7] examined k-diverse distributions and introduced the upper expectation regression model. Subsequently, ref. [8] advanced a mini-max risk and mini-mean risk regression approach within the context of distribution heterogeneity. Furthermore, ref. [9] introduced the notion of “maximin effects” along with a suitable estimator, evaluating its predictive accuracy from a theoretical perspective in mixture models with either known or unknown group structures. Then, ref. [10] focused on learning models that ensure uniform performance through distributionally robust optimization, incorporating considerations of the worst-case distribution and tail effects.
On the other hand, in the big data era, the rapid proliferation of data introduces fresh obstacles to numerous traditional statistical challenges. Foremost among these is the practical impossibility of utilizing standard computer data storage and analysis techniques. To address this issue, a multitude of statistical and computational methodologies have been devised thus far. The main strategies include subsampling, divide-and-conquer, and online update [11,12,13,14]. In this paper, we primarily consider the method of subsampling.
A central concept of the subsampling approach is to employ nonuniform sampling probabilities, ensuring that data points with higher information content are more likely to be selected. A notable method in this regard is the leverage-score-based subsampling introduced by [15]. Subsequently, ref. [16] suggested an information-driven optimal subsample selection technique specifically for linear models. This technique avoids random sampling and instead selects subsamples deterministically for statistical analysis. Additionally, ref. [17] derived the optimal Poisson subsampling probability for quasi-likelihood estimation and devised a distributed optimal subsampling strategy.
The main contribution of this paper is to improve and develop the upper expectation regression method within the framework of big data or under privacy constraints for models with distribution heterogeneity. This heterogeneity is common in practical applications due to factors such as differences in data sources and environments, which always lead to variations in data distributions, such as the model for the influencing factors of air quality in Section 5. Upper expectation regression differs from classical regression in that it tends to use larger values to predict the response variable and obtain mini-max prediction risk. Unlike the method proposed by [7], we address model heterogeneity by introducing group-specific μ i values, which allows us to obtain a consistent estimator for β , thereby avoiding the potential inconsistency issue of beta found in the literature.
Another major contribution is to study the asymptotic theory of mini-max estimates for upper expectation regression under subsampling conditions. And then we provide a method to obtain the optimal subsampling probability based on the asymptotic theory. Furthermore, we employ an effective and robust estimation and prediction method, making sampling more stable and feasible. This is further supported by simulations and real data.
The rest of this paper is organized as follows. The second section briefly reviews the motivation, methods, and theoretical properties of the upper expectation regression. And we improve the upper expectation regression. The subsampling method and asymptotic theory are studied in the third section. Section 4 provides the selection method and specific implementation form of the optimal subsampling probability. In the fifth section, simulation and real data examples are given to prove the effectiveness and feasibility of the proposed method. The proof of the theorem is postponed to the Appendix A.

2. Upper Expectation Regression Model

2.1. Preliminary of Upper Expectation Model

We analyze the given linear regression model, which is expressed as follows:
Y = β x + ε ,
where Y represents a scalar response variable, x = ( x 1 , , x p ) is a p-dimensional covariate vector, and β = ( β 1 , , β p ) is a p-dimensional vector comprising unknown parameters. For simplicity, we need the independence assumption. In this way, we need that the conditional expectation of ε with given x is a constant independent of x . That is,
E [ ε | x ] = μ ,
where μ is a constant when x is given.
In the classic regression model, it is often assumed that the error ε is an independent and identically distributed random variable. However, in practical applications, due to the different time and sources of data collection, the assumption of independent and identically distributed for error terms may not always hold.
Firstly, we briefly review the k-sample upper expectation linear regression in [7]. The essential difference between this and the classic regression model is that the error term ε has distribution heterogeneity. The possible distribution of the error term forms a set
F = F 1 , , F k ,
where k is the number of different distributions and it is finite.
Under the framework of sublinear expectations, the distribution of ε can be defined as
E [ φ ( ε ) ] = sup F F E F [ φ ( ε ) ] , φ C l , L i p .
Subsequently, we express the conditional expectations as follows:
μ ¯ = E [ ε | x ] , μ ̲ = E [ ε | x ] .
Given these definitions, we introduce the concept of upper expectation regression:
E [ Y | x ] = β x + μ ¯ .
Let y i j , x i j j = 1 n be a sample in model (1), where i = 1 , , k , which means that the data are divided into k groups. We assume that samples in different groups have different distributions and that samples in the same group have the same distribution. For simplicity, we assume that the number of samples in each group is equal, i.e., n 1 = n 2 = = n k = n . In actual situations, the number of samples in each group may be different, but when the number is not much different, we obtain almost the same theoretical results.
In order to achieve the upper expectation loss, we use the empirical version of it. Specifically, the empirical version of the upper expectation loss is
Q ( β ) = max 1 i k 1 n j = 1 n y i j β x i j μ ˜ 2 .
By minimizing the upper expectation loss Q ( β ) , we can obtain the estimator of β , and we call it the mini-max estimator of β .
β ^ G , μ ˜ ^ = arg min β B , μ ˜ U max 1 i k 1 n j = 1 n y i j β x i j μ ˜ 2 .
We write σ i 2 = E ε i j μ ˜ 2 and σ i * 2 = max 1 i k σ i 2 .
The following theorem gives the asymptotic normality of the mini-max estimator of β .
Theorem 1
([7]). We postulate that E x x is a positive definite matrix, with σ i * 2 exceeding σ i 2 for all i * i . As n approaches infinity, it follows that
n β ^ G β d N 0 , σ i * 2 ( E x x ) 1 ,
where d denotes convergence in distribution, and N 0 , σ i * 2 ( E x x ) 1 represents the standard normal distribution.
However, the prediction based on μ ˜ ^ may not align with the upper expectation prediction, as μ ˜ ^ might not consistently estimate the upper expectation μ ¯ . Prior to concluding this section, we propose a two-stage estimation approach to develop a consistent estimator for μ ¯ and subsequently formulate a prediction grounded in upper expectation. Utilizing the consistent estimator β ^ G derived previously, the second-stage estimator for μ ¯ is defined as
μ ¯ ^ S = max 1 i k 1 n j = 1 n y i j x i j β ^ G
Let c = 1 E x E x x 1 E [ x ] and
Ω 1 ( x ) = E x x 1 + E x x 1 E [ x ] E x E x x 1 c
We present the following theorem based on these definitions.
Theorem 2
([7]). Given the conditions outlined in Theorem 1, if the sequences ε i j , j = 1 , , n and ε s j , j = 1 , , n are independent for i s , then the second-stage estimator satisfies the asymptotic distribution:
n μ ¯ ^ S μ ¯ d N 0 , σ * 2 + σ i ¯ 2 E x Ω 1 ( x ) E [ x ] ,
where μ ¯ is equivalent to μ i ¯ for some index i ¯ within the set { 1 , , k } . Note that E x E Ω 1 ( x ) E [ x ] has been simplified to E x Ω 1 ( x ) E [ x ] for clarity.
The proof of Theorems 1 and 2 can be found in [7].

2.2. Improvement of the Method

We firstly consider the upper expectation loss given in the Section 2.1:
E Y β x μ ˜ 2 = max 1 i k E i Y i β x i μ ˜ 2 .
In different groups of data, the expectation of the error term in the right of the formula can be different. But in the expectation loss, each group uses the same μ ˜ value. By comparing the loss of each group, the selected data group with the largest expectation loss is likely to be the group for which the expectation of the error term is farthest from μ ˜ . This is somewhat unreasonable to some extent.
To solve this problem, we propose a new upper expectation loss
max 1 i k E i Y i β x i μ i 2 ,
where μ i denotes the expectation of the error term of the group i, and E i denotes the expectation under the distribution of the data in the group i.
Due to the value of μ i being unknown, in order to proceed smoothly in the following estimation, we can make an a priori estimation of μ i . That is, in each group, we would use simple least squares estimation to approximate the value of μ i to be μ ^ i , which is
μ ^ i = arg min μ i U 1 n j = 1 n y i j β x i j μ i 2 .
In this optimization equation, β is also used as a parameter. In other words, there is an estimate of the value of β , but we do not need an estimate of β here, so it is not listed here.
Then, we give a new empirical version of the upper expectation loss:
Q 1 ( β ) = max 1 i k 1 n j = 1 n y i j β x i j μ ^ i 2 .
By minimizing the upper expectation loss, we can obtain the estimator of β . We write it as β ^ N .
For convenience, we use the same notation as in Section 2.1. In this situation, we write σ i 2 = E ε i j μ i 2 and σ i * 2 = max 1 i k σ i 2 . The σ i 2 s that appear later are all defined here.
To better express our method, we summarize it in Algorithm 1.
Algorithm 1 The two-stage estimation process for μ i and β ,
Require:
   y i j , x i j j = 1 n be a sample in model (1), where i = 1 , , k ;
Ensure:
  the estimations μ ^ i and β ^ N ;
   1:
for  i = 1 to n do
   2:
     Use (3) to solve for the parameter estimate μ ^ i ;
   3:
end for
   4:
Via minimizing the empirical upper expectation loss (4), obtain the estimation β ^ N ;
   5:
return  μ ^ i and β ^ N .
Before presenting the asymptotic results, we first make the following assumption about ε i j primarily concerning data independence and distributional assumptions.
Assumption 1.
There exists an index decomposition I i , i = 1 , , k such that when ( i j ) I i , ε i 1 , , ε i n i are identically distributed with bounded variance, and
j = 1 n E ( ε i j μ i ) 2 1 | ε i j μ i | > δ 0 , every δ > 0 .
The following theorem gives the asymptotic normality of the new mini-max estimator of β .
Theorem 3.
Under Assumption 1 and further assuming E x x is a positive definite matrix, σ i * 2 > σ i 2 for all i * i . Then, when n , we have
n β ^ N β d N 0 , σ i * 2 ( E x x ) 1 ,
where d stands for convergence in distribution, and N 0 , σ i * 2 ( E x x ) 1 is a classic normal distribution.
The proof of the theorem is provided in the Appendix A.

3. General Poisson Subsampling

For large datasets, the increase in the amount of the data has caused great difficulties in calculations and storage. In order to solve this problem, we adopt the Poisson subsampling method. We first consider Poisson subsampling in general, that is, we firstly consider the method without specifying the probability of each sample being selected. Then, we provide a sampling method and give a specific implementation method in later chapters.

3.1. Poisson Subsampling Method

We consider the same datasets as those in Section 2.1. Then, let p i j be probability to obtain the jth sample point in the ith group, where i = 1 , , k and j = 1 , , n . Let S i denote the set of observation values and sampling probabilities of the sampled subsample in the ith group. That is,
S i = { δ i j ( x i j , y i j , p i j ) , j = 1 , , n } ,
where δ i j is a random variable with the Bernoulli distribution. We write δ i j Bernoulli p i j .
According to the introduction in the previous section, we know that Q 1 ( β ) is the empirical version of the upper expectation loss max 1 i k E i Y i β x i μ ^ i 2 .
Due to the large amount of data, a natural idea is to perform statistical analysis on the sampled subsets. Specifically, by using the obtained sample set, we define a new weighted upper expectation loss as
max 1 i k E i δ p Y i β x i μ ^ i 2 .
Because of the independence of δ and ( x , Y ) , we know the weighted upper expectation loss is equal to the upper expectation loss. That is,
E δ p Y β x μ ˜ 2 = max 1 i k E i δ p Y β x μ ^ i 2 = max 1 i k E i δ p E i Y β x μ ^ i 2 = max 1 i k E i Y β x μ ^ i 2 = E Y β x μ ˜ 2 .
Then, we define the empirical version of the weighted upper expectation loss as
Q 1 * ( β ) = max 1 i k 1 n j S i 1 p i j y i j β x i j μ ^ i 2 = max 1 i k 1 n j = 1 n δ i j p i j y i j β x i j μ ^ i 2 .
We can solve the parameter β by minimizing the weighted estimation function. It means that
β ˜ = arg min β B Q 1 * ( β ) .
From simple calculations, we obtain
Q 1 * ( β ) = max 1 i k 1 n j S i 1 p i j y i j 2 + β x i j x i j β + μ ^ i 2 2 β y i j x i j 2 y i j μ ^ i + 2 β x i j μ ^ i .
One advantage of Poisson subsampling is that the probability p i j only depends on the dataset ( x i j , y i j ) j = 1 n . Therefore, the probability p i j can be generated block by block, without using all the data at once. In addition, according to the above formula, we only need to calculate the sum of y i j 2 , x i j x i j , y i j x i j , y i j , and x i j in each group. These statistics can be sent to the central machine for calculation, without transmitting the original data of each group. Thereby, it can reduce the time and cost of communication.

3.2. Theoretical Properties

In order to establish asymptotic results of β ˜ , we need the following assumptions.
Assumption 2.
The regression parameter lies in the l 1 ball Λ = β R p : β 1 C , where C is a constant. This means that β ˜ and β ^ N are the inner points of Λ.
Assumption 3.
We assume
E i x i 3 <
for all i { 1 , , k } .
Assumption 4.
For every i { 1 , , k } , we have
E i ( y i β x i μ ^ i ) 3 < .
The size of the subsample r * is a random variable, and we have E r * = i = 1 n p i . We use r to denote E r * . In this article, it is obvious to assume that r < n .
Assumption 5.
We assume that max j = 1 , , n n p j 1 = O P r 1 .
Assumption 2 is that the neighborhood of the estimated values β ˜ and β ^ N have reasonable properties. Assumption 3 gives several moment conditions for variables. Assumption 4 requires the moment condition of the loss function. Assumptions 3 and 4 are used in the proof of Theorems 4 and 5 as the key moment conditions in the Appendix A. And Assumption 5 puts a limit on the sampling probability of each point.
First, we give the property of Q 1 * ( β ) .
Theorem 4.
If the Assumptions 2–5 hold, then we have
Q 1 * ( β ) = Q 1 ( β ) + o p ( 1 ) .
The proof of Theorem 4 is in the Appendix A. From Theorem 4, we can see that the Q 1 * ( β ) we proposed is reasonable.
Then, we give the asymptotic normality of β ˜ .
Theorem 5.
If Assumptions 2–5 hold, then when n , r , we have
V 1 / 2 β ˜ β ^ N d N ( 0 , I ) .
where V = Σ i * 1 V c Σ i * 1 , and
V c = 1 n 2 j = 1 n y i * j β ^ N x i * j μ ^ i * 2 x i * j x i * j p i * j 1 n 2 j = 1 n y i * j β ^ N x i * j μ ^ i * 2 x i * j x i * j ,
Σ i = n 1 j = 1 n x i j x i j .
The proof of Theorem 5 is in the Appendix A.

4. Optimal Subsampling Strategies

In this section, we derive the best subsampling probability to obtain a better estimate of β ^ N . After the theoretical derivation, we give a selection method for actual calculations. We mainly draw on the method of [17].

4.1. Theoretical Method

We would find the optimal subsampling probability according to the result of Theorem 5. That is, we minimize the asymptotic mean square error of β ˜ approaching β ^ N . This is equivalent to minimizing t r ( V ) . This method is called A-optimality in the language of optimal design (See [18]).
Method 1 (A-optimality).
For ease of presentation, define the statistics in the ith group:
d i j V = y i j β ^ N x i j μ ^ i Σ i 1 x i j 2 , j = 1 , , n .
Let  d i ( 1 ) V d i ( 2 ) V d i ( n ) V  denote the order statistics of  d i j V j = 1 n . By minimizing the asymptotic mean square error, i.e., minimizing  t r ( V ) , we obtain the sampling probability of the ith group  p i j  as
p i ( j ) V = ( r s ) d i ( j ) V k = 1 n s d i ( k ) V , 1 j n s ,
and we have
p i ( j ) V = 1 , n s + 1 j n ,
where
s = min t 0 t r , ( r t ) d i ( n t ) V j = 1 n t d i ( j ) V .
This means that s satisfies
( r s + 1 ) d i ( n s + 1 ) V j = 1 n s + 1 d i ( j ) V , ( r s ) d i ( n s ) V < j = 1 n s d i ( j ) V .
p i j V j = 1 n can be obtained directly by using the data of each group of covariates and response variables.
We know that the range of sampling probability values is [ 0 , 1 ] . Therefore, the sampling probability calculated based on (6) should be less than or equal to 1. As ( j ) increases, the sampling probability gets closer to 1. To ensure that the sampling probability remains less than or equal to 1, we introduce a threshold s that satisfies (8). When ( j ) exceeds this threshold, we directly set the sampling probability to 1.
However, when calculating p i j V j = 1 n , we need to calculate Σ i 1 x i j 2 for j = 1 , , n . It takes O n d 2 time. To further reduce the calculation time, a simple and natural way is to use the matrix V c directly, and thus, we do not need to calculate Σ i 1 x i j 2 . In this way, we could change the method by minimizing t r ( V c ) to calculate the optimal subsampling probability. This method is called the linear optimality criterion, or L-optimality for short (see [18]).
Below, we describe in detail the process of obtaining optimal subsampling probability by minimizing t r ( V c ) .
Method 2 (L-optimality).
For ease of presentation, define the statistics in the ith group:
d i j Vc = y i j β ^ N x i j μ ^ i x i j 2 , j = 1 , , n .
Let  d i ( 1 ) Vc d i ( 2 ) Vc d i ( n ) Vc  denote the order statistics of  d i j Vc j = 1 n . By minimizing the asymptotic mean square error, i.e., minimizing  t r ( V c ) , we obtain the sampling probability of the ith group  p i j  as
p i ( j ) Vc = ( r s ) d i ( j ) Vc k = 1 n s d i ( k ) Vc , 1 j n s ,
and we have
p i ( j ) Vc = 1 , n s + 1 j n ,
where
s = min t 0 t r , ( r t ) d i ( n t ) Vc j = 1 n t d i ( j ) Vc .
This means that s satisfies
( r s + 1 ) d i ( n s + 1 ) Vc j = 1 n s + 1 d i ( j ) Vc , ( r s ) d i ( n s ) Vc < j = 1 n s d i ( j ) Vc .

4.2. Robust Implementation

In the above calculation, there is δ i j p i j as the weight in the estimation function. Therefore, we consider the data points which satisfy y i j β ^ N x i j μ i ^ 0 . In this case, the probability of these data being sampled is quite small, but they may also be sampled. If these data points are sampled, the estimation equation may be sensitive to the data points. This may cause the failure of Methods 1 and 2 in practical applications. In order to make the estimation more stable, we adopt a more robust sampling method to implement Methods 1 and 2 in the previous section.
We use the following subsampling probability:
p i j s o s = ( 1 ρ ) p i j o s + ρ r n 1 , j = 1 , , n ,
where p i j o s is p i j V or p i j Vc , and ρ ( 0 , 1 ) .
p i j sos j = 1 n is the convex combination of p i j os j = 1 n and the uniform subsampling probability r n 1 . It inherits the advantages of both. Here, ρ is a preference tuning parameter. A smaller ρ results in a more optimal sampling probability, while a larger ρ yields a more robust sampling probability. When ρ is close to 1, the estimation function is not sensitive to data points, so the estimation obtained becomes more stable. On the other hand, no matter what value ρ takes, the order of p i j sos is consistent with p i j os . Thus, the estimator can still retain the advantages of optimal subsampling.
In practical implementation, the parameters β ^ N in p i j os are unknown. Therefore, we need to estimate the values of β ^ N to continue the following calculations.
In order to estimate β ^ N , we use uniform sampling to obtain an a priori estimate. In detail, we first take a uniform subsample in each group for which the subsample size is r 0 . Then, we substitute the obtained subsamples into the estimation function to calculate the pilot estimator of β ^ N . In addition, we also use the μ ^ i in the calculation of the pilot estimator. We write the pilot estimator as β ˜ 0 .
In addition, the typical situation is that r n is required in our subsampling setting. In order to ensure that each p i j sos [ 0 , 1 ] , we need take the inverses of p i j sos 1 as weights in the weighted upper expectation loss.
At the end of this section, we summarize the estimation methods under the optimal sampling probability as Algorithm 2, paving the way for the simulations and real-data applications in the next section.
Algorithm 2 The robust subsampling method for β .
Require:
   y i j , x i j j = 1 n be a sample in model (1), where i = 1 , , k ;
Ensure:
  the estimations β ˜ ;
  1:
for  i = 1 to n do
  2:
     Use (3) to solve for the parameter estimate μ ^ i ;
  3:
end for
  4:
Use uniform sampling to get a priori estimate β ^ 0 based on (5);
  5:
Use (6) and (7) or (9) and (10) to obtain the subsampling probability p i j V or p i j Vc ;
  6:
Obtain the robust subsampling probability p i j s o s according to (11);
  7:
Compute the estimation β ˜ according to (5) with the subsampling probability p i j s o s ;
  8:
return  β ˜ .

5. Simulation Study and Real Data Analysis

5.1. Simulation Study

In this part, we give several simulation examples. By comparing the performance of our proposed optimal subsampling methods and uniform sampling method, the rationality and effectiveness of our proposed method are further illustrated.
From the simulations given later, we discover the following conclusions. Whether it is the case where the mean value is certain or the mean value is uncertain, the optimal subsampling method we proposed has significantly better performance than the simple uniform subsampling method. And as the sampling ratio increases, the performance of the method becomes better.

5.1.1. Experiment 1

In the first simulation experiment, we consider the following linear model:
Y = β 1 X 1 + β 2 X 2 + β 3 X 3 + ε .
During the simulation, we examine a scenario where the error term possesses a defined mean, but its variance is uncertain. For this setup, we select regression coefficients as β j = 1 for j = 1 , 2 , 3 . Additionally, the predictors X j are independently and identically distributed as N ( 0 , 1 ) for j = 1 , 2 , 3 . The error term ε is presumed to adhere to a normal distribution with a mean of zero and an unknown variance σ 2 , thereby forming a generalized regression model accommodating variance uncertainty. During the simulation process, we draw variance values σ i 2 for i ranging from 1 to k from a uniform distribution spanning from [0, 4]. Following this, we generate error values ε i j for j ranging from 1 to n, using normal distributions centered at zero with variances σ i 2 .
We have three subsampling methods. The first method is the simple uniform sampling method. This method serves as a control to reflect the effectiveness and rationality of the method we proposed. The second method is the Method 1 we proposed in Section 4, which is called A-optimality. And the third method is the Method 2 in Section 4, which is called L-optimality. Let ρ = 0.2 be the preference tuning parameter. We chose k = 10 and n = 10,000, and we chose that the number of samples per group r was 100, 300, and 500. This implies that by adopting such a practice with varying sample sizes for subsampling, we can compare the impact of different sample sizes r in the method. We sampled the generated data 500 times and compared the performance of these three methods by calculating the deviation and mean square error of the parameter estimates.
The simulation results are reported in Table 1, Table 2 and Table 3. The bolded results in the tables correspond to the best-performing method.
In order to further illustrate the effect of our sampling method in the upper expected regression, we changed the experimental settings, which represents stronger heterogeneity. We considered increasing the number of groups k and appropriately reducing the number of samples in each group. We explored the effect of parameter estimation in the new setting. We chose k = 50 and n = 2000 as the new experiment settings. And the other experimental conditions remained unchanged.
The simulation results are reported in Table 4, Table 5 and Table 6. Based on the entire simulation experiment, we draw the following conclusions:
  • By comparing the results of the three methods, it can be observed that our proposed A-optimality and L-optimality approaches significantly outperformed the uniform sampling method in terms of mean squared error. On the other hand, there was little difference in the effectiveness of the mean estimates obtained by the three methods. This demonstrates the effectiveness and stability of our methods, particularly for the method of A-optimality.
  • By comparing Table 1, Table 2 and Table 3 or Table 4, Table 5 and Table 6, it can be observed that as the number of sampling instances r increased, the estimation performance of all three methods improved significantly, both in terms of the mean and mean squared error. This underscores the importance of the number of sampling instances.
  • By comparing Table 1, Table 2 and Table 3 and Table 4, Table 5 and Table 6, it can be observed that in cases of high heterogeneity, where k is large and n is small, the estimation performance of all three methods deteriorates. However, as the number of sampling instances r increases, this discrepancy diminishes.

5.1.2. Experiment 2

We reconsider the linear model
Y = β 1 X 1 + β 2 X 2 + β 3 X 3 + ε ,
which is the same in form as in Experiment 1. In this model, we consider the situation that the mean and variance of the error are both uncertain. This means that ε N μ , σ 2 have uncertain mean μ and variance σ 2 .
During the simulation, we also specified that X j follows a normal distribution N ( 0 , 1 ) for j = 1 , 2 , 3 . The error terms ε were generated as follows: First, the mean values μ i and variance values σ i 2 were drawn from uniform distributions U ( 1 , 1 ) and U ( 0 , 1 ) , respectively. Subsequently, the error values ε i j for j = 1 , , k were sampled from the normal distribution N ( μ i , σ i 2 ) . Additionally, we also set k = 10 and n = 10 , 000 or k = 50 and n = 2000 , as well as r to be 100, 300, or 500.
The simulation results are reported in Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12. The simulation results are similar to those of Experiment 1, which are briefly summarized here:
1.
In summary, by comparing the methods, our proposed A-optimality and L-optimality approaches outperformed the uniform sampling method in terms of mean squared error, while the mean estimates of all three methods showed little difference in effectiveness.
2.
It can be observed that as the number of sampling instances r increased, the estimation performance of all three methods improved significantly.
3.
Upon reviewing Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12, it becomes apparent that when heterogeneity is high, the estimation capabilities of all three methods suffer.

5.2. Real Data Analysis

An increasing number of urban areas are grappling with persistent air pollution, primarily due to fine particulate matter, especially PM2.5. PM2.5 comprises particles suspended in the air with aerodynamic diameters smaller than 2.5 micrometers. These particles are recognized for their impact on visibility, human health, and climate patterns. Epidemiological studies indicate that exposure to PM2.5 can lead to respiratory ailments, severe cardiovascular diseases, and potentially fatal outcomes.
There are many reasons for PM2.5 pollution, including SO2 concentration, O3 concentration, temperature, wind speed, etc. The main consideration in this article is the impact of these four factors on PM2.5 concentration. The dataset used in this article was taken from the UCI machine learning knowledge base. Specifically, we obtained the PM2.5 concentration, SO2 concentration, O3 concentration, temperature, and wind speed of the ancient city of Beijing from 1 March 2013 to 29 February 2016. Due to the large time span involved in the model, model heterogeneity may arise. Therefore, we considered grouping the data based on time for processing. Then, we considered the linear influence of the covariate on the response variable, and we tested the effect of the subsampling method proposed in this article.
We first centralized the 25,217 data collected over three years, with the PM2.5 concentration after centralization as the response variable, and the central SO2 concentration, O3 concentration, temperature, and wind speed as four covariates. These covariates all have an impact on PM2.5 concentrations. For instance, SO2 can undergo chemical reactions to transform into components of PM2.5, while wind speed affects the concentration of PM2.5 through dispersion and dilution effects. Next, we divided the data into 12 groups. The grouping method was grouped by time. Every three months was a group. This yielded about 2100 data points in each group. The number of data in each group was not necessarily equal.
For the linear regression problem, we considered two different situations. The first case is that the model has only variance uncertainty, that is, the mean value of the error term in each group is equal. The second case is that the model has both variance uncertainty and mean uncertainty. The two sampling methods we proposed performed well in both cases, and they were significantly better than ordinary uniform sampling methods.
Case 1 (Mean certainty model)
We assumed that the model only has variance heterogeneity. In other words, the errors from different group have the same mean. Consider the following model:
Y = β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε .
Our purpose is to estimate the parameters β 1 , β 2 , β 3 , and β 4 , which are the same in every group. Use the subsampling methods we proposed above, we obtained the following table that describes the mean prediction error and computational time of each method under different settings. The sample size r in the subsampling of each group was set to 50.
Case 2 (Mean–variance heterogeneity model)
Then, we considered a model with not only mean heterogeneity but also variance heterogeneity in the errors of different groups. Consider the following model:
Y = β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ε .
In this model, ε i N μ i , σ i 2 . We also set the sample size in the subsampling of each group to 50. We have the following table that describes the mean prediction error and computational time of each method under different settings.
The results are presented in Table 13 and Table 14. Through the actual data analysis of the two cases, we obtained the following conclusions:
  • First, in terms of computational time, it can be observed that uniform sampling was significantly the fastest, while A-optimality was the slowest. This is because the sampling probability for uniform sampling is easily obtained, whereas A-optimality requires the calculation of the inverse of the covariance matrix, which is more complex than L-optimality.
  • Based on the mean prediction error metric, we found that the A-optimality and L-optimality methods significantly outperformed the uniform sampling method, with the A-optimality method being the best.
  • By comparing Table 13 and Table 14, it can be observed that the prediction performance of the mean–variance heterogeneous model was significantly better than that of the variance heterogeneous model, while their computational times were comparable. This suggests that the mean–variance model is more suitable for the real data example.

6. Conclusions

In the background with heterogeneous distributions, the task of statistical modeling and inference remains a significant challenge. In this paper, we improved the k-sample upper expectation regression model in the context of distribution heterogeneity by employing group-specific μ i values to address the impact of heterogeneity and obtaining consistent estimates. Additionally, we delved into optimal subsampling techniques specifically designed for big data scenarios, overcoming the challenges posed by large datasets and privacy protection through a robust optimal sampling probability. We analyzed the theoretical properties of our method and conducted comprehensive numerical experiments on both simulated and actual datasets to assess its practical effectiveness. Both theoretical analysis and numerical results confirm the validity of our method when dealing with large datasets. Furthermore, real data analysis demonstrates that our method can be applied to various similar fields, such as finance and environmental monitoring.
However, there are still some shortcomings in our method that require subsequent work to address. First, the k-sample assumption necessitates prior knowledge of k-sample groups, which means we must assume that sample clusters with different distributions are already known, which are often derived from experience or other methods. This is a rather stringent requirement. Future work could consider developing methods that can accommodate complete distribution heterogeneity in situations where the grouping is unknown, for example, first by data-driven clustering and then by applying upper expectation regression. Additionally, when k is large, it often implies a smaller sample size in each group, resulting in a high degree of heterogeneity. In such cases, both computational complexity and estimation effectiveness face significant challenges. Overcoming these issues is also a direction for future work. Lastly, the robust optimal sampling probability involves a tune parameter ρ , and selecting the optimal ρ to balance robustness and efficiency of the sampling probability is also a research-worthy problem.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the author.

Acknowledgments

The author would like to express his sincere gratitude to the anonymous referees for their valuable comments and suggestions, which signiffcantly improved the quality of this paper.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

First, we present two lemmas that play crucial roles in subsequent proofs.
Lemma A1.
Given Assumptions 2–5, as both n and r tend to infinity, for any s r that converges to zero in probability, we have
1 n j = 1 n δ i * j p i * j x i * j T x i * j = 1 n j = 1 n x i * j T x i * j + o p ( 1 ) .
Proof. 
Let F = x i j , y i j i = 1 , j = 1 k , n . Direct calculation shows that conditionally on F .
E 1 n j = 1 n δ i * j p i * j x i * j T x i * j F = 1 n j = 1 n x i * j T x i * j ,
and
E 1 n j = 1 n δ i * j p i * j x i * j T x i * j 1 n j = 1 n x i * j T x i * j F 2 = i = 1 n p i * j ( 1 p i * j ) p i * j 2 x i * j T x i * j n 2 = i = 1 n 1 p i * j x i * j T x i * j n 2 i = 1 n x i * j T x i * j n 2 i = 1 n 1 p i * j x i * j T x i * j n 2 max i = 1 , , n 1 n p i * j i = 1 n ( x i * j T x i * j ) 2 n max i = 1 , , n 1 n p i * j i = 1 n x i * j 4 n = O p ( r 1 ) 0 .
Then, we obtain the conclusion
1 n j = 1 n δ i * j p i * j x i * j T x i * j = 1 n j = 1 n x i * j T x i * j + o p ( 1 ) .
by using the Chebyshev’s inequality. □
Lemma A2.
Under Assumptions 2–5, as r and n ,
R * ( β ^ N ) N ( 0 , V c ) ,
in distribution, where
V c = 4 n 2 j = 1 n y i * j β ^ N x i * j μ ^ i * 2 x i * j x i * j p i * j 4 n 2 j = 1 n y i * j β ^ N x i * j μ ^ i * 2 x i * j x i * j .
Proof. 
To prove the conclusion in the lemma, we use the Lindeberg–Feller central limit theorem. Thus, we only need to verify the conditions of the Lindeberg–Feller central limit theorem.
Let
r j = 2 n δ i * j p i * j y i * j β ^ N x i * j μ ^ i * x i * j .
It is obvious that
R * ( β ^ N ) = j = 1 n r j .
By direct calculation, we can obtain
E [ r j ] = 2 n E [ δ i * j p i * j y i * j β ^ N x i * j μ ^ i * x i * j ] = 2 n E [ δ i * j p i * j ] E [ y i * j β ^ N x i * j μ ^ i * ] x i * j = 0 ,
j = 1 n v a r ( r j ) = 4 n y i * j β ^ N x i * j μ ^ i * 2 x i * j x i * j p i * j 2 v a r ( δ i * j ) = 4 n y i * j β ^ N x i * j μ ^ i * 2 x i * j x i * j p i * j 2 p i * j ( 1 p i * j ) = V c .
And for any ε > 0 ,
j = 1 n E r j 2 1 r j > ε 1 ε j = 1 n E r j 3 = 1 n 3 1 ε j = 1 n y i * j β ^ N x i * j μ ^ i * 3 x i * j 3 p i * j 2 1 ε max j = 1 , , n 1 n p i * j 2 j = 1 n y i * j β ^ N x i * j μ ^ i * 3 x i * j 3 n .
According to the Assumptions 3 and 4, we know that
j = 1 n y i * j β ^ N x i * j μ ^ i * 3 x i * j 3 n < .
Combining Assumption 5, we can obtain a bound for (A1), namely,
j = 1 n E r j 2 1 r j > ε 1 ε O p ( r 2 ) 0 .
After verifying the conditions, according to the Lindeberg–Feller central limit theorem in [19], we obtain the conclusion of this lemma. □
Proof of Theorem 3.
Since the data in each group are independent and identically distributed, we have
1 n j = 1 n y i j β x i j μ ^ i 2 = 1 n j = 1 n ε i j μ ^ i 2 = σ i 2 + δ n ,
where δ n is of order O p ( 1 / n ) . Note that σ i * 2 > σ i 2 for all i i * .
The above two results lead to
max 1 i k 1 n j = 1 n ε i j μ ^ i 2 = σ i * 2 + O p ( 1 / n ) .
Consequently, when n is large enough,
max 1 i k 1 n j = 1 n ε i j μ ^ i 2 = 1 n j = 1 n ε i * j μ ^ i * 2 + O p ( 1 / n ) .
Then,
max 1 i m 1 n j = 1 n y i j β x i j μ ^ i 2 = 1 n j = 1 n y i * j β x i * j μ ^ i * 2 + O p ( 1 / n ) .
We denote the true values of β and μ ^ i * as β 0 and μ i * 0 , respectively. And let β and μ ^ i * satisfy β β 0 = O ( 1 / n ) and μ ^ i * μ i * 0 = O ( 1 / n ) . Then,
max 1 i k 1 n j = 1 n y i j β x i j μ ^ i 2 = max 1 i k 1 n j = 1 n ε i j μ i 0 β β 0 x i j μ ^ i μ i 0 2 = 1 n j = 1 n ε i * j μ i * 0 2 2 β β 0 x i * j + μ ^ i * μ i * 0 ε i * j μ i * 0 + β β 0 x i * j + μ ^ i * μ i * 0 2 + O p ( 1 / n ) = 1 n j = 1 n ε i * j μ i * 0 2 2 n j = 1 n β β 0 x i * j + μ ^ i * μ i * 0 ε i * j μ i * 0 + 1 n j = 1 n β β 0 x i * j + μ ^ i * μ i * 0 2 + O p ( 1 / n ) .
Note that the differences ε i * j μ i * 0 are independently and identically distributed with a mean of zero and a variance of σ i * 2 . Additionally, it is evident that both 1 n j = 1 n β β 0 x i * j + μ ^ i * μ i * 0 ε i * j μ i * 0 and 1 n j = 1 n β β 0 x i * j + μ ^ i * μ i * 0 2 converge in probability to terms of order O p ( 1 ) .
On the other hand, we know that ε i * j and δ n are independent of β . Thus, according to the method of solving β ^ N , we obtain that minimizing max 1 i m 1 n j = 1 n y i j β x i j μ ^ i 2 is equivalent to minimizing
j = 1 n 2 β β 0 x i * j + μ ^ i * μ i * 0 ε i * j μ i * 0 + β β 0 x i * j + μ ^ i * μ i * 0 2 .
We rewrite this objective function as
Z n ( γ ) = j = 1 n 2 ε i * j μ i * 0 n x i * γ + 1 n γ x i * j x i * j γ ,
where Z n ( γ ) is obviously convex and minimized at γ ^ n = n β ^ N β 0 . According to Assumption 1 and the Lindeberg–Feller central limit theorem in [19], we obtain
Z n ( γ ) d Z 0 ( γ ) = 2 W γ + γ E ( x x ) γ ,
where W N 0 , σ i * 2 E x x . Through minimizing Z n ( γ ) and simple deformation, we can obtain the conclusion of the theorem.
n β ^ N β d N 0 , σ i * 2 ( E x x ) 1 .
Proof of Theorem 4.
Let F = x i j , y i j i = 1 , j = 1 k , n . Direct calculation shows that conditionally on F .
E i 1 n j = 1 n δ i j p i j y i j β x i j μ ^ i 2 F = 1 n j = 1 n y i j β x i j μ ^ i 2 ,
and
E i 1 n j = 1 n δ i j p i j y i j β x i j μ ^ i 2 1 n j = 1 n y i j β x i j μ ^ i 2 F 2 = i = 1 n p i j ( 1 p i j ) p i j 2 y i j β x i j μ ^ i 2 n 2 = i = 1 n 1 p i j y i j β x i j μ ^ i 2 n 2 i = 1 n y i j β x i j μ ^ i 2 n 2 i = 1 n 1 p i j y i j β x i j μ ^ i 2 n 2 max i = 1 , , n 1 n p i j i = 1 n ( y i j β x i j μ ^ i 2 ) 2 n = O p ( r 1 ) 0 .
Then, we obtain the conclusion
1 n j = 1 n δ i j p i j y i j β x i j μ ^ i 2 = 1 n j = 1 n y i j β x i j μ ^ i 2 + o p ( 1 ) ,
by using Chebyshev’s inequality.
Let i traverse 1 to k and take the maximum value, and we obtain
Q 1 * ( β ) = Q 1 ( β ) + o p ( 1 ) .
Proof of Theorem 5.
We divided the data into k different groups, and the data in each group have the same distribution. So, we have
1 n j = 1 n ε i j μ ^ i 2 = σ i 2 + δ n ,
where δ n is bounded by O p ( 1 / n ) . It should be noted that σ i * 2 exceeds σ i 2 for any i other than i * . These two aforementioned findings imply that
max 1 i k 1 n j = 1 n ε i j μ ^ i 2 = σ i * 2 + O p ( 1 / n ) .
Consequently, when n is large enough,
max 1 i k 1 n j = 1 n ε i j μ ^ i 2 = 1 n j = 1 n ε i * j μ ^ i * 2 + O p ( 1 / n ) .
Then,
max 1 i m 1 n j = 1 n y i j β x i j μ ^ i 2 = 1 n j = 1 n y i * j β x i * j μ ^ i * 2 + O p ( 1 / n ) .
Using the results obtained above in the sampling estimation function Q 1 * ( β ) , we obtain
Q 1 * ( β ) = max 1 i k 1 n j = 1 n δ i j p i j y i j β x i j μ ^ i 2 = 1 n j = 1 n δ i * j p i * j y i * j β x i * j μ ^ i * 2 + O p ( 1 / r ) .
Similarly, we have
Q 1 ( β ) = max 1 i k 1 n j = 1 n y i j β x i j μ ^ i 2 = 1 n j = 1 n y i * j β x i * j μ ^ i * 2 + O p ( 1 / n ) .
Therefore, Q 1 * ( β ) and Q 1 ( β ) are continuous and differentiable near the true value β 0 of β . We take the derivative of Q 1 * ( β ) and Q 1 ( β ) at β 0 , respectively, and then we obtain R * ( β ) and R ( β ) .
R * ( β ) = d Q 1 * ( β ) d β = 2 n j = 1 n δ i * j p i * j y i * j β x i * j μ ^ i * x i * j ,
R ( β ) = d Q 1 ( β ) d β = 2 n j = 1 n y i * j β x i * j μ ^ i * x i * j .
We obtain the estimator of β ˜ by minimizing Q 1 * ( β ) . We can also obtain β ˜ by solving R * ( β ) equal to 0.
R * ( β ˜ ) = 2 n j = 1 n δ i * j p i * j y i * j β x i * j μ ^ i * x i * j = 0 .
We write
R * ( β ^ N ) = 2 n j = 1 n δ i * j p i * j y i * j β ^ N x i * j μ ^ i * x i * j .
By subtracting R * ( β ˜ ) and R * ( β ^ N ) , we obtain
R * ( β ˜ ) = R * ( β ^ N ) + 2 n j = 1 n δ i * j p i * j x i * j T x i * j ( β ˜ β ^ N ) .
By using Lemma A1, we have
1 n j = 1 n δ i * j p i * j x i * j T x i * j = 1 n j = 1 n x i * j T x i * j + o p ( 1 ) .
Then, we can obtain the expression of β ˜ β ^ N as follows.
( β ˜ β ^ N ) = Σ i * 1 R * ( β ^ N ) + o p ( β ˜ β ^ N ) .
According to Lemma A2, we know that R * ( β ^ N ) obeys normal distribution. That is,
R * ( β ^ N ) N ( 0 , V c ) .
After a simple calculation, we obtain
V 1 / 2 β ˜ β ^ N d N ( 0 , I ) .
This is the result of Theorem 5. □
Proof of Methods 1 and 2.
We employ d i j to represent d i j V for simplicity. Generally, we assume that all d i j values are positive. In cases where some d i j values equal zero, we can assign zero subsampling probabilities to the corresponding pairs and focus on the subsampling probabilities among the remaining pairs.
To achieve the minimum asymptotic mean square error, specifically t r ( V ) as stated in Theorem 5, we can solve the following optimization problem to derive the desired method:
min H ˜ = i = 1 n tr 1 p i j y i j β ^ N x i j μ ^ i 2 Σ i 1 x i j 2 2 s . t i = 1 n p i j = r , 0 p i j 1 for i = 1 , , n .
Without loss of generality, we further assume that d i 1 d i 2 d i n .
From the Cauchy–Schwarz inequality,
H ˜ = i = 1 n 1 p i j y i j β ^ N x i j μ ^ i 2 Σ i 1 x i j 2 2 = 1 r j = 1 n p i j i = 1 n p i j 1 d i j 2 1 r i = 1 n d i j 2 ,
where the equality in it holds if and only if p i j d i j .
When p i j = r d i j / ( j = 1 n h i j ) satisfies that p i j 1 for all j = 1 , , n , these p i s give the optimal solution.
Otherwise, we can easily see that p i n = 1 when r d i n / ( j = 1 n h i j ) > 1 . Thus, the original problem turns into solving the following optimization problem:
min H ˜ = i = 1 n 1 tr 1 p i j y i j β ^ N x i j μ ^ i 2 Σ i 1 x i j 2 2 s . t i = 1 n 1 p i j = r 1 , 0 p i j 1 for i = 1 , , n 1 .
This is a recursion problem. We repeat the method given above until all p i j 1 , where j s and
s = min t 0 t r , ( r t ) d i ( n t ) j = 1 n t d i ( j ) .
Then, we obtain the optimal solution
p i j = ( r s ) d i j k = 1 n s d i k , 1 j n s ,
and
p i j = 1 , n s + 1 j n .
The method that we obtain for Method 2 is similar for Method 1, so we omit the details. □

References

  1. Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 1981. [Google Scholar]
  2. Walley, P. Statistical Reasoning with Imprecise Probabilities; Chapman and Hall: London, UK; New York, NY, USA, 1991. [Google Scholar]
  3. Peng, S. Backward SDE and related g-expectations. In Backward Stochastic Differential Equations, Pitman Research Notes in Math. Series, No. 364; El Karoui, M., Ed.; Longman: Harlow, UK, 1997; pp. 141–159. [Google Scholar]
  4. Peng, S. G-Expectation, G-Brownian Motion and Related Stochastic Calculus of Itôs type. In Stochastic Analysis and Applications: The Abel Symposium 2005; Springer: Berlin/Heidelberg, Germany, 2007; pp. 541–567. [Google Scholar]
  5. Peng, S. Multi-dimensional G-Brownian motion and related stochastic calculus under G-expectation. Stoch. Process. Their Appl. 2008, 118, 2223–2253. [Google Scholar] [CrossRef]
  6. Peng, S. Survey on normal distributions, central limit theorem, Brownian motion and the related stochastic calculus under sublinear expectations. Sci. China Ser. A Math. 2009, 52, 1391–1411. [Google Scholar] [CrossRef]
  7. Lin, L.; Shi, Y.; Wang, X.; Yang, S. k-sample upper expectation linear regression-Modeling, identifiability, estimation and prediction. J. Stat. Plan. Infer. 2016, 170, 15–26. [Google Scholar] [CrossRef]
  8. Lin, L.; Liu, Y.; Lin, C. Mini-max-risk and mini-mean-risk inferences for a partially piecewise regression. Statistics 2017, 51, 745–765. [Google Scholar]
  9. Meinshausen, N.; Buhlmann, P. Maximin effects in inhomogeneous large-scale data. Ann. Stat. 2015, 43, 1801–1830. [Google Scholar] [CrossRef]
  10. Duchi, J.C.; Namkoong, H. Learning Models with Uniform Performance via Distributionally Robust Optimization. Ann. Stat. 2020, 49, 1378–1406. [Google Scholar]
  11. Schifano, E.D.; Wu, J.; Wang, C.; Yan, J.; Chen, M.H. Online updating of statistical inference in the big data setting. Technometrics 2016, 58, 393–403. [Google Scholar] [CrossRef] [PubMed]
  12. Wang, H.; Zhu, R.; Ma, P. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 2018, 113, 829–844. [Google Scholar] [CrossRef] [PubMed]
  13. Wang, H.; Ma, Y. Optimal subsampling for quantile regression in big data. Biometrika 2021, 108, 99–112. [Google Scholar] [CrossRef]
  14. Xu, G.; Shang, Z.; Cheng, G. Distributed Generalized Cross-Validation for Divide-and-Conquer Kernel Ridge Regression and Its Asymptotic Optimality. J. Comput. Graph. Stat. 2019, 28, 891–908. [Google Scholar]
  15. Ma, P.; Mahoney, M.W.; Yu, B. A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 2015, 16, 861–919. [Google Scholar]
  16. Wang, H.Y.; Yang, M.; Stufken, J. Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 2019, 114, 393–405. [Google Scholar] [CrossRef]
  17. Yu, J.; Wang, H.; Ai, M.; Zhang, H. Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators with Massive Data. J. Am. Stat. Assoc. 2020, 117, 265–276. [Google Scholar] [CrossRef]
  18. Pukelsheim, F. Optimal Design of Experiments (Classics in Applied Mathematics, 50); Society for Industrial and Applied Mathematics, University City Science Center: Philadelphia, PA, USA, 2006. [Google Scholar]
  19. Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000; Volume 3. [Google Scholar]
Table 1. The mean and MSE of three methods when r = 100 in experiment 1, with k = 10 and n = 10,000.
Table 1. The mean and MSE of three methods when r = 100 in experiment 1, with k = 10 and n = 10,000.
MethodStatistics β 1 β 2 β 3
uniformmean1.0721.0821.075
MSE7.25 × 10−28.05 × 10−27.13 × 10−2
A-optimalitymean1.0501.0521.044
MSE4.54 × 10−24.08 × 10−23.92 × 10−2
L-optimalitymean1.0431.0521.056
MSE4.57 × 10−24.08 × 10−24.18 × 10−2
Table 2. The mean and MSE of three methods when r = 300 in experiment 1, with k = 10 and n = 10,000.
Table 2. The mean and MSE of three methods when r = 300 in experiment 1, with k = 10 and n = 10,000.
MethodStatistics β 1 β 2 β 3
uniformmean1.0451.0371.033
MSE2.86 × 10−22.54 × 10−22.35 × 10−2
A-optimalitymean1.0151.0251.028
MSE1.51 × 10−21.29 × 10−21.46 × 10−2
L-optimalitymean1.0151.0341.023
MSE1.52 × 10−21.53 × 10−21.51 × 10−2
Table 3. The mean and MSE of three methods when r = 500 in experiment 1, with k = 10 and n = 10,000.
Table 3. The mean and MSE of three methods when r = 500 in experiment 1, with k = 10 and n = 10,000.
MethodStatistics β 1 β 2 β 3
uniformmean1.0251.0271.028
MSE1.50 × 10−21.66 × 10−21.43 × 10−2
A-optimalitymean1.0141.0281.026
MSE8.51 × 10−39.68 × 10−38.96 × 10−3
L-optimalitymean1.0091.0241.023
MSE8.67 × 10−39.19 × 10−38.59 × 10−3
Table 4. The mean and MSE of three methods when r = 100 in experiment 1, with k = 50 and n = 2000 .
Table 4. The mean and MSE of three methods when r = 100 in experiment 1, with k = 50 and n = 2000 .
MethodStatistics β 1 β 2 β 3
uniformmean1.1461.0341.112
MSE1.54 × 10−11.40 × 10−11.46 × 10−1
A-optimalitymean1.0891.0301.104
MSE8.02 × 10−27.87 × 10−29.23 × 10−2
L-optimalitymean1.0911.0091.104
MSE8.23 × 10−27.99 × 10−29.34 × 10−2
Table 5. The mean and MSE of three methods when r = 300 in experiment 1, with k = 50 and n = 2000 .
Table 5. The mean and MSE of three methods when r = 300 in experiment 1, with k = 50 and n = 2000 .
MethodStatistics β 1 β 2 β 3
uniformmean1.0611.0081.064
MSE5.18 × 10−24.14 × 10−24.65 × 10−2
A-optimalitymean1.0400.9711.028
MSE2.66 × 10−22.29 × 10−22.06 × 10−2
L-optimalitymean1.0380.9651.039
MSE2.51 × 10−22.23 × 10−22.43 × 10−2
Table 6. The mean and MSE of three methods when r = 500 in experiment 1, with k = 50 and n = 2000 .
Table 6. The mean and MSE of three methods when r = 500 in experiment 1, with k = 50 and n = 2000 .
MethodStatistics β 1 β 2 β 3
uniformmean1.0520.9811.040
MSE2.75 × 10−22.61 × 10−22.47 × 10−2
A-optimalitymean1.0260.9581.009
MSE1.36 × 10−21.29 × 10−21.00 × 10−2
L-optimalitymean1.0220.9641.012
MSE1.33 × 10−21.47 × 10−28.45 × 10−3
Table 7. The mean and MSE of three methods when r = 100 in experiment 2, with k = 10 and n = 10,000.
Table 7. The mean and MSE of three methods when r = 100 in experiment 2, with k = 10 and n = 10,000.
MethodStatistics β 1 β 2 β 3
uniformmean1.0861.0631.072
MSE8.11 × 10−26.92 × 10−27.30 × 10−2
A-optimalitymean1.0281.0471.046
MSE4.23 × 10−24.22 × 10−24.57 × 10−2
L-optimalitymean1.0421.0571.038
MSE4.68 × 10−24.61 × 10−23.88 × 10−2
Table 8. The mean and MSE of three methods when r = 300 in experiment 2, with k = 10 and n = 10,000.
Table 8. The mean and MSE of three methods when r = 300 in experiment 2, with k = 10 and n = 10,000.
MethodStatistics β 1 β 2 β 3
uniformmean1.0281.0361.038
MSE2.61 × 10−22.48 × 10−22.53 × 10−2
A-optimalitymean1.0081.0171.030
MSE1.44 × 10−21.44 × 10−21.37 × 10−2
L-optimalitymean1.0151.0211.028
MSE1.40 × 10−21.59 × 10−21.64 × 10−2
Table 9. The mean and MSE of three methods when r = 500 in experiment 2, with k = 10 and n = 10,000.
Table 9. The mean and MSE of three methods when r = 500 in experiment 2, with k = 10 and n = 10,000.
MethodStatistics β 1 β 2 β 3
uniformmean1.0101.0281.036
MSE1.58 × 10−21.60 × 10−21.60 × 10−2
A-optimalitymean1.0011.0141.026
MSE8.39 × 10−38.38 × 10−39.72 × 10−3
L-optimalitymean1.0091.0101.018
MSE8.86 × 10−39.90 × 10−37.79 × 10−3
Table 10. The mean and MSE of three methods when r = 100 in experiment 2, with k = 50 and n = 2000 .
Table 10. The mean and MSE of three methods when r = 100 in experiment 2, with k = 50 and n = 2000 .
MethodStatistics β 1 β 2 β 3
uniformmean1.1251.0551.116
MSE1.50 × 10−31.37 × 10−11.53 × 10−1
A-optimalitymean1.0631.0111.088
MSE7.69 × 10−28.01 × 10−27.81 × 10−2
L-optimalitymean1.0851.0091.082
MSE8.80 × 10−27.02 × 10−28.23 × 10−2
Table 11. The mean and MSE of three methods when r = 300 in experiment 2, with k = 50 and n = 2000 .
Table 11. The mean and MSE of three methods when r = 300 in experiment 2, with k = 50 and n = 2000 .
MethodStatistics β 1 β 2 β 3
uniformmean1.0681.0001.057
MSE5.44 × 10−24.16 × 10−24.65 × 10−2
A-optimalitymean1.0320.9751.024
MSE2.65 × 10−22.54 × 10−22.23 × 10−2
L-optimalitymean1.0280.9731.035
MSE2.89 × 10−22.33 × 10−22.31 × 10−2
Table 12. The mean and MSE of three methods when r = 500 in experiment 2, with k = 50 and n = 2000 .
Table 12. The mean and MSE of three methods when r = 500 in experiment 2, with k = 50 and n = 2000 .
MethodStatistics β 1 β 2 β 3
uniformmean1.0350.9741.028
MSE2.98 × 10−22.34 × 10−22.67 × 10−2
A-optimalitymean1.0130.9611.016
MSE1.34 × 10−21.42 × 10−29.99 × 10−3
L-optimalitymean1.0180.9621.011
MSE1.46 × 10−21.61 × 10−29.50 × 10−3
Table 13. The computational time and MPE of three methods in mean certainty model when r = 50 .
Table 13. The computational time and MPE of three methods in mean certainty model when r = 50 .
MethodUniformA-OptimalityL-Optimality
computational time0.67 s4.05 s3.13 s
MPE0.7370.6300.632
Table 14. The computational and MPE of three methods in mean–variance heterogeneity model when r = 50 .
Table 14. The computational and MPE of three methods in mean–variance heterogeneity model when r = 50 .
MethodUniformA-OptimalityL-Optimality
computational time0.68 s4.03 s3.10 s
MPE0.6760.5130.550
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Z. Optimal Subsampling for Upper Expectation Parametric Regression. Mathematics 2025, 13, 1133. https://doi.org/10.3390/math13071133

AMA Style

Liu Z. Optimal Subsampling for Upper Expectation Parametric Regression. Mathematics. 2025; 13(7):1133. https://doi.org/10.3390/math13071133

Chicago/Turabian Style

Liu, Zhaolei. 2025. "Optimal Subsampling for Upper Expectation Parametric Regression" Mathematics 13, no. 7: 1133. https://doi.org/10.3390/math13071133

APA Style

Liu, Z. (2025). Optimal Subsampling for Upper Expectation Parametric Regression. Mathematics, 13(7), 1133. https://doi.org/10.3390/math13071133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop