1. Introduction
Big data usually refers to datasets that are large in different aspects: there can be many observations, many variables, or both. Big data presents new challenges for statistical methods and inference. Excessive data may result in data that cannot be stored and calculated in one machine. In addition, the data may also come from different sources and have inhomogeneities, which would cause failure of traditional statistical methods. Specifically, the question we are interested in is whether it is possible to extract a computationally feasible model that is suitable for data from different times or different sources, or more generally, data with different potential distributions. Our goal is to obtain a robust parameter estimate and a prediction based on the expected upper limit.
Almost all classical statistical models rely on various assumptions, with the most crucial being that the model in question possesses a probability distribution, whether known or unknown. Classical linear expectations and determinant statistics hinge on the certainty of this distribution or model. However, in cases of model heterogeneity, traditional statistical methods may become inapplicable. For instance, the classical maximum likelihood may not exist or be uniquely determined due to the absence of a definitive likelihood function. In addition, the classical least squares estimation is invalid because the parameters are defined by linear expectations. Moreover, classic statistical models, such as linear regression models, may not be well defined, since their identifiability hinges on the average certainty, and the regression function becomes unidentifiable in its absence. Consequently, to attain the objective of statistical inference, it is imperative to devise a novel statistical framework and corresponding methodologies.
In the absence of certainty in distribution, the expectations derived are often nonlinear. Early studies on nonlinear expectations can be traced back to [
1] in the realm of robust statistics and [
2] in the field of imprecise probability. Over recent decades, theories and methods related to nonlinear expectations have seen substantial development and have gained recognition in application areas such as financial risk measurement and control. Ref. [
3] presented a notable instance of nonlinear expectation within the context of backward stochastic differential equations, termed g-expectation. As an extension, ref. [
4] introduced g-expectation and its related forms. Within the framework of nonlinear expectations, the most prevalent distribution is the so-called G-normal distribution, which was first put forth by [
4]. Moreover, refs. [
5,
6] established the law of large numbers and the central limit theorem, serving as the theoretical cornerstone for nonlinear expectations.
Addressing statistical problems arising from distribution heterogeneity, ref. [
7] examined
k-diverse distributions and introduced the upper expectation regression model. Subsequently, ref. [
8] advanced a mini-max risk and mini-mean risk regression approach within the context of distribution heterogeneity. Furthermore, ref. [
9] introduced the notion of “maximin effects” along with a suitable estimator, evaluating its predictive accuracy from a theoretical perspective in mixture models with either known or unknown group structures. Then, ref. [
10] focused on learning models that ensure uniform performance through distributionally robust optimization, incorporating considerations of the worst-case distribution and tail effects.
On the other hand, in the big data era, the rapid proliferation of data introduces fresh obstacles to numerous traditional statistical challenges. Foremost among these is the practical impossibility of utilizing standard computer data storage and analysis techniques. To address this issue, a multitude of statistical and computational methodologies have been devised thus far. The main strategies include subsampling, divide-and-conquer, and online update [
11,
12,
13,
14]. In this paper, we primarily consider the method of subsampling.
A central concept of the subsampling approach is to employ nonuniform sampling probabilities, ensuring that data points with higher information content are more likely to be selected. A notable method in this regard is the leverage-score-based subsampling introduced by [
15]. Subsequently, ref. [
16] suggested an information-driven optimal subsample selection technique specifically for linear models. This technique avoids random sampling and instead selects subsamples deterministically for statistical analysis. Additionally, ref. [
17] derived the optimal Poisson subsampling probability for quasi-likelihood estimation and devised a distributed optimal subsampling strategy.
The main contribution of this paper is to improve and develop the upper expectation regression method within the framework of big data or under privacy constraints for models with distribution heterogeneity. This heterogeneity is common in practical applications due to factors such as differences in data sources and environments, which always lead to variations in data distributions, such as the model for the influencing factors of air quality in
Section 5. Upper expectation regression differs from classical regression in that it tends to use larger values to predict the response variable and obtain mini-max prediction risk. Unlike the method proposed by [
7], we address model heterogeneity by introducing group-specific
values, which allows us to obtain a consistent estimator for
, thereby avoiding the potential inconsistency issue of beta found in the literature.
Another major contribution is to study the asymptotic theory of mini-max estimates for upper expectation regression under subsampling conditions. And then we provide a method to obtain the optimal subsampling probability based on the asymptotic theory. Furthermore, we employ an effective and robust estimation and prediction method, making sampling more stable and feasible. This is further supported by simulations and real data.
The rest of this paper is organized as follows. The second section briefly reviews the motivation, methods, and theoretical properties of the upper expectation regression. And we improve the upper expectation regression. The subsampling method and asymptotic theory are studied in the third section.
Section 4 provides the selection method and specific implementation form of the optimal subsampling probability. In the fifth section, simulation and real data examples are given to prove the effectiveness and feasibility of the proposed method. The proof of the theorem is postponed to the
Appendix A.
2. Upper Expectation Regression Model
2.1. Preliminary of Upper Expectation Model
We analyze the given linear regression model, which is expressed as follows:
where
Y represents a scalar response variable,
is a
p-dimensional covariate vector, and
is a
p-dimensional vector comprising unknown parameters. For simplicity, we need the independence assumption. In this way, we need that the conditional expectation of
with given
is a constant independent of
. That is,
where
is a constant when
is given.
In the classic regression model, it is often assumed that the error is an independent and identically distributed random variable. However, in practical applications, due to the different time and sources of data collection, the assumption of independent and identically distributed for error terms may not always hold.
Firstly, we briefly review the k-sample upper expectation linear regression in [
7]. The essential difference between this and the classic regression model is that the error term
has distribution heterogeneity. The possible distribution of the error term forms a set
where
k is the number of different distributions and it is finite.
Under the framework of sublinear expectations, the distribution of
can be defined as
Subsequently, we express the conditional expectations as follows:
Given these definitions, we introduce the concept of upper expectation regression:
Let be a sample in model (1), where , which means that the data are divided into k groups. We assume that samples in different groups have different distributions and that samples in the same group have the same distribution. For simplicity, we assume that the number of samples in each group is equal, i.e., . In actual situations, the number of samples in each group may be different, but when the number is not much different, we obtain almost the same theoretical results.
In order to achieve the upper expectation loss, we use the empirical version of it. Specifically, the empirical version of the upper expectation loss is
By minimizing the upper expectation loss
, we can obtain the estimator of
, and we call it the mini-max estimator of
.
We write
and
.
The following theorem gives the asymptotic normality of the mini-max estimator of .
Theorem 1 ([
7])
. We postulate that is a positive definite matrix, with exceeding for all . As n approaches infinity, it follows thatwhere denotes convergence in distribution, and represents the standard normal distribution. However, the prediction based on
may not align with the upper expectation prediction, as
might not consistently estimate the upper expectation
. Prior to concluding this section, we propose a two-stage estimation approach to develop a consistent estimator for
and subsequently formulate a prediction grounded in upper expectation. Utilizing the consistent estimator
derived previously, the second-stage estimator for
is defined as
Let
and
We present the following theorem based on these definitions.
Theorem 2 ([
7])
. Given the conditions outlined in Theorem 1, if the sequences and are independent for , then the second-stage estimator satisfies the asymptotic distribution:where is equivalent to for some index within the set . Note that has been simplified to for clarity. The proof of Theorems 1 and 2 can be found in [
7].
2.2. Improvement of the Method
We firstly consider the upper expectation loss given in the
Section 2.1:
In different groups of data, the expectation of the error term in the right of the formula can be different. But in the expectation loss, each group uses the same value. By comparing the loss of each group, the selected data group with the largest expectation loss is likely to be the group for which the expectation of the error term is farthest from . This is somewhat unreasonable to some extent.
To solve this problem, we propose a new upper expectation loss
where
denotes the expectation of the error term of the group
i, and
denotes the expectation under the distribution of the data in the group
i.
Due to the value of
being unknown, in order to proceed smoothly in the following estimation, we can make an a priori estimation of
. That is, in each group, we would use simple least squares estimation to approximate the value of
to be
, which is
In this optimization equation,
is also used as a parameter. In other words, there is an estimate of the value of
, but we do not need an estimate of
here, so it is not listed here.
Then, we give a new empirical version of the upper expectation loss:
By minimizing the upper expectation loss, we can obtain the estimator of . We write it as .
For convenience, we use the same notation as in
Section 2.1. In this situation, we write
and
. The
s that appear later are all defined here.
To better express our method, we summarize it in Algorithm 1.
Algorithm 1 The two-stage estimation process for and , |
Require: be a sample in model (1), where ; Ensure: the estimations and ; - 1:
for to n do - 2:
Use ( 3) to solve for the parameter estimate ; - 3:
end for - 4:
Via minimizing the empirical upper expectation loss ( 4), obtain the estimation ; - 5:
return and .
|
Before presenting the asymptotic results, we first make the following assumption about primarily concerning data independence and distributional assumptions.
Assumption 1. There exists an index decomposition such that when , are identically distributed with bounded variance, and The following theorem gives the asymptotic normality of the new mini-max estimator of .
Theorem 3. Under Assumption 1 and further assuming is a positive definite matrix, for all . Then, when , we havewhere stands for convergence in distribution, and is a classic normal distribution. The proof of the theorem is provided in the
Appendix A.
3. General Poisson Subsampling
For large datasets, the increase in the amount of the data has caused great difficulties in calculations and storage. In order to solve this problem, we adopt the Poisson subsampling method. We first consider Poisson subsampling in general, that is, we firstly consider the method without specifying the probability of each sample being selected. Then, we provide a sampling method and give a specific implementation method in later chapters.
3.1. Poisson Subsampling Method
We consider the same datasets as those in
Section 2.1. Then, let
be probability to obtain the
jth sample point in the
ith group, where
and
. Let
denote the set of observation values and sampling probabilities of the sampled subsample in the
ith group. That is,
where
is a random variable with the Bernoulli distribution. We write
Bernoulli
.
According to the introduction in the previous section, we know that is the empirical version of the upper expectation loss .
Due to the large amount of data, a natural idea is to perform statistical analysis on the sampled subsets. Specifically, by using the obtained sample set, we define a new weighted upper expectation loss as
Because of the independence of
and
, we know the weighted upper expectation loss is equal to the upper expectation loss. That is,
Then, we define the empirical version of the weighted upper expectation loss as
We can solve the parameter
by minimizing the weighted estimation function. It means that
From simple calculations, we obtain
One advantage of Poisson subsampling is that the probability only depends on the dataset . Therefore, the probability can be generated block by block, without using all the data at once. In addition, according to the above formula, we only need to calculate the sum of , , , , and in each group. These statistics can be sent to the central machine for calculation, without transmitting the original data of each group. Thereby, it can reduce the time and cost of communication.
3.2. Theoretical Properties
In order to establish asymptotic results of , we need the following assumptions.
Assumption 2. The regression parameter lies in the ball , where C is a constant. This means that and are the inner points of Λ.
Assumption 3. We assumefor all . Assumption 4. For every , we have The size of the subsample is a random variable, and we have . We use r to denote . In this article, it is obvious to assume that .
Assumption 5. We assume that .
Assumption 2 is that the neighborhood of the estimated values
and
have reasonable properties. Assumption 3 gives several moment conditions for variables. Assumption 4 requires the moment condition of the loss function. Assumptions 3 and 4 are used in the proof of Theorems 4 and 5 as the key moment conditions in the
Appendix A. And Assumption 5 puts a limit on the sampling probability of each point.
First, we give the property of .
Theorem 4. If the Assumptions 2–5 hold, then we have The proof of Theorem 4 is in the
Appendix A. From Theorem 4, we can see that the
we proposed is reasonable.
Then, we give the asymptotic normality of .
Theorem 5. If Assumptions 2–5 hold, then when , , we havewhere , and 4. Optimal Subsampling Strategies
In this section, we derive the best subsampling probability to obtain a better estimate of
. After the theoretical derivation, we give a selection method for actual calculations. We mainly draw on the method of [
17].
4.1. Theoretical Method
We would find the optimal subsampling probability according to the result of Theorem 5. That is, we minimize the asymptotic mean square error of
approaching
. This is equivalent to minimizing
. This method is called A-optimality in the language of optimal design (See [
18]).
Method 1 (A-optimality)
. For ease of presentation, define the statistics in the ith group: Let denote the order statistics of . By minimizing the asymptotic mean square error, i.e., minimizing , we obtain the sampling probability of the ith group asand we havewhere This means that s satisfies can be obtained directly by using the data of each group of covariates and response variables.
We know that the range of sampling probability values is
. Therefore, the sampling probability calculated based on (
6) should be less than or equal to 1. As
increases, the sampling probability gets closer to 1. To ensure that the sampling probability remains less than or equal to 1, we introduce a threshold
s that satisfies (
8). When
exceeds this threshold, we directly set the sampling probability to 1.
However, when calculating
, we need to calculate
for
. It takes
time. To further reduce the calculation time, a simple and natural way is to use the matrix
directly, and thus, we do not need to calculate
. In this way, we could change the method by minimizing
to calculate the optimal subsampling probability. This method is called the linear optimality criterion, or L-optimality for short (see [
18]).
Below, we describe in detail the process of obtaining optimal subsampling probability by minimizing .
Method 2 (L-optimality)
. For ease of presentation, define the statistics in the ith group: Let denote the order statistics of . By minimizing the asymptotic mean square error, i.e., minimizing , we obtain the sampling probability of the ith group asand we havewhere This means that s satisfies 4.2. Robust Implementation
In the above calculation, there is as the weight in the estimation function. Therefore, we consider the data points which satisfy . In this case, the probability of these data being sampled is quite small, but they may also be sampled. If these data points are sampled, the estimation equation may be sensitive to the data points. This may cause the failure of Methods 1 and 2 in practical applications. In order to make the estimation more stable, we adopt a more robust sampling method to implement Methods 1 and 2 in the previous section.
We use the following subsampling probability:
where
is
or
, and
.
is the convex combination of and the uniform subsampling probability . It inherits the advantages of both. Here, is a preference tuning parameter. A smaller results in a more optimal sampling probability, while a larger yields a more robust sampling probability. When is close to 1, the estimation function is not sensitive to data points, so the estimation obtained becomes more stable. On the other hand, no matter what value takes, the order of is consistent with . Thus, the estimator can still retain the advantages of optimal subsampling.
In practical implementation, the parameters in are unknown. Therefore, we need to estimate the values of to continue the following calculations.
In order to estimate , we use uniform sampling to obtain an a priori estimate. In detail, we first take a uniform subsample in each group for which the subsample size is . Then, we substitute the obtained subsamples into the estimation function to calculate the pilot estimator of . In addition, we also use the in the calculation of the pilot estimator. We write the pilot estimator as .
In addition, the typical situation is that is required in our subsampling setting. In order to ensure that each , we need take the inverses of as weights in the weighted upper expectation loss.
At the end of this section, we summarize the estimation methods under the optimal sampling probability as Algorithm 2, paving the way for the simulations and real-data applications in the next section.
Algorithm 2 The robust subsampling method for . |
Require: be a sample in model (1), where ; Ensure: the estimations ; - 1:
for to n do - 2:
Use ( 3) to solve for the parameter estimate ; - 3:
end for - 4:
Use uniform sampling to get a priori estimate based on ( 5); - 5:
Use ( 6) and ( 7) or ( 9) and ( 10) to obtain the subsampling probability or ; - 6:
Obtain the robust subsampling probability according to ( 11); - 7:
Compute the estimation according to ( 5) with the subsampling probability ; - 8:
return .
|
5. Simulation Study and Real Data Analysis
5.1. Simulation Study
In this part, we give several simulation examples. By comparing the performance of our proposed optimal subsampling methods and uniform sampling method, the rationality and effectiveness of our proposed method are further illustrated.
From the simulations given later, we discover the following conclusions. Whether it is the case where the mean value is certain or the mean value is uncertain, the optimal subsampling method we proposed has significantly better performance than the simple uniform subsampling method. And as the sampling ratio increases, the performance of the method becomes better.
5.1.1. Experiment 1
In the first simulation experiment, we consider the following linear model:
During the simulation, we examine a scenario where the error term possesses a defined mean, but its variance is uncertain. For this setup, we select regression coefficients as for . Additionally, the predictors are independently and identically distributed as for . The error term is presumed to adhere to a normal distribution with a mean of zero and an unknown variance , thereby forming a generalized regression model accommodating variance uncertainty. During the simulation process, we draw variance values for i ranging from 1 to k from a uniform distribution spanning from [0, 4]. Following this, we generate error values for j ranging from 1 to n, using normal distributions centered at zero with variances .
We have three subsampling methods. The first method is the simple uniform sampling method. This method serves as a control to reflect the effectiveness and rationality of the method we proposed. The second method is the Method 1 we proposed in
Section 4, which is called A-optimality. And the third method is the Method 2 in
Section 4, which is called L-optimality. Let
be the preference tuning parameter. We chose
and
10,000, and we chose that the number of samples per group
r was 100, 300, and 500. This implies that by adopting such a practice with varying sample sizes for subsampling, we can compare the impact of different sample sizes
r in the method. We sampled the generated data 500 times and compared the performance of these three methods by calculating the deviation and mean square error of the parameter estimates.
The simulation results are reported in
Table 1,
Table 2 and
Table 3. The bolded results in the tables correspond to the best-performing method.
In order to further illustrate the effect of our sampling method in the upper expected regression, we changed the experimental settings, which represents stronger heterogeneity. We considered increasing the number of groups k and appropriately reducing the number of samples in each group. We explored the effect of parameter estimation in the new setting. We chose and as the new experiment settings. And the other experimental conditions remained unchanged.
The simulation results are reported in
Table 4,
Table 5 and
Table 6. Based on the entire simulation experiment, we draw the following conclusions:
By comparing the results of the three methods, it can be observed that our proposed A-optimality and L-optimality approaches significantly outperformed the uniform sampling method in terms of mean squared error. On the other hand, there was little difference in the effectiveness of the mean estimates obtained by the three methods. This demonstrates the effectiveness and stability of our methods, particularly for the method of A-optimality.
By comparing
Table 1,
Table 2 and
Table 3 or
Table 4,
Table 5 and
Table 6, it can be observed that as the number of sampling instances
r increased, the estimation performance of all three methods improved significantly, both in terms of the mean and mean squared error. This underscores the importance of the number of sampling instances.
By comparing
Table 1,
Table 2 and
Table 3 and
Table 4,
Table 5 and
Table 6, it can be observed that in cases of high heterogeneity, where
k is large and
n is small, the estimation performance of all three methods deteriorates. However, as the number of sampling instances
r increases, this discrepancy diminishes.
5.1.2. Experiment 2
We reconsider the linear model
which is the same in form as in Experiment 1. In this model, we consider the situation that the mean and variance of the error are both uncertain. This means that
∼
have uncertain mean
and variance
.
During the simulation, we also specified that follows a normal distribution for . The error terms were generated as follows: First, the mean values and variance values were drawn from uniform distributions and , respectively. Subsequently, the error values for were sampled from the normal distribution . Additionally, we also set and or and , as well as r to be 100, 300, or 500.
- 1.
In summary, by comparing the methods, our proposed A-optimality and L-optimality approaches outperformed the uniform sampling method in terms of mean squared error, while the mean estimates of all three methods showed little difference in effectiveness.
- 2.
It can be observed that as the number of sampling instances r increased, the estimation performance of all three methods improved significantly.
- 3.
5.2. Real Data Analysis
An increasing number of urban areas are grappling with persistent air pollution, primarily due to fine particulate matter, especially PM2.5. PM2.5 comprises particles suspended in the air with aerodynamic diameters smaller than 2.5 micrometers. These particles are recognized for their impact on visibility, human health, and climate patterns. Epidemiological studies indicate that exposure to PM2.5 can lead to respiratory ailments, severe cardiovascular diseases, and potentially fatal outcomes.
There are many reasons for PM2.5 pollution, including SO2 concentration, O3 concentration, temperature, wind speed, etc. The main consideration in this article is the impact of these four factors on PM2.5 concentration. The dataset used in this article was taken from the UCI machine learning knowledge base. Specifically, we obtained the PM2.5 concentration, SO2 concentration, O3 concentration, temperature, and wind speed of the ancient city of Beijing from 1 March 2013 to 29 February 2016. Due to the large time span involved in the model, model heterogeneity may arise. Therefore, we considered grouping the data based on time for processing. Then, we considered the linear influence of the covariate on the response variable, and we tested the effect of the subsampling method proposed in this article.
We first centralized the 25,217 data collected over three years, with the PM2.5 concentration after centralization as the response variable, and the central SO2 concentration, O3 concentration, temperature, and wind speed as four covariates. These covariates all have an impact on PM2.5 concentrations. For instance, SO2 can undergo chemical reactions to transform into components of PM2.5, while wind speed affects the concentration of PM2.5 through dispersion and dilution effects. Next, we divided the data into 12 groups. The grouping method was grouped by time. Every three months was a group. This yielded about 2100 data points in each group. The number of data in each group was not necessarily equal.
For the linear regression problem, we considered two different situations. The first case is that the model has only variance uncertainty, that is, the mean value of the error term in each group is equal. The second case is that the model has both variance uncertainty and mean uncertainty. The two sampling methods we proposed performed well in both cases, and they were significantly better than ordinary uniform sampling methods.
Case 1 (Mean certainty model)
We assumed that the model only has variance heterogeneity. In other words, the errors from different group have the same mean. Consider the following model:
Our purpose is to estimate the parameters , , , and , which are the same in every group. Use the subsampling methods we proposed above, we obtained the following table that describes the mean prediction error and computational time of each method under different settings. The sample size r in the subsampling of each group was set to 50.
Case 2 (Mean–variance heterogeneity model)
Then, we considered a model with not only mean heterogeneity but also variance heterogeneity in the errors of different groups. Consider the following model:
In this model, . We also set the sample size in the subsampling of each group to 50. We have the following table that describes the mean prediction error and computational time of each method under different settings.
The results are presented in
Table 13 and
Table 14. Through the actual data analysis of the two cases, we obtained the following conclusions:
First, in terms of computational time, it can be observed that uniform sampling was significantly the fastest, while A-optimality was the slowest. This is because the sampling probability for uniform sampling is easily obtained, whereas A-optimality requires the calculation of the inverse of the covariance matrix, which is more complex than L-optimality.
Based on the mean prediction error metric, we found that the A-optimality and L-optimality methods significantly outperformed the uniform sampling method, with the A-optimality method being the best.
By comparing
Table 13 and
Table 14, it can be observed that the prediction performance of the mean–variance heterogeneous model was significantly better than that of the variance heterogeneous model, while their computational times were comparable. This suggests that the mean–variance model is more suitable for the real data example.
6. Conclusions
In the background with heterogeneous distributions, the task of statistical modeling and inference remains a significant challenge. In this paper, we improved the k-sample upper expectation regression model in the context of distribution heterogeneity by employing group-specific values to address the impact of heterogeneity and obtaining consistent estimates. Additionally, we delved into optimal subsampling techniques specifically designed for big data scenarios, overcoming the challenges posed by large datasets and privacy protection through a robust optimal sampling probability. We analyzed the theoretical properties of our method and conducted comprehensive numerical experiments on both simulated and actual datasets to assess its practical effectiveness. Both theoretical analysis and numerical results confirm the validity of our method when dealing with large datasets. Furthermore, real data analysis demonstrates that our method can be applied to various similar fields, such as finance and environmental monitoring.
However, there are still some shortcomings in our method that require subsequent work to address. First, the k-sample assumption necessitates prior knowledge of k-sample groups, which means we must assume that sample clusters with different distributions are already known, which are often derived from experience or other methods. This is a rather stringent requirement. Future work could consider developing methods that can accommodate complete distribution heterogeneity in situations where the grouping is unknown, for example, first by data-driven clustering and then by applying upper expectation regression. Additionally, when k is large, it often implies a smaller sample size in each group, resulting in a high degree of heterogeneity. In such cases, both computational complexity and estimation effectiveness face significant challenges. Overcoming these issues is also a direction for future work. Lastly, the robust optimal sampling probability involves a tune parameter , and selecting the optimal to balance robustness and efficiency of the sampling probability is also a research-worthy problem.