1. Introduction
With the rapid development of contemporary networks, science, and technology, large-scale and high-dimensional data have emerged. For a single computer or machine, due to the limitations of memory and computing power, the processing and storage of data have become a great challenge. Therefore, it is necessary to handle data scattered across various machines. Mcdonald et al. [
1] considered a simple divide and conquer, that is, the  parameters of interest of the model are learned separately on the local samples of each machine, and then these estimated parameters are averaged on a master machine. The divide and conquer is communication-efficient for large-scale data, but its accuracy of learning is low. Van de Geer et al. [
2] proposed an AVG-debias method, which improves the accuracy under a strong hypothesis but is time-consuming on computation because of the debiasing step. Therefore, it is necessary to develop communication-efficient distributed learning frameworks.
Afterwards, a novel communication-efficient distributed learning algorithm was proposed by [
3,
4]. Wang et al. [
3] developed an Efficient Distributed Sparse Learning (EDSL) algorithm, which optimizes a shifted 
 regularized 
M-estimation problem, and other machines to compute the gradient on local data for the high-dimensional model. Jordan et al. [
4] adopted the same framework of distributed learning, which is called Communication-efficient Surrogate Likelihood (CSL), for solving distributed statistical inference problems in low-dimensional learning, high-dimensional regularized learning, and Bayesian inference. The two algorithms significantly improve the communication efficiency of distributed learning and have been widely used to analyze the big data in medical research, economic development, social security, and other fields. Under the CSL framework, Wang and Lian  [
5] investigated a distributed quantile regression; Tong et al. [
6] developed privacy-preserving and communication-efficient distributed learning, which accounts for the heterogeneity caused by a few clinical sites for a distributed electronic health records dataset; and Zhou et al. [
7] developed two types of Byzantine-robust distributed learning with optimal statistical rates for strong convex losses and convex (non-smooth) penalties. It can be seen that the CSL framework plays an important role in distributed learning. In this paper, we will adopt the communication-efficient CSL for our distributed bootstrap simultaneous inference for high-dimensional data. When the data are complex with outliers or heteroscedasticity, the conventional mean regressions are unable to fully capture the information contained in the data. Quantile regression (QR) was proposed by [
8], which not only captures the relationship between features and outcomes but also allows one to characterize the conditional distribution of the outcomes given these features. Compared with the mean regression model, quantile regression can handle heterogeneous data better, especially for these outcomes with heavy tail distribution or outliers. Quantile regression is widely used in many fields [
9,
10,
11]. For quantile regression in high-dimensional sparse models, Belloni and Chernozhukov [
12] considered 
-penalized QR and post-penalized QR and showed that under general conditions, the two estimators are consistent at the near-oracle rate uniformly. However, they did not consider large-scale distributed circumstances. Under the distributed framework, quantile regression has also received great attention. For example, Yu et al. [
13] proposed a parallel QPADM algorithm for a large-scale heterogeneous high-dimensional quantile regression model; Chen et al. [
14] proposed a computationally efficient method, which only requires an initial QR estimator on a small batch of data, and proved that the algorithm with only a few rounds of aggregations achieves the same efficiency as the QR estimator obtained on all the data; Chen et al. [
15] developed a distributed learning algorithm that is both computationally and communicationally efficient and showed that distributed learning achieves a near-oracle convergence rate without any restriction on the number of machines; Wang et al. [
5] analyzed the high-dimensional sparse quantile regression under the CSL; and Hu et al. [
16] considered an ADMM distributed quantile regression model for massive heterogeneous data under the CSL. However, the above works mainly focus on the distributed learning to parameters of quantile regression models in circumstances of massive or high-dimensional data and have not yet involved their distributed statistical inference. Volgushev et al.  [
17] gave distributed statistical learning on quantile regression processes. So far, the statistical inferences for high-dimensional quantile models are still elusive, especially for distributed bootstrap simultaneous inference on high-dimensional quantile regression.
The bootstrap is a generic method for learning the sampling distribution of a statistic, typically by resampling one’s own data [
18,
19]. The bootstrap method can be used to evaluate the quality of estimators and can effectively solve the problem of statistical inference of high-dimensional parameters [
20,
21]. We refer to the fundamentals of the bootstrap method for high-dimensional data in [
22]. Kleiner et al. [
23] introduced the Bag of Little Bootstraps (BLB) for massive data via incorporating features of both the bootstrap and subsampling, which are suited to modern parallel and distributed computing architectures and maintain the statistical efficiency of the bootstrap. However, the BLB has restrictions on the number of machines in distributed learning. Recently, Yu et al. [
24] proposed K-grad and n+K-1-grad distributed bootstrap algorithms for simultaneous inference for a linear model and a generalized linear model, which do not constrain the number of machines and provably achieve optimal statistical efficiency with minimal communication. Yu et al. [
25] extended the K-grad and n+K-1-grad distributed bootstrap for simultaneous inference to high-dimensional data and adopted the CSL framework of [
4], which not only relaxes the restrictions on the number of machines but also effectively reduces communication costs. In this paper, we will further extend the K-grad and n+K-1-grad distributed bootstrap for simultaneous inference in high-dimensional quantile regression models. This is challenging due to the non-smooth nature of the quantile regression loss function, which cannot directly use the existing methodology.
In this paper, we design a communication-efficient distributed bootstrap simultaneous inference algorithm for high-dimensional quantile regression and provide its theoretical analysis. The algorithm and its statistical theory are the focus of this article, which belongs to the topic of probability and statistics. Its specific sub-field is bootstrap statistical inference. It is a traditional issue in statistics, but it is novel in the context of big data. We consider these methods to fit our model best. First, we adopt a communication-efficient CSL framework for large-scale distributed data, which is a novel distributed learning algorithm proposed by [
3,
4]. Under the master-worker architectures, CSL makes full use of the total information of the data over the master machine while only merging the first-order gradients from all the workers. Especially, a quasi-newton optimization at the master is solved as the final estimator, instead of merely aggregating all the local estimators like one-shot methods [
7]. It has been shown in [
3,
4] that CSL-based distributed learning can preserve sparsity structure and achieve optimal statistical estimation rates for convex problems in finite-step iterations. Second, we consider high-dimensional quantile regression for large-scale heterogeneous data, especially for these outcomes with heavy tail distribution or outliers. Thus, it brings more robust bootstrap inference. Third, we are motivated by communicate-efficient multiplier bootstrap methods K-grad/n+K-1-grad, which are originally proposed in [
24,
25] for mean regression, and propose our K-grad-Q/n+K-1-grad-Q Distributed Bootstrap Simultaneous Inference for high-dimensional quantile regression (Q-DistBoots-SI). Our proposed method relaxes the constraint on the number of machines and can provide more accurate and robust data for large-scale heterogeneous data. To the best of our knowledge, there is no more advanced distributed bootstrap simultaneous inference method available than our Q-DistBoots-SI for high-dimensional quantile regression.
Our main contributions are: (1) we develop a communication-efficient distributed bootstrap for simultaneous inference in high-dimensional quantile regression, under the CSL framework of distributed learning. Meanwhile, the ADMM is embedded for penalized quantile learning with distributed data, which is well suited for distributed convex optimization problems under minimum structural assumption [
26] and can handle the non-smoothness of quantile loss and the Lasso penalty. (2) We theoretically prove the convergence of the algorithm and establish a lower bound on the number of communication rounds 
 that warrants the statistical accuracy and efficiency. (3) The distributed bootstrap validity and efficiency are corroborated by an extensive simulation study.
The rest of this article is organized as follows. In 
Section 2, we present a communication-efficient distributed bootstrap inference algorithm. Some asymptotic properties of bootstrap validity for high-dimensional quantile learning are established in 
Section 3. The distributed bootstrap validity and efficiency are corroborated by an extensive simulation study in 
Section 4. Finally, 
Section 5 contains the conclusion and a discussion. The proof of the main results and additional experimental results is provided in 
Appendix A.
  Notations
For every integer ,  denotes a dimensional Euclidean space. The inter product of any two vectors is defined by  for  and . We denote the -norm () of any vector  by , and . The induced q-norm and the max-norm of any matrix  are denoted by , and , where the  is the i-th row and j-th column element of M. The  denotes the largest eigenvalue of a real symmetric matrix. Let  and  be the conditional density and conditional cumulative distribution function of y given x, respectively. Denote  as the index set of nonzero coefficient and  as the cardinality of S.  denotes the k worker machine. We write  for  and ,  for  and  for , and  if .
  3. Theoretical Analysis
Recall that the quantile regression model has the conditional quantile 
 of 
Y given the feature 
x at quantile level 
, that is,
      
      with 
. In this section, we establish theoretical results for distributed bootstrap simultaneous inference on high-dimensional quantile regression. We use the following assumptions.
Assumption 1.  x is sub-Gaussian, that is,for some absolute constant .  Assumption 2.   is the unique minimizer of the objective function , and  is an inner point in , where  is an compact subset.
 Assumption 3.   is absolutely continuous at y, its conditional density function  is bounded and continuously differentiable at y for all x in the support of x, and  is uniformly bounded by a constant. In addition,  is uniformly bounded away from zero.
 Assumption 4.   and  are sparse for , where  with . Especially, , , , , and .
 Remark 1.  Assumption 1 holds if the covariates are Gaussian. Under Assumption 1,  when  by Lemma 2.2.2 in [29]. Assumptions 2 and 3 are common in standard quantile regression [9]. Assumption 4 is a sparsity assumption typically adopted in penalized variable selection.  In order to state next assumptions, define a restricted set , and  is a support of J largest in absolute value components of  outside S.
Remark 2.  The assumptions of  and q come from [12]; they are called restricted eigenvalue conditions and restricted nonlinear impact coefficients, respectively.  holds when x are mean zero with diagonal elements of  being 1’s by Lemma 1 in [12]. The restricted eigenvalue condition is analogous to the condition in [30]. The q controls the quality of minoration of the quantile regression objective function by a quadratic function over the restricted set, which holds under Design 1 in [12].  First, we give the convergence rates of distributed learning for high-dimensional quantile regression models under the CSL framework.
Theorem 1.  Assume that Assumptions 1–5 hold, and . Then, with probability at least , we havewhere .  Remark 3.  Recall that Lemma A1,where . We can take Thus, Theorem 1 upper bounds the learning error  as a function of . So, applying it to the iterative program, we obtain the following learning error bound, which depends on the local -regularized estimation error .
 Corollary 1.  Suppose the conditions of Theorem 1 are satisfied;  takes as (22); and for all t, . Then, with probability at least , we havewhere , and .  Remark 4.  For initialization estimation, we refer to Theorem 2 in [12], We further explain the bound and see the scaling with respect to n, K, s, and p. When , it is easy to see by takingwe have the following error bounds: Moreover, as long as t is large enough so that , and , thenwhich match the centralized lasso without any additional error term [30] as [3] has done in sparse linear regression distributed learning.  Based on the proposed Q-DistBoots-SI algorithm, we define
      
Theorem 2.  (K-gard-Q) Assume that Assumptions 1–5 hold, letfor , and  for . Then, if , , and for , wherewe havewhere , in which  is the K-grad-Q bootstrap statistics with the same distribution as  in (12) and  denotes the probability with respect to the randomness from the multipliers. In addition,where  is defined in (7).  Theorem 2 ensures the effectiveness of constructing simultaneous confidence intervals for quantile regression model parameters using the “K-grad-Q” bootstrap method in Algorithm 1. Moreover, it indicates that bootstrap quantiles can approximate the prior statistics, implying that our proposed bootstrap procedure possesses statistical validity similar to the prior estimation method.
Remark 5.  If , ,  for some constants , , and , then a sufficient condition is , , and Notice that the representation of  mentioned above is independent of the dimension p; the direct effect of p only enters through an iterative logarithmic term , which is dominated by .
 Theorem 3.  (n+K-1-grad-Q) Assume that Assumptions 1–5 hold; take  as (24) for  and  for . Then, if , , and when , wherewe havewhere , in which  is the n+K-1-grad-Q bootstrap statistics with the same distribution as  in (13), and  denotes the probability with respect to the randomness from the multipliers. In addition, (27) also holds.  Theorem 3 establishes the statistical validity of the distributed bootstrap method when using “n+K-1-grad-Q”. To gain insight into the difference between “K-grad-Q” and “n+K-1-grad-Q”, we compare the difference between the covariance of the oracle score 
A and the conditional covariance of 
 and 
 conditioning on the data, and we obtain
      
      and
      
Remark 6.  If , , and  for some constants , , and , then a sufficient condition is , , and Notice that the representation of  mentioned above is independent of the dimension p; the direct effect of p only enters through an iterative logarithmic term , which is dominated by .
 Remark 7.  The rates of  and  in Theorems 3.2 and 3.3 are motivated by Theorem 1 and [2]. Therefore, we fix  (e.g., 0.01 in the simulation study) and use the cross-validation method to choose .  Remark 8.  The total communication cost in our algorithm is of the order  because in each iteration we communicate p-dimensional vectors between the master node and K-1 worker nodes, and  only grows logarithmically with K when n and p are fixed. Our order matches those in the existing communication-efficient statistical inference literature, e.g., [3,4,25].    4. Simulation Experiments
In this section, we demonstrate the advantages of our proposed approach through numerical simulation. We consider the problem of parameter estimation for high-dimensional quantile regression models in a distributed environment. In 
Section 4.1, we compare our algorithm Q-ADMM-CSL with the oracle estimation (Q-Oracle) and simple divide and conquer (Q-Avg) for high-dimensional quantile regression, evaluating the computational effectiveness of our proposed algorithm. In 
Section 4.2, we construct confidence intervals and assess their validity. The data are generated from the following model: 
      where 
 by, respectively, taking 
 and 
 to demonstrate the benefits of our method for large-scale high-dimensional data with heavy-tailed distribution.
In this section, we consider a high-dimensional quantile regression model with dimension of feature ; fix the total sample size ; and select the numbers of machines , and 20, respectively. Therefore, the sample size on each machine is , that is,  600 for , 300 for  and 150 for . Considering the scenario of parameter sparsity, we choose the real coefficient  to be p-dimensional, in which s coefficients are non-zero, and the remaining  parameters are 0. We consider the two cases: (1) sparsity with , , and the rest of the components are 0; (2) sparsity with , , and the rest are 0. We generate independent and identically distributed covariates  from a multivariate normal distribution , where the covariance matrix . Given a quantile level , we consider three levels: 0.25, 0.5, and 0.75.
  4.1. Parameter Estimation
In this section, we study the effect of our proposed algorithm. We repeatedly generate 100 datasets of independent data, and we use the - norm to evaluate the quality of parameter estimate. That is, . Meanwhile, we compare the effect of  obtained by our proposed algorithm, , obtained by all data estimation and , obtained by naively average data estimation.
For the choice of penalty parameter 
, in the oracle estimation, we refer to the method of selecting penalty parameter in [
12]; in the construction of the average estimation, we choose 
; in our proposed distributed multi-round communication process, 
; and when 
 we set 
. For the parameters 
 and 
 in ADMM, we refer to the selection in [
27] to choose 
 and 
.
Figure 1 and 
Figure 2 show the relationship between the number of communication rounds and the estimation error of parameters, when the noise distributions are normal and 
, for the sparsity levels 
 and 
, respectively. We consider various scenarios involving different quantile levels and number of machines. It can be observed that after sufficient communication rounds, our parameter estimation method (Q-ADMM-CSL) can approximate the performance of the Oracle estimation (Q-Oracle), and the performance is significantly better than Q-Avg after a round of communication. In addition, our proposed method converges quickly, and matches the Oracle method after only about 30 rounds of communication.
   4.2. Simultaneous Confidence Intervals
In this section, we demonstrate the statistical validity of confidence intervals constructed by our proposed method. For each choice of 
s and 
K, we run Algorithm 1 with “K-grad“ and “n+K-1-grad” on 100 independently generated datasets and compute the empirical coverage probabilities and the average widths based on the 100 running. At each running, we draw 
 bootstrap samples and calculate 
B bootstrap statistics (
 or 
) simultaneously. We obtain the 
 and 
 empirical quantiles and further obtain the 
 and 
 simultaneous confidence intervals. For the selection of the adjustment parameter 
 in the nodewise algorithm, we refer to the method proposed in [
25].
In 
Figure 3, for the case of 
, and 
Figure 4, for the one of 
, the coverage probabilities and the ratios of the average widths to the prior widths, calculated using the “K-grad-Q“ and “n+K-1-grad-Q” methods for different quantile levels, are displayed. The confidence levels are 95%. The sparsity levels are 
 and 
. To determine whether the true values of the non-zero elements in the parameter 
 lie within the intervals constructed by our proposed method, we examine different values for the different number of machines, 
K. We observe that the confidence intervals constructed by our method are capable of effectively encompassing the true values of unknown parameters. In 
Figure 5 for 
 and 
Figure 6 for 
, we construct a 95% confidence interval for the fifth element of the true parameter (
). The case for the confidence level of 
 is listed in the 
Appendix A.
When the round of communication is low, the accuracy of parameter estimation is poor and the coverage probabilities of both methods are low. However, when the round of communication is sufficiently large, the estimation accuracy is relatively high, and the “k-1-grad-Q” method tends to coverage. In addition, the “n+k-1-grad-Q” method is relatively more accurate. The confidence intervals via our method can effectively cover the unknown true parameters. We also find that when the number of machines is too large (with small amounts of data on each machine), the estimation accuracy is low, which also leads to a low coverage probability; when K is too small, both algorithms perform poorly, which is consistent with the results in Remarks 7 and 8.
  5. Conclusions and Discussions
Constructing confidence intervals for parameters in high-dimensional sparse quantile regression models is a challenging task. The bootstrap, as a standard inference tool, has been shown useful in handling the issue. However, previous works that extended the bootstrap technique to high-dimensional models focus on non-distributed mean regression [
25] or distributed mean regression [
24]. We extend their “k-1-grad” and “n+k-1-grad” bootstrap techniques to “k-1-grad-Q” and “n+k-1-grad-Q” distributed bootstrap simultaneous inference for high-dimensional quantile regression, which is applicable to large-scale heterogeneous data. Our proposed Q-DistBoots-SI algorithm is based on a communication-efficient distributed learning framework [
3,
4]. Therefore, the Q-DistBoots-SI is a novel communication-efficient distributed bootstrap inference, which relaxes the constraint on the number of machines and is more accurate and robust for the large-scale heterogeneous data. We theoretically prove the convergence of the algorithm and establish a lower bound on the number of communication rounds 
 that warrants statistical accuracy and efficiency. This also enriches the statistical theory of distributed bootstrap inference and provides a theoretical basis for its widespread application. In addition, our proposed Q-DistBoots-SI algorithm can also be applied to large-scale distributed data in various fields. In fact, the bootstrap method has been applied to statistical inference for a long time. For example, Chattergee and Lahiri [
31] studied the performance of the bootstrapping Lasso estimators on the prostate cancer data and stated that the covariates log(cancer volume), log(prostate weight), seminal vesicle invasion, and Gleason score have a nontrivial effect on log(prostate specific antigen); the rest of the variables (age, log(benign prostatic hyperplasia amount), log(capsular penetration), and percentage Gleason scores 4 or 5) were judged insignificant at level 
; Liu et al. [
32] applied their proposed bootstrap lasso + partial ridge method to a data set containing 43,827 gene expression measurements from the Illumina RNA sequencing of 498 neuroblastoma samples and found some significant genes. Yu et al. [
25] tested their distributed bootstrap for simultaneous inference on a semi-synthetic dataset based on the US Airline On-time Performance dataset and successfully selected the relevant variables associated with arrival delay. However, Refs. [
25,
31,
32] mainly focused on bootstrap inference for mean regression. Therefore, they cannot select relevant predictive variables with the responses at the different quantile levels. In contrast, our method can be successfully applied to the US Airline On-time Performance dataset and gene expression dataset to infer predictors with significant effects on the response at each quantile level. This is very important because we may be more interested in the influencing factors of response variables at extreme quantile levels. For example, our approach can be applied to gene expression data to identify genes that have significant effects on predicting a cancer gene’s expression levels in a quantile regression model. Compared with mean regression methods, our method finds genes that should be biologically more reasonable and interpretable because of the characteristics of quantile regression. Future work is needed to investigate the applications of our distributed bootstrap simultaneous inference for quantile regression to large-scale distributed datasets in various fields. Although our Q-DistBoots-SI algorithm is communication-efficient, when the feature dimension of the data is extra-high, the gradient transmitted by each worker machine in the algorithm is still an ultra-high dimensional vector, which also has unbearable communication costs. Thus, we need to develop a more communication-efficient Q-DistBoots-SI algorithm via quantization and sparse techniques (such as Top
k) for large-scale ultra-high-dimensional distributed data. In addition, our Q-DistBoots-SI algorithm cannot cope with Byzantine failure in distributed statistical learning. However, Byzantine failure has recently attracted significant attention [
7] and is becoming more common in distributed learning frameworks because worker machines may exhibit abnormal behavior due to crashes, faulty hardware, and stalled computation or unpredictable communication channels. Byzantine-robust distributed bootstrap inference will also be a topic of our future research. Additionally, in the future, we can also extend our distributed bootstrap inference method into transfer learning and graph models for large-scale high-dimensional data.