On Data-Enriched Logistic Regression

Cheng Zheng; Sayan Dasgupta; Yuxiang Xie; Asad Haris; Ying-Qing Chen

doi:10.3390/math13030441

,

and

¹

Department of Biostatistics, University of Nebraska Medical Center, Omaha, NE 68198, USA

²

Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, WA 98109, USA

³

Department of Biostatistics, University of Washington, Seattle, WA 98195, USA

⁴

Department of Medicine, Stanford University, Palo Alto, CA 94305, USA

Mathematics2025, 13(3), 441;https://doi.org/10.3390/math13030441

This article belongs to the Special Issue Statistical Methods in Bioinformatics and Health Informatics

Version Notes

Order Reprints

Abstract

Biomedical researchers typically investigate the effects of specific exposures on disease risks within a well-defined population. The gold standard for such studies is to design a trial with an appropriately sampled cohort. However, due to the high cost of such trials, the collected sample sizes are often limited, making it difficult to accurately estimate the effects of certain exposures. In this paper, we discuss how to leverage the information from external “big data” (datasets with significantly larger sample sizes) to improve the estimation accuracy at the risk of introducing a small amount of bias. We propose a family of weighted estimators to balance bias increase and variance reduction when incorporating the big data. We establish a connection between our proposed estimator and the well-known penalized regression estimators. We derive optimal weights using both second-order and higher-order asymptotic expansions. Through extensive simulation studies, we demonstrate that the improvement in mean square error (MSE) for the regression coefficient can be substantial even with finite sample sizes, and our weighted method outperformed existing approaches such as penalized regression and James–Stein estimator. Additionally, we provide a theoretical guarantee that the proposed estimators will never yield an asymptotic MSE larger than the maximum likelihood estimator using small data only in general. Finally, we apply our proposed methods to the Asia Cohort Consortium China cohort data to estimate the relationships between age, BMI, smoking, alcohol use, and mortality.

Keywords:

risk prediction; logistic regression; shrinkage estimator; big data

MSC:

62J12; 62P10

1. Introduction

In most research settings in medicine, we aim to know the effect of a specific exposure to the risk of some specific disease among a well-defined targeted population. To achieve this goal, well-designed trials are usually used. However, the sample sizes for such studies are usually limited due to the high cost of recruitment and thus the sample size usually just has the power to detect the effect for the primary exposure of interest. In the meantime, the number of available observational studies or trial studies from other populations is accumulating quickly nowadays. Can we use information from these data to improve our inference on the population where small data are drawn from? Here, we refer to the randomized clinical trial data that have a clearly defined target population by the design and sampling scheme as “small data” and refer to other external data as “big data”. Our goal is to efficiently combine the information from these two types of data to obtain a more accurate estimation of the association between some readily available quantities that are presented in both data (e.g., age and gender) and the risk of diseases among the population where small data are drawn from. Since the distributions of predictors, as well as the relationship among these predictors and the event of interest, are likely to be different between “big data” and “small data”, we are at risk of introducing some bias when using the information from big data. However, given the size of the big data, they can still provide insightful information about how we predict risk among the targeted population, assuming that the two populations share a certain degree of similarity in their prediction model forms. This motivates us to find an estimator that is better than using “small data” only with a bias–variance trade-off. Directly pooling the two data together can lead to substantial bias and increase mean square error (MSE) when the difference between the two sources is large. This motivates us to find estimators that always do not lead to increased MSE and lead to decreased MSE under certain situations when compared with estimator that using “small data” only.

Previous studies have shown the plausibility of this type of idea. In the simple mean estimation, ref. [1] showed that the simple sample mean is inadmissible. Ref. [2] studied the combining of regression results from small and big data in a linear regression setting and showed Stein-type results for Gaussian responses, i.e., the use of small data only is inadmissible when

p \geq 5

and degree of freedom is more than 10. Ref. [3] proposed to use shared Lasso to achieve this in a linear regression setting. However, the similar enhanced regression approach for non-Gaussian outcomes has not been fully studied. To our knowledge, risk prediction work is mostly dependent on the assumption that certain reduced marginal models or marginal information from the big data are accurate [4,5]. In this work, we propose to fill the gap of risk prediction by combining information from small and big data for binary outcomes that rely on an alternative structural assumption, where we assume the effect structure rather than the effect magnitude is the same. Specifically, we proposed two new estimators that can incorporate information from big data to improve efficiency in estimating parameters related to small data. We showed that these estimators are better than using small data both theoretically and via extensive numerical studies. In addition, we compared our proposed estimators with several existing alternative estimators (pooled estimator, penalized regression estimator, and James–Stein estimator).

The structure of this paper is as follows: In Section 2, we introduce the notation and models we used followed by our proposed estimators and their connection to other existing estimators. In Section 3, we study the performance of different estimators for their finite sample properties and show the improvement using our proposed estimator. In Section 4, we provide theoretical results that guarantee our proposed estimator to be no worse than the small data-only analysis in terms of MSE. In Section 5, we applied our method to analyze the Asia Cohort Consortium data with sensitivity for potential violation of model assumptions. In Section 6, we discuss the potential extension of our proposed estimator to a more general setting.

2. Methods

2.1. Notation and Model

We denote our outcome of interest as

Y \in {0, 1}^{n_{S} + n_{B}}

and denote the design matrices by

X_{S} \in R^{n_{S} \times p}

and

X_{B} \in R^{n_{B} \times p}

, where

n_{S}

and

n_{B}

represent the sample sizes for the small data and the big data, respectively. In general, we reserve the subscripts B and S to denote quantities related to the big data and the small data, respectively. Since the outcomes of interest are binary (disease occurrence), denoting

μ_{i} = E (Y_{i} | X_{i}) = P r (Y_{i} | X_{i})

, we assume logistic regression models for both the small data and the large data and write them as

\begin{matrix} log (\frac{μ_{i}}{1 - μ_{i}}) = X_{i} \{β + γ I (i \in B)\} \end{matrix}

(1)

where

β \in R^{p \times 1}

and

γ \in R^{p \times 1}

are unknown regression parameters, and

I (\cdot)

is the indicator function. Our goal is to obtain an accurate estimation of

β

while treating

γ

as a nuisance parameter using information from both the small and big data. In this project, we propose novel weighted shrinkage estimators and relate them to the penalized regression-based estimators. We compare the performance of weighted estimators, penalized estimators, and the James–Stein-type shrinkage estimator [6,7]. In this paper, we adopt the following big O and small O notations: A sequence

A_{n}

is said to be

o (n^{k})

if

{lim}_{n \to \infty} \frac{A}{n_{k}} \to 0

and is said to be

O (n^{k})

if

{lim sup}_{n \to \infty} | \frac{A}{n_{k}} | < \infty

; a random variable sequence

A_{n}

is said to be

o_{p} (n^{k})

if

\frac{A}{n_{k}} \to_{p} 0

for

n \to \infty

and is said to be

O_{p} (n^{k})

if

{sup}_{n} P r (\frac{A_{n}}{n^{k}} > M) \to 0

for arbitrary

M \leq \infty

when

n \to \infty

.

2.2. Penalized Regression-Based Estimators

Here, we first introduce the existing penalized regression-based estimators that can be used to integrate the information from the two data sets. We consider minimizing the following object function to obtain an estimate of

β

, where a penalty is put on

γ

only:

\begin{matrix} L_{λ} (β, γ) & = & - \sum_{i = 1}^{n_{S} + n_{B}} {y_{i} log \frac{μ_{i}}{1 - μ_{i}} + log (1 - μ_{i})} + p_{λ} (γ) \\ = & - \sum_{i = 1}^{n_{S} + n_{B}} y_{i} x_{i} \{β + γ I (i \in B)\} - log \{1 + exp {β + γ I (i \in B)}\} + p_{λ} (γ), \end{matrix}

where

p_{λ} (γ)

is the penalty term. The estimators from the above optimization problem can be implemented using a penalized logistic regression such as glmnet in R 4.0 [8] with a design matrix that has a row like

(x_{i}, x_{i} I (i \in B))

and the penalty factor

(0_{p}, 1_{p})

. Here, the form of the penalty term can be flexible, for example,

p_{λ} (γ) = λ \sum_{j = 1}^{p} | γ_{j} |

for LASSO

L_{1}

penalty [9,10],

p_{λ} (γ) = λ \sum_{j = 1}^{p} γ_{j}^{2}

for ridge regression

L_{2}

penalty. Other penalties like elastic net [11], SCAD [12], and MCP [13] can also be used, but for comparison purposes, we use

L_{1}

and

L_{2}

penalty to represent the performance of this class of estimators.

When the prediction in the small dataset is more important than the estimation of regression parameter

β

, instead of penalizing based on the parameter, we might penalize on the extra linear predictor

X_{s} γ

and use

p_{λ} (X_{s} γ)

to replace

p_{λ} (γ)

. The tuning parameter

λ

could be determined via

K -

fold cross-validation [14] in the small data set.

2.3. Weighted Shrinkage Estimator

We propose an alternative approach to the penalized regression method via the weighted shrinkage method. This method has been shown to be useful under a linear regression model [2]; however, as we see in this section, the application of it to this nonlinear model is not straightforward.

The basic idea of this kind of weighted shrinkage estimator is to first fit the logistic regression model among the small data and the big data separately to obtain

{\hat{β}}_{S}

and

{\hat{β}}_{B}

as the estimator for

β

and

β + γ

and then combine the two estimators through a weighted average

{\hat{β}}_{W} = W {\hat{β}}_{S} + (I - W) {\hat{β}}_{B}

for a specific weight matrix

W \in R^{p \times p}

. Specifically, when

W = I

, this is just the estimator of using small data only, and when

W = \frac{n_{S}}{n_{S} + n_{B}} I

, this can be approximately viewed as a pooled estimator of the small and the big dataset assuming

γ = 0

.

It is obvious that the performance of

{\hat{β}}_{W}

highly depends on the choice of weight matrix W. The major goal here is to find the optimal weight matrix as a function of

β

,

γ,

and data

X_{S}

,

X_{B}

, where

X_{S} \in R^{n_{S} \times p}

and

X_{B} \in R^{n_{B} \times p}

are design matrices for small and large data, respectively. Here, we define optimal weight by the weight that minimizes the coefficient estimation error

E | | {\hat{β}}_{W} - β {| |}_{2}^{2}

.

To find the form for the optimal weight, we use the asymptotic expansion of

{\hat{β}}_{S}

and

{\hat{β}}_{B}

. The optimal weight obtained via second-order approximation is denoted as

W_{2} (β, γ, X_{S}, X_{B})

, and the optimal weight obtained via higher-order Edgeworth expansion [15] is denoted as

W_{h} (β, γ, X_{S}, X_{B})

. For all these weights, we could plug in

{\hat{β}}_{S}

and

{\hat{β}}_{B} - {\hat{β}}_{S}

for

β

and

γ

to obtain estimated version of these optimal weights.

Another existing estimator that we want to compare our newly proposed estimator with is the James–Stein estimator. For the James–Stein estimator, we use the form from [16], i.e,

W_{J S} = d i a g {(1 - c / F_{i})}

or

W_{J S +} = d i a g {{(1 - c / F_{i})}_{+}}

, where

x_{+} = max {0, x}

and

F_{i}

,

i = 1, \dots, p

is the test statistic for

{\hat{β}}_{S i} = {\hat{β}}_{B i}

. Here, we have

c = \frac{(p - 2) (n - 2)}{p n}

.

Now, we provide more details on how to obtain these weights. We begin with the second-order approximation. Using the expansion of the logistic regression estimator, we have

\begin{matrix} {\hat{β}}_{S} & = & β + {(X_{S}^{T} V_{S} X_{S})}^{- 1} X_{S}^{T} (Y_{S} - μ_{S}) + O_{p} (n_{S}^{- 1}), \\ {\hat{β}}_{B} & = & β + γ + {(X_{B}^{T} V_{B} X_{B})}^{- 1} X_{B}^{T} (Y_{B} - μ_{B}) + O_{p} (n_{B}^{- 1}), \end{matrix}

where

V_{S} = μ_{S} (1 - μ_{S})

are the variance of

Y_{S}

, and

V_{B} = μ_{B} (1 - μ_{B})

are the variance of

Y_{B}

. Ignoring the

O_{p} (n_{S}^{- 1})

and

O_{p} (n_{B}^{- 1})

terms in the above expansion lead to second-order optimal weight

W_{2} (β, γ, X_{S}, X_{B}) = {(γ γ^{T} + Σ_{B}^{- 1} + Σ_{S}^{- 1})}^{- 1} (γ γ^{T} + Σ_{B}^{- 1}),

(2)

where

Σ_{S} = X_{S}^{T} V_{S} X_{S}

,

Σ_{B} = X_{B}^{T} V_{B} X_{B}

and corresponding estimated second-order optimal weight

{\hat{W}}_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B})

, where

{\hat{W}}_{2}

means the terms

Σ_{S}

and

Σ_{B}

are replaced by their consistent estimated version

{\hat{Σ}}_{S} = X_{S}^{⊤} {\hat{V}}_{S} X_{S}

and

{\hat{Σ}}_{B} = X_{B}^{⊤} {\hat{V}}_{B} X_{B}

, where

{\hat{V}}_{S} = {\hat{μ}}_{S} (1 - {\hat{μ}}_{S})

and

{\hat{V}}_{B} = {\hat{μ}}_{B} (1 - {\hat{μ}}_{B})

with

{\hat{μ}}_{S} = \frac{exp {X_{S} \hat{β}}}{1 + exp {X_{S} \hat{β}}}

and

{\hat{μ}}_{B} = \frac{exp {X_{B} (\hat{β} + \hat{γ})}}{1 + exp {X_{B} (\hat{β} + \hat{γ})}}

.

For the higher-order approximation, we use the approximation

\begin{matrix} {\hat{β}}_{S} & = & β + n_{S}^{- 1 / 2} B_{S} + n_{S}^{- 1} C_{S} + n_{S}^{- 3 / 2} D_{S} + n_{S}^{- 2} E_{S} + o_{p} (n_{S}^{- 2}), \end{matrix}

(3)

\begin{matrix} {\hat{β}}_{B} & = & β + γ + n_{B}^{- 1 / 2} B_{B} + n_{B}^{- 1} C_{B} + n_{B}^{- 3 / 2} D_{B} + n_{B}^{- 2} E_{B} + o_{p} (n_{B}^{- 2}), \end{matrix}

(4)

where

B_{S} = n_{S}^{1 / 2} {(X_{S}^{T} V_{S} X_{S})}^{- 1} X_{S}^{T} (Y_{S} - μ_{S})

,

B_{B} = n_{B}^{1 / 2} {(X_{B}^{T} V_{B} X_{B})}^{- 1} X_{B}^{T} (Y_{B} - μ_{B})

, and the expression of higher-order terms can be found in Appendix A. We have

E B_{S} = 0

and

V a r (B_{S}) = n_{S} Σ_{S}^{- 1}

,

E B_{B} = 0

, and

V a r (B_{B}) = n_{B} Σ_{B}^{- 1}

. Denote

c_{S} = E C_{S}

,

v_{S} = V a r (C_{S})

,

d_{S} = E D_{S}

,

e_{S} = E E_{S}

, and

ρ_{S} = C o v (B_{S}, C_{S}) = < B_{S}, C_{S} >

,

ν_{S} = C o v (B_{S}, D_{S}) = < B_{S}, D_{S} >

,

c_{B} = E C_{B}

,

v_{B} = V a r (C_{B})

,

d_{B} = E D_{B}

,

e_{B} = E E_{B}

and

ρ_{B} = C o v (B_{B}, C_{B}) = < B_{B}, C_{B} >

,

ν_{B} = C o v (B_{B}, D_{B}) = < B_{B}, D_{B} >

. Ignore

o_{p} (n_{S}^{- 2})

,

o_{p} (n_{B}^{- 2})

terms; we have

E | | {\hat{β}}_{W} - β {| |}_{2}^{2}

is minimized at

W_{h} (β, γ, X_{S}, X_{B}) = (A_{11} - A_{10}) {(A_{00} + A_{11} - A_{01} - A_{10})}^{- 1},

(5)

where

\begin{matrix} A_{00} & = & Σ_{S}^{- 1} + n_{S}^{- 2} (c_{S} c_{S}^{T} + v_{S}) + n_{S}^{- 3 / 2} ρ_{S} + n_{S}^{- 3 / 2} ρ_{S}^{T} + n_{S}^{- 2} ν_{S} + n_{S}^{- 2} ν_{S}^{T}, \\ A_{01} & = & n_{S}^{- 1} c_{S} γ^{T} + n_{S}^{- 3 / 2} d_{S} γ^{T} + n_{S}^{- 2} e_{S} γ^{T} + n_{S}^{- 1} n_{B}^{- 1} c_{S} c_{B}^{T}, \\ A_{10} & = & n_{S}^{- 1} γ c_{S}^{T} + n_{S}^{- 3 / 2} γ d_{S}^{T} + n_{S}^{- 2} γ e_{S}^{T} + n_{S}^{- 1} n_{B}^{- 1} c_{B} c_{S}^{T}, \\ A_{11} & = & γ γ^{T} + Σ_{B}^{- 1} + n_{B}^{- 2} (c_{B} c_{B}^{T} + v_{B}) + n_{B}^{- 1} γ c_{B}^{T} + n_{B}^{- 1} c_{B} γ^{T} + n_{B}^{- 3 / 2} γ d_{B}^{T} + n_{B}^{- 3 / 2} d_{B} γ^{T} \\ + n_{B}^{- 2} γ e_{B}^{T} + n_{B}^{- 2} e_{B} γ^{T} + n_{B}^{- 3 / 2} ρ_{B} + n_{B}^{- 3 / 2} ρ_{B}^{T} + n_{B}^{- 2} ν_{B} + n_{B}^{- 2} ν_{B}^{T} . \end{matrix}

The estimated version can be denoted as

{\hat{W}}_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})

, where

{\hat{W}}_{h}

means

c_{S}

,

c_{B}

,

d_{S}

,

d_{B}

,

e_{S}

,

e_{B}

,

v_{S}

,

v_{B}

,

ρ_{S}

,

ρ_{B}

,

ν_{S}

,

ν_{B}

in

W_{h}

are replaced by their consistent estimators (i.e.,

β

and

γ

replaced by

\hat{β}

and

\hat{γ}

and expectation and variance replaced by sample mean and sample variance-covariance within the expression of

c_{S}

,

c_{B}

,

d_{S}

,

d_{B}

,

e_{S}

,

e_{B}

,

v_{S}

,

v_{B}

,

ρ_{S}

,

ρ_{B}

,

ν_{S}

,

ν_{B}

terms).

2.4. Relationship Between Two Types of Estimators

The shrinkage estimator defined above is closely related to the penalized estimator. W can be written as

W_{λ}

based on the

L_{2}

penalty using the asymptotic linear expansion of GLM as in Equations (2) and (5) are the

O_{p} (1)

term, whose form can be obtained from Edgeworth expansion [17]. To relate this to the penalized regression method, we consider the following penalized version and find

{\hat{β}}_{λ}

and

{\hat{γ}}_{λ}

minimize

\begin{matrix} | | V_{S}^{- 1 / 2} (Y_{S} - μ_{S}) {| |}_{2}^{2} + | | V_{B}^{- 1 / 2} (Y_{B} - μ_{B}) {| |}_{2}^{2} + | | λ^{1 / 2} X_{T} γ {| |}_{2}^{2} . \end{matrix}

Denote

X = (\begin{matrix} X_{S} & 0 \\ X_{B} & X_{B} \\ 0 & X_{T} \end{matrix})

and

Y = (\begin{matrix} Y_{S} - μ_{S} \\ Y_{B} - μ_{B} \\ 0 \end{matrix})

and

V = (\begin{matrix} V_{S} & 0 & 0 \\ 0 & V_{B} & 0 \\ 0 & 0 & λ I \end{matrix})

, the score function will be

\begin{matrix} U = X^{T} Y, \end{matrix}

and the information matrix

\begin{matrix} I = X^{T} V X . \end{matrix}

So, we have the approximation

\begin{matrix} (\begin{matrix} {\hat{β}}_{λ} \\ {\hat{γ}}_{λ} \end{matrix}) & = & (\begin{matrix} β \\ γ \end{matrix}) + {(X^{T} V X)}^{- 1} X^{T} Y + O_{p} (n_{S}^{- 1}) . \end{matrix}

Define

Σ_{T} = X_{T}^{T} X_{T}

and

W_{λ} = {(Σ_{S} + λ Σ_{T} + λ Σ_{T} Σ_{B}^{- 1} Σ_{S})}^{- 1} (Σ_{S} + λ Σ_{T} Σ_{B}^{- 1} Σ_{S}) .

Then, we have

\begin{matrix} {\hat{β}}_{λ} = W_{λ} {\hat{β}}_{S} + (I - W_{λ}) {\hat{β}}_{B} + O_{p} (n_{S}^{- 1}) . \end{matrix}

So, we can see that with different choices of

X_{T}

and

λ

, the

L_{2}

penalized estimator is asymptotically equivalent to the weighted estimator.

3. Simulations

To see how much efficiency we can gain using our proposed estimator under a finite sample setting, we use a detailed simulation exercise, as described below, to compare the different methods with each other. In these simulations, we consider different sizes for p, the dimension for

β

(including intercept), such that

p \in {3, 6, 11}

, but we set the

L_{2}

norm of

β

, i.e.,

{| | β | |}_{2}

to be fixed at

p log (1.1)

. We also vary

γ

, the amount of bias in the big data such that

{| | γ | |}_{2} / {| | β | |}_{2}

takes the following values

{0.5, 1, 2}

. We generate the small data

X_{S}

and the big data

X_{B}

from the same Gaussian distribution, and the covariates are assumed to be uncorrelated with each other. We also vary

n_{S}

, the size of the small data

X_{S}

, between 100 and 500 in increments of 50 (thus

n_{S} \in {100, 150, 200, 250, 300, 350, 400, 450, 500}

), while we consider two fixed sizes for

n_{B}

, namely {1000, 10,000}. For each simulation, we generate Y based on our assumed logistic models for the small and the big data, given in Equation (1). In addition, we also consider the setting where

γ

is 0 and the setting where the bias of

γ

is due to covariate

X_{6}

missing in the big data.

For each simulation scenario, we perform 100 simulations to compute the mean squared error (MSE) in estimation,

E | | \hat{β} - β {| |}_{2}^{2}

. We obtain estimates for

β

by (1) using small data only (Small), (2) pooling big and small data (Pool), (3) weighted with optimal weight from second-order approximation (

W_{2}

), (4) weighted with optimal weight from higher-order approximation (

W_{h}

), (5)

L_{1}

penalized regression (

L_{1}

), (6)

L_{2}

penalized regression (

L_{2}

), or (7) JS+ weighted estimator (JSP).

Table 1 and Table 2 provides the mean square error (MSE) of the 7 estimators listed above under various small data sample sizes (

n_{B}

), number of covariates (p), and the magnitude of bias in the big data (

{| | γ | |}_{2} / {| | β | |}_{2}

). The simulation results are further summarized in Figure 1 and Figure 2 to compare the efficiency gain between different methods and to show the trend of efficiency gain over the sample size of the small data. Specifically, Table 1 and Figure 1 provide results for different simulation settings when

n_{B} = 1000

while Table 2 and Figure 2 provide results for different simulation settings when

n_{B}

= 10,000. In each figure, nine plots are presented in a grid of three rows and three columns, where the columns show plots for a particular ratio of

{| | γ | |}_{2} / {| | β | |}_{2}

(in increasing order of magnitude from left to right), while the rows show plots for different dimension size p (in increasing order of magnitude from top to bottom). Figure 3 shows the setting without bias (i.e.,

γ = 0

), and Figure 4 shows the setting where the bias in the big data is due to missing of

X_{6}

. Each plot in the grid presents the graphs of the log-transformed ratio of the MSE of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), as a function of the varying sizes for

n_{S}

. A specific curve below the horizontal zero line indicates that the estimator is consistently more efficient than the estimator from using small data only. When the curve for an estimator A is lower than the curve for another estimator B, it shows that estimator A is consistently more efficient than estimator B. From the plots, we can make the following observations:

Table 1. Mean square errors (MSE) of

\hat{β}

when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP) for varying sizes of

n_{S}

, number of covariates p, and magnitude of bias

{| | γ | |}_{2} / {| | β | |}_{2}

when

n_{B}

is 1000.

Table 2. Mean square errors (MSE) of

\hat{β}

when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP) for varying sizes of

n_{S}

, number of covariates p, and magnitude of bias

{| | γ | |}_{2} / {| | β | |}_{2}

when

n_{B}

is 10,000.

Figure 1. Plot for the log-transformed ratios of the mean squared error of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), for varying sizes of

n_{S}

and when

n_{B}

is 1000.

Figure 2. Plot for the log-transformed ratios of the mean squared error of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), for varying sizes of

n_{S}

and when

n_{B}

is 10,000.

Figure 3. Plot for the log-transformed ratios of the mean squared error of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), for varying sizes of

n_{S}

and when

γ = 0

.

Figure 4. Plot for the log-transformed ratios of the mean squared error of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), for varying sizes of

n_{S}

and when the bias term

γ

is caused by missing the covariate

X_{6}

in big data.

The performance of $W_{2}$ and $W_{h}$ procedures are very close to each other in every setting, pointing to the fact that probably the optimal second-order approximation weights $W_{2}$ suffice for our problem (in fact it is difficult to visually observe the graph for $W_{2}$ as it is exactly overlaid by the graph for $W_{h}$ ).
In every simulation setting, the $W_{2}$ and the $W_{h}$ procedures outperform every other method. The gain in performance of $W_{2}$ and $W_{h}$ over the next best-performing method increases with increasing dimension size p, and increasing ratio of ${| | γ | |}_{2} / {| | β | |}_{2}$ . The same trend is observed for both sizes for $n_{B}$ (1000/10,000).
The $L_{1}$ penalized procedure is the third best-performing method overall (after $W_{2}$ and $W_{h}$ ), and its performance is similar to the $W_{2}$ / $W_{h}$ procedures when dimension size is small ( $p = 3$ ), and the relative bias is low ( ${| | γ | |}_{2} / {| | β | |}_{2} = 0.5$ ).
The pooled procedure is the worst-performing method overall and is quite sensitive to the bias ${| | γ | |}_{2}$ . Although it shows relatively good performance when dimension size is small ( $p = 3$ ), and when the relative bias is low ( ${| | γ | |}_{2} / {| | β | |}_{2} = 0.5$ ), with increasing dimensions, and especially with increasing bias, its performance becomes very poor. Apart from JSPs in some scenarios, it is the only procedure that shows extremely elevated MSEs in comparison to the small data.
The performance of the $L_{2}$ procedure is similar to JSP in some settings but is better than it in others. For example, with increasing dimensions, and with increasing bias, the JSP procedure sometimes tends to have higher MSE than those obtained from the small data (Small), especially when the size of the small data is on the higher end, but the $L_{2}$ procedure always performs better than Small.
All methods (except for Pool and in some instances JSP) show lower MSE than the estimates obtained from the small data themselves, and the gain in efficiency is most pronounced when the size of the small data is small.

4. Theoretical Results

The simulation results indicate that our proposed estimator always outperforms the small-data-only analysis in terms of MSE and such improvement is substantial sometimes. However, to apply our proposed method in general, a natural question is whether there exists a certain scenario under which the proposed estimator will underperform the small-data-only analysis, especially when the difference between the two sources of data is large. To answer this question, here we summarize the theoretical guarantee of our proposed weighted estimators in the following theorems with brief proof idea and the detailed expression and proof can be found in the Appendix A.

Theorem 1.

The second-order optimal weight

W_{2} (β, γ, X_{S}, X_{B})

and its estimated version

W_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B})

approximately minimize

E | | {\hat{β}}_{W} - β {| |}_{2}^{2}

at

O (n_{S}^{- 1})

level in the sense that

\begin{matrix} E | | {\hat{β}}_{W_{2} (β, γ, X_{S}, X_{B})} - {β | |}_{2}^{2} \leq \inf_{W} E | | {\hat{β}}_{W} - β {| |}_{2}^{2} + O (n_{S}^{- 1}) \\ E | | {\hat{β}}_{{\hat{W}}_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - {β | |}_{2}^{2} \leq \inf_{W} E | | {\hat{β}}_{W} - β {| |}_{2}^{2} + O (n_{S}^{- 1}), \end{matrix}

where the infimum is taken over all

p \times p

random matrix that are measurable given

X_{S}, X_{B}, Y

.

W_{2} (β, γ, X_{S}, X_{B})

and

{\hat{W}}_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B})

are also approximate optimal weights for prediction purpose in the sense that

\begin{matrix} n_{S}^{- 1} E | | X_{S} ({\hat{β}}_{W_{2} (β, γ, X_{S}, X_{B})} - β) {| |}_{2}^{2} \leq \inf_{W} n_{S}^{- 1} E | | X_{S} ({\hat{β}}_{W} - β) {| |}_{2}^{2} + O (n_{S}^{- 1}), \\ n_{S}^{- 1} E | | X_{S} ({\hat{β}}_{{\hat{W}}_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - β) {| |}_{2}^{2} \leq \inf_{W} n_{S}^{- 1} E | | X_{S} ({\hat{β}}_{W} - β) {| |}_{2}^{2} + O (n_{S}^{- 1}) . \end{matrix}

For the proof of Theorem 1, we can use the second order expansion of

\hat{β}

around

β

to obtain the optimal weight as a function of

β

and

γ

. Since the approximation error for

\hat{β}

will be at the level of

O_{p} (n_{S}^{- 1})

, the mean square error for

β

and mean square prediction error can be approximated with error at level

O (n_{S}^{- 1})

. Replacing optimal weight with estimated optimal weight will further introduce an error term of

O (n_{s}^{- 2})

, but the term

O (n_{S}^{- 1})

will still be the dominant term. Details can be found in Appendix A.

Theorem 2.

The higher-order optimal weight

W_{h} (β, γ, X_{S}, X_{B})

and its estimated version

{\hat{β}}_{{\hat{W}}_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})}

approximately minimize

E | | {\hat{β}}_{W} - β {| |}_{2}^{2}

at

o (n_{S}^{- 2})

and

O (n_{S}^{- 2})

level in the sense that

\begin{matrix} E | | {\hat{β}}_{W_{h} (β, γ, X_{S}, X_{B})} {| |}_{2}^{2} \leq \inf_{W} E | | {\hat{β}}_{W} - β {| |}_{2}^{2} + o (n_{S}^{- 2}), \\ E | | {\hat{β}}_{{\hat{W}}_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})} {| |}_{2}^{2} \leq \inf_{W} E | | {\hat{β}}_{W} - β {| |}_{2}^{2} + O (n_{S}^{- 2}) . \end{matrix}

Similarly, for Theorem 2, we can use the high order expansion of

\hat{β}

around

β

with approximation error at the level of

o_{p} (n_{S}^{- 3 / 2})

, then the approximation error for the mean square error for

β

and mean square prediction error will be at the level of

o (n_{S}^{- 2})

. In addition, we can show that the difference between MSE based on this estimated optimal weight and the optimal oracle weight is with an approximation error of the order of

o (n_{S}^{- 2})

so that the final approximation error will be at the level of

O (n_{S}^{- 2})

.

Theorem 3.

Assuming

n_{B} / n_{S} \to r \in (0, \infty)

and

n_{S} \to \infty

, the weighted estimator based on estimated higher-order optimal weights is more efficient than using small data only, i.e.,

E | | {\hat{β}}_{{\hat{W}}_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - {β | |}_{2}^{2} \leq E | | {\hat{β}}_{I} - β {| |}_{2}^{2}

hold asymptotically when

n_{S} \to \infty

.

For Theorem 3, since using small data only is equivalent to using weight matrix I, we can show that the improvement from the estimated higher-order optimal weight is always larger than the estimated second-order optimal weight which is in the order of

O (n_{S}^{- 2})

and thus approximation error

o (n_{S}^{- 2})

can be ignored when comparing

E | | {\hat{β}}_{{\hat{W}}_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - β {| |}_{2}^{2}

and

E | | {\hat{β}}_{I} - β {| |}_{2}^{2}

.

5. Analysis of ACC Data

The Asia Cohort Consortium (ACC) is a collaborative effort born out of the need to study the Asian population, seeking to understand the relationship between genetics, environmental exposures, and the etiology of a disease through the establishment of a cohort of at least one million healthy people around different countries in Asia, followed over time to various disease endpoints and death. This pooling project, with its huge sample size across 29 subcohorts from 10 Asian countries (https://www.asiacohort.org/ParticipatingCohorts/index.html (accessed on 23 March 2017)), provides the perfect opportunity to explore informative relationships (association of exposure with disease, genome variability with disease, etc.) among major Asian ethnic groups.

Over the last few decades, obesity has become an important health issue in many countries. According to World Health Organization estimates, more than a billion adults around the world are overweight, and at least 300 million of them are obese (see [18]). Many epidemiological studies have found an association between the body mass index (BMI) and a variety of health outcomes, including mortality (see [19]). However, most of these inferences have been drawn from studies in populations of European origins, and very little focus has been given to the relationship between BMI and the overall risk of death among Asians, who account for more than 60% of the world population (see [20]). The data collected as part of the ACC can be used to answer these important questions.

To show the usefulness of our proposed methodology in a practical setting, we use data from the ACC to explore the relationship between BMI and mortality. In particular, we concentrate only on the cohorts from China—data from the Shanghai Cohort Study (SCS) are used to form our small data—while data from the rest of the Chinese subcohorts—China Hypertension Survey Epidemiology Follow-up Study (CHEFS), Linxian General Population Trial Cohort, Shanghai Men’s Health Study (SMHS), and Shanghai Women’s Health Study (SWHS)—are pooled together to form the initial big data. Since the SCS cohort only included males, we decided to restrict the big data to include only male participants from the other subcohorts (which completely excluded the SWHS). For individuals in the small data, enrollment started in 1986 and the study continued till 2007, while for the pooled large data, enrollment started in 1985, and the last year of follow-up was 2011. Missingness in covariates is not a big concern (no missingness in the small data and only 0.79% missingness in the large data). The baseline age distribution of the individuals is found to be different in the small and the large data, and since mortality is a definite function of age, for better comparability, we decided to restrict the two datasets such that they contain individuals whose baseline age varied between 50 and 60. And because methods described in this paper pertain to binary outcomes only, and time to follow-up varies for different individuals in the two datasets, we decided to only consider the first year of follow-up for each individual. Firstly, this makes the binary statuses of mortality comparable for individuals in the two datasets, and secondly, the short period of follow-up ensures that we do not lose too many individuals who are lost to follow-up. Such individuals form only 0.06% of the small data and 2.88% of the large data and are removed from the analysis. After performing all these data management steps, the small data are found to contain 10,675 individuals with 40 mortality events, while the large data are found to contain 46,779 individuals with 206 events. Apart from BMI, baseline age is also included as a covariate in the model, as well as indicators for each individual’s smoking and drinking habits, as these covariates have been proven to be important predictors of mortality in many settings.

We start off by analyzing the small and the big data separately first, using the standard logistic regression model, and then pooling them together. We then estimate the regression coefficients using the proposed weighted shrinkage methods, namely, with the optimal second-order weights (

W_{2}

), the optimal higher-order Edgeworth weights (

W_{h}

), and for comparison, the optimal James–Stein weights (JSPs). We also obtained the penalized estimates, using the

L_{1}

and

L_{2}

procedures.

The estimates, their standard errors, and estimated mean square errors for the various procedures are presented below in Table 3. Unlike the simulation data, we cannot know the true bias, and thus, we cannot obtain the true MSE. To illustrate the efficiency gain of the proposed estimators, we estimated MSE for each method by approximating the bias by the difference between the point estimate of each method and the small data estimator. From the table, we can see that the two proposed weighting estimators

W_{2}

and

W_{h}

provide smaller estimated MSEs than the estimator using small data only. These numerical results match the theoretical results from Theorem 3 as expected, since we have a large sample size here for

n_{B}

and

n_{S}

to make the asymptotic results hold. In addition, we noticed that except for the intercept, the two proposed weighting estimators

W_{2}

and

W_{h}

also provide smaller estimated MSE than other alternative estimators which is consistent with the simulation results. As can be seen, the pooled procedure obtains the lowest standard errors as expected, because it uses the entirety of the big and the small data, but it also means that the estimates for this procedure are inherently biased towards the ones that we obtain from the big data themselves, as it contains a lot more information (than the small data) because of its size, so naive pooling inappropriately shifts most of the focus to the big data themselves. The weighted shrinkage procedures seem to be better adjusted in this respect, with estimates shrunk somewhat but much closer to the ones that we see from the small data themselves but with much lower standard errors than the small data estimates. The optimal second-order and higher-order Edgeworth weights (

W_{2}

and

W_{h}

) we proposed perform similarly in this regard and have lower standard errors than the estimates from the James–Stein-adjusted weights (JSPs), except for BMI, in which case the JSP procedure obtains a lower standard error than

W_{2}

or

W_{h}

; however, the estimate for BMI obtained by JSP is shrunk completely to that obtained from the big data themselves, which implies potential large bias. Similarly, for other variables, the JSP estimate is far from the small data estimator compared with our two proposed weighted estimators which suggests a potential larger bias. Among the penalized procedures,

L_{1}

seems to borrow more strength from the big data and thus has lower standard errors and a higher amount of shrinkage, which leads to potential higher bias, while the estimates for the

L_{2}

procedure seem to be closer to the small data estimates and thus have higher standard errors compared with the proposed weighted estimators. In general, the two proposed weighted estimators show a better balance of bias and SD in this example.

Table 3. Estimates (Est), their standard errors (Std Err), and estimated mean square error (Est MSE) from ACC data analysis.

6. Discussion

In this paper, we proposed better estimators that allow more accurate estimation of the regression coefficient and the risk prediction for our target population using information from another different population with more observations. Although the expansion and detailed form of the weight we provided are specifically for logistic regression, the optimal weight formula is general in terms of the expansion formula C, D, Es. So the framework we proposed here could be extended to a generalized linear model and estimating equation models straightforwardly, though the more complicated computation of Edgeworth expansion for these estimating equation-based estimators needs to be derived for the optimal estimation weight.

To utilize big data, although we do not need to know the exact relationship between the small and big data in terms of association strength, we need the model form in big data to be correctly specified. In our setting, the same logistic form needs to hold for both the big and the small data. When the covariate is limited and categorical, this assumption is weak and easy to satisfy. When there is a continuous covariate, we can apply existing model-checking tools to the big data to check whether our model assumption holds.

In our analysis of the ACC data, we only concentrated on the first year of follow-up, because the methods presented in this paper are only relevant for binary outcomes, and the short period of follow-up ensured that we did not lose too many individuals to loss to follow-up, which would have otherwise introduced unforeseen sources of bias in our analysis. However, in doing so, we lost a lot of rich information that is contained in the time-to-follow-up data. This shows the need to extend our methods to the case when we have time-to-event data, and this indeed is one of our future research goals.

Author Contributions

Conceptualization, C.Z. and Y.-Q.C.; methodology, C.Z. and Y.-Q.C.; software and data analysis, S.D., Y.X. and A.H.; resources, Y.-Q.C.; data curation, Y.-Q.C.; writing—original draft preparation, C.Z., S.D., Y.X. and A.H.; writing—review and editing, C.Z., S.D., Y.X., A.H. and Y.-Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the National Institute of General Medical Science U54 GM115458.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from ACC and are available from the corresponding author at https://www.asiacohort.org/ParticipatingCohorts/index.html (accessed on 23 March 2017) in a collaboration mode with the permission of ACC.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MSE	Mean square error
ACC	Asia Cohort Consortium China
BMI	Body mass index
LASSO	Least Absolute Shrinkage and Selection Operator regression
SCAD	Smoothly Clipped Absolute Deviation
MCP	Minimax Concave Penalty

Appendix A

Appendix A.1. High-Order Expansion for Logistic Regression

In this section, we derive the high-order asymptotic expansion for the maximum likelihood estimator

\hat{β}

based on n i.i.d. data

(x_{i}, y_{i})

sampled from the logistic regression model

log \frac{P (Y = 1)}{P (Y = 0)} = X β

. The small and big data expansion can be obtained by replacing X by

X_{S}

and

X_{B}

, replacing

β

by

β

and

β + γ

and replacing n by

n_{S}

and

n_{B}

.

For logistic regression, we have the score function written as

{\dot{l}}_{n} (β) = X^{T} \{y - \frac{exp (X^{T} β)}{1 + exp (X^{T} β)}\}

. Let

Z_{n} = \frac{1}{\sqrt{n}} {\dot{l}}_{n} (β_{0})

and

Σ = V a r ({\dot{l}}_{n})

, then

(Σ / n) Z_{n}

converge to standard normal distribution. So, following Sun, Loader, McCormick [17], we have that

| | {\hat{β}}_{2} - \hat{β} {| |}_{2} = o_{p} (n^{- 1})

where

\begin{matrix} {\hat{β}}_{2} = β_{0} + {(Σ / n)}^{- 1} \{n^{- 1 / 2} Z_{n} + \frac{1}{2 n} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix})\} \end{matrix}

with

\begin{matrix} b_{k} & = & {[Σ^{- 1} X^{T} \{y - \frac{exp (X^{T} β)}{1 + exp (X^{T} β)}\}]}^{T} {\overset{⃛}{l}}_{k} [Σ^{- 1} X^{T} \{y - \frac{exp (X^{T} β)}{1 + exp (X^{T} β)}\}] \\ = & {\{{(Σ / n)}^{- 1} Z_{n}\}}^{T} \frac{{\overset{⃛}{l}}_{k}}{n} \{{(Σ / n)}^{- 1} Z_{n}\} + o_{p} (1), \end{matrix}

where

{\overset{⃛}{l}}_{k}

is a

p \times p

matrix with the

(i, j)

th element equal to

{\overset{⃛}{l}}_{i j k}

.

Similarly, we have

| | {\hat{β}}_{3} - \hat{β} {| |}_{2} = o_{p} (n^{- 3 / 2})

, where

\begin{matrix} {\hat{β}}_{3} & = & β_{0} + {(Σ / n)}^{- 1} \{n^{- 1 / 2} Z_{n} + \frac{1}{2 n} (\begin{matrix} {\tilde{b}}_{1} \\ \dots \\ {\tilde{b}}_{p} \end{matrix}) + \frac{1}{6 n^{3 / 2}} (\begin{matrix} d_{1} \\ \dots \\ d_{p} \end{matrix})\} \\ = & β_{0} + {(Σ / n)}^{- 1} \{n^{- 1 / 2} Z_{n} + \frac{1}{2 n} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n^{3 / 2}} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\} + o_{p} (n^{- 3 / 2}) \end{matrix}

where

{\tilde{b}}_{k} = {[{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix})\}]}^{T} \frac{{\overset{⃛}{l}}_{k}}{n} [{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix})\}]

and

d_{k} = {[{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix})\}]}^{T} m_{k} [{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix})\}]

where

m_{k} = \sum_{j} e_{j} {\{{(Σ / n)}^{- 1} [Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix})]\}}^{T} \frac{{\overset{⃜}{l}}_{k j}}{n}

where

{\overset{⃜}{l}}_{k j}

is a

p \times p

matrix with the

(i, l)

th element equal to

{\overset{⃜}{l}}_{i l k j}

and

\begin{matrix} {\tilde{d}}_{k} & = & d_{k} + 3 Z_{n}^{T} \frac{{\overset{⃛}{l}}_{k}}{n} b \\ \approx & {(Σ / n)}^{- 1} {\{{(Σ / n)}^{- 1} Z_{n}\}}^{T} [\sum_{j} {\{e_{j} {(Σ / n)}^{- 1} Z_{n}\}}^{T} \frac{{\overset{⃜}{l}}_{k j}}{n}] \{{(Σ / n)}^{- 1} Z_{n}\} \\ + 6 Z_{n}^{T} \frac{{\overset{⃛}{l}}_{k}}{n} (\begin{matrix} Z_{n}^{T} \frac{{\overset{⃛}{l}}_{1}}{n} Z_{n} \\ \dots \\ Z_{n}^{T} \frac{{\overset{⃛}{l}}_{p}}{n} Z_{n} \end{matrix}) . \end{matrix}

Now, we have

| | {\hat{β}}_{4} - \hat{β} {| |}_{2} = o_{p} (n^{- 2})

, where

\begin{matrix} {\hat{β}}_{4} & = & β_{0} + {(Σ / n)}^{- 1} \{n^{- 1 / 2} Z_{n} + \frac{1}{2 n} (\begin{matrix} {\tilde{b}}_{1}^{(2)} \\ \dots \\ {\tilde{b}}_{p}^{(2)} \end{matrix}) + \frac{1}{6 n^{3 / 2}} (\begin{matrix} {\tilde{d}}_{1}^{(2)} \\ \dots \\ {\tilde{d}}_{p}^{(2)} \end{matrix}) + \frac{1}{24 n^{2}} (\begin{matrix} {\tilde{E}}_{1}^{(2)} \\ \dots \\ {\tilde{E}}_{p}^{(2)} \end{matrix})\} \\ \approx & β_{0} + {(Σ / n)}^{- 1} \{n^{- 1 / 2} Z_{n} + \frac{1}{2 n} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n^{3 / 2}} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix}) + \frac{1}{24 n^{2}} (\begin{matrix} {\tilde{E}}_{1} \\ \dots \\ {\tilde{E}}_{p} \end{matrix})\} \end{matrix}

where

\begin{matrix} {\tilde{b}}_{k}^{(2)} & = & {[{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}]}^{T} \frac{{\overset{⃛}{l}}_{k}}{n} \\ [{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}] \end{matrix}

and

\begin{matrix} {\tilde{d}}_{k}^{(2)} & = & {[{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}]}^{T} m_{k} \\ [{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}] \end{matrix}

where

m_{k} = \sum_{j} {[e_{j} {(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}]}^{T} \frac{{\overset{⃜}{l}}_{k j}}{n}

\begin{matrix} {\tilde{E}}_{k}^{(2)} & = & {[{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}]}^{T} ζ_{k} \\ [{(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}] \end{matrix}

where

\begin{matrix} ζ_{k} & = & \sum_{j, l} {[e_{l} {(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}]}^{T} \\ {[e_{j} {(Σ / n)}^{- 1} \{Z_{n} + \frac{1}{2 \sqrt{n}} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}) + \frac{1}{6 n} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix})\}]}^{T} \frac{l_{k j l}^{(5)}}{n} \end{matrix}

So, we have

\begin{matrix} {\tilde{E}}_{k} & = & {\tilde{E}}_{k}^{(2)} + 4 Z_{n}^{T} \frac{{\overset{⃛}{l}}_{k}}{n} \tilde{d} + 3 b^{T} \frac{{\overset{⃛}{l}}_{k}}{n} b + 4 Z_{n}^{T} m_{k} b \end{matrix}

and

\begin{matrix} C & = & Σ^{- 1} \frac{1}{2} (\begin{matrix} b_{1} \\ \dots \\ b_{p} \end{matrix}), \\ D & = & Σ^{- 1} \frac{1}{6} (\begin{matrix} {\tilde{d}}_{1} \\ \dots \\ {\tilde{d}}_{p} \end{matrix}), \\ E & = & Σ^{- 1} \frac{1}{24} (\begin{matrix} {\tilde{E}}_{1} \\ \dots \\ {\tilde{E}}_{p} \end{matrix}) . \end{matrix}

Using subindex S and B, we can obtain

C_{S}

,

D_{S}

,

C_{B}

, and

D_{B}

, and we can estimate the mean and variance and the covariance between these terms using an empirical version of these terms.

Appendix A.2. Proof of Theorem 1

Proof of Theorem 1.

Under second-order approximation, we have

\begin{matrix} E \{({\hat{β}}_{W} - β) {({\hat{β}}_{W} - β)}^{T}\} \\ = & (I - W) γ γ^{T} {(I - W)}^{T} + n_{S}^{- 1} W Σ_{S}^{T} W^{T} + n_{B}^{- 1} (I - W) Σ_{B}^{T} {(I - W)}^{T} + O (n_{S}^{- 1}) \\ = & (I - W) (γ γ^{T} + n_{B}^{- 1} Σ_{B}) {(I - W)}^{T} + n_{S}^{- 1} W Σ_{S} W^{T} + O (n_{S}^{- 1}) . \end{matrix}

W_{2} (β, γ, X_{S}, X_{B})

minimize the quadratic form when ignoring the

O (n_{S}^{- 1})

term. Denote the real optimal weight as

W^{*}

, and the optimal value is

L^{*}

; then, we have

\begin{matrix} L^{*} & \geq & E \{({\hat{β}}_{W_{2} (β, γ, X_{S}, X_{B})} - β) {({\hat{β}}_{W_{2} (β, γ, X_{S}, X_{B})} - β)}^{T}\} \\ = & \{I - W_{2} (β, γ, X_{S}, X_{B})\} (γ γ^{T} + n_{B}^{- 1} Σ_{B}) {\{I - W_{2} (β, γ, X_{S}, X_{B})\}}^{T} \\ + n_{S}^{- 1} W_{2} (β, γ, X_{S}, X_{B}) Σ_{S}^{T} W_{2} {(β, γ, X_{S}, X_{B})}^{T} + O (n_{S}^{- 1}) \\ \geq & (I - W^{*}) (γ γ^{T} + n_{B}^{- 1} Σ_{B}) {(I - W^{*})}^{T} \\ + n_{S}^{- 1} W^{*} Σ_{S}^{T} W^{* T} + O (n_{S}^{- 1}) \\ = & L^{*} - O (n_{S}^{- 1}) + O (n_{S}^{- 1}) \\ = & L^{*} + O (n_{S}^{- 1}) \end{matrix}

For the main part, we have that

W_{2} (β, γ, X_{S}, X_{B})

minimize both the estimation MSE and prediction MSE with the same approximation error rate. So, we finish the first part of the proof.

Note that both

\hat{β}

and

\hat{γ}

are

\sqrt{n}

consistent, so we have

| | W_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B}) - W_{2} (β, γ, X_{S}, X_{B}) {| |}_{F} = O_{p} (n_{S}^{- 1})

, so

E | | {\hat{β}}_{W_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - {β | |}_{2}^{2} - E | | {\hat{β}}_{W_{2} (β, γ, X_{S}, X_{B})} - β {| |}_{2}^{2} = O (n_{S}^{- 2})

. So, this leads to the conclusion

E | | {\hat{β}}_{{\hat{W}}_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - {β | |}_{2}^{2} \leq inf_{W} E | | {\hat{β}}_{W} - β {| |}_{2}^{2} + O (n_{S}^{- 1})

and assume

X_{S}

is bounded; then, we have

n_{S}^{- 1} E | | X_{S} ({\hat{β}}_{{\hat{W}}_{2} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - β) {| |}_{2}^{2} \leq inf_{W} n_{S}^{- 1} E | | X_{S} ({\hat{β}}_{W} - β) {| |}_{2}^{2} + O (n_{S}^{- 1})

□

Appendix A.3. Proof of Theorem 2

Proof of Theorem 2.

We have

E B_{S} = 0

and

V a r (B_{S}) = n_{S} Σ_{S}^{- 1}

,

E B_{B} = 0

and

V a r (B_{B}) = n_{B} Σ_{B}^{- 1}

. Denote

c_{S} = E C_{S}

,

v_{S} = V a r (C_{S})

,

d_{S} = E D_{S}

,

e_{S} = E E_{S}

and

ρ_{S} = C o v (B_{S}, C_{S}) = < B_{S}, C_{S} >

,

ν_{S} = C o v (B_{S}, D_{S}) = < B_{S}, D_{S} >

,

c_{B} = E C_{B}

,

v_{B} = V a r (C_{B})

,

d_{B} = E D_{B}

,

e_{B} = E E_{B}

and

ρ_{B} = C o v (B_{B}, C_{B}) = < B_{B}, C_{B} >

,

ν_{B} = C o v (B_{B}, D_{B}) = < B_{B}, D_{B} >

. So, the mean square error for a weight matrix W will be

\begin{matrix} E \{({\hat{β}}_{W} - β) {({\hat{β}}_{W} - β)}^{T}\} \\ = & E {(I - W) γ γ^{T} {(I - W)}^{T} + n_{S}^{- 1} W B_{S} B_{S}^{T} W^{T} + n_{B}^{- 1} (I - W) B_{B} B_{B}^{T} {(I - W)}^{T} \\ + n_{S}^{- 2} W C_{S} C_{S}^{T} W^{T} + n_{B}^{- 2} (I - W) C_{B} C_{B}^{T} {(I - W)}^{T} \\ + n_{S}^{- 1 / 2} (I - W) γ B_{S}^{T} W^{T} + n_{S}^{- 1 / 2} W B_{S} γ^{T} {(I - W)}^{T} + n_{B}^{- 1 / 2} (I - W) γ B_{B}^{T} {(I - W)}^{T} \\ + n_{B}^{- 1 / 2} (I - W) B_{B} γ^{T} {(I - W)}^{T} + n_{S}^{- 1} (I - W) γ C_{S}^{T} W^{T} + n_{S}^{- 1} W C_{S} γ^{T} {(I - W)}^{T} \\ + n_{B}^{- 1} (I - W) γ C_{B}^{T} {(I - W)}^{T} + n_{B}^{- 1} (I - W) C_{B} γ^{T} {(I - W)}^{T} + n_{S}^{- 3 / 2} (I - W) γ D_{S}^{T} W^{T} \\ + n_{S}^{- 3 / 2} W D_{S} γ^{T} {(I - W)}^{T} + n_{B}^{- 3 / 2} (I - W) γ D_{B}^{T} {(I - W)}^{T} + n_{B}^{- 3 / 2} (I - W) D_{B} γ^{T} {(I - W)}^{T} \\ + n_{S}^{- 2} (I - W) γ E_{S}^{T} W^{T} + n_{S}^{- 2} W E_{S} γ^{T} {(I - W)}^{T} + n_{B}^{- 2} (I - W) γ E_{B}^{T} {(I - W)}^{T} \\ + n_{B}^{- 2} (I - W) E_{B} γ^{T} {(I - W)}^{T} + n_{S}^{- 1 / 2} n_{B}^{- 1 / 2} W B_{S} B_{B}^{T} {(I - W)}^{T} + n_{S}^{- 1 / 2} n_{B}^{- 1 / 2} (I - W) B_{B} B_{S}^{T} W^{T} \\ + n_{S}^{- 3 / 2} W B_{S} C_{S}^{T} W^{T} + n_{S}^{- 3 / 2} W C_{S} B_{S}^{T} W^{T} + n_{S}^{- 1 / 2} n_{B}^{- 1} W B_{S} C_{B}^{T} {(I - W)}^{T} \\ + n_{S}^{- 1 / 2} n_{B}^{- 1} (I - W) C_{B} B_{S}^{T} W^{T} + n_{S}^{- 2} W B_{S} D_{S}^{T} W^{T} + n_{S}^{- 2} W D_{S} B_{S}^{T} W^{T} \\ + n_{S}^{- 1 / 2} n_{B}^{- 3 / 2} W B_{S} D_{B}^{T} {(I - W)}^{T} + n_{S}^{- 1 / 2} n_{B}^{- 3 / 2} (I - W) D_{B} B_{S}^{T} W^{T} + n_{B}^{- 1 / 2} n_{S}^{- 1} (I - W) B_{B} C_{S}^{T} W^{T} \\ + n_{B}^{- 1 / 2} n_{S}^{- 1} W C_{S} B_{B}^{T} {(I - W)}^{T} + n_{B}^{- 3 / 2} (I - W) B_{B} C_{B}^{T} {(I - W)}^{T} + n_{B}^{- 3 / 2} (I - W) C_{B} B_{B}^{T} {(I - W)}^{T} \\ + n_{B}^{- 1 / 2} n_{S}^{- 3 / 2} W D_{S} B_{B}^{T} {(I - W)}^{T} + n_{B}^{- 1 / 2} n_{S}^{- 3 / 2} (I - W) B_{B} D_{S}^{T} W^{T} + n_{B}^{- 2} (I - W) D_{B} B_{B}^{T} {(I - W)}^{T} \\ + n_{B}^{- 2} (I - W) B_{B} D_{B}^{T} {(I - W)}^{T} + n_{S}^{- 1} n_{B}^{- 1} W C_{S} C_{B}^{T} {(I - W)}^{T} + n_{S}^{- 1} n_{B}^{- 1} (I - W) C_{B} C_{S}^{T} W^{T}} + o_{p} (n_{S}^{- 2}) \\ = & (I - W) γ γ^{T} {(I - W)}^{T} + n_{S}^{- 1} W n_{S} Σ_{S}^{- 1} W^{T} + n_{B}^{- 1} (I - W) n_{B} Σ_{B}^{- 1} {(I - W)}^{T} \\ + n_{S}^{- 2} W (c_{S} c_{S}^{T} + v_{S}) W^{T} + n_{B}^{- 2} (I - W) (c_{B} c_{B}^{T} + v_{B}) {(I - W)}^{T} \\ + n_{S}^{- 1} (I - W) γ c_{S}^{T} W^{T} + n_{S}^{- 1} W c_{S} γ^{T} {(I - W)}^{T} \\ + n_{B}^{- 1} (I - W) γ c_{B}^{T} {(I - W)}^{T} + n_{B}^{- 1} (I - W) c_{B} γ^{T} {(I - W)}^{T} + n_{S}^{- 3 / 2} (I - W) γ d_{S}^{T} W^{T} \\ + n_{S}^{- 3 / 2} W d_{S} γ^{T} {(I - W)}^{T} + n_{B}^{- 3 / 2} (I - W) γ d_{B}^{T} {(I - W)}^{T} + n_{B}^{- 3 / 2} (I - W) d_{B} γ^{T} {(I - W)}^{T} \\ + n_{S}^{- 2} (I - W) γ e_{S}^{T} W^{T} + n_{S}^{- 2} W e_{S} γ^{T} {(I - W)}^{T} + n_{B}^{- 2} (I - W) γ e_{B}^{T} {(I - W)}^{T} \\ + n_{B}^{- 2} (I - W) e_{B} γ^{T} {(I - W)}^{T} + n_{S}^{- 3 / 2} W ρ_{S} W^{T} + n_{S}^{- 3 / 2} W ρ_{S}^{T} W^{T} + n_{S}^{- 2} W ν_{S} W^{T} + n_{S}^{- 2} W ν_{S}^{T} W^{T} \\ + n_{B}^{- 3 / 2} (I - W) ρ_{B} {(I - W)}^{T} + n_{B}^{- 3 / 2} (I - W) ρ_{B}^{T} {(I - W)}^{T} \\ + n_{B}^{- 2} (I - W) ν_{B}^{T} {(I - W)}^{T} + n_{B}^{- 2} (I - W) ν_{B} {(I - W)}^{T} + n_{S}^{- 1} n_{B}^{- 1} W c_{S} c_{B}^{T} {(I - W)}^{T} + n_{S}^{- 1} n_{B}^{- 1} (I - W) c_{B} c_{S}^{T} W^{T} + o_{p} (n_{S}^{- 2}) \end{matrix}

This can be simplified as

\begin{matrix} E \{({\hat{β}}_{W} - β) {({\hat{β}}_{W} - β)}^{T}\} \\ = & W A_{00} W^{T} + (I - W) A_{11} {(I - W)}^{T} + W A_{01} {(I - W)}^{T} + (I - W) A_{10} W^{T} + o_{p} (n_{S}^{- 2}) \\ = & W (A_{00} + A_{11} - A_{01} - A_{10}) W^{T} + W (A_{01} - A_{11}) + (A_{10} - A_{11}) W^{T} + A_{11} + o (n_{S}^{- 2}) \\ = & W B_{11} W^{T} + W B_{10} + B_{01} W^{T} + B_{00} + o (n_{S}^{- 2}) \\ = & f (W) + o (n_{S}^{- 2}) \end{matrix}

where

\begin{matrix} A_{00} & = & Σ_{S}^{- 1} + n_{S}^{- 2} (c_{S} c_{S}^{T} + v_{S}) + n_{S}^{- 3 / 2} ρ_{S} + n_{S}^{- 3 / 2} ρ_{S}^{T} + n_{S}^{- 2} ν_{S} + n_{S}^{- 2} ν_{S}^{T} \\ A_{01} & = & n_{S}^{- 1} c_{S} γ^{T} + n_{S}^{- 3 / 2} d_{S} γ^{T} + n_{S}^{- 2} e_{S} γ^{T} + n_{S}^{- 1} n_{B}^{- 1} c_{S} c_{B}^{T} \\ A_{10} & = & n_{S}^{- 1} γ c_{S}^{T} + n_{S}^{- 3 / 2} γ d_{S}^{T} + n_{S}^{- 2} γ e_{S}^{T} + n_{S}^{- 1} n_{B}^{- 1} c_{B} c_{S}^{T} \\ A_{11} & = & γ γ^{T} + Σ_{B}^{- 1} + n_{B}^{- 2} (c_{B} c_{B}^{T} + v_{B}) + n_{B}^{- 1} γ c_{B}^{T} + n_{B}^{- 1} c_{B} γ^{T} + n_{B}^{- 3 / 2} γ d_{B}^{T} + n_{B}^{- 3 / 2} d_{B} γ^{T} \\ + n_{B}^{- 2} γ e_{B}^{T} + n_{B}^{- 2} e_{B} γ^{T} + n_{B}^{- 3 / 2} ρ_{B} + n_{B}^{- 3 / 2} ρ_{B}^{T} + n_{B}^{- 2} ν_{B} + n_{B}^{- 2} ν_{B}^{T} \end{matrix}

So we have

W_{h} (β, γ, X_{S}, X_{B})

minimize the

f (W)

. Denote the real optimal weight as

W^{*}

and the optimal value is

L^{*}

, then we have

\begin{matrix} L^{*} & \geq & E [\{{\hat{β}}_{W_{2} (β, γ, X_{S}, X_{B})} - β\} {\{{\hat{β}}_{W_{2} (β, γ, X_{S}, X_{B})} - β\}}^{T}] \\ = & f (W) + o (n_{S}^{- 2}) \\ \geq & f (W^{*}) + o (n_{S}^{- 2}) \\ = & L^{*} - o (n_{S}^{- 2}) + o (n_{S}^{- 2}) \\ = & L^{*} + o (n_{S}^{- 2}) \end{matrix}

So, we finish the first part of the proof.

Following the proof of Theorem 1, we have

| | W_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B}) - W_{h} (β, γ, X_{S}, X_{B}) {| |}_{F} = O_{p} (n_{S}^{- 1})

and

E | | {\hat{β}}_{W_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - {β | |}_{2}^{2} - E | | {\hat{β}}_{W_{h} (β, γ, X_{S}, X_{B})} - β {| |}_{2}^{2} = O (n_{S}^{- 2})

, which will lead to the result

E | | {\hat{β}}_{{\hat{W}}_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})} - {β | |}_{2}^{2} \leq inf_{W} E | | {\hat{β}}_{W} - β {| |}_{2}^{2} + O (n_{S}^{- 2})

□

Appendix A.4. Proof of Theorem 3

Proof of Theorem 3.

When

γ = 0

, or if we allow

γ

to be a function of sample size

n_{S}

such that

γ = O (n_{S}^{- 1 / 2})

, then we have

I - W = O (1)

, and the improvement in MSE by using oracle high-order weight will be on the order of

O (n_{S}^{- 1})

, so combined with Theorem 2, we conclude that the estimated high-order weight will lead to smaller MSE compared with using small data only.

When

γ \neq 0

, we have

I - W = O (n_{S}^{- 1})

, so the improvement is on the order of

O (n_{S}^{- 2})

from Taylor expansion

f (I) = f (W) + \dot{f} (W) (I - W) + (I - W) \frac{\ddot{f} (W)}{2} (I - W)

. Thus, we cannot directly apply Theorem 2 to obtain the conclusion that the estimated high-order weight is always superior to using small data only.

To obtain the conclusion we want, we need to show that ignoring the

o (n_{S}^{- 2})

approximation error term, the estimated weight leads to a smaller main part compared with using small data only, i.e.,

f (W_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})) < f (I) .

It is obvious that the main estimation error contributes to the estimated weight that comes from the

\hat{γ}

term, while the estimation error of other parts, c, d, e,

ρ

,

ν

are asymptotically ignorable. The improvement from the high-order approximation is asymptotically larger than the second-order approximation, and the improvement of the second-order approximation part is shown to always be positive and is on the order of

O (n_{S}^{- 2})

with large

n_{S}

(see Section 9.7 of [2]). This proves that

f (W_{h} (\hat{β}, \hat{γ}, X_{S}, X_{B})) < f (I)

, and thus, we have

\begin{matrix} E [{({\hat{β}}_{W^{*}} - β)}^{T} ({\hat{β}}_{W^{*}} - β) - {({\hat{β}}_{S} - β)}^{T} ({\hat{β}}_{S} - β)] \\ \leq & O (n_{S}^{- 2}) - o (n_{S}^{- 2}) < 0 \end{matrix}

when

n_{S}

is large. □

References

Stein, C.M. Estimation of the mean of a multivariate normal distribution. Ann. Stat. 1981, 9, 1135–1151. [Google Scholar] [CrossRef]
Chen, A.; Owen, A.B.; Shi, M. Data enriched linear regression. Electron. J. Stat. 2015, 9, 1078–1112. [Google Scholar] [CrossRef]
Gross, S.M.; Tibshirani, R. Data shared Lasso: A novel tool to discover uplift. Comput. Stat. Data Anal. 2016, 101, 226–235. [Google Scholar] [CrossRef] [PubMed]
Chatterjee, N.; Chen, Y.; Maas, P.; Carroll, R.J. Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J. Am. Stat. Assoc. 2016, 111, 107–117. [Google Scholar] [CrossRef] [PubMed]
Cheng, W.; Taylor, J.M.G.; Gu, T.; Thomlins, S.A.; Mukherjee, B. Informing a risk prediction model for binary outcomes with external coefficient information. J. R. Stat. Soc. Ser. C 2019, 68, 121–139. [Google Scholar] [CrossRef] [PubMed]
Efron, B.; Morris, C. Stein’s estimation rule and its competitors—An empirical Bayes approach. J. Am. Stat. Assoc. 1973, 68, 117–130. [Google Scholar]
James, W.; Stein, C. Estimation with quadratic loss. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1961, 1, 361–379. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2008, 33, 1–22. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T.; Johnstone, L.; Tibshirani, R. Least angle regression (with discussion). Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [PubMed]
Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
Hall, P. The Bootstrap and Edgeworth Expansion; Springer: New York, NY, USA, 1992. [Google Scholar]
An, L.; Fung, K.Y.; Krewski, D. Mining pharmacovigilance data using Bayesian logistic regression with James-Stein type shrinkage estimation. J. Biopharm. Stat. 2010, 20, 998–1012. [Google Scholar] [CrossRef]
Sun, J.; Loader, C.; McCormick, W.P. Confidence bands in generalized linear models. Ann. Stat. 2000, 28, 429–460. [Google Scholar] [CrossRef]
Abelson, P.; Kennedy, D. The obesity epidemic. Science 2004, 304, 1413. [Google Scholar] [CrossRef] [PubMed]
Haslam, D.W.; James, W.P. Obesity. Lancet 2005, 366, 1197–1209. [Google Scholar] [CrossRef] [PubMed]
Zheng, W.; McLerran, D.F.; Roll, B.; Zhang, X.; Inoue, M.; Matsuo, K.; He, J.; Gupta, P.C.; Ramadas, K.; Tsugane, S.; et al. Association between body-mass index and risk of death in more than 1 million Asians. N. Engl. J. Med. 2011, 364, 719–729. [Google Scholar] [CrossRef]

Figure 1. Plot for the log-transformed ratios of the mean squared error of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), for varying sizes of

n_{S}

and when

n_{B}

is 1000.

Figure 2. Plot for the log-transformed ratios of the mean squared error of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), for varying sizes of

n_{S}

and when

n_{B}

is 10,000.

Figure 3. Plot for the log-transformed ratios of the mean squared error of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), for varying sizes of

n_{S}

and when

γ = 0

.

Figure 4. Plot for the log-transformed ratios of the mean squared error of

\hat{β}

, when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP), versus that when we only use the small data (Small), for varying sizes of

n_{S}

and when the bias term

γ

is caused by missing the covariate

X_{6}

in big data.

Table 1. Mean square errors (MSE) of

\hat{β}

when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP) for varying sizes of

n_{S}

, number of covariates p, and magnitude of bias

{| | γ | |}_{2} / {| | β | |}_{2}

when

n_{B}

is 1000.

Table 1. Mean square errors (MSE) of

\hat{β}

when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP) for varying sizes of

n_{S}

, number of covariates p, and magnitude of bias

{| | γ | |}_{2} / {| | β | |}_{2}

when

n_{B}

is 1000.

$n_{S}$	p	${\| \| γ \| \|}_{2} / {\| \| β \| \|}_{2}$	Small	Pool	$W_{2}$	$W_{h}$	$L_{1}$	$L_{2}$	JSP
100	3	0.5	0.125	0.007	0.008	0.008	0.006	0.050	0.084
100	3	1.0	0.125	0.022	0.020	0.020	0.016	0.048	0.084
100	3	2.0	0.125	0.085	0.044	0.044	0.051	0.067	0.105
100	6	0.5	0.329	0.050	0.039	0.039	0.042	0.113	0.191
100	6	1.0	0.329	0.140	0.049	0.049	0.090	0.126	0.228
100	6	2.0	0.329	0.443	0.054	0.054	0.171	0.159	0.276
100	11	0.5	0.673	0.161	0.075	0.075	0.124	0.197	0.364
100	11	1.0	0.673	0.506	0.081	0.080	0.286	0.253	0.491
100	11	2.0	0.673	1.630	0.087	0.086	0.477	0.375	0.680
200	3	0.5	0.068	0.010	0.008	0.008	0.008	0.043	0.046
200	3	1.0	0.068	0.030	0.017	0.017	0.019	0.043	0.050
200	3	2.0	0.068	0.098	0.025	0.025	0.044	0.050	0.065
200	6	0.5	0.141	0.054	0.035	0.035	0.041	0.088	0.086
200	6	1.0	0.141	0.141	0.037	0.037	0.075	0.094	0.119
200	6	2.0	0.141	0.463	0.040	0.040	0.113	0.107	0.155
200	11	0.5	0.294	0.146	0.057	0.057	0.108	0.160	0.196
200	11	1.0	0.294	0.445	0.058	0.058	0.182	0.178	0.265
200	11	2.0	0.294	1.332	0.064	0.064	0.249	0.221	0.342
300	3	0.5	0.041	0.009	0.007	0.007	0.008	0.031	0.029
300	3	1.0	0.041	0.027	0.014	0.014	0.016	0.030	0.032
300	3	2.0	0.041	0.083	0.019	0.019	0.027	0.033	0.040
300	6	0.5	0.095	0.050	0.031	0.031	0.039	0.070	0.064
300	6	1.0	0.095	0.135	0.034	0.034	0.064	0.073	0.089
300	6	2.0	0.095	0.418	0.036	0.036	0.081	0.077	0.114
300	11	0.5	0.170	0.122	0.045	0.045	0.086	0.114	0.131
300	11	1.0	0.170	0.334	0.049	0.049	0.130	0.123	0.179
300	11	2.0	0.170	0.935	0.054	0.054	0.154	0.141	0.195
400	3	0.5	0.033	0.008	0.006	0.006	0.007	0.026	0.023
400	3	1.0	0.033	0.022	0.012	0.012	0.013	0.025	0.026
400	3	2.0	0.033	0.068	0.016	0.016	0.023	0.026	0.034
400	6	0.5	0.064	0.041	0.024	0.024	0.032	0.052	0.051
400	6	1.0	0.064	0.102	0.025	0.025	0.046	0.052	0.066
400	6	2.0	0.064	0.306	0.028	0.028	0.057	0.054	0.078
400	11	0.5	0.131	0.122	0.046	0.046	0.085	0.104	0.116
400	11	1.0	0.131	0.347	0.051	0.051	0.113	0.112	0.153
400	11	2.0	0.131	0.967	0.057	0.057	0.129	0.122	0.159
500	3	0.5	0.026	0.009	0.007	0.007	0.008	0.022	0.020
500	3	1.0	0.026	0.022	0.012	0.012	0.013	0.021	0.023
500	3	2.0	0.026	0.061	0.014	0.014	0.019	0.022	0.027
500	6	0.5	0.052	0.036	0.023	0.023	0.028	0.043	0.042
500	6	1.0	0.052	0.094	0.023	0.023	0.041	0.045	0.058
500	6	2.0	0.052	0.284	0.025	0.025	0.047	0.045	0.065
500	11	0.5	0.101	0.099	0.037	0.037	0.065	0.083	0.092
500	11	1.0	0.101	0.288	0.038	0.038	0.084	0.086	0.119
500	11	2.0	0.101	0.794	0.042	0.042	0.097	0.090	0.120

Table 2. Mean square errors (MSE) of

\hat{β}

when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP) for varying sizes of

n_{S}

, number of covariates p, and magnitude of bias

{| | γ | |}_{2} / {| | β | |}_{2}

when

n_{B}

is 10,000.

Table 2. Mean square errors (MSE) of

\hat{β}

when we use each of the procedures (Small, Pool,

W_{2}

,

W_{h}

,

L_{1}

,

L_{2}

, JSP) for varying sizes of

n_{S}

, number of covariates p, and magnitude of bias

{| | γ | |}_{2} / {| | β | |}_{2}

when

n_{B}

is 10,000.

$n_{S}$	p	${\| \| γ \| \|}_{2} / {\| \| β \| \|}_{2}$	Small	Pool	$W_{2}$	$W_{h}$	$L_{1}$	$L_{2}$	JSP
100	3	0.5	0.125	0.006	0.007	0.007	0.006	0.050	0.083
100	3	1.0	0.125	0.022	0.019	0.019	0.017	0.054	0.087
100	3	2.0	0.125	0.085	0.039	0.039	0.054	0.068	0.103
100	6	0.5	0.329	0.040	0.022	0.022	0.029	0.108	0.189
100	6	1.0	0.329	0.152	0.033	0.033	0.080	0.122	0.227
100	6	2.0	0.329	0.594	0.038	0.038	0.177	0.161	0.280
100	11	0.5	0.673	0.148	0.037	0.037	0.101	0.184	0.350
100	11	1.0	0.673	0.585	0.045	0.045	0.276	0.240	0.493
100	11	2.0	0.673	2.307	0.046	0.047	0.479	0.388	0.669
200	3	0.5	0.068	0.009	0.008	0.008	0.007	0.043	0.046
200	3	1.0	0.068	0.031	0.015	0.015	0.020	0.045	0.053
200	3	2.0	0.068	0.115	0.020	0.020	0.044	0.050	0.063
200	6	0.5	0.141	0.046	0.017	0.017	0.030	0.086	0.086
200	6	1.0	0.141	0.179	0.020	0.020	0.072	0.094	0.117
200	6	2.0	0.141	0.699	0.021	0.021	0.116	0.109	0.156
200	11	0.5	0.294	0.156	0.023	0.023	0.090	0.152	0.190
200	11	1.0	0.294	0.632	0.026	0.026	0.176	0.181	0.264
200	11	2.0	0.294	2.379	0.026	0.026	0.265	0.232	0.327
300	3	0.5	0.041	0.008	0.006	0.006	0.006	0.031	0.029
300	3	1.0	0.041	0.029	0.010	0.010	0.015	0.031	0.033
300	3	2.0	0.041	0.110	0.013	0.013	0.027	0.032	0.040
300	6	0.5	0.095	0.047	0.015	0.015	0.030	0.069	0.064
300	6	1.0	0.095	0.178	0.018	0.018	0.062	0.072	0.088
300	6	2.0	0.095	0.702	0.019	0.019	0.084	0.077	0.112
300	11	0.5	0.170	0.150	0.012	0.012	0.069	0.113	0.122
300	11	1.0	0.170	0.594	0.013	0.013	0.132	0.123	0.174
300	11	2.0	0.170	2.343	0.014	0.014	0.155	0.143	0.185
400	3	0.5	0.033	0.007	0.005	0.005	0.006	0.026	0.024
400	3	1.0	0.033	0.026	0.009	0.009	0.013	0.026	0.027
400	3	2.0	0.033	0.105	0.011	0.011	0.023	0.027	0.032
400	6	0.5	0.064	0.045	0.010	0.010	0.024	0.051	0.046
400	6	1.0	0.064	0.172	0.011	0.011	0.044	0.052	0.067
400	6	2.0	0.064	0.665	0.012	0.012	0.057	0.054	0.079
400	11	0.5	0.131	0.157	0.017	0.017	0.076	0.101	0.109
400	11	1.0	0.131	0.625	0.019	0.019	0.115	0.116	0.142
400	11	2.0	0.131	2.349	0.019	0.019	0.130	0.123	0.147
500	3	0.5	0.026	0.009	0.005	0.005	0.007	0.022	0.020
500	3	1.0	0.026	0.029	0.008	0.008	0.012	0.022	0.024
500	3	2.0	0.026	0.113	0.009	0.009	0.018	0.022	0.027
500	6	0.5	0.052	0.046	0.009	0.009	0.023	0.043	0.040
500	6	1.0	0.052	0.175	0.011	0.011	0.039	0.044	0.057
500	6	2.0	0.052	0.697	0.011	0.011	0.048	0.045	0.065
500	11	0.5	0.101	0.153	0.012	0.012	0.061	0.083	0.086
500	11	1.0	0.101	0.562	0.013	0.013	0.086	0.088	0.111
500	11	2.0	0.101	2.259	0.014	0.014	0.099	0.095	0.113

Table 3. Estimates (Est), their standard errors (Std Err), and estimated mean square error (Est MSE) from ACC data analysis.

		Small	Big	Pool	$W_{2}$	$W_{h}$	JSP	$L_{1}$	$L_{2}$
(Intercept)	Est	−4.80	−9.89	−9.08	−5.25	−5.26	−6.27	−9.04	−4.85
	Std Err	3.21	1.48	1.34	1.88	1.88	2.32	1.42	1.77
	Est MSE	10.30	28.10	20.11	3.74	3.75	7.54	20.00	3.14
Age at Baseline	Est	0.03	0.12	0.10	0.04	0.04	0.05	0.10	0.03
	Std Err	0.05	0.02	0.02	0.03	0.03	0.04	0.02	0.04
	Est MSE	0.003	0.009	0.005	0.001	0.001	0.002	0.005	0.002
Body Mass Index	Est	−0.14	−0.09	−0.09	−0.13	−0.13	−0.09	−0.12	−0.14
	Std Err	0.06	0.02	0.02	0.03	0.03	0.02	0.03	0.05
	Est MSE	0.004	0.003	0.003	0.001	0.001	0.003	0.001	0.003
Ever Smoked	Est	0.71	0.08	0.20	0.65	0.65	0.55	0.62	0.71
	Std Err	0.38	0.16	0.14	0.22	0.22	0.29	0.30	0.40
	Est MSE	0.14	0.42	0.28	0.05	0.05	0.11	0.10	0.16
Ever Used Alcohol	Est	0.06	−0.37	−0.30	0.02	0.02	−0.13	−0.02	0.06
	Std Err	0.33	0.16	0.14	0.18	0.18	0.20	0.27	0.31
	Est MSE	0.11	0.21	0.15	0.03	0.03	0.08	0.08	0.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

On Data-Enriched Logistic Regression

Abstract

1. Introduction

2. Methods

2.1. Notation and Model

2.2. Penalized Regression-Based Estimators

2.3. Weighted Shrinkage Estimator

2.4. Relationship Between Two Types of Estimators

3. Simulations

4. Theoretical Results

5. Analysis of ACC Data

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. High-Order Expansion for Logistic Regression

Appendix A.2. Proof of Theorem 1

Appendix A.3. Proof of Theorem 2

Appendix A.4. Proof of Theorem 3

References

Article Metrics

Citations

Article Access Statistics