1. Introduction
A challenging statistical problem is the estimation of lower and upper tail probabilities from a given small data set. Challenging as it is, the problem becomes even more arduous when the data set lacks information about lower or upper tail data to the extent that the use of the empirical distribution becomes problematic. This calls for additional data in some form. In this study, the fusion of data from multiple sources allows us to compute tail probabilities, which could not otherwise be obtained due to the lack of lower or upper tail data.
In particular, if the data from a certain source exceed a sufficiently high threshold, then information about lower values below the threshold can be obtained by fusion with other sources that do have data below the threshold. The same holds for sources with data below a given threshold. This necessitates fusion with data sources containing data above the threshold. Our approach is particularly useful when the sample sizes are relatively small and yet probabilities of unusual or extreme values are of interest [
1]. Here, we present a multivariate extension of our methodology and demonstrate its application using small pertussis case count data sets.
Pertussis, or whooping cough, is an acute infectious disease of the respiratory tract caused by the gram-negative bacterium
Bordetella pertussis. It is highly contagious and transmitted from infected to susceptible individuals by airborne droplets due to coughing and sneezing. Pertussis affects people of all ages but can be serious in infants less than 1 year of age and causes 195,000 infant deaths annually, mostly in developing countries. The global estimates in 2014 were 24.1 million cases and 160,700 deaths from the disease among children below five years of age [
2]. Pertussis is endemic in all countries and tends to occur every two to five years in North America and Europe [
3].
After widespread vaccination began in the U.S. in the 1940s, the number of new infections reduced to 10,000–40,000 cases of pertussis reported each year, resulting in a 100-fold reduction in the incidence of the disease, thereby making it a likely candidate for elimination. However, since the mid 1970s, pertussis incidence has steadily increased [
4]. In 2012, 48,277 pertussis cases were reported in the U.S. (an incidence rate of 15.1 per 100,000), the largest number since 1955 [
5]. In Washington state alone, more than 4600 pertussis cases were reported in 2012, mostly among infants aged less than 1 year and children aged 10 years [
6]. The incidence of the disease among adolescents of age 13–14 years and adults has also increased, including those previously vaccinated, suggesting early waning of vaccine-acquired immunity.
While vaccination remains the most effective means of preventing illness, pertussis has re-emerged in countries that have sustained high vaccine coverage. In the U.S., pertussis has been a reportable disease since 1922, and case-based surveillance data are available through the National Notifiable Diseases Surveillance System (NNDSS) of the Centers for Disease Control and Prevention (CDC) and, additionally, the Enhanced Pertussis Surveillance (EPS) in seven states [
7]. The reasons for this re-emergence are attributable to several factors including changes in diagnostic testing and reporting, increased awareness, mismatch of vaccine antigens and circulating strains, reduced duration of immunity from acellular pertussis (aP) vaccines that replaced whole-cell vaccines in the U.S. during the 1990s, and changes in the
B. pertussis organism at the molecular level [
7].
During the 2012 pertussis outbreak in Washington state, it was observed that the incidence was highest in infants of age <1 year and children of age 10, 13 and 14 years [
6]. The statewide incidence rate was higher among Hispanics than non-Hispanics [
6]. Household size [
7] and vaccination coverage [
8] have been considered among the risk factors of the disease. We have noted such risk factors in
Table A1.
Apart from the analysis of factors that affect the resurgence of pertussis, forecasting upper and lower joint tail probabilities of high incidence in a given period of time is another key topic of interest to epidemiologists. While a variety of methods for modeling pertussis incidence have been proposed in recent years [
9,
10], here we present a method for the forecasting of both univariate as well as multivariate joint tail probabilities using the fusion of pertussis count data obtained from neighboring counties in Washington state. Our approach is based on the so-called
density ratio model with variable tilts presented here with a multivariate extension, which is the novel contribution of this study.
2. Density Ratio Model
Given
independent
p-dimensional multivariate random samples
,
, where
’s are the corresponding sample sizes. Suppose that
has a density
for
, where the
satisfy the density ratio structure
where
is referred to as a tilt functions or simply tilt. The sample
is referred to as the reference sample and the rest of the samples are referred to as tilted samples.
Let
,
and
. Let
and
. Denote the combined sample by
with the corresponding samples size
. Let
be the reference cumulative distribution function that corresponds to the density
. The empirical likelihood function can be written as
with constraints
where
is the jump of
at
. By profiling, the
’s that maximize the empirical likelihood are given by
Therefore, the likelihood becomes a function of
only and we can find the estimator
that maximizes the likelihood. Subsequently, the estimator of
is obtained as
It can be shown that
has the asymptotic normal distribution
as
. Details can be found in [
11,
12,
13,
14,
15].
The estimated
is obtained from the accumulation of the
’s,
In the above expression for , replacing by we get the reference empirical distribution .
The selection of the tilts
’s can be based on [
16,
17,
18].
A flowchart in
Figure 1 is provided to illustrate the steps in the data fusion analysis.
3. Application: County-level Pertussis Cases in Washington State
We collected Washington state county-level annual data of the number of pertussis cases from 1997–2018 (Washington Department of Health Website
https://www.doh.wa.gov/ (accessed on 1 March 2021)). For each county, we have a sample of size 22. Without any distribution assumption, when county tail data are available we can estimate tail probabilities from the empirical distribution. However, such an estimation is not feasible if tail data are absent. For example, from
Table 1 we see that no observation exceeds 30 in Jefferson county so that estimating the chance of exceeding the threshold of 30 from the empirical distribution is not viable.
Nevertheless, the estimation of this probability is possible via the density ratio model if we fuse the sample from Jefferson county with samples from the counties of Cowlitz and Snohomish for which sufficient amounts of data above 30 are available.
3.1. Univariate Analysis
The sample from 0-Jefferson is taken as the reference while the samples from 1-Cowlitz and 2-Snohomish are tilted with tilts
as suggested in [
14]. Using the fused data from the three counties, and appealing to the density ratio model, tail probabilities for Jefferson County are given in
Table 2 for thresholds 30, 40 and 50. As discussed above, these tail probabilities cannot be estimated by the empirical distribution for lack of tail data.
To validate the model, we used the graphical goodness-of-fit discussed in [
15]. The idea is to see whether the points (
,
) lie on or close to a 45°-line. From the goodness-of-fit graph in
Figure 2, we see that some points lie not far from a 45°-line while others do not, pointing to a possible lack of fit. Moreover, little improvement has been observed by using different tilt functions. To resolve this issue as to the suitability of the density ratio model, we turn to the multivariate version of the model, where a somewhat
richer class of possible tilts is used. This leads to, as we shall see in the next section, remarkable improvement in the fit.
3.2. Multivariate Analysis
We took 3-dimensional (that is
) samples from three different regions: 0-(Grays Harbor, Jefferson, Clallam), 1-(Clark, Cowlitz, Lewis), 2-(King, Snohomish, Skagit). The order for each region is from the most to the least populated. Therefore, we obtained three 3-dimensional multivariate random samples with sample sizes all equal to 22 where the sample from (Grays Harbor, Jefferson, Clallam) was considered as the reference sample. The summary statistics of the nine counties are shown in
Table 3.
We initiated tilt selection with
suggested in [
14,
15]. The tilts selected were
and
giving the smallest AIC = 483.22 as shown in
Table 4. The 45°-line formed by the pairs (
,
) in
Figure 3 indicating a good fit (
is closed to the empirical distribution
).
We computed in
Table 5 the estimates of several selected joint threshold probabilities, which can be regarded as predictions for a future year. It is worth noticing that the probabilities selected cannot be estimated by the empirical distribution
due to the lack of observations while this is made feasible by fusing data from the other two regions.
4. Discussion
Our data fusion approach allows us to combine information from multiple sources that can together describe dynamic and multifactorial phenomena more comprehensively than a single source alone. Infectious disease dynamics are ideally suited for such integrative modeling of an outbreak in which a county is usually affected by its neighboring counties, especially in populated areas, due to population mobility [
19]. As the re-emergence of pertussis in the U.S. and Europe in recent years has shown, it is important to have the modeling capacity to predict the incidence of the disease even if the data are usually of small size, which are in themselves not adequate for the precise estimation of tail probabilities.
The multivariate density ratio model described in this study allowed us to examine the joint behavior of pertussis resurgence in adjacent counties. The model was validated by goodness-of-fit plots. Importantly, the observed support of the reference distribution of cases was enlarged by fusing the reference sample with data from nearby regions and applied to the density ratio model. While time series modeling of disease incidence is common in epidemiology, in the face of small or moderate data sources few methods can enhance their input to yield multivariate tail probabilities and confidence intervals, which are not possible to estimate otherwise.
In future work, we plan to further enrich our model with regional covariates to provide key insights for disease surveillance and public health researchers. For instance, the risk factors of pertussis cases that are studied in the U.S. include household size, vaccination coverage and demographics (see
Table A1). Such factors are observed with regional variation that is often spatially clustered across neighboring counties [
20]. Indeed, data fusion is well suited to the systematic modeling of regions with socioeconomic, political or cultural overlap (e.g., school districts) that are characterized by nonmedical vaccine exemptions, migration and vaccine refusal [
21,
22,
23]. In times of increasingly common vaccine hesitancy, such applications could be very effective for public health.
While the world is currently seeing outbreaks of the COVID-19 pandemic, pertussis is, in comparison, an ancient disease, which was recognized even in the Middle Ages. While connections between these diseases have recently been considered [
24], it is beyond the scope of the present study. However, the multivariate approach that we used for fusion of pertussis inter-county data could also be applied to other regionally transmissible diseases, including COVID-19. We leave this to future studies.