Multivariate Tail Probabilities: Predicting Regional Pertussis Cases in Washington State

In disease modeling, a key statistical problem is the estimation of lower and upper tail probabilities of health events from given data sets of small size and limited range. Assuming such constraints, we describe a computational framework for the systematic fusion of observations from multiple sources to compute tail probabilities that could not be obtained otherwise due to a lack of lower or upper tail data. The estimation of multivariate lower and upper tail probabilities from a given small reference data set that lacks complete information about such tail data is addressed in terms of pertussis case count data. Fusion of data from multiple sources in conjunction with the density ratio model is used to give probability estimates that are non-obtainable from the empirical distribution. Based on a density ratio model with variable tilts, we first present a univariate fit and, subsequently, improve it with a multivariate extension. In the multivariate analysis, we selected the best model in terms of the Akaike Information Criterion (AIC). Regional prediction, in Washington state, of the number of pertussis cases is approached by providing joint probabilities using fused data from several relatively small samples following the selected density ratio model. The model is validated by a graphical goodness-of-fit plot comparing the estimated reference distribution obtained from the fused data with that of the empirical distribution obtained from the reference sample only.


Introduction
A challenging statistical problem is the estimation of lower and upper tail probabilities from a given small data set. Challenging as it is, the problem becomes even more arduous when the data set lacks information about lower or upper tail data to the extent that the use of the empirical distribution becomes problematic. This calls for additional data in some form. In this study, the fusion of data from multiple sources allows us to compute tail probabilities, which could not otherwise be obtained due to the lack of lower or upper tail data.
In particular, if the data from a certain source exceed a sufficiently high threshold, then information about lower values below the threshold can be obtained by fusion with other sources that do have data below the threshold. The same holds for sources with data below a given threshold. This necessitates fusion with data sources containing data above the threshold. Our approach is particularly useful when the sample sizes are relatively small and yet probabilities of unusual or extreme values are of interest [1]. Here, we present a multivariate extension of our methodology and demonstrate its application using small pertussis case count data sets.
Pertussis, or whooping cough, is an acute infectious disease of the respiratory tract caused by the gram-negative bacterium Bordetella pertussis. It is highly contagious and transmitted from infected to susceptible individuals by airborne droplets due to coughing and sneezing. Pertussis affects people of all ages but can be serious in infants less than 1 year of age and causes 195,000 infant deaths annually, mostly in developing countries. The global estimates in 2014 were 24.1 million cases and 160,700 deaths from the disease among children below five years of age [2]. Pertussis is endemic in all countries and tends to occur every two to five years in North America and Europe [3].
After widespread vaccination began in the U.S. in the 1940s, the number of new infections reduced to 10,000-40,000 cases of pertussis reported each year, resulting in a 100-fold reduction in the incidence of the disease, thereby making it a likely candidate for elimination. However, since the mid 1970s, pertussis incidence has steadily increased [4]. In 2012, 48,277 pertussis cases were reported in the U.S. (an incidence rate of 15.1 per 100,000), the largest number since 1955 [5]. In Washington state alone, more than 4600 pertussis cases were reported in 2012, mostly among infants aged less than 1 year and children aged 10 years [6]. The incidence of the disease among adolescents of age 13-14 years and adults has also increased, including those previously vaccinated, suggesting early waning of vaccine-acquired immunity.
While vaccination remains the most effective means of preventing illness, pertussis has re-emerged in countries that have sustained high vaccine coverage. In the U.S., pertussis has been a reportable disease since 1922, and case-based surveillance data are available through the National Notifiable Diseases Surveillance System (NNDSS) of the Centers for Disease Control and Prevention (CDC) and, additionally, the Enhanced Pertussis Surveillance (EPS) in seven states [7]. The reasons for this re-emergence are attributable to several factors including changes in diagnostic testing and reporting, increased awareness, mismatch of vaccine antigens and circulating strains, reduced duration of immunity from acellular pertussis (aP) vaccines that replaced whole-cell vaccines in the U.S. during the 1990s, and changes in the B. pertussis organism at the molecular level [7].
During the 2012 pertussis outbreak in Washington state, it was observed that the incidence was highest in infants of age <1 year and children of age 10, 13 and 14 years [6]. The statewide incidence rate was higher among Hispanics than non-Hispanics [6]. Household size [7] and vaccination coverage [8] have been considered among the risk factors of the disease. We have noted such risk factors in Table A1.
Apart from the analysis of factors that affect the resurgence of pertussis, forecasting upper and lower joint tail probabilities of high incidence in a given period of time is another key topic of interest to epidemiologists. While a variety of methods for modeling pertussis incidence have been proposed in recent years [9,10], here we present a method for the forecasting of both univariate as well as multivariate joint tail probabilities using the fusion of pertussis count data obtained from neighboring counties in Washington state. Our approach is based on the so-called density ratio model with variable tilts presented here with a multivariate extension, which is the novel contribution of this study.

Density Ratio Model
Given m + 1 independent p-dimensional multivariate random samples X k = {X k1 , . . . , X kn k }, k = 0, . . . , m, where n k 's are the corresponding sample sizes. Suppose that X k has a density g k for k = 0, . . . , m, where the g k satisfy the density ratio structure where h k is referred to as a tilt functions or simply tilt. The sample X 0 is referred to as the reference sample and the rest of the samples are referred to as tilted samples. Let α = (α 1 , . . . , α m ) T , β = (β T 1 , . . . , β T m ) T and θ = (α T , β T ) T . Let w 0 (·; θ) = 1 and w k (·; θ) = exp (α k + β T k h k (·)). Denote the combined sample by t = {t 1 , . . . , t n } = {X 01 , . . . , X 0n 0 , . . . , X m1 , . . . , X mn m } with the corresponding samples size n = n 0 + · · · + n m . Let G 0 be the reference cumulative distribution function that corresponds to the density g 0 . The empirical likelihood function can be written as where p i is the jump of G 0 at t i . By profiling, the p i 's that maximize the empirical likelihood are given by . Therefore, the likelihood becomes a function of θ only and we can find the estimatorθ that maximizes the likelihood. Subsequently, the estimator of p i is obtained as .
The estimated G 0 is obtained from the accumulation of thep i 's, In the above expression forG 0 , replacingp i by 1/n we get the reference empirical distributionĜ 0 .
The selection of the tilts h k 's can be based on [16][17][18]. A flowchart in Figure 1 is provided to illustrate the steps in the data fusion analysis.

Start Choose a tilt function h(·).
Construct density ratio models with tilts taking the form h(·) and its reduced forms. Select the best model based on AIC.
Obtain empirical and density ratio estimate of reference CDF,Ĝ 0 andG 0 respectively. Is the selected model a good fit?
Calculate probability estimates based on the selected model.

Application: County-level Pertussis Cases in Washington State
We collected Washington state county-level annual data of the number of pertussis cases from 1997-2018 (Washington Department of Health Website https://www.doh.wa. gov/ (accessed on 1 March 2021)). For each county, we have a sample of size 22. Without any distribution assumption, when county tail data are available we can estimate tail probabilities from the empirical distribution. However, such an estimation is not feasible if tail data are absent. For example, from Table 1 we see that no observation exceeds 30 in Jefferson county so that estimating the chance of exceeding the threshold of 30 from the empirical distribution is not viable. Nevertheless, the estimation of this probability is possible via the density ratio model if we fuse the sample from Jefferson county with samples from the counties of Cowlitz and Snohomish for which sufficient amounts of data above 30 are available.

Univariate Analysis
The sample from 0-Jefferson is taken as the reference while the samples from 1-Cowlitz and 2-Snohomish are tilted with tilts h 1 (x) = h 2 (x) = x as suggested in [14]. Using the fused data from the three counties, and appealing to the density ratio model, tail probabilities for Jefferson County are given in Table 2 for thresholds 30, 40 and 50. As discussed above, these tail probabilities cannot be estimated by the empirical distribution for lack of tail data. Table 2. Selected joint probability estimates non-obtainable from the empirical distribution and the corresponding 95% confidence intervals. Here, t represents the annual pertussis cases in Jefferson. To validate the model, we used the graphical goodness-of-fit discussed in [15]. The idea is to see whether the points (Ĝ 0 ,G 0 ) lie on or close to a 45 • -line. From the goodness-of-fit graph in Figure 2, we see that some points lie not far from a 45 • -line while others do not, pointing to a possible lack of fit. Moreover, little improvement has been observed by using different tilt functions. To resolve this issue as to the suitability of the density ratio model, we turn to the multivariate version of the model, where a somewhat richer class of possible tilts is used. This leads to, as we shall see in the next section, remarkable improvement in the fit.

Multivariate Analysis
We took 3-dimensional (that is p = 3) samples from three different regions: 0-(Grays Harbor, Jefferson, Clallam), 1-(Clark, Cowlitz, Lewis), 2-(King, Snohomish, Skagit). The order for each region is from the most to the least populated. Therefore, we obtained three 3-dimensional multivariate random samples with sample sizes all equal to 22 where the sample from (Grays Harbor, Jefferson, Clallam) was considered as the reference sample. The summary statistics of the nine counties are shown in Table 3. We initiated tilt selection with [14,15]. The tilts selected were h 1 (x) = (x 1 , x 2 , x 3 ) T and h 2 (x) = (x 1 , x 3 ) T giving the smallest AIC = 483.22 as shown in Table 4. The 45 • -line formed by the pairs (Ĝ 0 ,G 0 ) in Figure 3 indicating a good fit (G 0 is closed to the empirical distributionĜ 0 ). Table 4. AIC values for different choices of h 1 and h 2 . A hyphen "-" indicates that h k (x) ≡ 0 and therefore g 0 and g k are identical for k = 1, 2.  We computed in Table 5 the estimates of several selected joint threshold probabilities, which can be regarded as predictions for a future year. It is worth noticing that the probabilities selected cannot be estimated by the empirical distributionĜ 0 due to the lack of observations while this is made feasible by fusing data from the other two regions. Table 5. Selected joint probability estimates non-obtainable from the empirical distribution and the corresponding 95% confidence intervals. Here, (t 1 , t 2 , t 3 ) represents the number of annual pertussis cases in (Grays Harbor, Jefferson, Clallam) respectively.

Discussion
Our data fusion approach allows us to combine information from multiple sources that can together describe dynamic and multifactorial phenomena more comprehensively than a single source alone. Infectious disease dynamics are ideally suited for such integrative modeling of an outbreak in which a county is usually affected by its neighboring counties, especially in populated areas, due to population mobility [19]. As the re-emergence of pertussis in the U.S. and Europe in recent years has shown, it is important to have the modeling capacity to predict the incidence of the disease even if the data are usually of small size, which are in themselves not adequate for the precise estimation of tail probabilities.
The multivariate density ratio model described in this study allowed us to examine the joint behavior of pertussis resurgence in adjacent counties. The model was validated by goodness-of-fit plots. Importantly, the observed support of the reference distribution of cases was enlarged by fusing the reference sample with data from nearby regions and applied to the density ratio model. While time series modeling of disease incidence is common in epidemiology, in the face of small or moderate data sources few methods can enhance their input to yield multivariate tail probabilities and confidence intervals, which are not possible to estimate otherwise.
In future work, we plan to further enrich our model with regional covariates to provide key insights for disease surveillance and public health researchers. For instance, the risk factors of pertussis cases that are studied in the U.S. include household size, vaccination coverage and demographics (see Table A1). Such factors are observed with regional variation that is often spatially clustered across neighboring counties [20]. Indeed, data fusion is well suited to the systematic modeling of regions with socioeconomic, political or cultural overlap (e.g., school districts) that are characterized by nonmedical vaccine exemptions, migration and vaccine refusal [21][22][23]. In times of increasingly common vaccine hesitancy, such applications could be very effective for public health.
While the world is currently seeing outbreaks of the COVID-19 pandemic, pertussis is, in comparison, an ancient disease, which was recognized even in the Middle Ages. While connections between these diseases have recently been considered [24], it is beyond the scope of the present study. However, the multivariate approach that we used for fusion of pertussis inter-county data could also be applied to other regionally transmissible diseases, including COVID-19. We leave this to future studies.
Author Contributions: All authors have contributed equally to the paper. This includes conceptualization, methodology, formal analysis, investigation, and writing. In addition, all authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.