1. Introduction
In regions lacking robust infectious disease surveillance mechanisms, health authorities can substantially benefit from utilizing detailed hospitalization records that typically include the patient’s demographic information, personal data, confirmed diagnosis, and history of illness [
1,
2,
3]. With increasing computerization of medical records and diagnostic improvements, electronic hospitalization records (EHRs) can be effectively utilized for individualized treatment and healthcare management and as a tool for targeted disease surveillance [
4,
5,
6,
7]. In resource-poor areas, the use of EHRs for disease monitoring on local and regional scales could be of high value, especially if the patient profile, hospital capture geographic area, and sources of exposure are well understood.
Three location-based pieces of information are relevant to source-tracking of a disease—the patient’s place of residence (PoR), the place of exposure (PoE), and the place of health care (PoH) (
Figure 1). PoE is typically determined based on patient recall and epidemiological investigations and ideally could be recorded in medical history or EHR. PoR is likely to be reported by the patient during admission into a hospital, which is often the PoH. In some situations of mild infection, two or more of these locations may be the same. For example, if the individual consumes contaminated water collected from a well close to their home, suffers diarrhea, and self-medicates with over-the-counter antidiarrheal drugs, we might conclude that PoR, PoH, and PoE are the same. For severe cases that require medical assistance and hospitalization, both PoR and PoH are known, but there may be substantial uncertainty regarding PoE. Detailed PoR and PoE information are often collected in investigations following disease outbreaks and may not be part of a standard diagnostic questionnaire.
Individual EHRs may contain information about PoE, PoR, and PoH, as well as time-stamped information on laboratory-confirmed disease vector or infectious agents. These EHR-derived data can be combined to create a unique and complete picture of infectious disease exposure, manifestation, and treatment patterns. Detected patterns can be useful for characterizing a hospital capture area, for example, by defining the average distance between PoR and PoH, or for identifying hotspots of infections based on patient PoRs [
5,
8]. When PoE is not included in EHRs, PoH or PoR could be used as a proxy, but only after careful examination of spatial patterns of hospitalizations.
The Christian Medical College (CMC) Hospital in the Indian city of Vellore in Tamil Nadu state, plays an important role in documenting the changing landscape of cholera on the Indian subcontinent and serves as a national reference laboratory. The Department of Microbiology at CMC was instrumental in detecting the first outbreak of cholera caused by the O139 serogroup. The outbreak started in Vellore in September 1992 and spread to Madras (now Chennai) by October 1992 [
9]. This epidemic subsequently spread to Calcutta city in the Indian state of West Bengal, and the country of Bangladesh [
9]. The new serogroup designation O139, synonym “Bengal,” became the most prevalent serogroup worldwide [
10,
11]. The Department of Microbiology at CMC has tracked the progress of
Vibrio cholerae (V. cholerae) O139 since first detection, documenting the virtual absence of O1
V. cholerae during 1992–1993, its reappearance in late 1993, and the prevalence of both O1 and O139
V. cholerae serogroups in Vellore since then [
12]. The O139 serotype has been transported around the world through trade and tourism and is now well-established in most South Asian countries [
13,
14]
Understanding the spatiotemporal patterns of cholera among patients at CMC is crucial for managing the hospital’s caseload given travel and migration patterns around Chennai and Vellore. Vellore is located near many major tourist destinations, and CMC Hospital is an important destination for medical tourism in the subcontinent. CMC is also located three hours from Chennai, the second most frequently visited destination for foreign tourists in India [
15]. While foreign travel to Vellore peaks during January and February, domestic travel to both Chennai and Vellore peaks from October through December [
16], coinciding with major Indian festivals such as Diwali, Dusshera, and Christmas. During these months, individuals are most likely to travel large distances through multiple transit modes to visit hometowns and relatives. Skilled employment is the primary motivation for out-migration from Chennai, especially to nearby centers such as Bengaluru [
17]. There is also significant economic migration between Tamil Nadu and the nearby states of Karnataka, Kerala, and Andhra Pradesh [
17]. Given this mobile population, hospital-based surveillance in this region provides valuable information on endemic diseases and novel pathogens [
18].
Cholera is a highly variable disease driven by local environments as well as seasonal and community-level factors governing disease transmission. Toxigenic cholera has a median incubation period of 1.4 days, with 95% of cases developing symptoms within five days [
19]. Due to the variety of drivers and quick onset, establishing exposure-disease associations for cholera at the individual level can be challenging [
19]. In such situations, one can utilize point processes, or stochastic processes whose events or results are observed within a study area and treated as a realization of a random point process in two-dimensional space [
20]. Point process methods are based on individual events in a study region and therefore offer a distinct advantage to standard epidemiological modeling methods that are based on data aggregated in space and time [
20].
This study examined geographic patterns of cholera-related hospitalization records maintained by CMC Hospital in Vellore, Tamil Nadu State, India. These records were used to examine spatiotemporal patterns of cholera based on patient PoR during 2000–2014. We used laboratory confirmed clinical isolates of V. cholerae abstracted from CMC logbooks and electronic databases to generate the 15-year record of cholera at the hospital. We geocoded each patient’s self-reported town and region, and developed point process models to identify clusters of PoR. Identified clusters were then modeled to examine their temporal characteristics, including peak timing and disease trend. These models were studied in conjunction with temporal covariates including holidays and weekends to characterize temporal and demographic differences between the two clusters.
4. Discussion
Detailed EHRs with accurate case information and patient PoR allow public health practitioners and data analysts to better understand the profile of hospitalized population and monitor infectious disease patterns. Our study demonstrates that patients diagnosed with cholera and treated at CMC Hospital represent a wide range of residential locations (PoR). These PoRs can be classified into two distinct geographic clusters—Vellore (61% of cases) and Bengal (22%). We found that the two clusters differ by their patient profiles, with patients in the Bengal cluster being most likely older males who are traveling to Vellore [
21]. Both clusters show well-aligned seasonal peaks in mid-July, only one week apart, and they also show the same proportion of predominant O1 serotype.
Travel and migration are complex phenomena which cannot fully explain the link between observed PoR in Bengal and PoH in Vellore. However, we hypothesize that PoR locations as far as the Bengal cluster may indicate established travel patterns. Per the 2011 Indian Census, 1% of the population in the Bengal cluster was born in Tamil Nadu, which may explain why patients from this region may have travelled to the Vellore vicinity [
37]. Employment opportunities may also motivate travel between these regions [
17]. Given a median incubation period of 1.4 days [
19], we posit that long-distance travel (over 1000 km) with symptomatic cholera seems unlikely. This hypothesis is generally supported in the literature—in an assessment of cholera-related hospitalization for children under five in Bangladesh, rural distances to hospital are classified in groups of less than 3 km, 3–5 km, 5–7 km, and greater than 7 km [
38]. Other studies report mean distance to hospital of 4.9 and 6.7 km [
39], and a maximum distance of 16.8 km [
40]. Given this range of reported distances, we conclude that patients would likely not travel more than a few kilometers to seek treatment for cholera. Therefore, distances of greater than 1000 km between Vellore (PoH) and Bengal (PoR) observed in our dataset lead us to suspect that place of exposure (PoE) for the visiting population is within the Vellore cluster boundary. However, sound conclusions on this topic require further inquiry and microbiological analysis that are outside the scope of the current study.
This presented investigation is subject to several challenges and limitations. One underlying challenge is that the detection of disease clusters is determined by the accuracy of the underlying data. In our study many of the records before 2004 were collected from paper logbooks and validated with data from one of two hospital databases. The address entries had to be checked for discrepancies such as misspelling or misspecification of town or region. These non-standard data entry methods led to several uncertainties regarding the abstracted fields. There were also discrepancies between the two databases in use at CMC. Some entries were duplicated with varying patient attribute details, and some others were not diagnosed with cholera per one database. Attempts were made to resolve discrepancies by using matched records if available and treating different entries of cholera (despite potential of being the same case stored in different databases) as different cases. However, the transitional nature of data migration to EHR databases makes it difficult to retrospectively analyze the accuracy of the records.
Spatial uncertainty regarding patient PoR also directly affects observed cluster boundaries. Several geographic boundaries and place names have changed since the reported case date. For example, Calcutta was renamed Kolkata in 2001, Pondicherry was renamed Puducherry in 2006, and the Indian state of Uttaranchal was created in 2000 and renamed Uttarakhand in 2007. Several entries were corrected in the second round of geocoding (
Figure 2) to address changing geographies. Despite these challenges, EHRs utilized in this dataset were quite robust. Over 97% of the dataset was geocodable with minor data cleaning, indicating that hospitalization datasets often have very high quality and fidelity. Since only the Town and Region for each patient was abstracted to protect patient privacy, there was also inherent spatial uncertainty in the exact location of each PoR. A few studies have utilized high spatial resolution information from hospitalization records for source-tracking [
4,
5,
41]; however, we were unable to abstract this level of detail in our dataset. Utilizing the centroid of a town as the PoR was a useful but coarse assumption for this study. The accuracy of future studies can be improved by geocoding the complete patient address obtained with appropriate patient consent and IRB approval. Patient travel information contained in EHRs can also help characterize the PoE more precisely for a thorough investigation of disease transmission among and across communities [
4].
Another limitation is that hospital-based surveillance only allows us to observe extreme cases that require intensive care. Such systems can only detect a narrow range of cases and do not capture less severe cases that did not require hospitalization, patients who cannot afford hospitalization, or patients who self-medicate. According to the PoR-PoE-PoH framework introduced in
Figure 1, our analysis does not capture cases with missing PoH. Therefore, any epidemiological study utilizing hospitalization records cannot make conclusions about prevalence or incidence in the general population. Furthermore, hospitals have limited capacity, and we do not have information about upstream admission and testing decisions which led to each record observed in our dataset. However, we demonstrate that hospitalization records are extremely valuable for understanding mobile patient populations. Given the distribution of patient PoRs in this dataset, we conclude that CMC Hospital’s capture area for cholera includes a local population from Tamil Nadu, Andhra Pradesh, and Kerala states and a visiting population from the Bengal region. This information is particularly useful for hospital administrators making daily decisions regarding staffing, procurement of laboratory supplies, seasonal testing schedules, and other factors that affect the quality of patient care. Knowledge of patient demographics and disease-specific peak timing across mobile populations is also extremely important to effectively manage outbreaks that may quickly overwhelm even large regional hospitals.
Like most infectious diseases, cholera exposure and manifestation are determined by a complex set of factors. Our study offers a preliminary inquiry considering spatiotemporal properties of the cholera caseload in this region. This analysis would benefit from additional demographic, socioeconomic, and public health covariates in the model. Traditional variables used in epidemiological clustering analysis include meteorological characteristics temperature, humidity, precipitation; and individual patient characteristics age, sex, and water and sanitation access [
42]. However, given the ambiguity in PoE, it is difficult to determine an appropriate spatial extent for environmental data extraction. While demographic characteristics can be abstracted at the district level, access to water and sanitation facilities is extremely spatially and demographically variant, and information on these facilities was not available at a scale relevant for this study. In the presence of better information, existing methods to develop local patient profiles from EHRs [
21] can be supplemented with PoR and PoE information to improve surveillance efforts and help rapidly characterize susceptible populations in the event of an outbreak. Local authorities can also implement simpler surveillance solutions by similarly geocoding patient PoRs to characterize the burden of local and imported cases of any disease over time.
Moving forward, we recommend permanent solutions to improve primary data collection across EHR fields through standardized data entry formats. For example, address entry can be streamlined by using dropdown menus for key fields instead of text boxes or manual entry systems. Hospital database interfaces could also display warning signs for incorrectly entered data and limit implausible values while entering patient age, sex, and town. These steps can establish a more streamlined data-to-results pipeline for active up-to-date surveillance results. With ongoing improvements in hospital data infrastructure, we expect rapid advances in the spatial and temporal resolution provided by EHRs for optimal local and regional infectious disease surveillance. Utilizing accurate PoR and PoE information in conjunction with the methods presented in this analysis can produce high-quality hospital-based predictive models of multiple locally significant diseases.
Improvements to the outlined methodology as well as primary data collection can help create an active surveillance network based on high-resolution spatial and temporal data. While this analysis only studies the hospitalization record of cholera from CMC Vellore, this methodology can be expanded to many more diseases and centers. EHRs can also be combined with health survey and morbidity data for a more complete picture of regional infectious disease burdens. As capacities for data storage and analysis increase exponentially, robust data standards can be used to develop a hospitalization record network, which allows epidemiologists to gather, analyze, and monitor large amounts of high-quality data at multiple scales. Such data must be curated in machine-readable format in secure and accessible data repositories to facilitate further research.