Evaluating Completeness of Foodborne Outbreak Reporting in the United States, 1998–2019

Public health agencies routinely collect time-referenced records to describe and compare foodborne outbreak characteristics. Few studies provide comprehensive metadata to inform researchers of data limitations prior to conducting statistical modeling. We described the completeness of 103 variables for 22,792 outbreaks publicly reported by the United States Centers for Disease Control and Prevention’s (US CDC’s) electronic Foodborne Outbreak Reporting System (eFORS) and National Outbreak Reporting System (NORS). We compared monthly trends of completeness during eFORS (1998–2008) and NORS (2009–2019) reporting periods using segmented time series analyses adjusted for seasonality. We quantified the overall, annual, and monthly completeness as the percentage of outbreaks with blank records per our study period, calendar year, and study month, respectively. We found that outbreaks of unknown genus (n = 7401), Norovirus (n = 6414), Salmonella (n = 2872), Clostridium (n = 944), and multiple genera (n = 779) accounted for 80.77% of all outbreaks. However, crude completeness ranged from 46.06% to 60.19% across the 103 variables assessed. Variables with the lowest crude completeness (ranging 3.32–6.98%) included pathogen, specimen etiological testing, and secondary transmission traceback information. Variables with low (<35%) average monthly completeness during eFORS increased by 0.33–0.40%/month after transitioning to NORS, most likely due to the expansion of surveillance capacity and coverage within the new reporting system. Examining completeness metrics in outbreak surveillance systems provides essential information on the availability of data for public reuse. These metadata offer important insights for public health statisticians and modelers to precisely monitor and track the geographic spread, event duration, and illness intensity of foodborne outbreaks.


Introduction
Worldwide, public health agencies routinely collect time-referenced records to mon-itor~600 million foodborne or waterborne outbreaks occurring annually [1][2][3][4][5]. In the United States (US) alone, approximately 1 in 6 Americans suffer from a foodborne illness resulting in~48 million cases,~128,000 hospitalizations, and~3000 deaths annually [6]. Nearly 90% of these illnesses and hospitalizations are caused by five pathogens, including Salmonella, Toxoplasma, Staphylococcus aureus, Norovirus, and Campylobacter [6]. In 2013, the US Department of Agriculture (USDA) Economic Research Service (ERS) estimated that the frequency and severity of foodborne illnesses culminate in~$15.5 billion (USD 2013) of losses annually attributed to medical costs, productivity losses, and economic burden due to death [7]. A recent 2021 report suggests that these expenses have risen by 13% to an completeness metrics to evaluate the credibility and usability of event-based surveillance data.
In this study, we developed a framework to perform systematic screening of data completeness to serve as metadata for outbreak surveillance systems. We evaluated the completeness of publicly reported foodborne outbreak data in eFORS and NORS eventbased electronic surveillance systems and explored how the implementation of NORS improved data completeness. We extracted, aligned, and merged 25 data tables containing 213 variables for 22,792 outbreaks publicly reported from 1 January 1998 through 31 December 2019. We compared the patterns of completeness for 103 variables before and after the transition from eFORS (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008) to NORS (2009-2019) using segmented linear regression models adapted to time-referenced monthly values and controlled for outbreak seasonality. Our results provide the basis for a standardized metadata report to accompany publicly available surveillance system data downloads to assist data users in effectively utilizing reported electronic health records.

Data Source
On 4 March 2021, we requested and received integrated NORS data for all available foodborne outbreak records from 1 January 1998 through 31 December 2019 [28], where records between 1 January to 31 December 2008 were collected by eFORS. Data were unavailable for 2020 and 2021 due to a~12-18-month delay in the public distribution of outbreak records. Extracted data included 213 variables in 25 data tables broadly categorized by general (109 variables in 5 tables), etiological (48 variables in 2 tables), and food-related (56 variables in 18 tables) outbreak characteristics. As an event-based surveillance system, NORS recorded outbreaks using identification numbers (CDCID) to permit alignment and merging of variables across data tables.
NORS records related to general outbreak information (e.g., location, illness date, incubation time, and case information) categorized primary cases by health outcomes (e.g., hospitalization, ER visits, etc.), age group (groups vary between eFORS and NORS), gender (i.e., male, female, unknown), and case definition (e.g., confirmed, probable, estimated). NORS etiological information described clinical and environmental testing procedures, sampling techniques, and pathogen etiology. Food-related characteristics pertained to the suspected modes of infection transmission, point of contamination in the supply chain, and where the contaminated foods were prepared or consumed. Some tables described only small subsets of observations, such as school-related outbreaks (3 tables), ground beef information (1 table), or egg information (1 table). Given the variability of food ingredients per outbreak, NORS provided an implicated food identification number (FID) for merging 7 ingredient-related data tables.

Data Preparation
NORS reported 3 types of variables, including strings (dates, text answers, notes), binary choice (dichotomous-0 for absent, 1 for present), and multiple-choice responses. During data pre-processing, the CDC converted multiple-choice questions either to numerous binary-choice variables or repeated outbreak observations under the same CDCID for each categorical option selected. The latter introduced a repeated observation structure that duplicated outbreak records within our dataset and required more extensive data cleaning and transposition to create a uniformly structured dataset. For these multiple-choice questions, data completeness depended on the ratio of options selected to the total response options available. For example, incomplete "case information for signs or symptoms of illnesses" could not exceed the number of "signs or symptoms" available for reporting per outbreak. Similarly, ingredient-related variables depended on the number of ingredients involved in an outbreak. We define these variables whose completeness depended on the ratio of categorical options selected to all those available per outbreak as conditional variables.
To generate a single dataset, we cleaned and merged the 213 variables in all 25 data tables across 22,792 outbreaks (Supplementary Table S1). During our cleaning process, we first excluded: • 26 variables generated by CDC personnel to track the reporting of electronic records (e.g., data recorder ID, local report date, CDC report date, etc.); • 12 variables providing reporter contact information and optional comments written during reporting (e.g., recall comments, agency title, reporting site, etc.); • 9 variables providing clarification responses to specific questions asked only for specific outbreaks (e.g., clarification of supply chain stage of contamination, questions regarding antimicrobial resistance testing, etc.); and • 17 variables unavailable for the entire study period duration (e.g., illness attack rate, percentage of illnesses by age group, food contaminant infecting exposed persons, age percentage, etc.).
Next, we collapsed 60 variables relating to multiple-choice questions into 14 variables estimated as the count of multiple-choice options per question. After completing this process, the final dataset consisted of 103 variables ( Figure 1). We calculated completeness as the ratio for which variable information was reported per outbreak.

Crude Completeness Estimation
For each outbreak, we determined whether any of the 103 variables had a complete, partial, or absent record. We differentiated between incomplete records as no information available for a variable (e.g., blanks) and values of 0. For all but conditional variables, we created a dichotomous indicator defined as 1 if an outbreak had complete information for that variable and 0 if not. Dichotomous indicators were left blank if variables did not pertain to an outbreak, such as clarification questions asked only to a subset of outbreaks (i.e., the handling of beef food products for non-beef-associated outbreaks). Indicator blanks properly corrected completeness estimates for only those outbreaks eligible to report information on a given variable. For conditional variables, we estimated completeness as the ratio of reported categorical responses to the total responses available per outbreak. Dichotomous indicators contained 4 types of information: completely missing (0), partially missing (ranging 0-1), and non-missing records (1), and records where completeness information was not applicable for a given outbreak (blanks).
We measured crude completeness across outbreaks and variables. Crude outbreak completeness (C i ) reflected the percentage of variables with complete information per outbreak, while crude variable completeness (C j ) reflected the percentage of outbreaks with complete information per variable, such that: (1) where C i and C j -completeness of i-outbreak or j-variable; j i -the sum of variables reporting complete information for i-outbreak; J-the total number of variables for which information was collected; i j -the sum of outbreaks reporting complete information for j-variable; and I-the total number of outbreaks for which information was collected.
Next, we calculated annual and monthly completeness for all outbreaks and for selected pathogens using monthly and yearly time series of outbreak counts and completeness values. We selected only 6 pathogen subgroups of interest: the 3 most reported pathogens in the study period (Norovirus, Salmonella, and Clostridium), 2 unique groups (outbreaks associated with unknown and multiple etiologies), and all pathogens (total reported outbreaks). We used yearly time series to examine trends in completeness across study years while monthly time series described the seasonality of completeness by Gregorian calendar month. We created yearly and monthly time series using the date of first reported outbreak illness. We calculated variable-based completeness for all outbreaks N t and each selected pathogen (N t,p ), as: where C i,t and C i,t,p -completeness for i-variable in t-time unit (t = 1-22 for annual completeness, t = 1-264 for seasonal completeness) for p-pathogens; r i,t and r i,t,p -outbreak records with complete information for i-variable in t-time unit for p-pathogen; N t and N t,p -total count of outbreaks reported in t-time unit for p-pathogen. Once creating a time series of monthly completeness, we examined patterns of completeness over time, by a reported pathogen, and according to completeness categories, by estimating average completeness as: where A g,t,p -average completeness of g-category in t-month for p-pathogen; S g,t,p -sum of monthly completeness of g-category in t-month for p-pathogen; and L g -total number of variables included in g-category (ranging from 18 to 22).

Temporal Trend Analyses
To best capture trend differences between eFORS and NORS, we defined the segmented regression analysis critical point as January 2009 (onset of NORS reporting). We used segmented negative binomial regression models adjusted for linear monthly trends to estimate counts of outbreaks under both surveillance systems (Equation (6)): where N t,p -monthly number of outbreaks at t-month for p-pathogen; exponential of β 0outbreak counts at the critical point (January 2009); β b and β a -linear trend estimates for periods before and after the critical point, respectively; and t b and t a -continuous time series of study months from onset to the critical point and critical point to conclusion, respectively.
To examine trends in average monthly completeness, we used a segmented linear regression model adjusted for linear monthly trends and outbreak counts (Equation (7)) as well as harmonic regression terms (Equation (8)): where A g,t,p -average completeness of g-category in t-month for p-pathogen; β 0 -average completeness at the critical point (January 2009); β b and β a -estimates of linear trends in completeness before and after the critical point, respectively; t b and t a -continuous study time series before and after January 2009, respectively; β c and β s -harmonic trend coefficients for each critical period such that ω = 1/M, where M is the length of the annual cycle in Gregorian calendar months (12). We determined seasonality by the presence of a significant sinusoidal or co-sinusoidal regression term. We used Akaike Information Criterion (AIC) to examine model fit in Equation (6) and R 2 values to examine model fit in Equations (7) and (8). We defined statistical significance in all analyses as α < 0.05. We performed data extraction, alignment, management, and cleaning using Excel 2016 Version 2103, Stata SE/16.1, and R (1.2.5033) software. We conducted statistical analyses using R (1.2.5033) software and created visualizations using R (1.2.5033) and Adobe Illustrator (25.4.1) software.
We found that NORS reported 88-103 variables per outbreak with variable-based crude completeness ranging 3.05-100.00% (Supplementary Table S2). Average crude completeness per outbreak varied from 41.96% to 85.37% for all contaminants (Table 1). Among pathogen groups, estimates were high for parasitic pathogens for both crude completeness per outbreak and per variable (66.05% and 64.10%, respectively). In contrast, outbreaks of unknown etiology had the lowest variable-based and outbreak-based crude completeness (51.22% and 46.06%, respectively). Within each pathogen group, variable-based completeness varied greatly: from 72.49% for Cyclospora to nearly half of this value at 38.86% for the other-parasite group. We found similar variation within the bacterial group, with 73.03% crude variable completeness for Enterococcus and only 48.23% completeness for Shigella.
Location-related variables and total case counts reached 100.00% completeness across pathogens with >95% completeness for epidemiologic information related to illness symptoms. School-, beef-, and egg-related information had much lower completeness (ranging 10.00-20.00%) despite only being asked for a subset of outbreaks. Category 5's variable on the secondary mode of illness transmission had the lowest completeness (3.05%) followed by etiology serotype and variables related to specimen testing types (6.98-9.42%).

Annual Completeness
We identified an increased annual trend in average completeness for 21-31 pathogens consistently reporting outbreak characteristics (

Segemented and Seasonality Trend Analyses
We examined the temporal trend for the three most-reported individual pathogens across five variable categories using monthly completeness value. We also selected outbreaks with unknown etiology to understand whether the missing of etiology information will lower the completion of other variables and outbreaks with multiple etiologies to determine if multiple pathogen outbreaks will increase the completion of other variables.
For each category, the average monthly completeness for all outbreaks showed a similar general trend to those of other selected pathogens (Table 2, Figure 3). Category 1 had~100% completeness while other categories showed an increasing trend, especially in Category 4 and 5. Category 5's average monthly completeness increased most after the transition to NORS with values increasing from~0.00% completeness before 2009 to as high as 60.82% thereafter. We also saw large increases in average monthly completeness in Category 4, with values increasing from 1.28-35.99% during eFORS to 16.77-72.59% during NORS. Nonetheless, the completeness in outbreaks with unknown etiology remained low in Category 4 and 5 even after reporting system transition. In addition to the completeness trend, we observed a decrease in total outbreaks, Norovirus outbreaks, and outbreaks of unknown etiology after the transition to NORS. With respect to outbreak trend, outbreak counts for all pathogens decreased throughout the eFORS study period but remained stable during NORS (Figure 3 and Supplementary Figure S1). The percent of change in outbreak counts by periods differed for each pathogen ( Table 2).
During eFORS, Norovirus had a 0.048% yearly increase while Clostridium, outbreaks with unknown etiology, and outbreaks with multiple etiologies decreased by 0.048%, 0.096%, and 0.048% per year, respectively. During NORS, we found a decreased annual trend for both Norovirus and outbreaks with unknown etiologies whereas outbreaks with multiple etiologies increased by 0.072% per year. We found no significant trend for either Salmonella during eFORS or Salmonella and Clostridium during NORS.
Category 1 maintained high completeness despite differing outbreak counts in eFORS and NORS (Supplementary Figures S2 and S3). The effect of outbreak counts on completeness differed by pathogen. We observed a decreasing trend in completeness as outbreak counts increased for outbreaks with unknown etiology. Except for Category 1, completeness was relatively higher in NORS than in eFORS (Supplementary Figure S2). The highest monthly outbreak counts occurred in eFORS for all pathogens, outbreak with unknown etiology, outbreak with multiple etiology, Norovirus, and Clostridium. Table 2. Monthly outbreak counts estimation by pathogen types (related to fitted curves in Supplementary Figure S1). Results include the number of outbreaks at the time of system change, which is January 2009, and the yearly percentage change in eFORS and NORS study periods (with 95% confidence interval). LCI is the lower bound of the 95% confidence interval and UCI is the upper bound of the 95% confidence interval.

Yearly % Change (eFORS)
Monthly We found that the monthly completeness of Category 1 and 2 reached over 60% for all pathogen groups when transitioning from eFORS to NORS (Table 3). We also found outbreak counts have limited association with completeness, with significant associations in Category 1 and 2. Outbreak counts were negatively associated with the average completeness of outbreaks with unknown etiology yet positively associated with outbreaks of multiple etiology. Furthermore, we found the greatest improvement in variable completeness for Category 2 and 3 during eFORS and Category 4 and 5 during NORS. For example, the monthly completeness for all outbreaks increased faster in NORS than in eFORS for  . We calculated monthly completeness as the average completeness of all outbreaks per month (as defined by illness onset date) for each completeness category (represented by colored lines from yellow (least complete variables) to red (most complete variables)). We report study months from January 1998 (1) through December 2019 (264). Table 3. Estimated monthly change in completeness effect by outbreak counts and system type for each pathogen groups and Category (related to Supplementary Figures S2 and S3 Estimation with ** represents p-value <0.001 and * represents p-value < 0.05. Bacterial and certain viral genes are italicized due to scientific nomenclature. The results of harmonic regression models (Supplementary Table S3 and Supplementary Figure S4) showed no seasonality in completeness across all categories for all outbreaks. Yet, we detected seasonality for Norovirus in Category 1 and 4 and for outbreaks of unknown etiology in Category 4 (p ≤ 0.037). These seasonal patterns in completeness were detected only during the NORS study period.

Discussion
In this study, we described and evaluated variable completeness by pathogens and pathogen groups over time. Our findings provide essential information on data availability and suitability imperative for modelers performing time series analyses. Temporal patterns of this completeness metric illustrate substantial improvements in foodborne outbreak surveillance reporting over time after integrating surveillance system reporting under NORS. Furthermore, the annual completeness showed a steady trend in increasing completeness that exceeded 60% after 2009. We also observed improvements in completeness across variables, especially for those that contain specific characteristics rarely reported at the beginning of the reported period.
The examination of average completeness by variable category is useful in assisting researchers with variable extraction and planning data analysis when using national outbreak surveillance data. Our findings suggest that NORS data are well equipped to study outbreaks' general characteristics, such as outbreak location, eaten and preparation location, symptoms, and hospitalization information, as those variables had >70% average completeness (Supplementary Table S2). However, investigation and reporting could be improved for variables related to etiology and food products. Pathogen factors, such as a long incubation period, latent symptom onset, and delayed diagnosis, could potentially complicate outbreak investigations and the identification of pathogen etiology and contaminated food products. Our findings of low completeness, especially for pathogen etiology and food product variables likely highlight these challenges. Future research should focus on studying completeness patterns by food-related variables. Moreover, we noticed there were outbreaks where the completeness for certain related variables varied substantially (e.g., when variables should have been collected or missed simultaneously in the same outbreak report). For example, NORS had high completeness for the incubation period time unit but relatively low completeness for the incubation time itself ( 80% vs. 70%, respectively; Supplementary Table S2). Similarly, NORS had high completeness for the total number of cases and total primary cases, but low completeness for total secondary cases (~100% vs. 20.88%, respectively; Supplementary Table S2). Further improvements and validations could be performed by triangulating various data sources, say surveillance and hospitalization records, allowing detection of detailed discrepancies [29].
By examining pathogen completeness over time, data users can identify pathogens and pathogen groups with less missingness. We found that Vibrio, Scombroid toxin, and Clostridium had the highest average annual completeness. We also found that as the total outbreak counts of individual pathogen went down, the fluctuation of the corresponding annual completeness went up. This fluctuation is caused by the insufficient outbreak size. Therefore, we used the five most reported outbreak types for trend analysis. Among the five most reported outbreak types (outbreak of unknown etiology, Norovirus, Salmonella, Clostridium, and outbreak of multiple etiologies), Clostridium had the highest completeness followed by multiple etiologies, Norovirus, Salmonella, and unknown etiologies. It is very likely that data completeness is influenced by disease dynamics and diagnostic modalities implemented in the investigation protocols. Unfortunately, the metadata for both systems has very limited information on reporting capacity or testing rigor from the state and local facilities investigating each outbreak. No variables in eFORS and NORS describe the quality of reported records. For outbreaks of unknown etiology, the missing of etiological-related information will lower its average completeness. This could explain why the outbreak counts for unknown etiology decreased over time as the system improving. Moreover, after transition to NORS, data cleaning is more rigorous and NORS is more likely to distinguish the different modes of transmission (e.g., person to person, waterborne, foodborne, etc.) [30]. For example, a portion of Norovirus outbreaks, previously reported as foodborne, now is classified as non-foodborne outbreak [30]. In addition, we noticed an increasing trend in outbreak counts for outbreaks with multiple etiologies. Prior studies have indicated that outbreaks of multiple etiologies were more likely caused by the improper handle or environmental cross-contamination in cultured farms and dairy beef farms [31,32].
However, the reason for increased multiple etiology outbreaks in recent years remains unknown.
In prior research, we found variations in seasonal peak timing across pathogens using harmonic regression modeling [33][34][35]. Such fluctuations could occur for a variety of social and environmental reasons [36][37][38][39]. Studies investigating temporal variations or seasonal patterns of illness depend on sufficient sample size for regression analyses to ensure proper statistical power [40]. The ability to detect pathogen's seasonality is influenced by data aggregation and completeness [35,40]. Thus, the proposed metrics of completeness could be used as a tool for planning statistical analysis and determining needed statistical power when investigating foodborne illness or outbreak seasonality with surveillance data. Furthermore, we detected seasonality in the completeness of records for Norovirus and outbreaks of unknown etiology during the NORS period. The capability of identifying completeness seasonality in NORS might be indicative to the maturity of the surveillance system.
Our study was subject to several limitations and some of the limitations are due to the constraints of the surveillance reporting system. First, foodborne outbreak reporting was based on local government reporting standards and regulations, which may contribute to state-wide differences in categorizing foodborne and waterborne outbreaks [30]. These differences in reporting may lead to missing information related to outbreak contamination sources. Incomplete records could potentially occur when optional variables were not included in local investigation. Local variations in reporting practices could also affect outbreak grouping. For instance, some states may regard a multi-location outbreak as one combined outbreak, while other states report as several distinct outbreaks [41]. Some states report outbreaks using the broad CDC definition (the number of cases ≥2 per outbreak) while other states only report notifiable outbreaks [42,43]. When reporting practices depend on an outbreak size, reporting small outbreaks could better identify the sources of sporadic illnesses and disease patterns [44]. The inconsistency among outbreak definitions across states might prevent early outbreak detection and forecasting [45]. Outbreak, as a disease measurement term, needs to be more clearly and uniformly defined to better capture disease characteristics and detect disease patterns [45]. Second, as NORS is a dynamic system, the public health agencies can submit new or revise previous reports after new information becomes available [46]. Accordingly, the completeness results could vary depending on the time of data requests.
Besides limitations due to the surveillance system, there are also limitations subjecting to our study design. For example, eFORS and NORS have a different variable definition and structure: eFORS contained 6 age groups whereas NORS contained 8 age groups. To evaluate the completeness between two periods the age-related variable, we had to use the average completeness across all age groups and thus reduced data granularity. During our data cleaning process, we excluded variables that were related to contact information, optional comments, and clarification responses due to its low relevance to study objectives. We further collapsed variables that were related to multiple-choice questions into single responses. Due to the vast number of variables and differences in variable structures, we were unable to examine all variables as their original structure presented in multiple surveillance reports. Lastly, we studied completeness at the outbreak event level. We did not investigate completeness for outbreak case information specifically because of reporting practices for public data. As case-based reporting is time consuming and labor intensive, public health agencies must balance cost-effectiveness and reporting accuracy.
For decades, the government has collected foodborne disease outbreak information to investigate the occurrence, prevent the outbreak, and reduce the severity of foodborne illnesses. The United States Public Health Service and Centers for Disease Control (CDC) have been collecting and publishing periodate reports since the 1990's [47]. The launch of eFORS provided valuable outbreak information, which was further enhanced by NORS [42,47]. The CDC has been improving the surveillance system through multiple actions. In October 1999, the CDC simplified its outbreak reporting form [48]. As a result, we observed an increase in average annual completeness in 1999-2000. As the laboratory and epidemiology methods have been improving over time, the completeness and level of details should improve in outbreak data [47,49]. Yet, unknown etiology and missing information are still present in the current surveillance system, which can undermine statistical power [50]. A good surveillance system improves outbreak information, reduces medical costs, better informs policies, and improves public health accountability [51]. As a passive surveillance system, the number of outbreaks reported by NORS are likely underreported. While active surveillance can provide more accurate and timely information, this type of surveillance system is expensive to maintain. To curtail missingness in outbreak surveillance systems, health practitioners and data curators could: 1.
Create a Standard Operating Procedure (SOP) to identify must-have variables, variables that are related to one another, and less-relevant variables. This SOP can assist in the streamlining of data cleaning procedures to identify true missingness, zero values, and information that is not applicable for an outbreak. Moreover, SOP can be used as a guideline to create NORS checkpoints to avoid missing information between related variables.

2.
Consider removing variables with consistently low completeness or conduct thorough investigation into the obstacles preventing adequate reporting these variables.

3.
Publicly report documentation explaining reasons for incomplete data; NORS has a rigorous data cleaning process that includes 30+ checkpoints for foodborne outbreaks. Outbreak data are reported as missing until all issues are solved [52]. Although incomplete outbreak reports cannot provide all information, these checkpoints and their completion may still be useful for researchers to study.

4.
In accordance with the Population Health Surveillance Theory, perform periodic system audits to evaluate data reporting procedure and data quality at the local level [53]. In addition, these periodic system audits can be used as an assessment to evaluate both workforce resource and laboratory testing capacities. For any local agency with low audit scores, the CDC can provide training materials, or relocate necessary recourses.

Conclusions
Information on secondary mode of illness transmission and specimen testing types had the lowest completeness in the assessed public surveillance data, yet such information could be of value to better understand the contribution of food products to outbreak etiology. Understanding completeness is essential in estimating statistical power and identifying the effective length of disease surveillance time series to examine disease trends and characteristics at the population level. Our completeness analysis is the first attempt to examine the missingness in publicly available national outbreak surveillance systems. Future work can assess completeness by variables or outbreak types across locations to better improve the outbreak information at the local level within NORS. The continuous improvement of surveillance records enables researchers to better utilize surveillance data and to model diseases with greater reliability.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijerph19052898/s1, Supplementary Table S1: Variable index in NORS with data cleaning procedure, Table S2: Crude variable completeness estimates across all outbreaks and pathogens, Table S3: Summary of seasonality and % completeness change in eFORS and NORS system by categories and pathogen groups, Supplementary Figure S1: The number of foodborne outbreaks per month reported in NORS from 1998 to 2019 with fitted negative binomial regression, Figures S2 and S3: outbreak count per month in relation to average completeness in each category, Figures S4 and S5: seasonality analysis for overall data and per pathogen group in each category.
Author Contributions: E.S., L.E.S. and R.B.S. constructed the original dataset for this manuscript. Y.Z. worked to conceptualization and formal analysis for completeness analyses. Y.Z., R.B.S. and E.S. contributed to the writing, editing, and review of the publication manuscript. Y.Z. and K.M.M. collaborated to construct visual aids for this publication. E.N.N. contributed to conceptualization, project administration, supervision, review and editing, and funding acquisition. All authors have read and agreed to the published version of the manuscript.
Funding: This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 2017-17072100002 (Naumova-PI). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes, notwithstanding any copyright annotation therein. This research is also based upon work supported in part by the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Cooperative State Research, Education, and Extension Service Fellowship via grant award number 2020-38420-30724. The views represented in this article are solely those of the authors and do not represent the views of the United States Government, the Department of Defense, or the U.S. Army. Additionally, this article does not represent endorsement of any organization or association by the authors or any United States Government agency. This work, in part, was supported by the STOP Spillover project through the United States Agency for International Development (USAID). The contents are the responsibility of STOP Spillover and do not necessarily reflect the views of USAID or the United States Government. This manuscript was completed for a course enrolled in the National Science Foundation Innovations in Graduate Education Program grant (Award #1855886) entitled SOLution-oriented, STudent-Initiated, Computationally-Enhanced (SOLSTICE) Training.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
We used publicly available datasets in this study, which are referenced throughout our manuscript text [13][14][15][16].