At the turn of the decade (2019/2020), a highly-infectious virus, COVID-19, suddenly struck China and rapidly spread worldwide. Presently, the virus is essentially controlled in many parts of the world, but infection rates remain severe in regions like South America and the U.S. Moreover, with the gradual reopening of cities around the world, a potential second outbreak is predicted by some epidemic experts. To track, control, and prevent the pandemic from getting worse, a comprehensive virus and related data collection platform is urgently needed for studying, modeling, predicting, and validating the spatiotemporal spread of COVID-19.
A large number of studies have shown that the occurrence, development, and spread of diseases, especially infectious diseases, are closely related to meteorological conditions [1
]. Since the outbreak of SARS in 2003, many studies focused on atypical pneumonia and meteorological factors and found that the propagation characteristics and spread of the SARS virus is correlated with meteorological factors such as air temperature, air pressure, cloud cover, and precipitation [2
]. In most cases as humidity and wind speed increase, the prevalence of SARS decreases. Both COVID-19 and SARS viruses are coronaviruses, but COVID-19 is more infectious than SARS virus.
Recent studies mentioned that the outbreak of COVID-19 is strongly correlated with meteorological and weather factors (e.g., precipitation, humidity, heat) [3
]. For example, average temperature, maximum and minimum temperature, and air quality impact infection rates in the COVID-19 pandemic [4
]. Similarly, humidity, visibility, and wind speed affect environmental stability and the viability of viruses, while air temperature impacts virus transmission. Furthermore, absolute air temperature and humidity significantly affect COVID-19 transmission [3
]. On the other hand, the pandemic has adverse impact on human daily lives, activities, and environment. Liu et al. [5
] found a decrease in nighttime light usage especially within the commercial areas as well as the emission of air pollution. It is also proposed that meteorological variables can aid in predicting worldwide outbreaks of COVID-19 and help investigate viral impacts on human [6
]. On the other hand, poor air quality accompanied by strong winds likely accelerates the dispersal of the virus [7
], which leads to an increase of new COVID-19 infections [8
]. Furthermore, environmental factors such as temperature, humidity and air quality are important input parameters for the transmission model of COVID-19 [9
]. The transmission models are highly sensitive to the accuracy of input predictors [11
]. A high-quality data collection of these factors is vital for accurately predicting the COVID-19 spread and outbreak. Therefore, environmental data is crucial for the study of COVID-19 impacts and modeling [12
To provide integrated and convenient-utilization data sources for users in different communities, a standardized data collection with widely acceptable formats and variables needs to be produced on a stable data platform. More specifically, the COVID-19 data platform should be an integrated technology solution that allows users to access, explore, and acquire COVID-19 relevant data from a pre-processed database (s) [13
]. A number of data platforms and repositories have been built by various organizations since the outbreak of COVID-19. The COVID-19 data platform could be provided and maintained by official authoritative organizations [14
], longtail departments from the news agencies [19
], research groups [20
], as well as nonprofit organizations (NGOs) [21
]. For example, the World Health Organization (WHO) provides international virus data, while local health and medical departments [16
] publish state/province- and county/city-level data.
Longtail platforms integrate and summarize first- and second-hand COVID-19 data sources into a visualization dashboard and data repository for users’ demands. Most platforms provide initial analysis and tracking of COVID-19 case numbers for each administrative scale and region. However, almost all existing COVID-19 data platforms only provide virus case data such as confirmed, suspected, deaths, and numbers of recovered patients. Other relevant factors, especially environmental variables, are rarely mentioned. A comprehensive and complete data collection is necessary to fill this gap. Our proposed COVID-19 related environmental data collection is associated with and distributed through the COVID-19 rapid response platform established and maintained by George Mason University’s (GMU) site of the (National Science Foundation) NSF spatiotemporal innovation center (https://covid-19.stcenter.net/
], with standardized spatiotemporal data structures in multiple spatiotemporal sales.
This paper offers a comprehensive description of the COVID-19 related environmental data collection. The paper is organized as follows: Section 2
introduces the raw data, derived values, and metadata of the collection; Section 3
describes the methodology concerning how derived values are produced, and data are processed and stored; Section 4
illustrates the data publishing method and provides downloading addresses; and finally, Section 5
introduces the data quality control method.
3.1. Spatiotemporal Aggregation and Collocation
Focusing on the reprocessed environmental data (e.g., temperature, humidity, precipitation), it is necessary to establish the relationship between the data in time and space. This study proposes to statistically analyze the environmental characteristics on daily and monthly scales and different administrative levels based on vector boundaries.
As shown in Figure 1
, global maps of daily average factors are generated by aggregating the hourly and half-hourly data in temporal dimension for each spatial location with the means output to NetCDF format files.
The collocation with different administration levels are realized based on the Python programming language. The open-source libraries of GDAL and netCDF4 are used to convert the reprocessed data (e.g., temperature, humidity, precipitation) from NetCDF to GeoTIFF. By using open-source libraries such as “geopandas”, “shapely” and “rasterio”, the vector boundaries of different administrative levels (country, province/state and county/city) are used to obtain the GeoTIFF pixels covered by the mask as a statistical array. For the obtained pixel array, this is accomplished by setting the calculation conditions, using the NumPy scientific calculation library to extract the statistical characteristics (maximum, minimum, and average), and finally exporting array to a CSV file for storage. The specific procedure is shown in Figure 1
3.2. Collocating Environmental Factors with COVID-19 Case Data
The proposed environmental data collection is integrated and published together with GMU STC Center data cube to associate with COVID-19 cases data. The data cube structure is established and utilized to represent factors and values from a spatiotemporal perspective. Due to the multiple scales of target regions, the dataset is divided by country and region at the first level, and the administration scales are archived and shared under distinct regional folders. Daily report and time-series summary reports are processed and published in each country and administrative level. For example, the United States folder includes administrative 1 for the state level dataset and administrative 2 for the county level dataset. Under USA administrative 1 folder, a group of csv files keep a one-day timestamp of all extracted and processed environmental values for each state, defined as the daily report dataset. The summary report only keeps the latest updated files divided by factors to record the time-series value of each state.
3.3. Data Computing and Storage on AWS Cloud Platform
Cloud computing is becoming the standard approach to handling large scale and remotely sensed (RS) imagery dataset processing, storage, access, and management. There are many cloud platform providers that provide users a “pay as you go” service to support customized computing needs. For example, Amazon web services (AWS), Microsoft Azure, and Google’s Compute Engine provide IaaS (Infrastructure as a Service), PaaS (Platform as a Service), or SaaS (Software as a Service). In this study, AWS was adopted as the cloud to support elastic storage and processing tasks for processing Nighttime Light Radiance, Temperatures, Humidity, Pollutants, and Precipitations dataset. With automatic data scraping from multiple RS data portals, those data were stored in a virtual storage optimized instance and were published to AWS S3 distributed storage. By exploiting computing capacity with over one-hundred computing cores and two-hundred gigabytes of memory, a multi-tasked python-based processing was deployed to mine those datasets and produced covid-19 related results from the perspective of RS observation. Ongoing distributed computing approach will be developed to accommodate global scale multi-sourced RS data processing in a single run with a reasonable processing time.
5. Quality Control
To provide reliable environmental data sources to the geospatial and covid-19 community, populated data are evaluated in three dimensions including data integrity, consistency, and validity to ensure high quality data publishing.
Raw data selection, cleaning and qualification: The first and crucial step to create a high-quality COVID-19 related environmental data collection is to select proper input raw data. To guarantee this, we firstly review as many literatures on COVID-19 related research as we can to decide on what environmental factors should be included in the final collection and what data sources are researchers usually dealing with. Then we sift among the potential data sources and choose the one that is most frequent-adopted, stable and authoritative for each environmental factor. In the data processing step, we filter all the invalid and unreasonable values as well as variables that are not related to COVID-19.
Data integrity: This means that populated data should be comprehensive. A thorough check is applied to time-series data, making sure the data contain all historical data stored in data sources. In addition, since daily grid environmental data are mapped to an administrative level shapefile to provide regional environmental data, integrity check ensures the generated data are provided at each unit (e.g., counties in US) at a certain administrative level if the data is available in the source files.
Data consistency: This requires that data in our repository are consistent with other sources. On one hand, extracted data should be consistent with values from data sources; on the other hand, regional derived values (e.g., country-level monthly mean temperature) should be consistent with global and temporal distributions. For example, mean temperature in a location in winter is lower than mean temperature there in summer. Precipitation value are relatively larger and frequent in the Inter Tropical Convergence Zone (ITCZ) and South Pacific Convergence Zone (SPCZ).
Data validity: This dimension estimates the data reliability. Data sources should be provided with the populated data, thereby making data sources available to data consumers to ensure data consumers can investigate the data sources for validity.
Our proposed data collection encompasses COVID-19 related environmental datasets that serves as a data basis and reference for users in broader communities (e.g., governmental and urban planning departments, meteorological and climatological scientists, medical and disease control researchers). This is an alternative to other data collection efforts that are virus-case-only platforms. The proposed collection is associated with the COVID-19 gateway of GMU’s NSF Spatiotemporal Center and is stored on a stable and highly available AWS server to provide multiple-scale spatiotemporal data at high acquisition speed [33
]. The collection includes various data types and features including temperature, humidity, air quality, nighttime light and precipitation.
The raw datasets are automatically downloaded from the data sources using Python programs, and the derived values are produced as soon as the newest raw data are released. The timeliness is guaranteed by this procedure.
The proposed framework is a growing data collection with content extended according to the needs and requirement of users and the evolution of the pandemic. For example, the team is working to automatically correspond OMI NO2
data with administration shapefiles and to provide country and county level NO2
information to the communities. It is proposed that these NO2
data contribute to the Earth data aspects for big spatiotemporal data analytics in fighting against covid-19 pandemic [33