Application of Machine Learning to Study the Association between Environmental Factors and COVID-19 Cases in Mississippi, USA

Because of the large-scale impact of COVID-19 on human health, several investigations are being conducted to understand the underlying mechanisms affecting the spread and transmission of the disease. The present study aimed to assess the effects of selected environmental factors such as temperature, humidity, dew point, wind speed, pressure, and precipitation on the daily increase in COVID-19 cases in Mississippi, USA, during the period from January 2020 to August 2021. A machine learning model was used to predict COVID-19 cases and implement preventive measures if necessary. A statistical analysis using Python programming showed that the humidity ranged from 56% to 78%, and COVID-19 cases increased from 634 to 3546. Negative correlations were found between temperature and COVID-19 incidence rate (−0.22) and between humidity and COVID-19 incidence rate (−0.15). The linear regression model showed the model linear coefficients to be 0.92 and −1.29, respectively, with the intercept being 55.64. For the test dataset, the R2 score was 0.053. The statistical analysis and machine learning show that there is no linear dependence of temperature and humidity with the COVID-19 incidence rate.


Introduction
The virus SARS-CoV-2 is a member of a large family of viruses called coronaviruses [1,2]. As the incidence of Coronavirus Disease 2019  began to increase rapidly across the world [3], the World Health Organization (WHO) declared a global pandemic on 11 March 2020 [4].
Similar to the coronavirus family, COVID-19 is an infectious disease, and human-to-human contact is the primary factor of transmission of the virus-by touching infected surfaces and then mediating the infection through the mouth, nose, or eyes. The complexity and gravity of the situation also led machine learning investigators to understand the mechanism of the spread of the disease with a view to control and mitigate. Machine learning is a non-invasive tool that acts on a large dataset of observations to find association features among the data. Machine Learning is being used in different research fields and applications such as genetic programming for the nondestructive testing of critical aerospace systems [5], machine learning-based detection techniques for NDT in industrial manufacturing [6], and machine learning in medical imaging [7]. Similarly, machine learning can be applied to COVID-19 data to predict useful features from the complex data in contrast to using a traditional computation-based method. Particularly, machine learning with COVID-19 data can be used to deduce risk factors related to weather, air quality, social habits, demographics, and location. A recent surveys on applications of machine learning for the COVID-19 pandemic is provided by Kushwaha et al. [8]. Hybrid machine learning methods are also used to predict the time series of infected individuals and mortality rate [9]. Machine learning is also utilized to accurately predict the risk for critical COVID-19 [10]. Some machine learning methods are studied to compare their performance in terms of COVID-19 transmission forecasting [11].
Apart from using machine learning for the prediction of COVID-19 transmission, the scientific community has sought to study and understand the impact of environmental factors such as temperature and humidity on the prevalence of COVID-19.
The survivability and persistence of SARS-CoV-2 depend on weather conditions that indirectly control the virus transmission. The association between weather variables and COVID-19 transmission is complex. Some studies have shown that weather factors such as humidity have a determining factor for virus survival in aerosols [12,13]. The effect of sunshine on the transmission of pathogens is not positive [14]. Yasir et al. [15] showed that humidity was associated with a lower incidence of COVID-19, and lower death rate; whereas temperature was associated with higher daily incidence and death rate due to COVID-19. Colin et al. [16] pointed out that weather probably influences COVID-19, but not significantly compared to other preventive measures. Merow et al. [17] investigated the seasonality and uncertainty of global COVID-19 growth rates and reported that uncertainty remains high in establishing an association between them.
The study by Gupta et al. [18] on the effect of weather on COVID-19 spread showed that it is possible to predict vulnerable regions with high chances of weather-based spread in already affected countries, and countries with high populations, such as India. Zohair et.al [19] studied the association between weather data and COVID-19 to predict mortality rate using a machine learning approach.
Given the continued interest of the scientific community in the role of weather factors on COVID-19, there is a need to consider local prevailing cases and weather in order to identify an association between them, and to examine, on a local scale, if a rise in temperature or low humidity decrease the transmission of the disease and hence reduce the number of COVID-19 cases.
In the present study, we examined the effect of weather factors on COVID-19 cases in Jackson, MS, USA, to understand and predict its potential association with weather factors. We also seek to determine if local weather conditions could be a factor in the spread of COVID-19. Statistical and machine learning methods will be used to corroborate the results.

Data Sources
Daily cases of COVID-19 in MS, USA were obtained from the Department of Health, MS, USA [20] and the incidence rates were computed. The weather data used for the study included temperature, humidity, dew point, pressure, wind speed, and precipitation. Daily averages of the weather data were taken from Weather Underground [21] for the same region and the period of study. It was assumed that the weather conditions of the neighboring regions did not vary much from that of Jackson, MS, USA. The period from 22 January 2020 to 4 August 2021, was selected due to simultaneous weather and COVID-19 data availability. The Mississippi region was selected to identify local effects. The cumulative dataset consisted of daily COVID-19 incidence rates, temperature, humidity, dew point, pressure, wind speed, and precipitation. For a cross-correlation analysis, COVID-19 incidence rates were used. Table 1 shows a sample of the collected data.
Using statistical methods and a machine learning model, the data were analyzed to determine the correlations between weather factors and COVID-19 incidence rate, if any, and to make inferences that would help policymakers to take preventive measures.

Analytical Procedures
The Scikit-learn module of Python 3 [22][23][24][25][26] was used to analyze the data and identify a correlation between the weather data and COVID-19 incidence rate using machine learning. Here, it was assumed that high temperature and humidity would decrease the incidence of COVID-19 cases. In the present work, a linear-regression machine learning model was applied to the dataset to determine the relationship between weather-data variables and the spread of COVID-19 and to draw inferences, if any exist. The linear algorithm was selected to predict the COVID-19 incidence rate from its dependence on environmental factors. A Jupyter Notebook was used to run the Python code on the NVIDIA Xavier NX developer kit [27].
For each variable of the dataset, plots of the daily values were obtained. Exploratory data analysis (EDA) was conducted to determine the frequency, mean, standard deviation, minimum, maximum, and quantiles. To understand the inter-relationships between the variables of the data, a cross-correlation analysis was carried out.

Machine Learning Model
In addition to the cross-correlation analysis, a linear-regression machine learning model [19,[22][23][24][25][26] was run to determine model fitting for the relationship between the COVID-19 incidence rate and the humidity and temperature taken from the weather data. See Table 1 for the features used to train the linear model. Here, the input features (X) of the model are limited to humidity and temperature because of the assumption that high temperature and humidity would decrease the spread of COVID-19 cases. The target variable (Y) of the linear model is the COVID-19 incidence rate. The methodology of linear models for implementation in Python is well documented [22,26]. The general form of the linear model [22,26] is given by, where Y is for the COVID-19 incidence rate, X1 is for humidity, and X2 is for temperature. The corresponding model coefficients are represented by B1, and B2, respectively, with B0 being the coefficient for the intercept.
The dataset consisting of weather data and the COVID-19 incidence rate were divided into two parts, namely the training data set and the testing data set. The model training was run on the training data set, and the test set which was not included earlier was used for validation and prediction. The performance of the model was evaluated by standard performance evaluation metrics, namely R 2 (R-square metric), Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Square Error (RMSE).

Time Series Analysis Results
A sample of the time series of COVID-19 cases, temperature, and humidity over, Mississippi for the period of study 22 January 2020 to 4 August 2021, is shown in Figure 1A,B.

Cross-Correlation Analysis Results
The results of the cross-correlation analysis are shown in Table 3. A scatter plot of the COVID-19 incidence rate against each of the weather data variables (Temperature, Humidity, Dew Point, Windspeed, Pressure, and Precipitation) is shown in Figure 2.
The correlation coefficients between the COVID-19 incidence rate and the weather variables (Temperature, Humidity, Dew Point, Wind Speed, Pressure, Precipitation) are −0.221, −0.148, 0.143, −0.155, 0.089, and −0.049, respectively. Figure 3 shows the correlation between humidity and COVID-19 cases in Jackson, MS, USA, as a function of temperature for the period of study 22 January 2020 to 4 August 2021.

Machine Learning Model Results
A linear regression machine learning model [22,26] was run on the data set. By applying Equation (1)

Discussion
Among the six weather variables of the dataset of COVID-19 and weather data in Jackson for the period of study from 22 January 2020 to 4 August 2021, the statistical description of data (Table 3) shows a considerable variation in the range of values corresponding to temperature (from 19.6 °F to 86.4 °F), humidity (11.6% to 85.7%) and dew point (40.5 °F to 93.5 °F). However, the cross-correlation analysis ( Table 3, Figures 2 and 3) shows either a slight positive or negative correlation of the COVID-19 incidence rate with these weather data variables, of −0.221, −0.148, and 0.143, respectively. Regardless, we carried out a linear regression model to run these variables so as to test the hypothesis that an increased temperature and humidity would decrease the spread of COVID-19 cases. The results of the linear regression model shown in Table 4 and Figure 4 show that the R 2 value of 0.0529 is too small to consider any linear dependency between COVID-19 and the input features of temperature and humidity. The results of the machine learning model also agree with that of the results of the statistical method ( Figures 1B and 3). The results of the statistical method do show a linear dependency between temperature and humidity but not with COVID-19 incidence.
There is an increasing interest in understanding the regional effects of weather factors on COVID-19 to reduce the large-scale impact of COVID-19 on mortality or health disorders. More specifically, identifying incidence rates and distribution in semi-rural and rural plain geographical terrain with relatively poor populations is not addressed. It is a common understanding that a rise in temperature or low humidity will decrease the transmission of the disease and hence reduce the number of COVID-19 cases. Our results also agree with the findings described by Colin et al. [16] that weather probably influences COVID-19, but not significantly compared to other preventive measures, and by Merow et al. [17] that uncertainty remains high in establishing an association between seasonality and COVID-19 growth rates. However, the present study provides a relatively efficient method of studying weather impacts on the COVID-19 incidence rate that would be useful for policymakers in terms of taking preventive measures.

Conclusions
This study illustrates that the association between weather variables and the COVID-19 incidence rate is not statistically significant in the study region. The computed values of correlation coefficients were −0.221, −0.148, 0.143, −0.155, 0.089, and −0.049 between the COVID-19 incidence rate and temperature, humidity, dew point, wind speed, pressure, and precipitation, respectively. Additionally, a low R 2 score of 0.053 was generated from the machine learning model, rejecting the hypothesis that increased temperature and humidity would decrease the spread of COVID-19 cases in the study region.