Toward Smart Lockdown: A Novel Approach for COVID-19 Hotspots Prediction Using a Deep Hybrid Neural Network

: COVID-19 caused the largest economic recession in the history by placing more than one third of world’s population in lockdown. The prolonged restrictions on economic and business activities caused huge economic turmoil that signiﬁcantly affected the ﬁnancial markets. To ease the growing pressure on the economy, scientists proposed intermittent lockdowns commonly known as “smart lockdowns”. Under smart lockdown, areas that contain infected clusters of population, namely hotspots, are placed on lockdown, while economic activities are allowed to operate in un-infected areas. In this study, we proposed a novel deep learning prediction framework for the accurate prediction of hotpots. We exploit the beneﬁts of two deep learning models, i.e., Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) and propose a hybrid framework that has the ability to extract multi time-scale features from convolutional layers of CNN. The multi time-scale features are then concatenated and provide as input to 2-layers LSTM model. The LSTM model identiﬁes short, medium and long-term dependencies by learning the representation of time-series data. We perform a series of experiments and compare the proposed framework with other state-of-the-art statistical and machine learning based prediction models. From the experimental results, we demonstrate that the proposed framework beats other existing methods with a clear margin.


Introduction
In December 2019, novel coronavirus, namely COVID-19 caused an outbreak in Wuhan city, China. Soon after its emergence, it rapidly spread over more than 200 countries in the world. The number of confirmed COVID-19 infected cases is exponentially increasing throughout the world. To date, more than 17 hundred thousand people are infected by COVID-19 and more than 0.7 million people have died. Without having specific vaccines to control further spread of the disease, as an alternative, several countries have completely locked down daily businesses. Countrywide lockdown, to some extent, controls the spread of the disease; however, has severely affected national and global economies. The COVID-19 pandemic has mainly affected small, medium, and large enterprises that are facing problems like decreased demands, no exports, shortage of raw materials, and disruptions in transportation and the supply chain.
The unprecedented COVID-19 pandemic has affected developing and underdeveloped countries of the world to a greater extent. These countries are already suffering from other social and economic 1. We propose a novel prediction framework that forecasts potential hotspots by exploiting the benefits of different deep learning models. 2. The framework utilizes a unique CNN model to extract multi-time scale features that incorporate short, medium, and long dependencies in time-series data.
3. From the experiment results, we demonstrate that the proposed framework achieves state-of-the-art performance in comparison to other existing methods.
The rest of the paper is organized as follows: Section 2 discusses the related works, Section 3 discusses proposed methodology, experimental results are discussed in Section 4, and Section 5 concludes the paper.

Related Work
Predicting hotspots is a special case of time series prediction problem. Therefore, in this section, we discuss different models and techniques of time-series prediction.
Time series data is the sequence of data points arranged in time order. Time series data is generated in a wide range of domains, for example, stock exchange, gold prices, power consumption, sales forecasting, signal, and text. Analysis of time-series data characterizes different trends. Predicting those trends in the data has a long-range of applications in industry and business. During the last decade, several models and techniques have been proposed to forecast the trends in time series data. Among these models, deep learning has achieved widespread importance due to its superior performance.
Classical machine learning models, such as support vector regression, random forest, Hidden Markov Models (HMMs) are memoryless models and have been predominantly used for time-series prediction since last decade. These models achieve state-of-the-art performance; however, these models require significant amount of task specific knowledge. Chang et al. [15] proposed a forecasting model that predicted trends in time-series data by fitting the predicted values to the estimated trend. This multi-step ahead prediction model are memoryless and provides inaccurate predicted values. Makridou et al. [10] proposed a forecasting model that utilized Adaptive Neural Fuzzy Inference System (ANFIS). The model achieved superior performance compare to auto-regression model, artificial neural network (ANN), and auto-regressive integrated moving average (ARIMA). Dubey [7] proposed two prediction models. The first prediction model is based on ANFIS that utilized grid partition and clustering algorithm. The second model is based on support vector machine. From experimental results they demonstrated that support vector model achieve better performance compare to their first model based on ANFIS. A novel prediction model based on random forest proposed by Liu and Li [9]. The model exploited number of factors from time-series data that improved the prediction ability. Artificial neural network and linear regression model is employed in [16] to predict future trends in gold rates.
Recently, deep neural networks have achieved state-of-the-art performance in forecasting trends of time-series data. Among these models, Recurrent neural network (RNN) and Long short-term memory (LSTM) have gained much popularity [17][18][19] due to internal memory mechanism. An empirical study was conducted by Lipton et al. [18] that demonstrated the ability of LSTM to recognize and learn long-term dependencies in multivariate time series data related to clinical measurements. Similarly, the LSTM network is proposed in [20] to predict stock data. The authors also demonstrated that LSTM due to its memory mechanism outperforms memory-free models. Several LSTM models [21][22][23] have been proposed for predicting the stock trends.
Convolutional neural networks have the ability to learn hierarchical features using different convolutional layers from the input data. CNN has gained much popularity due to its superior performance in speech classification [24], image classification [25], semantic segmentation [26] and detection [27] tasks. Inspired by the success of CNN in various machine learning tasks, CNN has also been explored for time-series classification [28]. A one-dimensional CNN has been used by Chen et al. [29] to predict the stock index. Sezer et al. [30] first convert the one-dimensional data to a 2D image and then employed CNN to learn the important features for prediction.
During recent years, hybrid deep neural networks has been proposed that combined the benefits of both CNN and LSTM. Both CNN and LSTM has unique capabilities in a way that CNNs are good in learning spatial features while LSTMs are good for temporal modelling [31]. This is due to the fact that these hybrid neural networks have been widely adopted in many computer applications, such as image captioning [19], action recognition [32], human behaviour recognition [33], tracking [34]. However, hybrid neural networks have not been explored for time series prediction problems. Kim et al. [35] proposed a hybrid model to predict residential energy consumption. Du et al. [36] proposed a similar network for forecasting the traffic flow. The authors used CNN to learn spatial features while LSTM has been exploited to learn long-term temporal dependencies. Lin et al. [37] proposed a hybrid network that learn local and contextual features for predicting trends in time series data. The authors trained the model in an end-to-end fashion. Xue et al. [38] proposed CNN-LSTM network inventory forecasting. Most recently, Livieris et al. [3] proposed a hybrid model for gold price prediction.
Differences. Our proposed hybrid framework is different from the above-mentioned hybrid networks in the following ways: 1. The CNN network of existing hybrid neural utilizes the features from the last convolutional layer that represents the information of a single time-scale. In our work, we exploit CNN in an innovative way and extract features from different layers of CNN that represent different time-scale features which are vital for prediction process. 2. The existing hybrid networks use LSTM that utilizes single time-scale features obtained from CNN to learn temporal dependencies. In the proposed framework, LSTMs take input from different convolutional layers of the CNN, to learn short, medium, and long time series dependencies.

Proposed Framework
In this section, we discuss the details of the proposed framework. The purpose of this work is to design and develop a prediction model that can forecast the potential hotspots by employing sophisticated deep learning models. The pipeline of proposed framework is illustrated in Figure 1. As shown in Figure 1, the framework utilizes and exploits two deep learning models: (1) Convolutional Neural Network (CNN) and (2) Long-Short-Term-Memory (LSTM) Model. CNNs have the ability to extract hierarchical features that represent the generic representation of time-series data. On the other hand, LSTM models learn short and long-term dependencies from time series data. The goal of the proposed framework is to combine the advantages of both CNN and LSTM models. In the following sub-sections, we discuss the details of the proposed framework.

Data Pre-Processing
In this section, we discuss how to prepare the data for training the deep neural networks. In this work, we conduct analysis on the data collected from Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19). The dataset contains daily statistics of the data related to number of positive cases, number of recoveries, and number of deaths. We observe that the dataset contains long sequences and training deep model with such long sequences cause following problems: (1) With such long sequences, the models take significant amount of time and require huge memory storage during training. (2) Back-propagating long sequences results in quick gradient vanishing that leads to poor learning of the model. Therefore, as a solution, we pre-process the data before feeding deep neural networks.
The overall process of data preparation is shown in Figure 2. As shown in the figure, we divide the long sequence into sub-sequences of fixed length, β. In this paper, we set the value of β to 10. Let S = {x t , x t+1 , . . . x T } represents long time series sequence over duration of T. We divide sequence S into a set of sub-sequencesŜ = {s 1 , s s , . . . s n }, where each sub-sequence s i = {x t , x t+1 , . . . x t+β } contains the data of ten (since β = 10) consecutive days. In the next step, we pre-process the data. Pre-processing step deals with cleaning the data from noise, account for missing data and normalization. Normalizing the data prevents the network nodes from saturation [39]. There are different types of normalization techniques that can be used for pre-processing [40]; however, in this work we use Min-Max normalization technique and formulated as in Equation (1).
wherex t is the normalized value at time t, x t is the original value, Min(x t ) and Max(x t ) represent the minimum and maximum values over the range of data. After normalization, we then prepare the data for training the CNN. The CNN takes the input data and maps a vector of past observations (provided as input) to output predictions. In order to train CNN using time-series data, the input sequence must be divided into a set of multiple examples. For this purpose, we re-arrange the data in each sub-sequence by dividing the sub-sequences into input and output patterns. For each sub-sequence s i , we sample its first ten data points as input and 10th data point is used as output for one-step prediction. After data preparation, we then provide data as input to the CNN for hierarchical features extraction.

Convolutional Neural Network
In this section section, we discuss the architecture of CNN used in proposed framework for feature extraction. Traditionally, the Multi-Layer Perceptron (MLP) model is widely adopted as feature extractor; however, it lost its popularity due to the following reason. The MLP model, due to dense connection of each perceptron to every other perceptron, is becoming redundant, inefficient and unscalable. CNN models address this problem by adopting a sparse connection of neurons in multiple layers. CNN is composed of multiple convolutional and pooling layers. Unlike MLP models, the neuron in each layer does not need to be connected to each and every neuron in the preceding layer. The neurons are connected only to a region of the input data and perform dot product operations on the input data according to certain filter size and stride. The size of filters and stride needs to be determined experimentally or using specific domain knowledge. These location invariant filters, learn specific patterns by allowing parameter and weight sharing, while slide over the data. Generally, neurons in each layer of CNN have certain weights and biases which is initialized and learned during the training process.
Admitting the success of CNN in feature extraction and classification tasks [41,42], we employ CNN for extracting morphological features of data. Our CNN is composed of two 1-D convolutional layers followed by two max pooing layers. This shallow architecture of CNN allows us to reduce the computational complexity during training and inference, and learns multiple time scale hierarchical features effectively. The architecture of the proposed CNN with two convolutional and two pooling layers is shown in Figure 3. We provide the details of each layer as follows:

1-D Convolutional layer
performs the dot product of the time-series input data and convolution kernels. The convolutional layer depends on a number of parameters, including, number of filter f n , filter size f s and stride s. These parameters need to be determined before starting the training process. The convolution kernel can be viewed as a window that contains coefficients values in the form of a vector. During the forward pass, the windows slides over all the input data and provides a feature vector as an output. We can generate different feature vectors by applying convolution kernels of different sizes. In this way, we extract more useful information from the input data that ultimately enhances the performance of the CNN. The convolutional layer is always followed by a nonlinear activation function, namely Rectified Linear Unit (ReLU), that returns to zero if the input is negative and returns the value back if the input is positive. The activation function can be expressed as f (x) = max(0, x).
1-D Max-pooling layer employs a down sampling technique that reduces the size of the feature map obtained by convolution layers. The feature maps obtained by convolution layers are sensitive to the location of the features in the data. In order to make the feature vector more robust, a pooling layer is applied that extracts certain values from the feature vector obtained by covolutional layers. Similar to convolutional layers, pooling layers also employ a sliding window approach that takes a small portion of the input feature vector (equal to the size of windows) and takes the maximum or average value. In this way, the pooling layer produces a robust and summarized version of the feature map obtained by convolutional layers. Furthermore, the new feature map obtained by the pooling layer is translation invariant, since a small change in the input data will not affect the output values of the pooling layer.
Generally, our CNN network, shown in Figure 3, follows the architecture of [43] that consists of two blocks. The overall architecture of proposed CNN is illustrated in Figure 3a. As is obvious from the figure, each block consists of a set of one convolutional layer followed by one pooling layer. The convolutional layer of the first block contains 32 filters with size of 2. The convolutional layer of the second block contains 64 filter with size of 2. We use max-pooling in both blocks with size of 2. Let S = {s 1 , s 2 , . . . , s n } is a set of time-series sub-sequence applied as input to the CNN. The input is then passed through the first block and outputs a feature vector Ω 1 after passes through convolutional and pooling layer. The feature vector Ω 1 is then passed through a second block and produces feature vector Ω 2 .
The convolutional neural network achieved considerable success in predicting time series data [28,44,45]. These models use single time scale features from the last convolutional layer (Ω 2 ) to predict the sequences and do not incorporate multi-time scale features. Such a model performs well in long term prediction; however, the performance of these models degrades while predicting short or medium term data. To predict hotspots, it is useful to incorporate multiple time scale features. Since the dataset contains the data from different cities, where the number of positive cases is affected by many factors such as short and long duration interactions, physical proximity and environmental conditions. Due to these factors, populous cities have a high number of positive cases compare to less populous cities. From the empirical studies, we observe that using single time scale feature for small cities results in High Root Means Square Error (RMSE). Since we are using a single model for the prediction, instead of using single time scale feature, we employ a fusion strategy that fuses the features from both blocks. Feature vector Ω 1 and Ω 2 have different receptive fields that describe the input time-series data at different time scales. The difference between two feature vectors is described in Figure 3b. Each data point of feature vector describes a part of region of the original time-series data. Since the receptive fields of both blocks is different, therefore, feature vector Ω 1 and Ω 2 describes time-series data from different two different time scales. Furthermore, we also regarded the original input sequence as feature vector and represent it by Ω 0 . Ω 0 represents the feature vector of short time scale that reflects local changes in the data. We then employ the Long Short-Term Memory (LSTM) model to learn sequential or temporal dependencies of the features obtained by CNN.

Long Short-Term Memory
Long Short-Term Memory (LSTM) models are the special type of Recurrent Neural Networks (RNNs) which have the ability to learn long-term temporal or sequential dependencies due to feed back connections that causes internal memory. RNNs, due to the lack of memory mechanism, cannot perform well on long-term time series sequential data. This is due to the fact that RNN models utilize cyclic connections and suffer from the gradient vanishing or exploding problem. LSTM, on the other hand, solve the problem by using memory cells to store useful information and vanish undesired information, thus performs better than traditional RNNs. Each LSTM unit consists of a cell and each cell consists of three gates, i.e., input gate, forget gate, and output gate. With the help of these gates, the LSTM unit can control the flow of information by remembering important information and forgetting the unnecessary information. In this way, LSTM can capture temporal changes in sequential data and has been widely adopted in different problems.
To learn temporal dependencies at different time-scales, we concatenate feature vectors Ω 0 , Ω 1 and Ω 2 and provide input to LSTM. We first describe the structure of LSTM cell and then show how we can build complete LSTM network by concatenating LSTM cells. Let K = [0,1] represents a unit interval then ±K = [−1,1]. The LSTM cell L consists of two recurrent features, i.e., hidden state hs and current cell's state cs at time t. At time t, output of the cell L is a mathematical function of three values and expressed as in Equation (2).
(hs t , cs t ) = L(hs t−1 , cs t−1 , d t ) (2) where hs t and cs t represent hidden state and current state of the cell and hs t−1 , cs t−1 represent the previous hidden and previous state of the cell respectively. d t is the input time-series data provided as input at time t. Inside the LSTM cell, hidden state and input time-series data d flow through three gates and update the current state of the cell using a sigmoid function, denoted by σ. Let I t , O t , and F t are the input, output and forget gates respectively. The output of each gate is a scalar value in the range of K computed as follows.
The equation for input gate I t is calculated by: The equation for output gate O t is calculated by: The equation for forget gate F t is calculated by: where w I,d , w O,d , w F,d are the weight vectors, w I,h , w O,h , w F,h , B I , B O , B F are the biases. These weights and biases are learned during the training process. These gates behave like switches, where the gate is "ON" when its value is nearly 1 and "OFF" if value is nearly 0. Memory inside the cell is implemented by incorporating new information using input gate and remove unnecessary information using forget gate. New information is incorporated to the cell by using Equation (6) C t (hs t−1 , d t ) = tanh(w d d t + w h hs t−1 + B) where w is the weight vector and w h , B are the biases. The current state cs of the cell at time t is computed as in Equation (7) cs t = F t (hs t−1 , d t ) cs t−1 + I t (hs t−1 , d t ) C t (hs t−1 , d t ) The hidden state hs at time t is calculated as in Equation (8) where is the element-wise multiplication. All of above equations are implemented inside a single LSTM cell.
To design the LSTM layer, we simply concatenate m number of LSTM cells, i.e., {L 1 , L 2 , . . . L m }, with each cell L i contains different set of weights and biases. We stack the individual weights and biases of all LSTM cells into a matrices. LSTM layer is evolved by applying activation functions in element-wise manner. The gate equations will now contain weight matrices, w I,d , w O,d , w F,d with size of R m×d that is compatible with the size of input data ∈ R d . Similarly, the dimensions of weights matrices, w I,d , w O,d , w F,d of hidden states will be R m×m and the size of biases, B I , B O , B F , B will be R m .
The final output provided by Equation (8) is regarded as the temporal dependencies learn by each LSTM and denoted by Ψ. The vector Ψ is then provided as input to a fully connected layer to provide the final prediction. The description of the CNN-LSTM framework is provided in Table 1 and the pipeline of the overall framework is illustrated in the figure.

Experiment Results
In this section, we discuss experimental results obtained using proposed method. We also demonstrate the effectiveness of proposed framework by comparing it with other state-of-the-art methods.

Experimental Data
We perform experiments using the data collected and maintained by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. The data is publicly available at (https://github.com/CSSEGISandData/COVID- 19). The data refers to daily cumulative cases, deaths, recoveries starting from 22 January 2020 to date. In our study, we emphasise the significance of the total number of daily reported cases over a period of the last months. In our analysis, we use the data from United States due to the following easons. (1) In USA, the number of laboratory and clinically confirmed cases is increasing daily with a big margin compare to other countries.
(2) The data also contains town-by-town data regarding the number of positive cases, which is vital for our analysis. Figure 4 illustrates the progression of COVID-19 positive cases over time in United States. The x-axis of the heat map shows the date starting from 15 February 2020 to 23 July 2020 with a step of three days gap. The y-axis shows state's name of United states. Each cell in the heat map (encoded with colors) shows the number of new cases by state normalized by the population of each state. The blue color shows lower number of positive cases while the red color shows large number of positive cases. We also report data (number of COVID-19 positive cases) collected from 183 towns of New hampshire, asreported in Figure 5. The x-axis shows different town in New hemisphere and y-axis shows total number of positive cases in each town. The code at (https://github.com/JasonRBowling/covid19NewCasesPer100KHeatmap) is used to generate Figure 4.  After obtaining the data, we normalize the data as discussed in Section 3.1 before further processing. After data normalization, we form series of sequences that combine both historical data and target data.
The prediction horizon usually affects the accuracy of every predicting model. We first define the predicting horizon as the number of data points taken into consideration by a prediction model for predicting the next data point. In this paper, we fix the value of the predicting horizon to 10 days. In other words, we sample data points for 10 days and then predict the value on the 11th day.
We evaluate and compare proposed framework with other existing state-of-the-art predicting models. We group the methods into two categories: (1) Statistical models (2) Deep models. Statistical models include the Support Vector Regression (SVR) [46], Forward Feed-Forward Model [6,[47][48][49], and Auto Regressive Integrated Moving Average (ARIMA) [48,50,51]. In a second category, we use different deep models, such as Long Short-term Memory (LSTM) [18,34], and Convolutional Neural Network (CNN) [28]. We briefly discuss each model as follows: Performance Metrics: The performance of different time-series prediction models is evaluated by root-mean-square-error (RMSE) formulated as in Equation (9) and mean-absolute-error (MAE) formulated as in Equation (10).
where m is the number of predicted points, p i andp i represent actual and predicted value of ith points, respectively.

Hyper Parameters of CNN
We analyse the effect of number of filter and filter sizes of convolution layer on accuracy. There is no optimum way of determining these hyper parameters. Usually, these values are determined experimentally. The convolution filter captures the local features of the input data. Therefore, the filter size has an impact on their learning performance. If the filter size is too small, it cannot learn the contextual information of the input data. On the other hand, if the filter size is too large, it may loose local details of the input data. Furthermore, the number of filters also has an impact on accuracy.
Using a smaller number of filters cannot capture enough features to perform the prediction task. The accuracy will increase with increase in the number of filter. However, too much filters may not increase the accuracy and also make the training computationally expensive. Table 1 shows the prediction performance (in terms of accuracy) using different filter sizes and number of filters. These values are selected randomly keeping in view the input size of data (10 data points/sequence in our case). From the table, it is obvious that filter size 2 and a large number of filters increase the accuracy.

Hyper Parameters of LSTM
Long short-term memory networks are widely used in different time-series forecasting problems, as mentioned before. However, the performances of these models are always influenced by different hyper parameters that need to be tuned to achieve better performance. The selection of these hyper parameters is not effortless and requires a sophisticated optimization method to find the optimized values of various hyper parameters. These hyper parameters include the depth of network, the number of units in hidden layers, and drop out ratio. In our experiments, we fix the number of units (100 in our case) and analyse the performance of the network by varying the number of layers. We trained three different network with different numbers of layers. The first network LSTM 1 is trained with one layer, the second network LSTM 2 is trained with two layers, and the third network LSTM 3 is trained with three layers. The performance of all LSTMs is reported in Table 2. From the table, it is obvious that with constant number of units in all LSTMs, increasing the number of layers increases accuracy. From the experimental evidence, we observe that LSTM 3 beats LSTM 2 by a small margin; however, takes much time (almost double) during training. We also observe that LSTM 1 also produces marginal results in some cases; however, it looses the generalization capability on validation data. Therefore, we observe a trade-off between number of layers and training efficiency.

Comparison of Results
In this section, we discuss and compare the performance of different methods with proposed framework. These state-of-the-art methods are employed for predicting different time-series problems. Due to the difference in dataset, model training and targets, these methods achieve different performance in different problem domains. For fair comparison, we carefully tune the different parameters of different models and select the version that performs best. The results are reported in Table 3. From the Table, it is obvious that proposed method beats other reference methods by a great margin. Furthermore, we analyse the hotspot prediction performance of all models. For this purpose, we regard the prediction problem as a multi-class classification problem by analysing the increase or decrease in the number of cases with respect to the previous day for each regions/states. In more detail, prediction model takes the daily number of positive cases of the last ten days and predicts the number for the next day. To characterize the regions/states, we assign labels to each state/region as {L = l 1 , l 2 , . . . , l n }, where l i is the label of region/state i and n is total the number of regions/states. Let X = {X 1 , X 2 , . . . , X n } represents a set of times series data for n regions/states, where X i is regarded as ordered set of real values (representing the number cases per day) over the length of time series (10 days in our case) for region/state i.
We provide X as input to each model and obtain set of predicted values Y = {Y 1 , Y 2 , . . . , Y n }, where each Y i represents the number of cases in region/state i predicted by the model. Let D = {(X 1 , Y 1 , l 1 ), (X 2 , Y 2 , l 2 ), . . . , (X n , Y n , l n )} is the collection of of inputs, outputs and labels. We then select top 10 labels by sorting the list Y in ascending order. We compute the following evaluation metric to compute the performance of each model. We follow the same convention in [3] to evaluate the performance of classifier using two metrics: (1) Accuracy, (2) area under the curve. The results are reported in Figures 6 and 7. From the figures, it is obvious that the proposed method outperforms other state of the art methods.

Discussion
From the experiment results, it is obvious that proposed method outperforms other state-of-the-art methods. Among the reference models, Linear Regression (LR) performs relatively badly. This is due to fact that the LR model linearly approximates the target data. In other words, the LR model assumes a linear relationship between the forecast and predictor values. However, in our case, the relationship between the number of positive cases versus days is not linear. Furthermore, the LR model is susceptible to outliers. We also evaluate the performance of CNN model and the results using different evaluation metrics are reported in Table 3. From experiment results, it is obvious that CNN performs better than the LR model; however, it lags behind the other reference methods. This is because CNN models can not learn the temporal dependencies and not suitable for time-series prediction problems. SVR and ARIMA, and FFNN show similar performances in our case.
From the empirical studies, we observe that ARIMA and FFNN model achieve similar level of accuracy and produce lower MAE and RMSE values. This empirical evidence also validates the findings of [50,52]. However, the performance of ARIMA is better than FFNN in many time-series prediction problems, as reported in [51]. The LSTM model, on the other hand, beats all state-of-the-art methods. This is due to fact that the LSTM model learns temporal dependencies from the data and learn the context of observations over time by incorporating memory state. The proposed framework out-performs all reference methods by a significant margin. This is due to the fact that we combine the benefits of two deep neural networks, i.e., CNN and LSTM. The CNN model learns multi-time scale features from time-series data, which is then provided as input to LSTM. The LSTM then learns temporal dependencies from the features obtained by CNN. From the comparison results, we conclude that the proposed hybrid model that exploits features from different time scale performs better than other reference methods.

Conclusions
In this paper, we proposed a novel hybrid framework for the prediction of potential COVID-19 hotspots. The proposed framework has two sequential networks. The first network is the convolutional neural network that consists of two blocks with each block contains one covolutional layer followed by a pooling layer. The CNN network extract multi time-scale features from the input data. The multi time scales features extracted from two blocks are then concatenated and provided as input to LSTM model. The LSTM model then learns temporal dependencies. The proposed framework is evaluated and compared with different state-of-the-art forecasting models. From the experiment results, we demonstrated that the proposed framework outperforms other state-of-the-art models by a considerable margin. From aseries of experiment studies, we pointed out that LSTM models perform well and have been widely used in different forecasting problems since for a few years. However, we achieve a significant performance boost by utilizing the LSTM with CNN. However, the accuracy is still low. In our future work, we intend to integrate Meta-iPVP, Meta-i6mA, and HLPpred-Fuse predictors to further increase the performance. We believe that the proposed framework can easily be adopted in other time-series forecasting applications, for example, stock market prediction, weather prediction, earthquake prediction, gold price prediction, and other pandemic predictions.