A Hybrid DenseNet-LSTM Model for Epileptic Seizure Prediction

: The number of people diagnosed with epilepsy as a common brain disease accounts for about 1% of the world’s total population. Seizure prediction is an important study that can improve the lives of patients with epilepsy, and, in recent years, it has attracted more and more attention. In this paper, we propose a novel hybrid deep learning model that combines a Dense Convolutional Network (DenseNet) and Long Short-Term Memory (LSTM) for epileptic seizure prediction using EEG data. The proposed method first converts the EEG data into the time-frequency domain through Discrete Wavelet Transform (DWT) for use in the input of the model. Then, we train the previously transformed image through a hybrid model combining Densenet and LSTM. To evaluate the performance of the proposed method, experiments are conducted for each preictal length of 5, 10, and 15 min using the CHB-MIT scalp EEG dataset. As a result, we obtained a prediction accuracy of 93.28%, a sensitivity of 92.92%, a specificity of 93.65%, a false positive rate of 0.063 per hour, and an F1-score of 0.923 when the preictal length was 5 min. Finally, as the proposed method is compared to previous studies, it is confirmed that the seizure prediction performance was improved significantly.


Introduction
The number of people suffering from epilepsy worldwide is about 50 million. Epilepsy is a neurological brain disorder identified by the frequent occurrence of seizures [1]. Seizures show movement, sensory, cognitive, and behavioral disorders due to the release of abnormal electrical signals from the cerebral cortex. About 30% of patients have incurable epilepsy, whose seizures are not well controlled even with Anti-Epileptic Drugs (AED) [2].
To diagnose and analyze seizures, an electroencephalogram (EEG) is used that records the flow of electricity generated when signals are forwarded between the cranial neurons. EEG can be classified into two types, intracranial EEG and scalp EEG, depending on the location to be measured. An intracranial EEG measures signals by attaching electrodes directly to the cerebral cortex exposed during surgery to record the electrical activity of the cerebral cortex. Scalp EEG measures EEG signals by attaching electrodes to the scalp. Intracranial EEG can obtain signals without noises, but since the skull needs to be incised, the scalp EEG measurement method, which can be used for routine patient monitoring and seizure alarm generation, has higher potential in terms of applicability and ease of use. In addition, according to the EEG record, the EEG state of a seizure patient can be classified into four categories: First, during the onset of a seizure, it is called the ictal. Second, the state before the onset of seizures is called the preictal. Third, the state after the seizure is over is called the postictal. Finally, the interval between seizure and seizure excluding the previously mentioned states is called the interictal [3]. These four states are shown in Figure 1.
Seizures usually occur irregularly, and because it is difficult to predict the exact timing of their occurrence, patients with epilepsy are limited in social activities and are always exposed to the risk of trauma. So, studies on seizure prediction using EEG signals have been conducted steadily to give time to raise an alarm before a seizure onset and take appropriate actions. Seizure prediction begins with the existence of a difference between the interictal and preictal intervals. That is, before the seizure onset, it detects the preictal interval and generates an alarm. In the past few years, machine learning has been widely used in seizure prediction, but in recent years, research using deep learning algorithms that show great performance in fields such as computer vision and speech recognition have mainly been conducted. In seizure prediction, Convolutional Neural Networks (CNN) [4], which are widely used in image processing and show good performances, have attracted the attention of researchers. The supervised learning method using this CNN trains the difference between interictal and preictal states, and the trained classifier predicts the occurrence of seizures by detecting the preictal interval in the new EEG recording. In this paper, we propose a seizure prediction method using DenseNet-LSTM. Dense Convolutional Network (DenseNet) [5] is an architecture that solves problems such as vanishing gradient or parameter increase that occurs as the CNN layer deepens, and is more advantageous than CNN in training information from limited EEG data. In addition, the Long Short-Term Memory (LSTM) [6] is an architecture that solves the long-term dependence problem of Recurrent Neural Network (RNN) and is mainly used to predict time-series data, so it is suitable for finding temporal features of EEG, which are time-series data. The proposed method consists of two stages. In the first stage, to use the EEG signal as input data of DenseNet, it is converted into image data in the time-frequency domain using Discrete Wavelet Transforms (DWT). In the second stage, seizures are predicted by training the difference between the interictal and preictal states using the EEG signal converted into an image.
The rest of this paper is organized as follows. Section 2 covers previous studies of seizure prediction. Section 3 describes the dataset used, the preprocessing method, and the proposed model. Section 4 presents the performance evaluation according to preictal length and comparative analysis with previous studies. Finally, Section 5 concludes the paper.

Related Work
Over the past few years, research in the field of seizure prediction has been ongoing. The basic assumption of seizure prediction is that there is a difference between the interictal and preictal states. In early seizure prediction studies, threshold-based methodology [7][8][9][10][11] or machine learning techniques such as Support Vector Machines (SVM) [12][13][14] were used a lot, but recently, deep learning methods [15][16][17] such as CNN have been studied a lot. Ref. [18] was the first to propose training a deep learning classifier to identify seizures in EEG images, similar to how clinicians identify seizures through visual inspection. Ref. [19] proposed a method of extracting the univariate spectral power of intracranial EEG signals, classifying them through SVM, and removing sporadic and incorrect information using Kalman filters. Their methodology consisted of 80 seizures and 18 patients on the Freiburg dataset, reaching 98.3% sensitivity and 0.29 false positive rate (FPR). Ref. [20] proposed a method of extracting the power spectral density ratio of the EEG signal, further processing it by a second-order Kalman filter, and then inputting it into the SVM classifier for classification. The dataset used for the evaluation is the same as the previous data, reaching 100% sensitivity and 0.03 FPR. Ref. [21] proposed a mechanism for calculating the phase-locking values between the scalp EEG signals and classifying them into interictal and preictal states through SVM using this. Their proposed method was applied to the CHB-MIT dataset consisting of 21 patients and 65 seizures, reaching a sensitivity of 82.44% and a specificity of 82.76%.
In seizure prediction studies using deep learning algorithms, CNN is attracting the most attention. Since seizure prediction studies using CNN usually require data in the form of images as input, the EEG signal is converted into a two-dimensional form through a preprocessing method. The authors of [22] proposed a method of dividing the raw EEG signal by a window size of 30 s, applying Short-Time Fourier Transform (STFT) to extract spectrum information, and then using it as an input to CNN. In the experiment using 64 seizures from 13 patients in the CHB-MIT dataset, reaching a sensitivity of 81.2% and an FPR of 0.16. In [23], an image is transformed into a time-frequency form using Continuous Wavelet Transform (CWT) to see the various frequency bands of EEG. The authors proposed a method of predicting seizures by learning the difference between interictal and preictal states using the transformed data as an input to CNN. The same dataset as before was used, and as a result of testing 18 seizures from 15 patients, the average FPR was 0.142 and was unpredictable for three seizures. In [24], seizure prediction using preprocessed features with spectral band power, statistical moment, and Hjorth parameters as inputs to a multiframe 3D CNN model is performed, achieving a sensitivity of 85.71% and FPR of 0.096 in the CHB-MIT dataset. Figure 2 shows the overall system model of the proposed method. First, it goes through a preprocessing method to use the EEG signal as input data to the deep learning model. The preprocessing divides the raw EEG signal by channel and then segments it by the window size and applies the mother function db4 of the DWT to convert it into a time-frequency type 2D image. The db4 is a transform of Daubechies wavelet, it encodes polynomials with two coefficients, which has a relatively fast calculation speed processing time. Next, the preprocessing data are used as the input data of DenseNet, and the resulting feature map is used as the input data of LSTM. As a result, the proposed model trains the difference between interictal and preictal states and then predicts seizures by detecting the preictal state before the onset of seizures.

Dataset
The CHB-MIT dataset used in the paper is a scalp EEG recording measured from 23 pediatric patients at Children's Hospital Boston, which is a public dataset and is avail-able with open access at PhysioNet.org. This record was measured at a 256 Hz sampling rate using 22 electrodes placed according to the International 10-20 Electrode Positioning System and contains a total of 983 h of consecutive EEG recordings and 198 seizures [25]. As can be seen from the annotation file of the dataset, we can see that the patient's channel changes frequently. Therefore, we used 18 channels ("FP1-F7", "F7-T7", "T7-P7", "P7-O1", "FP1-F3", "F3-C3", "C3-P3", "P3-O1", "FP2-F4", "F4-C4", "C4-P4", "P4-O2", "FP2-F8", "F8-T8", "T8-P8", "P8-O2", "FZ-CZ", "CZ-PZ") that are commonly used by 24 patients out of a total of 22 electrode channels. Although there are some differences according to the patient's data, it must be a certain distance from the ictal PHASE to be regarded as interictal. If the distance is too close, seizure waves may be included within the interictal period. Since the distance from the ictal varies from patient to patient, there are two cases considered as interictal. First, patients with close distances used the interictal as far from the ictal as possible. On the other hand, patients with sufficient distance used the interictal at a distance more than a certain distance from the ictal. In addition, we assume the preictal length to be 5, 10, and 15 min because the preictal phase is not clearly distinguished. As shown in Figure 3, there exist the preictal length plus the 5 min interval before the ictal period. Since the model is trained with the preictal data for seizure prediction, the 5 min interval preceding the seizure is excluded from the preictal length purposefully. In real situations, if a seizure can be predicted in advance before the ictal period and the patient can be treated immediately, a certain amount of time (e.g., 5 min) is needed to ensure that the patient has some effect on seizure.

Preprocessing
The raw EEG signal is difficult to analyze because it consists of a time-amplitude domain. So, we use a signal processing method to convert the EEG signal into a timefrequency domain suitable for analysis. Ref. [26] tried to extract spectral information from EEG data which converted to the frequency domain using the Short-Time Fourier Transform (STFT). STFT and Wavelet transform are typical methods of converting a signal into the time-frequency domain. Among them, a wavelet transforms that can reflect a more diverse frequency band was selected by supplementing the shortcomings of STFT. Wavelet transform is a method that can be effectively analyzed in all areas of high frequency or low frequency, and there are CWT and DWT [27].
As shown in Figure 4, the original EEG signal is separated for each channel and then segment window size of 10 s. After that, Daubechies 4 (db4) is applied as the mother function of DWT to convert the EEG signal into a two-dimensional image of the timefrequency domain. As an additional parameter, the overlap was set to 1 s, and the frequency level of the DWT was set to 7 (frequency bandwidth in the 2-128 hz section). As the network deepens, there is a problem that input or gradient information may vanish when it reaches the end of the network. Various studies are being conducted to solve this problem, and all of them have the feature of making a shortcut from the early layer to the later layer. A densely connected convolutional network, which was introduced at IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2017 [5], proposed architecture with great advantages in terms of vanishing gradient, reduced computation, and reduced number of parameters through a new concept of dense connectivity that extends this feature. As shown in Figure 5, dense connectivity is a method of continuously connecting the feature map of the previous layer with the input of the later layer to reinforce the information flow between layers. DenseNet is composed of dense block and transition layer. The dense block consists of a bottleneck layer and a growth rate. Since the feature maps of different layers in DenseNet are connected using channel-wise concatenation, but it can lead to oversized parameters of the network, which will affect the efficiency of the computation. To avoid oversized parameters' problem, the DenseNet author used the growth rate (=k) as a hyperparameter, also apply the Batch Normalization (BN) -> Rectified Linear Unit (ReLU) -> Conv(1 × 1) -> Batch Normalization (BN) -> ReLU -> Conv(3 × 3) nonlinear transformation to the DenseNet structure in order to solve the problem we mentioned before. The bottleneck layer is shown in the Figure 6a. Additionally, as before, it is used to reduce the number of input feature maps and improve calculation efficiency.  . Bottleneck layer and transition layer. Batch normalization is performed independently for each feature map, the ReLU is a piecewise linear function that will output directly if the input is positive; otherwise, it will output zero. As for the 1 × 1 convolution is used to reduce the number of feature maps to improve the computational efficiency.
As shown in Figure 6b, the transition layer has the role of reducing the width and height size of feature maps and reducing the number of feature maps. It is connected behind the dense block and consists of BN -> ReLU -> Conv(1 × 1) -> Avg pool(2 × 2). At this time, it is determined how much to reduce the feature map through the hyperparameter value between 0 and 1 called the compression factor. If this value is 1, the number of feature maps does not change. In addition, DenseNet applied the composite function consisting of the order of BN -> ReLU -> Conv to the layer, citing the efficiency results according to the order of BN, ReLU, and Conv tested in [28].

LSTM
LSTM is a special structure of RNN, a field of deep learning, and solves the long-term dependency problem. The long-term dependency problem says when past information is not delivered to the end. By solving these problems, LSTM shows good performance in analyzing and predicting not only short sentences but also long data such as voice and video and time-series data. Figure 7 shows the structure of the LSTM. The top line in Figure 7 is the cell state, which is the core of the LSTM. The cell state flows like a conveyor belt, adding and subtracting information through the gate and sending the information to the next level. It's also makes the previous information directly influence the future output. LSTM basically goes through four steps. The first step is the forget gate layer, expressed by Equation (1). In this step, it is used to decide what information to forget by the sigmoid layer. x t ∈ R d is the input vector to the LSTM unit, h t−1 ∈ (0, 1) is the previous hidden state vector which can be seen as the output vector of the previous LSTM unit. W f ∈ R d , b f ∈ R d also means the weight matrices and bias vector parameters for forget alyer which need to be optimized during model training. σ is a sigmoid function, the sigmoid functions return values (y axis) in the range 0 to 1, and the LSTM unit will select which value in the range of 0 to 1 to forget. The second step is the input gate layer of Equation (2) and the tanh layer of Equation (3). The input gate layer determines which values to update through the sigmoid layer, and the tanh layer creates a new candidate value of C t which is a cell input activation vector. Finally, the values of the two layers are added and appended to the cell state. The third step is to create a new cell state by updating the past state as shown in Equation (4). First, the information decided to be dropped through the forget gate is discarded, and the information decided to be added is appended next. The last step is the part to decide which value to output with the output gate layer of Equation (5). First, determine which part of the cell state is to be exported through sigmoid for input data, and determine the final output by multiplying the value obtained through the tanh layer in the cell state as shown in Equation (6).

Hybrid Model
As shown in Figure 8, we propose a hybrid model that combines DenseNet and LSTM. The proposed model uses the structure of DenseNet to construct the first half. We use the feature map from here as input data of LSTM to reflect the sequence information on the feature and finally propose a hybrid model that classifies through the sigmoid function. Specifically, the input data are image data converted by applying DWT to the raw EEG signal and are composed of frequency (DWT level), time, and channel. The input image first passes through the Conv layer and makes an output feature map that is twice the growth rate. Next, all dense blocks each have the same number of layers, and Conv(3 × 3) in them does 1-pixel zero-padding so that the size of the feature map does not change. After the dense block, the transition layer is used. Transition layers reduce the size of the feature map through Conv(1 × 1) and apply average pooling. Finally, instead of a fully connected layer that increases parameters too much, global average pooling is used to create and output the feature map as a 1-D vector. Then, through reshape, it is converted into an input format suitable for LSTM and input into LSTM. Finally, the features generated through LSTM are classified into interictal and preictal states using the Sigmoid function. The detailed structure is shown in Table 1.

Experimental Setup
This section describes the workstation environment, hyperparameters of DenseNet-LSTM, experimental methods, and evaluation indicators. As shown in Table 2, AMD Ryzen 7 3700X was used as the CPU, and a total of 64 GB of memory was used. The proposed model was trained using GeForce RTX 2080 Ti as GPU. The software is experimented with using Python 3.6 version, Tensorflow 1.14, and Keras 2.2.4 version. As a hyperparameter of DenseNet-LSTM, as shown in the Table 3, the growth rate was set to 32, and the compression factor was set to 0.5. For the activation function, ReLU was used, Adam was selected as the optimizer, and the learning rate was set to 0.001. The experimental method is performed using the k-fold cross-validation method. K-fold cross-validation divides the data into k folds, trains with k − 1, and tests with the remaining one. The average of the result values obtained by repeating this process k times is used as the verification result of the model.
In order to evaluate the seizure prediction performance of the model, accuracy, sensitivity, specificity, and FPR (False prediction rate), F1-score calculated as shown in Table 4 are used as performance indicators. Accuracy represents the proportion of correctly classified data in the entire dataset. Sensitivity represents the ratio accurately predicted as preictal among data classified as preictal. Specificity refers to the ratio predicted by the actual interictal among data classified as interictal, and FPR refers to the ratio of incorrectly judging the interictal as preictal states. Precision is the ratio of really true among true predicted values. F1-score represents the harmonic average of precision and recall. Table 4. Evaluation metrics (TP is true positive, TN is true negative, FP is false positive, FN is false negative).

Experimental Results
In this section, we set the preictal lengths to 5, 10, and 15 min, respectively, and show the experimental results and comparison with the existing algorithm. Figure 9 shows the average Acc, Sen, Spec, FPR, F1-score over 5, 10, and 15 min of preictal lengths. In the experimental results, the model trained under the assumption that the preictal length of 5 min ensures a higher sensitivity than that of 10 and 15 min. This means that the model assumed to be 5 min trained the preictal interval better than other models, so the preictal characteristic appears a lot between 0 and 5 min. On the other hand, assuming that the preictal lengths are 10 and 15 min, the trained model has a higher specificity and lower FPR than 5 min. This means that the model trained to assume 10 and 15 min clearly distinguished the interval classified as interictal than other models.  Table 5 shows the average Acc, Sen, Spec, FPR, and F1-score for each patient according to the preictal length. Looking at the results for each patient, the average sensitivity is high in the model assuming the preictal length of 5 min, but in the case of patient 4, the sensitivity is lower than 10 and 15 min and the specificity is high. It can be seen that the preictal characteristics did not appear well during 0-5 min and appeared after 5 min. On the other hand, in the case of patient 24, the sensitivity of the trained model was relatively lower than that of 5 min, assuming the preictal length was 10 and 15 min. This means that the preictal features were more pronounced at 0-5 min. In addition, the overall average result is best when the preictal length is assumed to be 5 min. However, the model that predicted the balanced outcome without significantly degrading the outcome for each patient was when the preictal length was 15 min.
In order to objectively verify the performance of our proposed method, we compared it with the existing algorithms [22][23][24]. The authors of [22] proposed a method of converting EEG signals into image data through STFT and classifying them through CNN. In [23], an EEG signal is transformed into image data through CWT and uses CNN for classification. The authors of [24] predicted seizures using features obtained through Hjorth parameters as input to 3D-CNN. As shown in Figure 10 and Table 6, the proposed method has better performance than the existing method. This means that the proposed model is different from the CNN used in the existing algorithm, using the improved DenseNet method and reinforcing the information flow throughout the network, so that the learning was effective. In addition, it can be said that the sequence information of the EEG signal was well learned by adding the LSTM in the second half. Figure 10. Comparison of the proposed method with previous studies [22,23].  Table 6. Results of a recent epileptic seizure prediction approach on the CHB-MIT scalp EEG dataset. In the case of "This work", the results of 5 min of preictal length, which had the best results, were used.

Conclusions
In this paper, we have proposed a new deep learning hybrid model, DenseNet-LSTM for predicting patient-specific epileptic seizures using scalp EEG data. This method achieves a prediction accuracy of 93.28%, a sensitivity of 92.92%, a specificity of 93.65%, an FPR of 0.063 per hour, and an F1-score of 0.923. The DenseNet approach, which improves the existing CNN problem proposed in this study, enhances the information flow throughout the network and increases computational efficiency. In addition, by applying LSTM, the long-term temporal features of the EEG data are trained by the network. Since the CHB-MIT dataset used in the proposed method consists mostly of pediatric patients, it needs to be extensively tested with more EEG data. However, our experimental results and comparisons with previous studies show that the proposed method is efficient and reliable. This suggests the potential as a seizure prediction tool to effectively mitigate the potential threat of epilepsy patients.