An Efficient Hybrid Model for Patient-Independent Seizure Prediction Using Deep Learning

Recently, many researchers have deployed different deep learning techniques to predict epileptic seizure, using electroencephalogram signals. However, most of this research requires very large amounts of memory and complicated feature extraction algorithms. In addition, they could not precisely examine EEG signal characteristics, which led to poor prediction performance. In this research, a non-patient-specific epileptic seizure prediction approach is proposed. The proposed model integrates Wavelet-based EEG signal processing with deep learning architectures for efficient prediction of pre-ictal and inter-ictal signals. The proposed system uses different models of onedimensional convolutional neural networks to discriminate between inter-ictal signal and pre-ictal signals in order to enhance prediction performance. Experiments have been carried out on a benchmark dataset to validate the robustness of the proposed model. The experimental results showed that the proposed approach achieved 93.4% for 16 patients and 97.87% for 6 patients. Experiments showed that the proposed model can predict epileptic seizures effectively, which can have remarkable potential in clinical applications.


Introduction
Epilepsy is a brain disease that causes sudden recurring seizures, which are provoked by unusual electrical discharges in some brain cells [1,2]. Such seizures can severely harm patients, as they might cause nervousness and injuries [3]. The World Health Organization (WHO) has stated that around 10% of the world population will have one seizure during their lifetime, which makes epilepsy one of the most common neurological disorders around the world [1][2][3][4]. Hence, finding an efficient methodology for epileptic seizure prediction is considered a very important issue [5]. Studies on seizure prediction using electroencephalogram (EEG) signals has shown that EEG holds valuable physiological and pathological information that can significantly enhance the performance of seizure predictive models if properly analyzed [1][2][3][4][5][6][7].
In addition, the use of deep learning (DL) and neural network (NNs) techniques in the development of computer models for diseases detection, the identification of diseases, and many other domains, have attracted the interest of numerous researchers [8][9][10][11][12][13]. For instance, the authors of [12] introduced a COVID-19 patient tracking system using Lagrange optimization and a distributed DL model to ensure that all data needed to track any suspected patient is received efficiently and reliably. To achieve a low square mean error, they deployed a one-dimensional convolution neural network (1D-CNN). Moreover, in real-world applications, such as that proposed in [13], a 1D-CNN deep learning model was used to calculate the optimum interference transmission power in communication between vehicles and everything. In addition, the authors of [14] showed the utility of DL for vehicle pollution detection using real-time images. Their outcome clearly demonstrated that a simple deep neural network (DNN) architecture is capable of producing significant The aim of this research is to design and implement an efficient hybrid model for the prediction of epileptic seizures using DL techniques. In addition, it aims to study the effect of using DWT techniques to introduce a noise removal module on the performance of the presented model.
The contributions of this paper can be demonstrated as follows. A generalized nonpatient-specific epileptic seizure prediction model is developed. The proposed model integrates DWT for EEG signal processing with DL techniques for efficient classification of EEG signals. Furthermore, it introduces a robust noise removal technique that is adaptable to different EEG signal channel selections.
The rest of this paper is organized as follows. The next section describes the materials and methods used, including the dataset, explains the different modules used in the proposed approach and shows the 1-D CNN models' architectures. In Section 3 the experimental results using the two proposed DL models are presented. The analysis is then discussed in Section 4. Finally, Section 5 presents the conclusions.

Materials and Methods
The dataset used in this research was assembled from the CHB-MIT [26], which is a benchmark dataset and is freely accessible for research.
Each case (chb1, chb2, etc.) contained between 9 and 42 continuous '.edf' files from a single patient. To protect the patients' privacy, all protected health information (PHI) in the original files was replaced with surrogate information in the provided files. Surrogate dates replaced the dates in the original files, but the time relationships between the individual files belonging to each case were preserved. The files, in most cases, contained exactly one hour of digitized EEG signals, though those in case chb10 were two hours long, and those in cases chb04, chb06, chb07, chb09, and chb23 were four hours long; files in which seizures were recorded were occasionally shorter. For these recordings, the International 10-20 system of EEG electrode positions and nomenclature was used.
Files labeled 'RECORDS' contained a list of all 664 files in the collection, while files labeled 'RECORDS-WITH-SEIZURES' contained a list of the 129 files that contained one or more seizures. These records contained 198 seizures in total, with the beginning and the end of each seizure annotated.
This dataset included continual EEG signals recordings for 24 patients, composed of 5 males and 17 females patients. The records for patient 1 and 21 were gathered from the same female with 1.5-year difference. While records for patient 24 were added to the dataset later in December 2010. Female patients' ages were from 1.5 to 19 years, while male patients' ages varied from 3 to 22 years. The data were reported and collected by positioning 23 electrodes on the scalps of patients, which meant that all EEG signals were scalp EEG (sEEG). This dataset has a frequency of 256 HZ per second with a 16-bit resolution. Figure 1 shows a sample of the dataset used. The signal depicted represents patient one's ictal phase, which is the phase in which he had seizures that typically last a few seconds. The figure shows that, for each patient, there were several parallel channels, which could differ from one patient to the next. Every patient's record file contained information about the exact location and duration of the ictal state. The pre-ictal state is the state that occurs immediately before the onset of the ictal phase, whereas the inter-ictal state was the state in which the patient was normal. To achieve standardization, the following 18 channels were chosen from the 23 electrodes to be the 18 common channels between all 24 patients: Seizure prediction methods are divided into patient-specific and patient-independent. EEG signal patterns are unique for each epilepsy patient. Moreover, seizure prediction is a challenging task because of the presence of noise during EEG recording due to eye blinking, eyelids movement, swallowing, and breathing. In addition, EEG signal patterns vary from one individual to another.
The proposed system for epileptic prediction system, as shown in Figure 2, is composed of 3 main modules: EEG signal pre-processing, EEG signal segmentation, and the 1-D CNN based model used to predict whether the signal is pre-ictal or inter-ictal.
As shown in Figure 2, the signal channels are first pre-processed and noise is removed from every channel using DWT techniques. After channel cleaning, the pre-ictal and inter-ictal segments are extracted from the signals. Each channel is then passed to the 1-D CNN model for the prediction process.

EEG Signal Pre-Processing
The pre-processing step's goal here is to remove noise from the EEG signal in order to get more accurate features and results. Many studies have discussed different noise removal techniques, and one of the most effective methods is discrete wavelet de-noising schema. Wavelet de-noising is effective in de-noising physiological signals because it has the tendency to preserve signal properties while reducing noise; this is preferred and frequently used over signal frequency domain filtering. The reason for this is that threshold strategies are available that allow reconstruction based on chosen coefficients. It has also Seizure prediction methods are divided into patient-specific and patient-independent. EEG signal patterns are unique for each epilepsy patient. Moreover, seizure prediction is a challenging task because of the presence of noise during EEG recording due to eye blinking, eyelids movement, swallowing, and breathing. In addition, EEG signal patterns vary from one individual to another.
The proposed system for epileptic prediction system, as shown in Figure 2, is composed of 3 main modules: EEG signal pre-processing, EEG signal segmentation, and the 1-D CNN based model used to predict whether the signal is pre-ictal or inter-ictal.  Seizure prediction methods are divided into patient-specific and patient-independent. EEG signal patterns are unique for each epilepsy patient. Moreover, seizure prediction is a challenging task because of the presence of noise during EEG recording due to eye blinking, eyelids movement, swallowing, and breathing. In addition, EEG signal patterns vary from one individual to another.
The proposed system for epileptic prediction system, as shown in Figure 2, is composed of 3 main modules: EEG signal pre-processing, EEG signal segmentation, and the 1-D CNN based model used to predict whether the signal is pre-ictal or inter-ictal.
As shown in Figure 2, the signal channels are first pre-processed and noise is removed from every channel using DWT techniques. After channel cleaning, the pre-ictal and inter-ictal segments are extracted from the signals. Each channel is then passed to the 1-D CNN model for the prediction process.

EEG Signal Pre-Processing
The pre-processing step's goal here is to remove noise from the EEG signal in order to get more accurate features and results. Many studies have discussed different noise removal techniques, and one of the most effective methods is discrete wavelet de-noising schema. Wavelet de-noising is effective in de-noising physiological signals because it has the tendency to preserve signal properties while reducing noise; this is preferred and frequently used over signal frequency domain filtering. The reason for this is that threshold strategies are available that allow reconstruction based on chosen coefficients. It has also 2. Proposed model architecture, consisting of a pre-processing module followed by an EEG signal segmentation phase and a deep learning phase.
As shown in Figure 2, the signal channels are first pre-processed and noise is removed from every channel using DWT techniques. After channel cleaning, the pre-ictal and interictal segments are extracted from the signals. Each channel is then passed to the 1-D CNN model for the prediction process.

EEG Signal Pre-Processing
The pre-processing step's goal here is to remove noise from the EEG signal in order to get more accurate features and results. Many studies have discussed different noise removal techniques, and one of the most effective methods is discrete wavelet de-noising schema. Wavelet de-noising is effective in de-noising physiological signals because it has the tendency to preserve signal properties while reducing noise; this is preferred and frequently used over signal frequency domain filtering. The reason for this is that threshold strategies are available that allow reconstruction based on chosen coefficients. It has also been proven in recent years that the following sets of wavelets are effective for cleaning signals: ['Daub4', 'Daub6', 'Daub8', 'Daub12', 'DMeyer', 'Symlet9'].
In this paper, the latter ones were applied to choose the best de-noising technique for every EEG channel. The dataset signals were further subdivided into individual channels. Each of these channels was then fed into the noise removal module as an input. This module included six major wavelets, which were as follows: ['Daub4', 'Daub6', 'Daub8', 'Daub12', 'DMeyer', 'Symlet9']. The 'denoise_wavelet' method from the Python library 'skimage' was used for each of the wavelets, with wavelet levels = 4.
The wavelet domain is a sparse representation of the signal, similar to the frequency domain of the Fourier transform. Sparse representations have primarily values of zero or near-zero and truly random noise is represented by many small values in the wavelet domain. Setting all values below a certain threshold helps to reduce signal noise.
A de-noising procedure is usually divided into three steps. First, a wavelet and a N level are selected. The wavelet decomposition of signal s at level N is then computed. In the case of this study, N was set to 4, and we had a list of several wavelets, as previously mentioned. Figure 3a depicts the decomposition of a signal into different levels using low pass and high pass filters. Second, a threshold was chosen for each level, from 1 to N, and soft thresholding was applied to the detail coefficients. Finally, wavelet reconstruction was computed using the original approximation coefficients of level N and the modified detail coefficients of levels 1-N.  In order to prove the robustness of the system and the effectiveness of the noise removal module, artificial noise, such as Gaussian noise, was added to the signal at different percentages. In Figure 4, the average accuracy of the four classifiers (SVM, random forrest, KNN, and bagging classifier) is presented against different percentages of white Gaussian noise. This showed that the system achieved a high accuracy, even with added noise. We could conclude from this experiment that the system was almost robust up to 60% noise as the accuracy was still above 95%. For each channel, the best wavelet was chosen according to the least root mean square error (RMSE) and the highest signal-to-noise ratio (SNR). SNR is defined as the ratio between the power of the signal and the power of the noise; this ratio must be maximized to reach a better and cleaner signal. While RMSE is the standard deviation of the residual (difference between the original signal and cleaned signal), it is used frequently in forecasting and regression analyses to verify experimental results. RMSE and SNR have been widely used as criteria to choose the most effective wavelet for a cleaned signal. Table 1 shows examples of different SNR and RMSE values for a channel of an EEG signal from patient one. Based on the experimental results, channel 'db12' had the highest SNR and the lowest RMSE, so it was chosen as the best wavelet for noise removal in this channel. In order to prove the robustness of the system and the effectiveness of the noise removal module, artificial noise, such as Gaussian noise, was added to the signal at different percentages. In Figure 4, the average accuracy of the four classifiers (SVM, random forrest, KNN, and bagging classifier) is presented against different percentages of white Gaussian noise. This showed that the system achieved a high accuracy, even with added noise. We could conclude from this experiment that the system was almost robust up to 60% noise as the accuracy was still above 95%. In order to prove the robustness of the system and the effectiveness of the noise removal module, artificial noise, such as Gaussian noise, was added to the signal at different percentages. In Figure 4, the average accuracy of the four classifiers (SVM, random forrest, KNN, and bagging classifier) is presented against different percentages of white Gaussian noise. This showed that the system achieved a high accuracy, even with added noise. We could conclude from this experiment that the system was almost robust up to 60% noise as the accuracy was still above 95%.

EEG Signal Segmentation Phase
After noise removal in all channels for each patient, the following segments were extracted: 1. Pre-ictal phase segment (with window = k), which represents the window of signal that happens just before the ictal phase and which could be used as an alarm to take precautions by caregivers. 2. Inter-ictal phase segment (with window = k), which represents the normal signal where the patient is in normal state.

EEG Signal Segmentation Phase
After noise removal in all channels for each patient, the following segments were extracted: 1.
Pre-ictal phase segment (with window = k), which represents the window of signal that happens just before the ictal phase and which could be used as an alarm to take precautions by caregivers.

2.
Inter-ictal phase segment (with window = k), which represents the normal signal where the patient is in normal state.
The 'k' value was tested with different values to study the effect of the segmentation of the signal. First k was equal to 8 minutes for both the pre-ictal and inter-ictal phases. Furthermore, the window extracted was segmented to equally lengthen the windows of L length, where L was chosen to be 10 s and 30 s. Both overlapping and non-overlapping windows were considered in the segmentation phase. Figure 5 shows the channel signal after extracting the inter-ictal phase, where k = 8 min.
The 'k' value was tested with different values to study the effect of the segmentation of the signal. First k was equal to 8 minutes for both the pre-ictal and inter-ictal phases. Furthermore, the window extracted was segmented to equally lengthen the windows of L length, where L was chosen to be 10 s and 30 s. Both overlapping and non-overlapping windows were considered in the segmentation phase. Figure 5 shows the channel signal after extracting the inter-ictal phase, where k = 8 min. Figure 5. The orange part of the signal represents k minutes of the inter-ictal phase.

EEG Signal Prediction Using Deep Learning Models
In this paper, a two DL models, both based on the 1D CNN, are presented. The common EEG channels between all subjects in the dataset were each passed using the proposed models and with a majority voting scheme, the results were reached as a prediction of the given signal. Figure 6 shows the proposed diagram of the deep learning model. Each channel is passed as an input, independently of the other channels, to the CNN model.

EEG Signal Prediction Using Deep Learning Models
In this paper, a two DL models, both based on the 1D CNN, are presented. The common EEG channels between all subjects in the dataset were each passed using the proposed models and with a majority voting scheme, the results were reached as a prediction of the given signal. Figure 6 shows the proposed diagram of the deep learning model. Each channel is passed as an input, independently of the other channels, to the CNN model. of the signal. First k was equal to 8 minutes for both the pre-ictal and inter-ictal phases. Furthermore, the window extracted was segmented to equally lengthen the windows of L length, where L was chosen to be 10 s and 30 s. Both overlapping and non-overlapping windows were considered in the segmentation phase. Figure 5 shows the channel signal after extracting the inter-ictal phase, where k = 8 min.

EEG Signal Prediction Using Deep Learning Models
In this paper, a two DL models, both based on the 1D CNN, are presented. The common EEG channels between all subjects in the dataset were each passed using the proposed models and with a majority voting scheme, the results were reached as a prediction of the given signal. Figure 6 shows the proposed diagram of the deep learning model. Each channel is passed as an input, independently of the other channels, to the CNN model.

Layers of CNN Model Architecture
Appl. Sci. 2022, 12, x FOR PEER REVIEW 8 of 15

Layers of CNN Model Architecture ▰ Convolution layer:
In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the net-Convolution layer: In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Layers of CNN Model Architecture ▰ Convolution layer:
In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower

Layers of CNN Model Architecture ▰ Convolution layer:
In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.
Conv1D (filters, kernel-size, strides, activation = none) ▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input.
Kernel-size: The size of the sliding window; In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the network) and increase the robustness of feature extraction while keeping all the important information. Average, max, and sum are all considered as pooling layer operations. The max-pooling operation is used more often in modern deep learning architecture.
Most used in 1D  MaxPooling1D (pool-size = 2) ▻ Pool-size: take the maximum value in a defined spatial window.
▰ Batch normalization: A method for training deep neural networks that normalizes each contribution to a layer for every small batch. This slows down the learning process and reduces the number of training epochs required to train and build deep neural networks. ▰ Dropout: Overfitting is avoided by setting input units to zero at random with a rate of frequency at each step during the training period. Inputs that are not set to zero are scaled up by 1/(1-rate) so that the total sum remains the same for all inputs. ▻ Dropout (rate = 0.1~0.5)

First Proposed Model Architecture
The first proposed model architecture is presented in Figure 7. The model is structured as three main CNN blocks, each of which includes four layers, 1D CNN, Max Pool 1D, BatchNormalization, and spatial dropout. After the end of the third CNN block, the output is passed to a flatten layer and then three dense layers with the activation function 'relu'. The presence of more than one dense layer aids in improving the model's performance. The next layer is BatchNormalization, followed by a dropout layer, and finally a Sigmoid layer for classification of the signal as either inter-ictal or pre-ictal. The main goal of the layers after the third block is to avoid overfitting of the data.
The model was initially constructed with just one block, containing four layers, 1-D CNN, maxPool 1D, BatchNormalization, and spatial dropout. During the experiments, the effect of adding blocks two and three on the performance were examined. The experimental results showed that the performance increased when block two and then block three were added. Adding more than three blocks began to reduce the model performance. Hence, three blocks were adopted in the proposed model. In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the network) and increase the robustness of feature extraction while keeping all the important information. Average, max, and sum are all considered as pooling layer operations. The max-pooling operation is used more often in modern deep learning architecture.
Most used in 1D  MaxPooling1D (pool-size = 2) ▻ Pool-size: take the maximum value in a defined spatial window.
▰ Batch normalization: A method for training deep neural networks that normalizes each contribution to a layer for every small batch. This slows down the learning process and reduces the number of training epochs required to train and build deep neural networks. ▰ Dropout: Overfitting is avoided by setting input units to zero at random with a rate of frequency at each step during the training period. Inputs that are not set to zero are scaled up by 1/(1-rate) so that the total sum remains the same for all inputs. ▻ Dropout (rate = 0.1~0.5)

First Proposed Model Architecture
The first proposed model architecture is presented in Figure 7. The model is structured as three main CNN blocks, each of which includes four layers, 1D CNN, Max Pool 1D, BatchNormalization, and spatial dropout. After the end of the third CNN block, the output is passed to a flatten layer and then three dense layers with the activation function 'relu'. The presence of more than one dense layer aids in improving the model's performance. The next layer is BatchNormalization, followed by a dropout layer, and finally a Sigmoid layer for classification of the signal as either inter-ictal or pre-ictal. The main goal of the layers after the third block is to avoid overfitting of the data.
The model was initially constructed with just one block, containing four layers, 1-D CNN, maxPool 1D, BatchNormalization, and spatial dropout. During the experiments, the effect of adding blocks two and three on the performance were examined. The experimental results showed that the performance increased when block two and then block three were added. Adding more than three blocks began to reduce the model performance. Hence, three blocks were adopted in the proposed model. Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input.

Layers of CNN Model Architecture ▰ Convolution layer:
In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the network) and increase the robustness of feature extraction while keeping all the important information. Average, max, and sum are all considered as pooling layer operations. The max-pooling operation is used more often in modern deep learning architecture.
Most used in 1D  MaxPooling1D (pool-size = 2) ▻ Pool-size: take the maximum value in a defined spatial window.
▰ Batch normalization: A method for training deep neural networks that normalizes each contribution to a layer for every small batch. This slows down the learning process and reduces the number of training epochs required to train and build deep neural networks. ▰ Dropout: Overfitting is avoided by setting input units to zero at random with a rate of frequency at each step during the training period. Inputs that are not set to zero are scaled up by 1/(1-rate) so that the total sum remains the same for all inputs. ▻ Dropout (rate = 0.1~0.5)

First Proposed Model Architecture
The first proposed model architecture is presented in Figure 7. The model is structured as three main CNN blocks, each of which includes four layers, 1D CNN, Max Pool 1D, BatchNormalization, and spatial dropout. After the end of the third CNN block, the output is passed to a flatten layer and then three dense layers with the activation function 'relu'. The presence of more than one dense layer aids in improving the model's performance. The next layer is BatchNormalization, followed by a dropout layer, and finally a Sigmoid layer for classification of the signal as either inter-ictal or pre-ictal. The main goal of the layers after the third block is to avoid overfitting of the data.
The model was initially constructed with just one block, containing four layers, 1-D CNN, maxPool 1D, BatchNormalization, and spatial dropout. During the experiments, the effect of adding blocks two and three on the performance were examined. The experimental results showed that the performance increased when block two and then block three were added. Adding more than three blocks began to reduce the model performance. Hence, three blocks were adopted in the proposed model.
Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the network) and increase the robustness of feature extraction while keeping all the important information. Average, max, and sum are all considered as pooling layer operations. The max-pooling operation is used more often in modern deep learning architecture.

Layers of CNN Model Architecture ▰ Convolution layer:
In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the network) and increase the robustness of feature extraction while keeping all the important information. Average, max, and sum are all considered as pooling layer operations. The max-pooling operation is used more often in modern deep learning architecture.
Most used in 1D  MaxPooling1D (pool-size = 2) ▻ Pool-size: take the maximum value in a defined spatial window.
▰ Batch normalization: A method for training deep neural networks that normalizes each contribution to a layer for every small batch. This slows down the learning process and reduces the number of training epochs required to train and build deep neural networks. ▰ Dropout: Overfitting is avoided by setting input units to zero at random with a rate of frequency at each step during the training period. Inputs that are not set to zero are scaled up by 1/(1-rate) so that the total sum remains the same for all inputs. ▻ Dropout (rate = 0.1~0.5)

First Proposed Model Architecture
The first proposed model architecture is presented in Figure 7. The model is structured as three main CNN blocks, each of which includes four layers, 1D CNN, Max Pool 1D, BatchNormalization, and spatial dropout. After the end of the third CNN block, the output is passed to a flatten layer and then three dense layers with the activation function 'relu'. The presence of more than one dense layer aids in improving the model's performance. The next layer is BatchNormalization, followed by a dropout layer, and finally a Sigmoid layer for classification of the signal as either inter-ictal or pre-ictal. The main goal of the layers after the third block is to avoid overfitting of the data.
The model was initially constructed with just one block, containing four layers, 1-D CNN, maxPool 1D, BatchNormalization, and spatial dropout. During the experiments, the effect of adding blocks two and three on the performance were examined. The experimental results showed that the performance increased when block two and then block three were added. Adding more than three blocks began to reduce the model performance. Hence, three blocks were adopted in the proposed model.

Layers of CNN Model Architecture ▰ Convolution layer:
In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the network) and increase the robustness of feature extraction while keeping all the important information. Average, max, and sum are all considered as pooling layer operations. The max-pooling operation is used more often in modern deep learning architecture.
Most used in 1D  MaxPooling1D (pool-size = 2) ▻ Pool-size: take the maximum value in a defined spatial window.
▰ Batch normalization: A method for training deep neural networks that normalizes each contribution to a layer for every small batch. This slows down the learning process and reduces the number of training epochs required to train and build deep neural networks. ▰ Dropout: Overfitting is avoided by setting input units to zero at random with a rate of frequency at each step during the training period. Inputs that are not set to zero are scaled up by 1/(1-rate) so that the total sum remains the same for all inputs. ▻ Dropout (rate = 0.1~0.5)

First Proposed Model Architecture
The first proposed model architecture is presented in Figure 7. The model is structured as three main CNN blocks, each of which includes four layers, 1D CNN, Max Pool 1D, BatchNormalization, and spatial dropout. After the end of the third CNN block, the output is passed to a flatten layer and then three dense layers with the activation function 'relu'. The presence of more than one dense layer aids in improving the model's performance. The next layer is BatchNormalization, followed by a dropout layer, and finally a Sigmoid layer for classification of the signal as either inter-ictal or pre-ictal. The main goal of the layers after the third block is to avoid overfitting of the data.
The model was initially constructed with just one block, containing four layers, 1-D CNN, maxPool 1D, BatchNormalization, and spatial dropout. During the experiments, the effect of adding blocks two and three on the performance were examined. The experimental results showed that the performance increased when block two and then block

Layers of CNN Model Architecture ▰ Convolution layer:
In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the network) and increase the robustness of feature extraction while keeping all the important information. Average, max, and sum are all considered as pooling layer operations. The max-pooling operation is used more often in modern deep learning architecture.
Most used in 1D  MaxPooling1D (pool-size = 2) ▻ Pool-size: take the maximum value in a defined spatial window.
▰ Batch normalization: A method for training deep neural networks that normalizes each contribution to a layer for every small batch. This slows down the learning process and reduces the number of training epochs required to train and build deep neural networks. ▰ Dropout: Overfitting is avoided by setting input units to zero at random with a rate of frequency at each step during the training period. Inputs that are not set to zero are scaled up by 1/(1-rate) so that the total sum remains the same for all inputs. ▻ Dropout (rate = 0.1~0.5)

First Proposed Model Architecture
The first proposed model architecture is presented in Figure 7. The model is structured as three main CNN blocks, each of which includes four layers, 1D CNN, Max Pool 1D, BatchNormalization, and spatial dropout. After the end of the third CNN block, the output is passed to a flatten layer and then three dense layers with the activation function 'relu'. The presence of more than one dense layer aids in improving the model's performance. The next layer is BatchNormalization, followed by a dropout layer, and finally a Sigmoid layer for classification of the signal as either inter-ictal or pre-ictal. The main goal of the layers after the third block is to avoid overfitting of the data.

Dropout:
Overfitting is avoided by setting input units to zero at random with a rate of frequency at each step during the training period. Inputs that are not set to zero are scaled up by 1/(1-rate) so that the total sum remains the same for all inputs.

Layers of CNN Model Architecture ▰ Convolution layer:
In this layer, the input dataset is convolved with a kernel or weight. A feature map is considered the output of this layer. The stride parameter determines how much the kernel convolves with the input data. By learning from the diverse input signals, this operation is considered as a feature extractor.

Conv1D (filters, kernel-size, strides, activation = none)
▻ Filters: Number of sliding windows; ▻ Kernel-size: The size of the sliding window; ▻ Strides: Modifies the amount of movement over a sample of data; ▻ Activation: The activation function is a mathematical port that connects the current neuron's input to its output, which is passed to the next layer as input. ▰ Pooling layer: This is a down-sampling layer that uses the pooling process to lower the input data's dimension (reduce the amount of computation performed in the network) and increase the robustness of feature extraction while keeping all the important information. Average, max, and sum are all considered as pooling layer operations. The max-pooling operation is used more often in modern deep learning architecture.
Most used in 1D  MaxPooling1D (pool-size = 2) ▻ Pool-size: take the maximum value in a defined spatial window.
▰ Batch normalization: A method for training deep neural networks that normalizes each contribution to a layer for every small batch. This slows down the learning process and reduces the number of training epochs required to train and build deep neural networks. ▰ Dropout: Overfitting is avoided by setting input units to zero at random with a rate of frequency at each step during the training period. Inputs that are not set to zero are scaled up by 1/(1-rate) so that the total sum remains the same for all inputs. ▻ Dropout (rate = 0.1~0.5)

First Proposed Model Architecture
The first proposed model architecture is presented in Figure 7. The model is structured as three main CNN blocks, each of which includes four layers, 1D CNN, Max Pool 1D, BatchNormalization, and spatial dropout. After the end of the third CNN block, the output is passed to a flatten layer and then three dense layers with the activation function 'relu'. The presence of more than one dense layer aids in improving the model's performance. The next layer is BatchNormalization, followed by a dropout layer, and finally a Sigmoid layer for classification of the signal as either inter-ictal or pre-ictal. The main goal of the layers after the third block is to avoid overfitting of the data.
The model was initially constructed with just one block, containing four layers, 1-D CNN, maxPool 1D, BatchNormalization, and spatial dropout. During the experiments, the effect of adding blocks two and three on the performance were examined. The experimental results showed that the performance increased when block two and then block Dropout (rate = 0.1~0.5)

First Proposed Model Architecture
The first proposed model architecture is presented in Figure 7. The model is structured as three main CNN blocks, each of which includes four layers, 1D CNN, Max Pool 1D, BatchNormalization, and spatial dropout. After the end of the third CNN block, the output is passed to a flatten layer and then three dense layers with the activation function 'relu'. The presence of more than one dense layer aids in improving the model's performance. The next layer is BatchNormalization, followed by a dropout layer, and finally a Sigmoid layer for classification of the signal as either inter-ictal or pre-ictal. The main goal of the layers after the third block is to avoid overfitting of the data.

Second Proposed Model Architecture
The second proposed model architecture is presented in Figure 8. The model was structured as four main CNN blocks, each of which included four layers: two layers of 1D CNN followed by Max Pool 1D, and spatial dropout. After the end of the fourth CNN block, the output is passed to a dropout layer and finally a Sigmoid layer for classification of the signal as inter-ictal or pre-ictal. This model is considered different from the first model and the goal was to compare and study the different effects of varying the deep results showed that the performance increased when block two and then block three were added. Adding more than three blocks began to reduce the model performance. Hence, three blocks were adopted in the proposed model.

Second Proposed Model Architecture
The second proposed model architecture is presented in Figure 8. The model was structured as four main CNN blocks, each of which included four layers: two layers of 1D CNN followed by Max Pool 1D, and spatial dropout. After the end of the fourth CNN block, the output is passed to a dropout layer and finally a Sigmoid layer for classification of the signal as inter-ictal or pre-ictal. This model is considered different from the first model and the goal was to compare and study the different effects of varying the deep learning model on the results. The difference was that each block has one more layer of 1-D CNN and the dropout occurs only once after the four blocks. A fourth block was also added to this model to make it complex enough for the EEG signal prediction process.

Second Proposed Model Architecture
The second proposed model architecture is presented in Figure 8. The model was structured as four main CNN blocks, each of which included four layers: two layers of 1D CNN followed by Max Pool 1D, and spatial dropout. After the end of the fourth CNN block, the output is passed to a dropout layer and finally a Sigmoid layer for classification of the signal as inter-ictal or pre-ictal. This model is considered different from the first model and the goal was to compare and study the different effects of varying the deep learning model on the results. The difference was that each block has one more layer of 1-D CNN and the dropout occurs only once after the four blocks. A fourth block was also added to this model to make it complex enough for the EEG signal prediction process. Data augmentation has been integrated in the model to guarantee a big enough dataset for the deep learning models which always have a need for larger datasets in order to achieve two main goals: the first is to avoid overfitting and the second is to achieve high performance results in validation and testing. In this paper, data augmentation was applied using the noise addition technique. Data augmentation has been integrated in the model to guarantee a big enough dataset for the deep learning models which always have a need for larger datasets in order to achieve two main goals: the first is to avoid overfitting and the second is to achieve high performance results in validation and testing. In this paper, data augmentation was applied using the noise addition technique.

Results
Both models were trained using various batch sizes. The dataset was divided into 70 percent for training and 30 percent for testing. To effectively assess the predictive power of the model, the training dataset was further split, and 20 percent of the training dataset was utilized for inclusive validation. Different types of optimizers were tested, such as 'SGD' and 'Adam'. Based on experimental analysis, the 'Adam' optimizer outperformed other optimizers by a wide margin by providing an optimized gradient descent. Hence, 'Adam' was chosen as a suitable optimizer for the proposed model.
In the following tables (Tables 2-7), a sample of the results is presented for model 1 and model 2, with different k window values, where k = 8 min, 30 s, and 10 s, in terms of accuracy, area under the curve, precision, and recall. It is clear from the results presented in these tables that model 1 outperforms model 2. Furthermore, as shown in the tables, the model performs best for k = 10 s, followed by k = 30 s, and finally for k = 8 min. In Figure 9, examples of the validation vs. training accuracy are presented. A comparison between the results for testing, validation, and training is described in Tables 8 and 9.

Discussion
It is notable from the previous section that there were promising results for both models 1 and 2 with the DL approach. The proposed model can help medical systems to predict seizures in patients with a previous medical history, as well as patients lacking a medical history.
The work presented in this paper shows the great effect on results when segmenting data into smaller epochs in order to enrich the data fed into the DL model. In Figure 10, the effect of segmentation on the proposed model 1, using a window of L length, where L = 10, 30 s, and 8 min are shown.

Discussion
It is notable from the previous section that there were promising results for both models 1 and 2 with the DL approach. The proposed model can help medical systems to predict seizures in patients with a previous medical history, as well as patients lacking a medical history.
The work presented in this paper shows the great effect on results when segmenting data into smaller epochs in order to enrich the data fed into the DL model. In Figure 10, the effect of segmentation on the proposed model 1, using a window of L length, where L = 10, 30 s, and 8 min are shown.
A comparative study was undertaken using prior related work, using the same benchmark dataset (CHB-MIT) and the results are presented in Table 10. As shown in the table, the proposed work outperforms state-of-the-art methods in seizure prediction by testing a larger number of samples and by being a patient-independent model. In [19], a binary single-dimensional CNN (BSDCNN) for epileptic seizure prediction was proposed. Their model uses 1D convolutional kernels in the prediction process. Furthermore, the authors of [20] implemented a one-dimensional capsule network for epileptic seizure prediction that achieved an accuracy of 97.74% using only two EEG signal channels. In addition, in [21], a patient-specific prediction model that uses a temporal-spatial multi-scale CNN was presented. Working with 16 patients, they achieved an accuracy of 93.3%. Finally, a patient-specific epileptic seizure prediction model was presented in [23]. They converted raw signals into images. Their model reached an accuracy of 94.3% when seizure prediction was done for 5 patients. However, the conversion of EEG signals into images could cause the loss of important information that CNN can extract from raw signals. Results showed that the proposed model performed better in different cases, such as when using 6 patients and 10 patients for prediction using DL. The improvements in rates were 0.87% compared to the values in [19], 0.13% compared to values in [20], 0.1% compared to values in [21], and 3.54% compared to the values in [23] respectively. The work presented in this paper provides a generic model that can use patients' historical data to predict seizures for other new patients with a high accuracy and precision. A comparative study was undertaken using prior related work, using the same benchmark dataset (CHB-MIT) and the results are presented in Table 10. As shown in the table, the proposed work outperforms state-of-the-art methods in seizure prediction by testing a larger number of samples and by being a patient-independent model. In [19], a binary single-dimensional CNN (BSDCNN) for epileptic seizure prediction was proposed. Their model uses 1D convolutional kernels in the prediction process. Furthermore, the authors of [20] implemented a one-dimensional capsule network for epileptic seizure prediction that achieved an accuracy of 97.74% using only two EEG signal channels. In addition, in [21], a patient-specific prediction model that uses a temporal-spatial multiscale CNN was presented. Working with 16 patients, they achieved an accuracy of 93.3%. Finally, a patient-specific epileptic seizure prediction model was presented in [23]. They converted raw signals into images. Their model reached an accuracy of 94.3% when seizure prediction was done for 5 patients. However, the conversion of EEG signals into images could cause the loss of important information that CNN can extract from raw signals. Results showed that the proposed model performed better in different cases, such as when using 6 patients and 10 patients for prediction using DL. The improvements in rates were 0.87% compared to the values in [19], 0.13% compared to values in [20], 0.1% compared to values in [21], and 3.54% compared to the values in [23] respectively. The work presented in this paper provides a generic model that can use patients' historical data to predict seizures for other new patients with a high accuracy and precision. Table 10. Comparative analysis study with previous state-of-the-art work in the same domain.

Conclusions
A generalized epileptic seizure hybrid prediction model using DL techniques was introduced to minimize risks that can affect the patients during a seizure. The proposed model initially performed pre-processing on EEG signals, using the DWT technique, in

Conclusions
A generalized epileptic seizure hybrid prediction model using DL techniques was introduced to minimize risks that can affect the patients during a seizure. The proposed model initially performed pre-processing on EEG signals, using the DWT technique, in order to acquire a noise-free signal, followed by deploying segmentation of the channel signal to different k-window length segments in order to enhance the performance of epileptic seizure prediction. The proposed model achieved an accuracy of 97.87% for 6 patients and 93.4% for 16 patients, respectively. Achieving improvement rates of 0.87% compared to the work in [19], 0.13% compared to those of [20], 0.1% compared to values in [21], and 3.54% compared to those of [23], respectively. These results show that the proposed work enhances the performance of epileptic seizure prediction through examining EEG signals with a high accuracy compared to some of state-of-the-art models in the literature. The findings in this paper could help to save patients' lives, learn more about the nature of pre-ictal vs. inter-ictal signals, and support patients to take important precautions to reduce risks during seizures.
The limitation of the proposed work is that only two categories of seizure, pre-ictal and inter-ictal, are analyzed. Further work can be extended to deal with other categories of seizures. In addition, further research can examine the integration of the proposed model with cloud services and to extend the model for application to IoT platforms.

Conflicts of Interest:
The authors declare no conflict of interest.