Training Datasets for Epilepsy Analysis: Preprocessing and Feature Extraction from Electroencephalography Time Series

: We describe 20 datasets derived through signal filtering and feature extraction steps applied to the raw time series EEG data of 20 epileptic patients, as well as the methods we used to derive them. Background: Epilepsy is a complex neurological disorder which has seizures as its hallmark. Electroencephalography plays a crucial role in epilepsy assessment, offering insights into the brain’s electrical activity and advancing our understanding of seizures. The availability of tagged training sets covering all seizure phases—inter-ictal, pre-ictal, ictal, and post-ictal—is crucial for data-driven epilepsy analyses. Methods: Using the sliding window technique with a two-second window length and a one-second time slip, we extract multiple features from the preprocessed EEG time series of 20 patients from the Freiburg Seizure Prediction Database. In addition, we assign a class label to each instance to specify its corresponding seizure phase. All these operations are made through a software application we developed, which is named Training Builder. Results: The 20 tagged training datasets each contain 1080 univariate and bivariate features, and are openly and publicly available. Conclusions: The datasets support the training of data-driven models for seizure detection, prediction, and clustering, based on features engineering. Dataset: Data are available at https://doi.org/10.5281/zenodo.10808054. Dataset License: CC-BY 4.0


Summary 1.Problem Statement
Epilepsy is a neurological disorder that affects millions globally.It is characterized by recurrent seizures and presents substantial challenges in medical diagnosis and clinical management.Central to these challenges is the analysis of electroencephalography (EEG) time series data.An EEG captures the brain's electrical activity and is critical for identifying and understanding epileptic seizures.Despite the richness of information contained within EEG signals, the raw time series data, as recorded by sensors, present considerable difficulties for direct analysis due to their complexity and the high variability of signal characteristics among patients.This complexity is compounded by a lack of sufficient techniques for directly analyzing raw series, necessitating advanced data processing methodologies for effective interpretation.
Our work underscores the importance of employing data processing techniques for EEG signal analysis in epilepsy.Our datasets, which are derived from the Freiburg EEG Database [1] through advanced preparation analyses, encompass diverse seizure phases and form a comprehensive foundation for the development of advanced diagnostic tools.By processing raw EEG signals and extracting a large amount of features from the filtered signals, our approach may enhance the research in this domain.
The main objective of this paper is to provide a foundational corpus for analyzing seizures and training analytical models for seizure detection, prediction, and clustering.We describe the derivation of 20 datasets from EEG data of patients with focal epilepsy through signal filtering and feature extraction and make these datasets freely available.
Our datasets provide invaluable insights into the dynamics of epileptic seizures, encompassing recordings across various brain states critical for comprehensive seizure analysis.As summarized in Table 1, these states include pre-ictal, ictal, post-ictal, and interictal phases, each one offering a unique perspective on the seizure cycle.This segmentation underlines the database's utility in exploring the mechanisms of seizure onset, progression, and recovery, enhancing our understanding and prediction of epileptic events.Many studies [2,3] highlight that the duration of epilepsy phases can be quite variable and patientspecific, influenced by factors such as the type of epilepsy, the nature of individual seizures, and the physiological state of the patient at the time.

State Description Abbreviation
Pre-ictal This state occurs before the onset of a seizure, without a standard duration due to the unclear starting point.

Ictal
This state starts with the onset of the seizure and concludes with the end of the seizure.

IKTAL
Post-ictal This state begins immediately after the ictal phase.

POST Inter-ictal
This state occurs after the post-ictal phase and concludes before the onset of the pre-ictal state of a subsequent seizure.

INTER
This work not only contributes to the epilepsy research community by providing access to meticulously annotated EEG data, but also fosters innovation in Data Science (DS) methodologies for analyzing large amounts of data generated at high frequencies.Through rigorous data processing, we aim to advance epilepsy monitoring and improve patient outcomes, highlighting the critical role of interdisciplinary research in medical diagnostics and underscoring the necessity of novel approaches to raw EEG data processing.For the sake of clarity, the objectives of this manuscript are specifically tailored to a specialized audience, including computer engineers, data scientists, and related professionals focused on developing SW for seizure analysis.It is not intended for engineers and physicists at epilepsy centers who require ready-to-use seizure detection SW.

Related Works
The datasets discussed in this study, in whole or in part, as well as others generated using the same methods but with varying temporal parameters (refer to Sections 3 and 6), served as the training data for developing models aimed at epilepsy analysis.These models were obtained utilizing DS techniques and Machine Learning (ML) algorithms.Table 2 lists the related research efforts, which primarily focused on the detection of epileptic seizures (for further details, see Section 4).Specifically, the work cited in [4] provides a comprehensive account of the seizure detection analyses, efforts to reduce false alarms, and the portability of models, conducted using the training datasets we make openly available.Table 2 also presents the performance metrics of models trained with ML algorithms (k-NN, MLP, SVM, BayesNet) using the datasets described in this paper, highlighting the effectiveness of our methodology in creating them.

About This Paper
The rest of this paper is organized as follows: in Section 2, we provide a detailed description of the training datasets that we make available, introducing the Freiburg Seizure Prediction EEG database and outlining its significance in epilepsy research.In Section 3, we present our methods, including the software tool developed for signal processing and feature extraction from EEG time series, which is called the Training Builder (TrB) tool.In Section 4, we explore various DS techniques for epilepsy analysis, focusing on prediction, forecasting, and detection.Section 5 concludes the paper with reflections on the study's implications and future research directions.Finally, Section 6 provides information about requesting additional training datasets.

Data Description
Being able to access large quantities of neurological data from individuals with epilepsy is crucial for analysis when using DS methodologies and techniques.In this section, we describe the datasets generated by our data preprocessing and feature extraction SW tool (TrB tool), which analyze EEG signals from patients in the Freiburg Seizure Detection Database.

Training Datasets
The training datasets we provide consist of 20 csv files obtained through the TrB tool [8], one for each epileptic patient of the Freiburg Seizure Detection Database.
The Freiburg EEG Database stands out as a fundamental resource in epilepsy research, as it has been carefully curated to support advancements in detection and prediction and enhance our understanding of the underlying mechanisms of seizures.It comprises intracranial EEG recordings from a selected cohort of 21 patients (although data from only 20 patients are available to us because patient number 12 is missing), each dealing with drugresistant focal epilepsy.These patients underwent comprehensive pre-surgical evaluation at the University Hospital of Freiburg, Germany, making the dataset particularly relevant for those investigating the potential of surgical interventions in epilepsy treatment.The EEG recordings span a vast spectrum of up to 128 channels per patient.They encompass a wide array of brain states, including prei-ctal, ictal, post-ictal, and inter-ictal phases (Table 1).These phases provide an overall view, which is necessary for developing algorithms that can accurately distinguish between normal and abnormal brain activity.The temporal resolution of the database is very high, with recordings sampled at 256 Hz.This ensures that the fast dynamics of epileptic activity are captured in detail, which is crucial for analyzing the rapid changes associated with seizure onset and progression.Each patient is monitored for 24 h on average.In addition, the DB includes an extensive array of patient metadata (Table 3) [9].Each of the 20 file names we provide uniquely identifies the patient using a number as suffix, so the file Pat001.csvrefers to the training data of patient number 001.
In each file we have a columns header with three metadata fields as follows: • Registration: Freiburg EEG database has several registrations for each patient.This column specifies the registration number from which the training data are extracted.

•
Actual Timestamp: this column specifies the time interval, in terms of the initial sample and the final sample, from which all the features are extracted.This interval is also related to the length of the selected window (the L parameter), e.g., the symbol 1_512 states that a 2 s window was used because it contains 512 samples (i.e., 2 * 256, where 256 is the sample frequency of the Freiburg DB recordings).

•
Actual TAG: this column is the class value identified in the actual timestamp interval.It summarizes which portion of the signal the record (data vector) refers to and can therefore take on the values in the set {PRE, IKTAL, POST, INTER} (see Table 1).
The remaining fields of the csv file represent the features, which are calculated by applying the sliding window technique (see Section 3).Each feature name is in the form EiBjFCMk, where:

•
Ei is the i-th electrode number, with i = 1, 2, . . ., 6. Electrodes 1, 2, and 3 are in focus because they are located in the epileptic brain areas, while electrodes 4, 5, and 6 are out of focus, as they are situated in the healthy regions of the brain.FC is the Code of feature extracted.In Table 4, a list of all implemented features is reported, where it is specified, among other things, whether a feature is a Univariate or Bivariate (UB column) (for further details on features descriptions and formulas, see [5,8]).• Mk is the calculation method used to compute features.For univariate features, the value is always MU.In contrast, for bivariate features, the value can be MA if the reference signal is sourced from the preceding L window, or MB if the reference signal is the zero constant signal.
Each feature is extracted from a window of length L of the EEGs, registered by the 6 electrodes Ei, previously filtered in the 6 Bj bands.Thus, we have 1080 features, because of: 6 bands * 6 electrodes * (14 Univ.f eatures + 8 Biv.f eatures * 2 Calc.methods) = 1080 (1) All the datasets that we make available are obtained considering length L = 2 and sliding S = 1 as temporal parameters of the sliding window.This choice follows the methodology described in [4].A selection of windowing time parameters L and S is reported in [6].Moreover, it is straightforward to also obtain the window parameters L = 2 and S = 2 from the training dataset by excluding the odd rows in the provided training sets.In conclusion, for each windowed signal of length L, we have 1083 fields (3 metadata + 1080 features).

Methods
In this section, we provide a description of all the methods used to create the 20 final training datasets that we are making available.These datasets are the result of the EEGs preprocessing elaboration, whose steps for signal filtering and feature extraction are performed using the TrB SW tool that we developed for time series analysis.

Training Builder Tool
The TrB tool, a modular and extensible SW application, filters large quantities of time series (using low-pass, band-pass or high-pass filters) and extracts from them all the features listed in Table 4, using procedures carried out considering the technique of the sliding window.The final outputs of the tool are the training sets, which can be used as input for the training of models with data-driven learning techniques.Therefore, each dataset varies depending on • Time series (or the recorded part of them).

•
Filters parameters: type (low-pass, band-pass, etc.) and cut-off frequencies.The time series of the signals are analysed by the TrB considering the sliding window technique.Signal windowing is achieved by using two user-selectable temporal parameters (or window parameters): • L: it represents the length of the signal to be analysed, expressed in seconds.• S: it represents the slippage of the signal to be analysed (i.e., how often the algorithm is applied), expressed in seconds.
If the sliding step size S is smaller than the window size L, the windows overlap, while if S = L, we obtain a tumbling window.
Through TrB's GUI, the user can select which and how many univariate and bivariate features to compute.In case of bivariate selection, user has to choose which reference signal to use to calculate the feature.Actually, this reference signal is of two different types:

•
The previous L: i.e., the same signal taken at a previous L interval.

•
The zero signal: i.e., the zero constant signal.
Each final dataset consists of a comma-separated values (csv) file, where features are recorded as vectors.

Software Architecture
The TrB SW architecture has been designed following the Client/Server architectural model, in which the Server part is composed of the algorithms for massively extracting features, windowing and filtering functions, and other support utilities, while the Client part is composed of a web-oriented Graphical User Interface application, which enables output result visualization and shows a form for user input selection and validation.Figure 1 shows the high-level diagram of the designed and implemented SW architecture, including the input data sources and the outputs delivered; accordingly, two possible time series data sources are provided:

•
Recorded in text format (txt or csv).• Stored in a time series DB (TSDB).Using a TSDB, instead of formatted files, allows optimization of the management of time series, with regard to their storage and recovery, while ensuring high reliability and availability.
In output, instead, the results of the application of features to these time series are provided in csv format.The csv file can be saved by the client and stored on a local file system.
TrB has been developed to be as extensive as possible, with the aim of being able to run feature calculation algorithms developed with different programming languages; currently, Java, C/C++, Matlab and Python are natively supported, but compatibility with other languages can be easily configured.

Epilepsy Analysis
Data Mining, ML, Deep Learning, AI, and other DS techniques revolutionize epilepsy analysis by interpreting complex neurological data to enhance seizure detection, predict seizures, identify epileptogenic zones, personalize treatments, and plan surgeries, ultimately improving patient outcomes.Analyzing EEG data, patient information, and feature-based training datasets deepens our understanding of epilepsy.This section introduces potential applications for the 20 provided training datasets, expanding on the studies listed in Table 2.

Prediction, Forecasting, and Detection
Predicting the pre-ictal state is highly valuable for managing epilepsy.Research on epileptic seizure prediction has been underway for several decades [10], leveraging both ML and Deep Learning algorithms [11,12].These approaches aim to enhance the accuracy and timeliness of predictions, thereby offering significant improvements in patient care and the quality of life of people with epilepsy.
The definition of seizure prediction involves utilizing an alert system when an algorithm detects the pre-ictal period, indicating that a seizure will occur within a well-defined period known as the Seizure Occurrence Period, after a certain time horizon that allows for intervention, which is referred to as the Seizure Prediction Horizon [13].The results of this approach are not always satisfactory, given the high complexity of the neurological phenomenon under analysis [14,15].
On the other hand, seizure forecasting, which is a new development in EEG analysis, takes a probabilistic approach, in which the patient is not alerted to an imminent seizure but instead is provided a constant analysis of seizure likelihood.This method identifies states of low, moderate, and high risk, continuously conveying this information to the user [16,17].
The main difference between seizure prediction and forecasting lies in the following approach: prediction is based on a deterministic alert for a specific event, whereas forecasting evaluates the likelihood of an event over time, offering a continuous risk assessment without generating specific alerts.
Seizure prediction has been studied recently in many works, but most of the existing works that rely on EEG data analysis concern seizure detection [18,19].The main goal of a seizure detection model is to accurately identify the occurrence of seizures in real-time or from recorded data.This involves distinguishing seizure activities (ictal phase) from non-seizure activities (pre-ictal or inter-ictal phases) within the brain's electrical signals.
The detection model aims to enable timely intervention, improve patient management, and reduce the risk associated with unattended seizures.

Other Investigations
Other types of analyses can be conducted on datasets obtained from the EEG signals of epileptic patients.An unsupervised clustering-based approach could be useful, allowing us to group patients with respect to some of their characteristics [20,21].In more detail, meta-characteristics, such as statistical features or complexity measures, can be calculated for each patient from the datasets with the extracted features.This new dataset, which may be referred to as a meta-set, contains 20 records, with 1 record per patient, and is the basis for new analyses such as a cluster analysis, with the aim of determining similar patients considering epileptic characteristics.This newly extracted knowledge is useful for directing new pharmacological therapies to groups of patients and not just to a single patient.
Although obtaining a single detection, prediction, or forecasting model for all patients is complex, it may be simpler and more efficient to develop one for a group of similar patients.

Conclusions
In this study, our objectives are to describe and provide sets of data obtained from the processing and analysis of neurological data related to epileptic patients, leveraging advanced EEG signal preprocessing and feature extraction techniques.
The training datasets, generated through the TrB tool via two successive steps of signal filtering and feature extraction, are useful for subsequent investigations and modeling, for instance, using DS methodologies.
Future steps may include integrating additional algorithms for Univariate and Bivariate feature calculations into the TrB tool.We also plan to enable the provision of the algorithm's code at runtime, which is particularly feasible when using runtime-interpreted programming languages like Python or Matlab.
For further investigation, there is also the field of data visualization, which will allow the graphical exploration of time series for both raw and processed signals.

Table 1 .
Distinct states of epileptic seizures.

Table 3 .
Metadata exploration and data insights of the Freiburg EEG database.