1. Summary
The Earth’s ionosphere lies below the magnetosphere, with altitudes ranging from approximately 60 to 1000 km. As a partially ionized region of the upper atmosphere, it overlaps with the thermosphere, mesosphere, and portions of the exosphere [
1,
2]. Its physical and chemical characteristics are influenced by both incoming solar radiation and localized energetic processes within the ionized medium [
3,
4,
5,
6].
The D-region of the ionosphere, located between 60 km and 90 km above sea level, is particularly significant because even small variations in ionization can substantially affect radio wave propagation and, consequently, a range of technologies essential to modern life. These include over-the-horizon radar, long-range communication and navigation systems, and other applications that depend on stable ionospheric conditions [
7,
8,
9,
10].
Within this region, the D-layer remains the least studied of the ionospheric layers, primarily due to its comparatively low ionization levels and complex interactions with neutral atmospheric constituents [
11]. Technical limitations, operational risks, and high costs restrict direct in situ investigations of the D-region [
12,
13]. Consequently, its characteristics are predominantly examined using remote sensing methods, particularly sub-ionospheric Very Low Frequency (VLF, 3–30 kHz) and Low Frequency (LF, 30–300 kHz) radio wave transmission techniques [
14,
15,
16].
Given its technological significance and the challenges of direct measurement, the continuous monitoring and detailed characterization of the D-region are essential for improving our understanding of ionospheric dynamics, mitigating its impacts on communication and navigation systems, and advancing our broader knowledge of upper atmospheric processes.
Machine learning (ML) has become an important aspect of anomaly detection (AD). The model, equipped with suitable target data labels and feature sets in supervised ML tasks, can effectively distinguish between normal and anomalous data points within a given signal’s characteristics. In the case of ionospheric VLF amplitude data, out-of-the-box solutions yield erroneous classification results, necessitating the development of a tailored approach. The VLF amplitude signal typically consists of various characteristics, including normal daytime signals, nighttime signals, solar flare effects, instrumental noise, instrumental errors, and outlier data points, as illustrated in Arnaut et al. [
17]. All of these signal characteristics exhibit unique shapes, amplitude variations, and durations, suggesting that a customized approach employing various ML techniques and different feature sets could be effective for classifying these characteristics of the signal.
Prior research utilized the Random Forest algorithm [
18] on a binary-labeled VLF amplitude signal [
19], revealing that the model exhibited a noteworthy performance in distinguishing between the normal data class (specifically, the normal daytime VLF amplitude signal) and the anomalous data class (comprising all other classes). However, in certain instances, the model produced an increased amount of noise in the output classifications, resulting in erroneous classifications where the entire signal was inaccurately labeled as anomalous. In continuation of our research, we shift from a binary scenario to a multi-class scenario, wherein the labels for the training and testing sets are represented as multi-class variables; here, each signal characteristic is designated as its own data class.
To transition from a binary to a multi-class scenario, the dataset previously presented in [
20] must be reclassified, with each data group assigned to its corresponding new class. Data labeling is a time-intensive and laborious [
21] task, requiring a skilled researcher to manually annotate each data point within the entire dataset for future model training and development. The dataset presented in this data descriptor is the manually reclassified dataset featured in Arnaut et al. [
20]. In conjunction with the previous, binary classified, dataset, it currently represents the sole dataset of its kind that allows for the analysis, visualization, and classification of multi-class, binary, or mixed-class target variables of VLF amplitude data. Additionally, this data descriptor outlines a workflow for optimal data labeling and provides the appropriate tools specifically designed or utilized for reclassification, which can also be applied for first-time classification purposes.
2. Data Description
The dataset’s time span, as well as the transmitter and receiver list, remained unchanged from the original binary classification dataset available from Arnaut et al. [
20]; specifically, the training dataset comprised data collected from 3 September 2011 to 9 September 2011, with a resolution of 1 min. The testing dataset comprised data collected from 19 October 2011 to 25 October 2011, with a resolution of 1 min. The employed transmitters were NAA, NLK, NPM, NML, and NAU, situated in the USA. The receiver list included receivers such as Oklahoma East, Oklahoma South, Sheridan, and Walsenburg, also located in the USA. The total number of transmitter–receiver pairs amounted to 19.
The dataset consists of seven columns contained in a .csv file, including the timestamp in ISO 8601 format and the transmitter code (e.g., NAA, NAU, etc.), the receiver’s name (e.g., OklahomaEast, Sheridan, etc.), and the train–test split, where 0 signifies the training segment of the dataset and 1 indicates the testing segment. The dataset includes X-ray irradiance, measured in Wm
−2 (National Centers for Environmental Information (NCEI) [
22], obtained from GOES 15), and VLF amplitude, expressed in dB ((WALDO) obtained from WALDO [
23]). The final and most significant data column is labeled “label,” which serves as the target column containing the classification for each data point in the dataset.
The dataset comprises 383,041 distinct instances labeled for ML anomaly detection. The original research by Arnaut et al. [
19] categorizes the features into primary, secondary, and tertiary. The VLF amplitude and X-ray irradiance are the sole primary features, which can subsequently be utilized to compute tertiary, or derived, statistical features. Furthermore, the transmitter and receiver information was provided, allowing for the generation of a comprehensive list of features from the base features present in this dataset and the corresponding target variable.
Of the total 383,041 manually labeled instances, 40.65% (155,761 instances) were categorized as the daytime undisturbed signal, with this distribution comprising 19.65% in the training segment and 21% in the testing segment of the dataset, i.e., the VLF amplitude signal measured during time windows in September and October, 2011, respectively. Solar flares constitute 8.87% of the total dataset, predominantly within the training segment at 6.66%. Conversely, for the instrumental error parameter, the majority is found in the testing segment of the dataset, comprising 3.79% of the total 3.96% of the entire dataset. Instrumental errors were predominantly found in the training segment of the dataset, comprising approximately 7.44% of the total 9.39%. The nighttime signal represents the second largest data class in the dataset, exhibiting relatively balanced training and testing proportions of 16.03% and 21.02%, respectively. Finally, outliers were uniformly distributed across both the training and testing datasets, constituting the smallest data class with a total of 296 instances, representing 0.08% of the dataset. The final, manually labeled dataset exhibits strong class imbalances, which was anticipated, as classes other than daytime undisturbed and nighttime signals occur with significantly lower frequency compared to solar flares, instrumental errors/noise, and outlier data points. This raises a question regarding the use of a single, generally tuned model versus a combination of models specifically fine-tuned for individual classes or class pairs for future modeling purposes. In this manuscript of data descriptor type, we provide general features with basic characteristics of the manually labeled dataset of the VLF signal amplitude in the form of multi-class labeling, including the dataset itself, for ML anomaly detection. Modeling procedure tuning with the employed models falls outside the scope of this paper, and is the topic of our future research.
Figure 1 illustrates the reclassification process applied to the classified VLF amplitude time series from 3 September 2011 at 00:00:00 UT to 4 September 2011 at 07:15:00 UT.
Figure 1b presents a binary classification example from our prior research, wherein nighttime signals, solar flare effects, and instrumental errors, among others, were categorized as “Anomalous” data. Conversely, in multi-class classification, these sub-classes were categorized into distinct data classes; for instance, the nighttime signal constitutes its own class, as do solar flares and instrumental errors, as illustrated in
Figure 1c. The reclassification was conducted based on the initial binary classification, in which the “Normal” data class remained unchanged, while the “Anomalous” data class was further partitioned into several distinct data classes. In the novel multi-class dataset, the data classes are represented as daytime_undisturbed, nighttime, solar_flare, instrumental_error, instrumental_noise, and outlier.
Figure 2 illustrates an additional instance of reclassification concerning, in this example, an X2.1 solar flare that occurred on 6 September 2011. The VLF signal in
Figure 2 exhibits instrumental errors, characterized by a minor outlier blink in the latter section of the instrumental error trace. Following the restoration of the signal, the normal daytime data class was classified, which persisted until disturbances associated with the X2.1 solar flare occurred. The notable distinction from the binary scenario is that both instrumental errors, outlier data points, and solar flares are categorized as separate data classes, unlike the prior instance where they were aggregated into a singular (Anomalous) data class.
3. Methods
The labeling process for machine learning anomaly detection is a time-intensive and laborious task that must be conducted by a researcher knowledgeable about the specific characteristics of the signal in question. The VLF amplitude data repository is the Worldwide Archive of Low-Frequency Data and Observations (WALDO) repository, which contains data in MATLAB (.mat) format, presenting certain challenges due to the relative inconvenience of handling MATLAB files. Consequently, for this research, for the sake of more user-friendly data handling, a web-based tool was developed (
Supplementary Materials) to convert MATLAB VLF amplitude data into a .csv file compatible with TRAINSET, a web-based application designed for efficient data annotation. Furthermore, the web-based tool finalizes a completely free pipeline from the data repository WALDO to the open-source labeling tool TRAINSET, serving as a significant resource that facilitates researchers in navigating various data formats and labeling selected data with ease.
The first step is to upload the .mat file into the web-based converter and subsequently select the downscaling factor (
Figure 3). The downscaling factor was applied because VLF data can be sampled at either a frequency of 1 Hz, resulting in 86,400 data points per day, or at a higher frequency of 50 Hz, yielding approximately 4.3 million data points per day. Given that VLF data is typically examined over extended consecutive days, the capability to reduce resolution through downsampling has been incorporated. Furthermore, issues may arise with TRAINSET when a large amount of data is loaded, and in conjunction with the prior data quantity limitation, a downsampling factor has been introduced. The default value is 1, indicating that no downsampling is implemented.
Following the application of the downsampling factor, the data conversion process commences with the extraction of time stamps and VLF amplitude values from the MATLAB file. The time stamps are converted to the ISO 8601 format, as TRAINSET only accepts that format. The TRAINSET requires data to consist of four columns: a series column, where each data point is represented as “series_a”; a timestamp column; a value column, containing VLF amplitude values; and a label column, where each data point is initially designated as “Normal”. Following the conversion, the exported .csv file is suitable for the TRAINSET labeling web-based tool.
The labeling framework, as presented by Arnaut et al. [
17], identified five distinct data classes, with an additional sixth class incorporated during the initial data labeling process. The data categories include daytime undisturbed signal (daytime signal undisturbed by solar flare effects), nighttime signal (usually larger amplitude values than daytime amplitude signal), solar flare effects (characteristic features of solar flare on VLF amplitude marked by an abrupt increase in signal amplitude with a gradual decrease), instrumental errors (constant zero or non-zero values), outlier data points (individual or few data points distinct from surrounding data points), and an additional data class for instrumental noise (rapid fluctuation of VLF amplitude which is non-characteristic for the VLF amplitude signal). A comprehensive description of the conditions for each data class is available in Arnaut et al. [
17].
The manual multi-class labeling of the 19 transmitter–receiver pairs of VLF amplitude data was conducted by two researchers in a dual-phase approach. The primary researcher conducted the initial data labeling, while the secondary researcher performed independent quality control (QC) and was responsible for correcting certain data labels on a case-by-case basis. Since two researchers conducted data labeling, with the first performing the initial labeling and the second executing QC, high data quality was ensured with minimized bias. The potential for bias may arise from the subjectivity introduced by a researcher’s manual labeling, influenced by their experience with VLF data. By employing a dual approach in which the primary researcher labels the data and the secondary researcher conducts quality control, this risk is believed to be mitigated. While some subjectivity may arise in low-energy SF events, those of high intensity carry minimal risk of labeling uncertainty.
The initial researcher initiated the labeling process in the TRAINSET software after data conversion to a compatible csv format, where both the VLF amplitude and X-ray irradiance data were displayed on the same graph using two separate axes. This allowed the researcher to monitor the X-ray irradiance data alongside the VLF amplitude data and provide precise data labeling. Concurrently, the second researcher conducted quality control within the TRAINSET software, replicating the process of the first researcher while inspecting for any mislabeling or residual data points from the original dataset. All data was examined for remnants of the prior data classifications.
The standardization framework presented in Arnaut et al. [
17] was employed for all data classes, while the instrumental_noise data class was incorporated to exhibit noise from the VLF receiver, which is separate from the instrumental_error data class (when the VLF receiver shows consistent, prolonged, low values). The instrumental_noise data class was designated when the VLF amplitude exhibits abrupt, high-frequency, rapid fluctuations between values, typically in cyclic, repetitive bursts.
Furthermore, as only the “Anomalous” signal was reclassified, no ambiguity is anticipated in the “Normal” signal data class, i.e., in the daytime_undisturbed data class in the multi-class dataset. Furthermore, given that VLF data exhibits distinct visual signal characteristics, no misclassified instances are expected within the current multi-class data categories.
4. User Notes
The dataset holds significant potential for the ML-based anomaly detection of VLF amplitude data. Initially, as it builds upon prior research that improved data quality through more detailed classifications of the target variable, it offers greater flexibility in developing an accurate, precise, and reliable model(s) for anomaly detection in VLF amplitude data. The presence of six data classes in the target variable presents a new research opportunity to develop multiple models for anomaly detection, as opposed to the singular model approach utilized in binary classification, each with unique feature sets tailored to the respective data classes. Given that the target variable comprises six classes, a total of 63 combinations can be generated. For instance, a model may be developed and optimized with a specific feature set to exclusively identify the nighttime signal, while ignoring other signal characteristics. This approach can be similarly applied to other classes, as well as to various combinations of classes, whether in pairs, trios, quartets, quintets, or all six target variables, either individually or in conjunction with others. Consequently, a customized solution can be developed for each target data class, resulting in multiple models that operate collaboratively to deliver the most precise and dependable classifications, i.e., ensemble models. This endeavor is time-intensive to develop; sharing the dataset with a diverse array of researchers aims to engage more individuals in the research activity.
In terms of the selection of specific models, we suggest that the baseline models be tree-based models, such as Random Forest or extreme gradient boosting, or related models, as they have demonstrated their classification power in previous examples.
An additional benefit and challenge of this dataset can be seen in the class balance among the six data classes. Geophysical data relating to ML-related issues typically exhibits significant biases favoring a particular data class. The largest data class in this instance is the daytime undisturbed signal, comprising approximately 40% of the entire dataset, closely followed by the nighttime signal at about 37%. These classes collectively represent approximately ¾ of the dataset. In other words, phenomena resulting from external factors, such as the effects of solar flares or instrument noise/errors and outliers arising from various causes, represent a minority of instances. This presents the issue and challenge of appropriately balancing the data. From the present viewpoint of this data descriptor, and without utilizing or developing any anomaly detection workflow, this question may be challenging to address, as it pertains to individual cases and is significantly influenced by the strategy adopted for model creation. In this instance, it is evident that utilizing random undersampling, as demonstrated in the binary anomaly detection example [
19], is not advisable for all data classes due to the limitation of data points, particularly for outlier data points, where random undersampling would result in approximately 600 samples remaining, which is considered insufficient for any dependable ML workflow. While other data classes appear to possess adequate data points for random undersampling, it is recommended to utilize alternative data balancing strategies, as the loss of data points in this context would be considered detrimental.
Additional attention must be given to the selection of features for each data class, specifically if the adopted strategy involves developing six independent models for each data class, which currently appears to be the optimal initial approach for the dataset, followed by integrating each classification method into a singular ensemble model. Different features must be evaluated for each data class; for instance, outliers represent rapid fluctuations in the dataset, rendering long-term features, such as a rolling mean with a 60-min window, ineffective. Furthermore, post-processing techniques, such as cluster analysis, should be considered, specifically cluster correction within the ML pipeline, where the signal is “corrected” post-classification, thereby enhancing evaluation metrics and reducing model noise.
The given information regarding the strategy to formulate the optimal model for multi-label anomaly detection requires significant time and effort from researchers. The primary objective of this data descriptor is to present a novel multi-label VLF anomaly dataset for ML-based anomaly detection, enabling researchers to develop their own models, report findings, and collaboratively strive towards an optimal method.
The provided web-based tool seeks to reduce the entry cost for labeling VLF amplitude data, as this dataset is extensive; however, additional data could prove advantageous in the current context or in the future. The primary objective of creating the web-based tool is to establish a free and open pipeline for dataset labeling and preparation, in conjunction with the openly accessible WALDO datasets and the TRAINSET tool, enabling a wide array of researchers to participate.
This dataset can also be utilized by early career researchers and students as a reference for both binary (provided in our previous research, Arnaut et al. [
20]) and multi-class classification (presented in this research) during ML studies. By offering a dataset of multi-class labeled data, we aim to grant other researchers the freedom to utilize this data according to their specific needs and objectives, while also highlighting its potential for educational purposes.
The future potential of applying anomaly detection techniques to VLF amplitude data extends beyond ionospheric data and can be utilized for a broader spectrum of space weather parameters, provided the signal is continuous and time-dependent. Interested readers are referred to some more recent papers dealing with different topics related to ionospheric research using different ML approaches, which can be found in, e.g., [
24,
25,
26,
27,
28]. Investigating optimal labeling techniques and classification methods in VLF amplitude may enhance the application of anomaly detection and ML techniques in space weather physics.