Machine Learning Classification Workflow and Datasets for Ionospheric VLF Data Exclusion

: Machine learning (ML) methods are commonly applied in the fields of extraterrestrial physics, space science, and plasma physics. In a prior publication, an ML classification technique, the Random Forest (RF) algorithm, was utilized to automatically identify and categorize erroneous signals, including instrument errors, noisy signals, outlier data points, and the impact of solar flares (SFs) on the ionosphere. This data communication includes the pre-processed dataset used in the aforementioned research, along with a workflow that utilizes the PyCaret library and a post-processing workflow. The code and data serve educational purposes in the interdisciplinary field of ML and ionospheric physics science, as well as being useful to other researchers for diverse objectives.


Summary
Numerous machine learning (ML) algorithms and pre-processing techniques have been made possible by rapid advancements in computer science, data science, and data analysis.It can be noted that it takes a lot of time and effort to manually verify, review, and exclude data from an ionospheric very-low-frequency (VLF) investigation during intense occurrences [1,2].Nevertheless, ML classification methods can be used to automate this job.We evaluated the Random Forest (RF) algorithm [3] in our prior publication [4] with the purpose of automatically classifying erroneous ionospheric VLF amplitude data points during solar flare (SF) investigation/detection.These erroneous data points were categorized as representing SF events, instrumentation errors, or noisy signals.Due to its ease of use and simplicity (few hyperparameters to tune and a reduced likelihood of overfitting the model due to averaging/voting [5] and the law of large numbers [3]), the RF algorithm is considered a first choice for various ML tasks [6,7].Consequently, it was a suitable selection for the given research purpose.However, as stated in the research paper [4], it is advantageous to extend the original dataset and test additional classification algorithms in order to possibly increase the predictive power of the algorithms.
This data report fulfills two objectives: firstly, it will catalog and provide a link to the data employed in this study, thereby making them accessible to a broader range of researchers, professionals, and others, and secondly, it will include a workflow that integrates the PyCaret library [8], enabling the comparison and testing of fifteen models in total (data and code available at Supplementary material).The chapter methods, i.e., the workflow description, will provide a comprehensive overview of the workflow utilized in conjunction with the data.A synopsis of the pre-processing steps performed during the construction of the original dataset is also available in [4].

Data Description
The datasets that were originally obtained and labeled (erroneous values were filtered out) for other research purposes in September and October of 2011 provided a favorable opportunity to evaluate ML classification on this type of data.The original and labeled samples were combined into a single data source, to which additional data (X-ray irradiance, transmitter and receiver data, and local receiver time) were added (Figure 1).The target variable was obtained from the labeled, i.e., filtered, dataset where each instance which was filtered out of the original dataset was annotated as 1 (anomalous data class) and the data that remained were annotated as 0 (normal data class).The feature extraction process was performed by analyzing statistical features of the VLF amplitude and X-ray irradiance signals.The statistical features utilized included rolling window statistics such as mean, standard deviation, and median, with three different window sizes (5, 20, and 180 min).Additionally, lagged signals and other measures such as rate of change, first-and secondorder difference, etc., were also employed.Due to the imbalanced nature of the given ML task, random undersampling [9][10][11] was performed to balance the distribution of the target labels.As a result, the final dataset was generated.To obtain a more comprehensive explanation of the data pre-processing, refer to [4].

Data Description
The datasets that were originally obtained and labeled (erroneous values were filtered out) for other research purposes in September and October of 2011 provided a favorable opportunity to evaluate ML classification on this type of data.The original and labeled samples were combined into a single data source, to which additional data (X-ray irradiance, transmitter and receiver data, and local receiver time) were added (Figure 1).The target variable was obtained from the labeled, i.e., filtered, dataset where each instance which was filtered out of the original dataset was annotated as 1 (anomalous data class) and the data that remained were annotated as 0 (normal data class).The feature extraction process was performed by analyzing statistical features of the VLF amplitude and X-ray irradiance signals.The statistical features utilized included rolling window statistics such as mean, standard deviation, and median, with three different window sizes (5, 20, and 180 min).Additionally, lagged signals and other measures such as rate of change, first-and second-order difference, etc., were also employed.Due to the imbalanced nature of the given ML task, random undersampling [9][10][11] was performed to balance the distribution of the target labels.As a result, the final dataset was generated.To obtain a more comprehensive explanation of the data pre-processing, refer to [4].The dataset was divided into two separate sets: the training dataset and the test dataset.Both of these sets have been pre-processed and are provided as links.The workflow described below can be readily applied to these pre-processed versions of the dataset.

Methods (Workflow Description)
Prior to executing the code, the user must specify the input variables, which include the training and test datasets, as well as the visualization range.The initial stage of the workflow involves ML modeling, where the PyCaret library employs the training dataset to conduct a comparison among 15 ML algorithms (Figure 2 and Table 1).After conducting the comparison, the model with the best evaluation metrics and statistics is selected as the overall best model.This model is then used for the hyperparameter tuning process to further optimize the model.The last step involves employing the optimized model to make predictions on the given test dataset.This process generates an output file that includes the predictions made by the most effective and fine-tuned ML algorithm, along with the features and target variables.The post-processing workflow can be summarized in four steps: decoding the data from each transmitter and receiver, separating the individual transmitter-receiver pairs due to the presence of 19 pairs in the test dataset, computing per-class evaluation metrics for each pair, and finally visualizing the true and predicted data labels using the specified input range at the start of the workflow.The dataset was divided into two separate sets: the training dataset and the test dataset.Both of these sets have been pre-processed and are provided as links.The workflow described below can be readily applied to these pre-processed versions of the dataset.

Methods (Workflow Description)
Prior to executing the code, the user must specify the input variables, which include the training and test datasets, as well as the visualization range.The initial stage of the workflow involves ML modeling, where the PyCaret library employs the training dataset to conduct a comparison among 15 ML algorithms (Figure 2 and Table 1).After conducting the comparison, the model with the best evaluation metrics and statistics is selected as the overall best model.This model is then used for the hyperparameter tuning process to further optimize the model.The last step involves employing the optimized model to make predictions on the given test dataset.This process generates an output file that includes the predictions made by the most effective and fine-tuned ML algorithm, along with the features and target variables.The post-processing workflow can be summarized in four steps: decoding the data from each transmitter and receiver, separating the individual transmitter-receiver pairs due to the presence of 19 pairs in the test dataset, computing per-class evaluation metrics for each pair, and finally visualizing the true and predicted data labels using the specified input range at the start of the workflow.
emphasis should be given to per-class evaluation metrics.Furthermore, a comprehensive overview of all evaluation metrics can be found in [4].However, for the purpose of this brief data descriptor, we will provide a short definition of the F1-score.The F1-score is calculated as the harmonic mean of the true positive (TP) rate, also known as recall, and the precision parameter.When evaluating imbalanced ML tasks, the F1-score is typically preferred over accuracy [12,13].The evaluation metrics employed for the workflow consist of the confusion matrix, as well as the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values for each class.Furthermore, the workflow presents precision, and F1-score values for each class as well.The overall metrics are also displayed as an output of the workflow.However, due to the highly imbalanced ML task in question, greater emphasis should be given to per-class evaluation metrics.Furthermore, a comprehensive overview of all evaluation metrics can be found in [4].However, for the purpose of this brief data descriptor, we will provide a short definition of the F1-score.The F1-score is calculated as the harmonic mean of the true positive (TP) rate, also known as recall, and the precision parameter.When evaluating imbalanced ML tasks, the F1-score is typically preferred over accuracy [12,13].
The ML workflow's results are displayed in Figure 3b.All three panels represent the signal obtained from the NAA-Walsenburg transmitter-receiver pair.The signal's duration in Figure 3 spans 600 min, beginning on 19 October 2011 at 14:37 UT and concluding on 20 October 2011 at 1:37 UT.In addition, the workflow also provides the evaluation metrics for each transmitter-receiver pair, specifically the F1-score.In the given example shown in Figure 3b, the anomalous data class has an F1-score of 0.65, while the normal data class has a score of 0.96.This results in a total F1-score of 0.93.
A comparison of the outcomes produced by the workflow integrating the PyCaret library and the transmitter-receiver pair utilized by [4] employing the RF algorithm reveals a distinction.The PyCaret algorithm ascertains that the Extra Trees Classifier (ET) is the optimal overall model for the given task.The comparison of the outputs clearly illustrates that, at least for the instance depicted in Figure 3, the ET classifier is more suitable.However, additional investigation is required to ascertain the specific conditions that require a certain model.The ML workflow's results are displayed in Figure 3b.All three panels represent the signal obtained from the NAA-Walsenburg transmitter-receiver pair.The signal's duration in Figure 3 spans 600 min, beginning on 19 October 2011 at 14:37 UT and concluding on 20 October 2011 at 1:37 UT.In addition, the workflow also provides the evaluation metrics for each transmitter-receiver pair, specifically the F1-score.In the given example shown in Figure 3b, the anomalous data class has an F1-score of 0.65, while the normal data class has a score of 0.96.This results in a total F1-score of 0.93.
A comparison of the outcomes produced by the workflow integrating the PyCaret library and the transmitter-receiver pair utilized by [4] employing the RF algorithm reveals a distinction.The PyCaret algorithm ascertains that the Extra Trees Classifier (ET) is the optimal overall model for the given task.The comparison of the outputs clearly illustrates that, at least for the instance depicted in Figure 3, the ET classifier is more suitable.However, additional investigation is required to ascertain the specific conditions that require a certain model.
In addition to conducting additional research to identify the most suitable model for different scenarios, it is crucial to undertake a comprehensive data acquisition endeavor to further enhance the predictive capabilities of a model.Additional data collection would allow the model to acquire observations from a wider array of events and varying degrees of potential noise levels.For example, data from different time periods within one or a couple of solar cycles can be utilized, etc. Acquiring this level of detailed data requires the collaboration of a larger team of researchers to label and verify the data in a semi-manual manner.Following this undertaking, the model has the potential to be significantly enhanced and additional solutions can be devised to cater to a larger research community, for instance, the creation of standalone software with a user-friendly interface that can be utilized by a diverse group of researchers for data VLF pre-processing.In addition to conducting additional research to identify the most suitable model for different scenarios, it is crucial to undertake a comprehensive data acquisition endeavor to further enhance the predictive capabilities of a model.Additional data collection would allow the model to acquire observations from a wider array of events and varying degrees of potential noise levels.For example, data from different time periods within one or a couple of solar cycles can be utilized, etc. Acquiring this level of detailed data requires the collaboration of a larger team of researchers to label and verify the data in a semi-

Figure 2 .
Figure 2. Workflow for ML modeling and post-processing.

Figure 2 .
Figure 2. Workflow for ML modeling and post-processing.

Figure 3 .
Figure 3. (a) Visualization of actual class labels for the NAA_Walsenburg transmitter-receiver pair from 19 October 2011 14:37 to 20 October 2011 01:37, obtained from [4]; (b) predictions made by the Figure 3. (a) Visualization of actual class labels for the NAA_Walsenburg transmitter-receiver pair from 19 October 2011 14:37 to 20 October 2011 01:37, obtained from [4]; (b) predictions made by the Extra Trees Classifier from the PyCaret library for the same time period; (c) predictions made by the Random Forest Classifier from the same time period, obtained from [4].