The Choice of Time–Frequency Representations of Non-Stationary Signals Affects Machine Learning Model Accuracy: A Case Study on Earthquake Detection from LEN-DB Data

: Non-stationary signals are often analyzed using raw waveform data or spectrograms of those data; however, the possibility of alternative time–frequency representations being more informative than the original data or spectrograms is yet to be investigated. This paper tested whether alternative time–frequency representations could be more informative for machine learning classiﬁcation of seismological data. The mentioned hypothesis was evaluated by training three well-established convolutional neural networks using nine time–frequency representations. The results were compared to the base model, which was trained on the raw waveform data. The signals that were used in the experiment are three-component seismogram instances from the Local Earthquakes and Noise DataBase (LEN-DB). The results demonstrate that Pseudo Wigner–Ville and Wigner–Ville time–frequency representations yield signiﬁcantly better results than the base model, while spectrogram and Margenau–Hill perform signiﬁcantly worse ( p < 0.01). Interestingly, the spectrogram, which is often used in signal analysis, had inferior performance when compared to the base model. The ﬁndings presented in this research could have notable impacts in the ﬁelds of geophysics and seismology as the phenomena that were previously hidden in the seismic noise are now more easily identiﬁed. Furthermore, the results indicate that applying Pseudo Wigner–Ville or Wigner–Ville time–frequency representations could result in a large increase in earthquakes in the catalogs and lessen the need to add new stations with an overall reduction in the costs. Finally, the proposed approach of extracting valuable information through time–frequency representations could be applied in other domains as well, such as electroencephalogram and electrocardiogram signal analysis


Introduction
Non-stationary signal analysis is usually focused on detecting components of interest in a measured signal.It can be performed on a wide variety of non-stationary signals, such as electroencephalogram (EEG), electrocardiogram (ECG), speech, and so on.The signals can be analyzed raw, in their original waveform configuration, or after a time-frequency representation (TFR) has been applied.This also holds true for non-stationary earthquake signals used for earthquake monitoring, which plays a crucial role in seismology.In recent years, technological advancements made it possible to gather and store vast amounts of data, which surpassed the processing capabilities of human analysts.Automatic earthquake detection algorithms have been under development since the 1960s (see review [1]), but often fail to detect smaller earthquakes buried in seismic noise.The detection of smaller earthquakes is important for improving seismic catalogs, which are crucial for seismic hazard and risk analysis.Automatic detection is also important for larger earthquakes, mainly to improve the rapid response capabilities for natural hazards.New approaches for earthquake detection have been developed recently, such as template matching [2] and machine learning (see review [3]).Advancements in the field of neural networks combined with ever available colossal amounts of data can lead to increased accuracy in earthquake detection and characterization.Perol et al. (2018) proposed a convolutional neural network (CNN), ConvNetQuake, and formulated earthquake detection as a supervised classification problem [4].Con-vNetQuake achieved substantial results by having a precision of 94.8% and a recall of 100% for local earthquake detection in Oklahoma, USA.Building on top of [4], Lomax et al. (2019) built a single CNN for local, regional, and teleseismic earthquake detection and characterization [5].The subsequent studies of local earthquake detection (e.g., [6,7]) performed at least as good as [4] or better.Most of the algorithms which deal with earthquake detection have used the waveforms in the time domain, and few have used waveform transformations (e.g., spectrograms [8,9] and continuous wavelet transform [10]).The importance of extracting as much information as possible from the data we have is self-explanatory; however, to our knowledge, the use and comparison of different TFRs along with machine learning in earthquake detection has not yet been explored.
The goal of this research is to explore the potential of several powerful TFRs for earthquake detection, and their influence on algorithm performance.Furthermore, an important contribution of this paper is raising awareness that Cohen's class of TFRs could be a good way to extract more useful information from non-stationary data for machine learning applications.

Experimental Setup
The concept of our experiment was to train three state-of-the-art CNNs using different TFRs, and compare their performance on earthquake detection with the base model.The base model was trained on the original waveform data.The experiment was divided into two parts: data preprocessing and neural network training.Data preprocessing included random sampling of the Local Earthquakes and Noise DataBase (LEN-DB) [11] (further described in Section 2.2) and TFRs extraction (further described in Section 2.3).Both the base model and the CNNs are detailed in Section 2.4.The main phases of the experiment are depicted in Figure 1.

LEN-DB Dataset
To conduct the experiment, we used the LEN-DB seismological dataset [11].The dataset includes earthquake waveforms and noise recorded by high-gain, velocimetric seismometers from earthquakes in the magnitude range 0.4 ≤ M ≤ 7.1, originating up to 134 km from the recording stations.The data were filtered in the frequency range 0.1 < f < 5 Hz.All this implies that the dataset includes very diversified trace records in terms of max amplitude and frequency content.Each instance of the dataset contains three component seismograms sampled at 20 Hz.Each seismogram contains 540 points, i.e., it is 27 s long.Illustrations of the three-component (3C) seismograms depicting an event (earthquake) and a noise instance are presented in Figure 2. The dataset contains 631,105 instances labeled as earthquakes, and 618,306 instances labeled as noise.This gives an approximate earthquake-to-noise ratio of 50:50.In terms of uniformity, seismological data tends to have lower intra-class variance and higher inter-class variance than the, e.g., ImageNet data [12].To clarify, in most cases, the differences between an earthquake and noise are much easier to learn and require fewer features than the differences between a tabby cat and a tiger cat (which are only two of the 1000 classes in the ImageNet dataset).Having that in mind, one can assume that even a relatively small amount of data will suffice in this experiment.Furthermore, the time required to extract TFRs and train three complex CNN models is quite significant.Consequently, we decided to use a subset of 150,000 randomly sampled three-component seismograms from the LEN-DB dataset while maintaining the original earthquake-tonoise ratio.
We split the data into training, validation, and test sets.The training set contained 104,999 instances, the validation set 22,499 instances, and the test set 22,502 instances.Each set had an approximately equal number of earthquake and noise instances, following the original ratio of the LEN-DB dataset.This waveform data were used to train the base model.The same division was maintained throughout the experiment.That means that each CNN model was trained on the same set of data as the base model, only processed by different TFRs.The same was true for the validation and the test set, so the results should reflect the effects of using TFRs with great credibility.In addition, it is worth noting that the data preprocessing procedure was parallelized due to the time-consuming nature of the process.

Extracting Time-Frequency Representations from Seismograms
Due to the oscillatory nature of seismic signals, which are non-stationary (having frequency content varying in time) and multi-component, it is necessary to develop tools to characterize and analyze them simultaneously in time and frequency.Namely, the Fourier transform provides only spectral information (information of frequencies present in the signal without their time localization).Considering the time dimension along the spectral information naturally leads to the concept of TFRs.These representations result in the 2D spreading of signal information content along the time-frequency plane.
A straightforward way to obtain signal TFR, called short-time Fourier transform (STFT), is by applying the Fourier transform to a signal windowed by a sliding window, Equation (1) [13], where x is the analyzed time-series, t denotes time, f frequency, h smoothing window (Hamming windows were used in the paper), and * denotes complex conjugate.Well-known shortcomings of the STFT are the time/frequency resolution limits caused by the fixed-sized window width.Namely, short time windows provide a better time resolution and a worse resolution in the frequency domain.On the other hand, wide windows improve frequency resolution and, at the same time, decrease the time resolution.Further, it is not possible to adjust the time and frequency resolutions independently due to a single fixed-size window (also known as Heisenberg's uncertainty principle).
In many practical applications, such as seismology, instead of the STFT, the squared modulus of the STFT (known as the spectrogram, SP) is often used [13][14][15], Equation (8).
Another TFR used in this paper is Margenau-Hill (MH), defined as [16] and presented in Equation (5), where X( f ) stands for signal Fourier transform.In addition, the MH can also be interpreted as the real part of the product between two STFTs having different time windows.The two windows, consequently, allow for independently adjusting time and frequency resolution.The shortcoming of this approach is the possible generation of artifacts when the window lengths are significantly different, and the signal contains components with similar frequencies [17].
Another TFR was introduced by E. Wigner and further applied to the analytic signal by J. Ville.Unlike the SP where the Fourier transform was applied to a windowed signal, the Wigner-Ville distribution (WV) is performed on signal autocorrelation function resulting in quadratic TFR calculated as [18,19], Equation (10).The WV, a member of Cohen's class of time and frequency shift-invariant TFRs, improves auto-term resolution compared to the smeared out SP; however, in the case of multi-component signals, the WV suffers from cross-terms due to its quadratic nature.
In order to reduce the cross-terms, numerous modified methods were proposed, such as windowed WV with frequency smoothing window or Pseudo WV (PWV), calculated as [20,21], Equation (6).The PWV distribution can further be modified by introducing a time-smoothing window, resulting in so-called Smoothed PWV (SPWV) defined as [20,22], Equation (9), where g stands for the time smoothing window.
Another method to reduce unwanted interferences (cross-terms) is Born-Jordan distribution (BJ), whose kernel contains the sinus cardinalis sinc, which can be viewed as the Fourier transform of the first B-Spline [23].In particular, it damps so-called "ghost frequencies" and also effectively suppresses the noise.The BJ is expressed as [22,24], Equation (2).
Hyung-Ill Choi and William J. Williams proposed an exponential time-frequency kernel to suppress the cross-terms, which resulted in another TFR called the Choi-Williams distribution (CW), also belonging to Cohen's class [25], Equation (4).The CW distribution does not provide an independent choice of the amount of smoothing in the time and frequency domain.Namely, its weighing function does not decay along the axis in the ambiguity domain (otherwise, it would not satisfy marginal properties), and thus, it can only filter out the cross-terms that result from the components that differ in both time and frequency center [26].
Finally, a reduced interference distribution based on the first kind Bessel function (RIDB) [31,32] was used in this research, Equation (7).The method is known to effectively reduce cross-terms while the auto-terms are kept due to the low-pass filter characteristics of the used Bessel kernel.
To summarize, we used the following TFRs: • Reduced Interference Distribution with Bessel kernel (RIDB) [31,32] • Wigner-Ville (WV) [18,19] The results of the TFR extractions conducted on the LEN-DB data were three-channel, 224 × 540 images.Sample images are shown in Figures 3-6.The heatmaps depicted in the mentioned figures indicate the presence and intensity of different frequencies in the original signal at the specific point in time.The heatmap values are TFR-dependent and do not have a specific unit of measurement.A complete list of images of all the representations is shown in Figures S1-S9.Each channel consists of 224 frequencies and 540 timeframes.TFR extraction originally produced 540 × 540 images, but we decided to logarithmically choose the first 224 frequencies, starting from the lowest one.The former was performed to reduce the size of the output data and to adapt it to the input size of the CNNs listed in Section 2.4.As it can be noticed, some of the used TFRs may take negative values (which are, same as the interferences, the results of the mathematical transformation).This indicates that the TFRs are not energy density functions.The cause for this lies in the fact that time and frequency are non-commuting variables.Hence, it is not possible to simultaneously achieve absolute accuracy in both domains.Thus, no joint distribution (even if non-negative) may be interpreted, in a strict mathematical sense, as a probability density [33].Nevertheless, TFRs still offer a valuable insight into the spectral nature of the signal, number of components, their time localization, and frequency modulation, to name a few.

Convnet Models and the Base Model
CNNs are a class of feed-forward neural networks whose architecture includes convolutional layers.In addition, pooling layers are also common in CNNs.CNNs are mainly used in image classification and pattern recognition due to their invariance to translation, but also, to some extent, to rotation and scale.One of the first CNNs was LeNet [34].LeNet proved successful in identifying handwritten zip code numbers supplied by the US Postal Service.
Each of the listed architectures won the ImageNet classification challenge [12].We refrained from making major modifications to the network architectures; however, we did adjust the following characteristics architecture-wise (the motivation behind this is explained in the sequel): • Batch normalization (BN) layers were added after each convolutional layer and after each fully connected layer, except for the output layer.BN layers were not added to ResNet50 because the mentioned network originally has BN layers incorporated in its architecture.

•
Inputs into the networks were adapted so that each model accepted a TFR image and three maximum channel values before data scaling.Data scaling/normalization is thoroughly described in Section 2.5.

•
Outputs from the networks were adapted so that each model predicted the probability of the input data being an earthquake.
Batch normalization is a neural network training acceleration technique devised by Ioffe and Szegedy in 2015 [38].Batch normalization layers reduce internal covariate shift allowing the training to progress at a quicker pace.Additionally, batch normalization layers have regularization effects.That is why dropout and other regularization techniques should be removed or reduced in strength when used in conjunction with BN layers.Table 1 shows the original and modified regularization parameters used by the CNN models.The original parameters are those reported in [35][36][37], while the modified parameters are the ones used in this research.Following the practice in [38], we removed dropout and reduced L2 in strength.ResNet50 originally uses BN layers and does not employ other regularization techniques.
To utilize the full potential of BN layers, Ioffe and Szegedy recommend, among other things, to accelerate the learning rate decay and increase the learning rate.Where should BN layers be added is still a topic of debate, but the original paper states to "apply it before the nonlinearity since that is where matching the first and second moments is more likely to result in a stable distribution" [38], so we followed the same principle.Wherever we added a BN layer, we disabled the use of bias since the role of the bias is subsumed by β in the batch normalization layer.β is one of the parameters of a BN layer that is determined through the model training process and serves as a bias value.As it was already mentioned, training the CNN models takes quite a lot of time, so we used BN layers primarily as a means of accelerating model training.As described in Section 2.3, the results of TFR extractions were three channel, 224 × 540 images.Each channel represents a single seismogram axis, X, Y, and Z.The chosen CNN architectures have the following input dimensions: It is obvious that some sort of input adaptation was necessary.For VGG16 and ResNet50, we added two max pooling layers before the original input The first max pooling layer had both a size, (height, width), and a stride of (1, 2), while the second pooling layer had a size of (1, 47) and a stride of (1, 1).AlexNet's input size differs from the formerly mentioned CNNs, so a slightly different scheme was applied there.Firstly, a zero padding layer was added, having a dimension of (3, 0).Because the time component of a TFR image is large enough, only the frequency component needs to be padded.The zero padding layer produces 230 × 540 × 3 images, which are then fed to the two max pooling layers.The first one has a size and a stride of (1, 2), while the second one has a size of (4, 44) and a stride of (1, 1).By applying the described scheme, we believe to have adapted the input layers of the networks while minimizing the modifications of the original architectures.
The three used CNN architectures have 1000 neurons in their output layers.This stems from the fact that the architectures were originally devised for dealing with the ImageNet [12] dataset, which contains 1000 classes.Because we are interested in classifying whether something is or is not an earthquake, we only need one sigmoid neuron in the output layer.In addition, because we believed that the maximum values of each of the three channels (before data scaling) would enhance classification performance, along with modifying the output layers to have a single sigmoid neuron, we also added channel max data input as close to the output layer as possible.The same was performed in [5].Jozinović et al. showed that this procedure is highly effective [39].In VGG16 and AlexNet, max channel data were concatenated with the last fully connected layer before the output neuron.In ResNet50, channel max data were concatenated with the global average pooling layer.All the described modifications resulted in CNNs that are capable of accepting 224 × 540 × 3 TFR input along with three channel max values before data scaling, and are capable of outputting the probability that the input data represents an earthquake.Each CNN model takes approximately 24 h to train.
We also trained the base model as a means of comparing classification accuracy with the CNNs trained on TFR representations.We chose the model described in [4].The original model accepts three-channel waveform data of size 1000 × 3.As described in Section 2.2, LEN-DB waveform data have three channels, each consisting of 540 samples.To match LEN-DB data size with the model input, we padded waveform data with 230 zeros on each side.Further, as formerly described, maximum channel values before scaling were added to the last fully connected layer.The output layer was also adapted to enable earthquake classification.No other modifications were made to the base model.We retained L2 regularization with λ = 0.001.The base model, being much simpler than the CNN models, takes about 1 h to train; therefore, BN layers were not added to the architecture and the original regularization procedure was retained.It is important to emphasize that we normalized the inputs into neural networks, both in case of CNN models and the base model.Input normalization is described in Section 2.5.
Since our primary concern is the usefulness of TFRs, the single important thing when choosing the models is that have sufficient capacity.Neural network capacity can "be viewed as an upper bound on the number of bits that can be stored in the architecture during learning, or equivalently an upper bound on the number of bits that can be extracted from the training data" [40].It is clear that the capacity of a neural network should be assessed with respect to the complexity of the problem the network is expected to solve.Consequently, the neural network of insufficient capacity is guaranteed to underfit.Given the fact that the used CNN architectures were originally designed to deal with the ImageNet dataset [12], which is a multi-class high-variance dataset, we are confident that all the architectures have sufficient capacity in the context of this experiment.To make the comparisons as unbiased as possible, we tried to use the original hyperparameter values and refrained from making any major modifications to the networks.This was described earlier in this subsection.Table 2 lists the training parameters that were used.For CNNs, we mostly relied on ResNet50 training parameters due to the fact that it originally uses BN layers in its architecture.
However, while conducting the experiment, we noticed that the spectrogram TFR was underperforming.The results achieved on the validation dataset by all three CNNs were a clear indicator of an overly large learning rate; therefore, we assessed the performance with many different learning rates ranging from 0.01 to 0.00001.The learning rate of 0.0001 was the first to yield significant improvements, while 0.00001 achieved the same results as 0.0001, but had longer training time, which was expected.The tested learning rates had no positive or negative effects when applied to other TFRs.The only difference was the extended training time.Due to that, we decided to apply the learning rate of 0.0001 when CNNs were trained on spectrogram data, and a learning rate of 0.1 when we used other TFRs.Each CNN model was trained for a maximum of 30 epochs.The learning rate was divided by 10 if there was no improvement in validation loss over 5 consecutive epochs.Similarly, the training was stopped if there was no improvement in validation loss over 10 consecutive epochs.

Input Normalization
Input normalization brings specific benefits into neural network training [41,42].More specifically, data normalization brings different features into the similar value range, and therefore facilitates the gradient flow during neural network training, consequently yielding a model with superior performance.We normalized the data by scaling each channel separately, i.e., by determining the minimum and maximum values and applying the following expression: where η represents the value after scaling, α is the maximum channel value, β is the minimum channel value, and γ is the value that needs to be scaled.This brought each value of each channel in the range between zero and one, inclusive.Further, we saved maximum values of each channel for each instance and used them in neural network training.Furthermore, we calculated a per-channel mean image (i.e., a three-channel mask) on the training set for each TFR separately.The mask is of the same shape as TFR images and contains average of each pixel, i.e., each pixel of each channel is a separate feature.Let M be a zero tensor of dimension 224 × 540 × 3. Let T be a 4-D tensor containing TFR training set images.T's shape is N × 224 × 540 × 3, where N represents the number of images in the training set.Equation ( 12) describes how mask calculation is conducted for each element of the mask tensor.The mask was then applied to the validation set and the test set.
The same normalization procedure we applied to the TFR data was also applied to the original waveform data that were used for the base model training.We scaled each channel separately by using Equation (11).As described in Section 2.2, waveform data dimensions are 540 × 3. Hence, when calculating the mask, each value of each channel represents a separate feature.Equation ( 12) can therefore be adapted to allow for the smaller number of dimensions when dealing with waveform data.Let M be a zero tensor of dimension 540 × 3.
Let T be a 3-D tensor containing training set waveform data.T's shape is N × 540 × 3, where N represents the number of instances in the training set.Equation ( 13) describes how mask calculation is conducted for each element of the mask tensor.

. Evaluation Metrics
We evaluated the results based on three indicators: receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), and classification accuracy.A model is evaluated using different probability thresholds.Obtained true positive rates (TPRs) and false positive rates (FPRs) show the level of certainty the model has while classifying the data.TPRs and FPRs can be plotted one against another, and the AUC can be calculated.The ideal AUC value is 1, but in practice, it is fairly hard to achieve it.We also tested the classification accuracies of our models by applying the standard threshold of 0.5.The following equation demonstrates the rounding rule that we used: where x represents the model's prediction, while f (x) represents the output class-earthquake (1) or noise (0).

Statistical Significance
A series of statistical tests were performed to determine if the results are strong enough to draw solid conclusions about the usefulness of TFRs.We used McNemar's test because "for algorithms that can be executed only once, McNemar's test is the only test with acceptable Type I error" [43].Our goal was to assess if any of the used TFRs enhances original data and, as a consequence, yields models whose performance significantly surpasses the performance of the base model.The idea was to keep the significance level α at 0.01 for each TFR.Due to the fact that each TFR was applied to three different CNN architectures, we utilized the Bonferroni correction [44] and divided α by 3.This resulted in significance level α = 0.00 3 for each TFR-CNN couple.Each comparison was made between the results of a CNN coupled with a TFR and the results of the base model.

Results
The final results on the test dataset are presented in Tables 3 and 4. Table 3 lists the final accuracies on the test set, while Table 4 lists the AUC values.Base model results are marked in gray, the best TFR-model combination in green, and the worst TFR-model combination in red.Additional results are presented in Figure S10, where each subfigure represents the base model and all the TFRs coupled with a specific CNN.Figures S11-S19 focus on the base model results and how they compare to the TFRs.Table 3.The results of model evaluation using classification accuracy measured on the test dataset.Each time-frequency representation is coupled with a convolutional neural network.The best performing combination is marked in green, while the worst performing one is marked in red.The results of the base model are marked in gray.The base model achieved an accuracy of 94.15% and an AUC of 0.9785.The best TFR results were achieved by AlexNet architecture combined with Pseudo Wigner-Ville TFR.The mentioned combination achieved an accuracy of 95.71% and an AUC of 0.9859.On the other hand, the worst performing combination was the one between VGG16 and Margenau-Hill TFR, which achieved an accuracy of 86.45% and an AUC of 0.9253.

Time-Frequency
Table 5 lists p-values for all the comparisons.We used McNemar's test [43].We should note that the p-values were rounded to three decimal places.Statistically significant cases are marked in green and red.Green is used to indicate that a TFR model performed better than the base model, while red is used to emphasize the opposite.The cases where no statistically significant differences between the results of a time-frequency representation model and the base model were observed are marked in gray.

Discussion
This research shows that the most common TFR traditionally used in seismology, the spectrogram, has better performing alternatives.Because we used fairly complex CNNs and a well-determined training protocol, a reasonable assumption is that both the CNNs and the base model had an equal opportunity to learn the differences between earthquake and noise seismograms; therefore, the key factor preventing CNN results from being significantly better than those of the base model are the data representations that were used to train the CNN models.Further to the previous point, if a TFR yields significantly better results than the base model when coupled with all three CNNs, we conclude that the particular TFR significantly enhances waveform data and, also, gives superior accuracy.The opposite is also true.In addition, if a row connected to a certain TFR (see Table 5) contains even a single result that is not statistically significant, we can not draw a conclusion about that particular TFR.The same is true if a single row contains cells that contradict each other, i.e., the p-values infer TFR superiority and inferiority at the same time.Given the p-values presented in Table 5, and the statistical significance level of α = 0.00 3, we can conclude that Pseudo Wigner-Ville and Wigner-Ville TFRs significantly enhance waveform data, while Margenau-Hill and spectrogram TFRs significantly worsen the results.We can not claim anything about the remaining five TFRs, that is, the remaining TFRs are as informative as the original data.The conclusions are based on the randomly sampled subset of the LEN-DB dataset that was used in the experiment.
Pseudo Wigner-Ville TFR coupled with AlexNet CNN achieves 1.56 percentage points better results than the base model, while Wigner-Ville TFR coupled with ResNet50 CNN surpasses the base model by 1.09 percentage points.The increase in accuracy is important as the waveforms misclassified in the original study, [11], mainly come from the stations that have high noise levels.This suggests that the selection of a TFR helps the machine learning model to achieve significant improvements on those stations.The former could be of crucial importance for the applications of machine learning in seismology on sensors featuring higher instrumental noise.Consequently, researchers will have higher quality data to further study earthquake occurrence and phenomena linked to it.
Because the difference in performance between the base and the TFR models is quite close, we used a McNemar's test of statistical significance.The test does not directly compare model accuracies, but focuses on the cases where the base model and the TFR model disagreed in predictions.It then checks if there is a statistically significant difference in their performance.In the context of our experiment, we can say it is highly unlikely the results presented in the tables arose by chance, if there is no statistically significant difference between machine learning model performance on raw waveform data and on the TFRs.
Perhaps the most important result is that by applying the PWV or WV TFRs to the recorded data, seismologists can exploit the data better to lower the magnitude of the earthquake detection level.This indeed, because of the magnitude-occurrence Gutenberg-Richter law, can result in a large increase in earthquakes in the catalogs and, to some extent, reduce the need to add new stations with an overall reduction in the costs.
An interesting observation is that the spectrogram TFR, which is often used in nonstationary signal analysis, performed significantly worse than the original waveform data.Seeing that the results achieved using TFRs suggest their usefulness in machine learning modeling, we believe that TFRs could prove useful in other domains as well, e.g., EEG and ECG signals, gravitational waves, speech data, and so on; however, this should be further inspected.
Since the LEN-DB dataset contains an approximately equal number of earthquake and noise instances, the work presented in the paper does not take into account possible class imbalance; however, the proposed methodology could easily be adapted to investigate such a scenario.The only necessary modification of the experiment workflow is the use of evaluation metrics appropriate for imbalanced datasets, such as Matthews correlation coefficient, precision-recall curve, F1 score, etc.
We should note that, in addition to the LEN-DB dataset, we also considered other datasets for this research, such as STanford EArthquake Dataset (STEAD) [45]; however, given the complexity and the time-consuming nature of extracting TFRs, along with the time needed to train three intricate CNN models using nine TFRs, we decided to limit the experiment to the LEN-DB dataset only as it had the waveforms with the smallest size between the considered datasets.As our hypothesis is that TFRs might affect the classification accuracies of machine learning models, a confirmation of the hypothesis on a single dataset presents sufficient evidence for its approval, which mitigates the necessity to use other datasets.
Inclusion of spatio-temporal information proved as a viable way to enhance deep learning models [46][47][48], and has been performed in seismic applications of machine learning mainly by using multi-station input data (e.g., [49,50]).Such applications could perhaps also be improved by using the alternative TFRs suggested here (it, however, needs to be tested).Additionally, our analysis could be easily expanded by, e.g., incorporating magnitude prediction in the experiment (similar as in our recent paper on examining transfer learning between different domains [51]).This is, however, out of the scope of this study.

Conclusions
Building on top of [4,5], this research explored the possibility of TFRs being more informative than raw waveform data or commonly used spectrograms for non-stationary signals analysis.More specifically, we tested the effects of TFRs on machine learning classification.The data used in the experiment were a subset of LEN-DB seismological dataset [11].We used nine TFRs coupled with three respectable CNNs.
The results show that a number of combinations surpasses the accuracy of the base model described in [4].In addition, tests of statistical significance suggest that Pseudo Wigner-Ville and Wigner-Ville TFRs significantly enhance waveform representations and result in higher classification accuracies, while Margenau-Hill and spectrogram TFRs worsen the results (p < 0.01).The best combination, the one between AlexNet architecture and Pseudo Wigner-Ville TFR, achieves 1.56 percentage points better accuracy than the base model.
Given the vast amount of earthquakes that happen every year, this results in a significant increase in the number of correctly classified events.Moreover, the analysis of the results suggests that the phenomena that were previously hidden in the seismic noise are now more easily identified; however, the main conclusion of the paper is that the machine

Figure 1 .
Figure 1.A diagram depicting the main phases of the experiment.

Figure 2 .
Two randomly sampled seismogram instances from the LEN-DB dataset, depicting (a) an event (earthquake) and (b) noise.

Figure 3 .Figure 5 .
Pseudo Wigner-Ville transformations of (a) an earthquake and (b) noise shown in Figure 2. All three axes are presented.(a) (b) Figure 4. Wigner-Ville transformations of (a) an earthquake and (b) noise shown in Figure 2. All three axes are presented.Margenau-Hill transformations of (a) an earthquake and (b) noise shown in Figure 2. All three axes are presented.(a) (b) Figure 6.Spectrogram transformations of (a) an earthquake and (b) noise shown in Figure 2. All three axes are presented.

Table 1 .
Original and modified regularization parameters used by the convolutional neural networks.

Table 2 .
Training parameters that were used for the experiment.

Table 4 .
The results of model evaluation using the area under the receiver operating characteristic curve measured on the test dataset.Each time-frequency representation is coupled with a convolutional neural network.The best performing combination is marked in green, while the worst performing one is marked in red.The results of the base model are marked in gray.

Table 5 .
p-values of the performed statistical tests (α = 0.00 3), using McNemar's test.Each timefrequency representation is coupled with a convolutional neural network.The values are rounded to three decimal places.Statistically significant cases are marked in green and red.Green is used to indicate that a time-frequency representation model performed better than the base model, while red is used to emphasize the opposite.The cases where no statistically significant differences between the results of a time-frequency representation model and the base model were observed are marked in gray.