In this section, we firstly describe the dataset and preprocessing input raw seismometer time-series recordings into STFT for input to the Siamese network. We then include a brief description of state-of-the-art classification methods, followed by the proposed Siamese network architecture to characterise each detected seismic event into one of four classes.
2.1. Dataset
The Résif seismic records dataset (available at [
9]) contains a subset of recordings from an active landslide in the French Alps, for which a catalogue—containing labels generated using an STA/LTA detector in the frequency domain followed by a supervised random forest algorithm and then expertly validated [
16]—is available for three monitoring periods, namely 11 October to 19 November 2013, 10 November to 30 November 2014 and 9 June to 15 August 2015. The seismic records are acquired by two permanent arrays of the French Landslide Observatory OMIV (Observatoire Multi-disciplinaire des Instabilités de Versants) installed at the east and west sides of the Super-Sauze landslide (Southeast France) developed in weathered black marls [
16]. Data are gathered by two sensor arrays (SZB & SZC), each with one 3D sensor and three 1D (Z-axis) sensors. The sampling rate is 250 Hz and events are between 0.2 s to 105 s. Please refer to [
10,
16,
19] for further details on the monitoring instruments, setup and location and how the data are processed for classification. For our experiments, we make use of the MT.SZC station as it has more complete data with fewer data gaps. Our model uses all three channels from the 3D sensor (sensor 0) from the SZC array.
Table 1 shows four types of labelled events available. Rockfalls mainly occur at the main scarp of the landslide, where the rigid block falls from the steep slope (height > 100 m), characterised by distinctive impacts over several seconds. The earthquakes represent regional seismic events in this area and the teleseisms; earthquake events have triangular spectrogram components with reducing high-frequency content ranging from 2–50 Hz. Micro-earthquakes (also called micro-quakes) are likely to be triggered by material damage, surface cracks and openings and are very short events less than 5 s. N/A noise events include all anthropogenic and environmental noise, due to, e.g., transportation, pedestrian walking, heavy rain, animals, strong wind, etc., that can last tens of seconds and generally have higher frequency range between 0 and 100 Hz and very distinct spectral patterns. Note that the events in the Résif catalogue can be as low as 2 Hz. Please refer to [
19], which provides a description and visualisation, in time and frequency domains, of the four types of events, considered in this study, and references therein for more details about the site’s endogenous seismicity.
The second column in
Table 1 shows the total number of labelled events per class in the Résif catalogue. During training, the dataset is split into training, validation and testing sets. The test set size was set to be chronologically the last 30% per class, and the test set was identical for all runs. And 10% of events of each class is kept aside for validation, and the remaining events are left for training. Data augmentation, as described in
Section 2.5, is performed during training such that in the random event start pool at the start, each class has the same number of events. The training and validation events are sorted into chronological order and stratified. Training is completed using Python 3.10.10 using TensorFlow 2.10.1 on a system with a i7 10700K and RTX NVIDIA 2080 Ti. Throughput is around 46.5 STFT images per millisecond. With a dataset repetition of 10, a training epoch with validation takes around 15 s.
In the catalogue [
16], the challenge of expert labelling is obvious, indicating that there are a number of events which, although labelled as one class, contain a note suggesting there may be doubt. The value of expert labelling with the support of machine learning, especially for uncertain events, is described in the human in the loop approach of [
11]. As shown in in the third column of
Table 1, uncertainty occurs most frequently within the microseismic class, where 15 events are noted as being possible rockfalls and 2 possible earthquakes. Within the earthquake class, there are also nine teleseismic events between 12 and 25 October 2013, possibly aftershocks of the 2013 Bohl earthquake. The teleseismic events are longer duration than the other earthquake events (around 40 to 60 s), and hence are likely to lead to classification errors. We take all of the catalogue labels as ground truth during training and testing, and show examples of a label that differs from its class in
Section 3.3, concluding that they are most likely mislabelled. This underscores the importance of the proposed work.
2.3. Classication Methods
Recently, machine learning, and specifically deep learning, approaches have been prevalent within the Earth Sciences community. Indeed, the scale of this is highlighted in a recent review paper [
26], which breaks down the various applications that machine learning is being used for—event detection, source localisation [
27] and cluster analysis [
28] being the most prominent. For recent approaches using hand-crafted feature generation and selection for classification, ref. [
10] provide a review, including a breakdown of the most commonly used features, as well as highlighting some of the current issues faced by the industry standard approaches using Short-term Average/Long-term Average (STA/LTA) pickers which struggle at detecting events with low spatial and temporal separation. For recent approaches on deep learning-based classification approaches, including multi-class classification, please refer to [
19]. Popular approaches are Convolutional Neural Networks (CNN)-based [
11,
19,
21], although other deep learning architectures have been used as well including a pseudo-Siamese implementation of EQTransformer performing template matching, which is then fed into P and S networks [
22] and Region-based CNNs (R-CNN) enabling capture of earthquake events in 1-D time series data across multiscale (time dilation) anchors [
18]. We next consider only deep learning-based approaches that perform both detection and classification, and do not require additional feature generation and selection. These approaches are broken down into either binary solutions (for seismic event detection) or multi-class solutions, as the case with this paper.
EQTransformer (EQT) [
20], is a deep encoder/decoder model, trained on the STanford EArthquake Dataset (STEAD), to detect earthquakes. EQT comprises an encoder, branching into three decoders, a detector and S-phase and P-phase pickers. EQT picks/detects arrivals from the raw input signal in the time domain and was tested on the Japanese Meteorological Agency HiNet dataset, showing that it can detect magnitudes of around 1.5–1.8 M, although it is capable of detecting smaller microearthquakes with high-resolution network coverage and smaller station spacing. The majority of literature [
29,
30,
31] employs EQT to detect only earthquakes or ‘noise’ (i.e., everything else) using the time domain rather than the frequency domain signal representation. In [
29], large pre-trained image classification networks such as GoogLeNet, AlexNet and SqueezeNet, are adapted for binary classification of earthquake arrival.
DeepQuake [
21] proposes two classification solutions for temporal and frequency domain classification, and concludes that the frequency approach proves more successful at classifying multiple types of events, namely earthquake (natural and induced), noise (everything else) and other. The ‘other’ classification in this case involved events typically associated with mining, such as explosion, mine collapse, sonic boom, quarry blast and a ‘not specified’ category which made up around 30% of ‘other’ events. RockNet [
32] is a deep learning model that supports single and multi station setups proposed for classifying rockfall, earthquake and noise events. It is designed to take in 3-axis seismograms and a spectrogram from the vertical axis. The inputs are then processed through their respective encoders and an attention gate to decoders for earthquakes pick, earthquakes and rockfall masks. This forms the single station model of RockNet. The multi-station model makes use of the frozen weights of the individual station models to train for multiple stations, resulting in the association models output of earthquakes, rockfall occurrence, or no occurrence (considered as noise).
Performance comparison of three CNN multi-label classifiers, each with a different input domain type: time, frequency and wavelet, is conducted in [
19]. Each classifier takes in six channels, comprising three-axis recordings from one station and three vertical components from other stations. The three networks, all with a similar design (1D input for time, 2D for frequency and wavelet), perform almost identically: the time series-based network is the fastest, while the wavelet-based the slowest, with insignificant change in classification performance for earthquakes and rockfalls, but Short-time Fourier Transform (STFT) and wavelet models work best for micro-quakes. The multi-classifier also demonstrates good transferability between geologically distinct sites with different monitoring network geometry. Furthermore, explainability, based on Layer-wise Relevance Propagation, of the outputs of the CNN-based multi-class classifier is introduced in [
11].
While a number of binary and multi-class deep learning-based classifiers have been proposed in the literature capable of detecting earthquakes at mostly large magnitudes, and to a lesser extent rockfalls and noise with high accuracy, they are highly dependent on a large labelled training dataset due to the complex nature of seismic signals differing due to source and geology.
The benefit of our proposed approach is the ability to extend the number of labels available once trained, a benefit of the soft-labelling of Siamese networks, with minimal new data. We show good performance with minimal network parameters in comparison to other Siamese approaches (e.g., [
33]) and without the use of a pre-trained model unlike [
22]. Additionally, the proposed semi-supervised methodology includes how to identify good exemplar anchors for each class, which can be used to reduce inference time. We also show how identification of outliers is possible and can highlight misclassifications within the original dataset, which in turn can aid experts to remove uncertainty in labelling during review. Additionally, we highlight the proposed sliding window approach to data ingest which means that the overhead of individual P & S wave picking, as in [
20], is not necessary to determine event starts.
2.4. Siamese Network Architecture
Siamese networks were initially proposed as a way of validating signatures but have rapidly become a key component of the architecture within change-detection networks. They are based on the concept of similarity learning, and operate by comparing two inputs (typically images) and calculating the difference between the encoded network representation of the input sequences. This can either be pixel-by-pixel classification often used in lower-resolution remote sensing images [
34], to creating change maps, e.g., ref. [
35] shows the change in built-up areas highlighting building construction and demolition between specified dates. As such, our approach can be transferred to remote sensing applications allowing for classification of features based on similarity scores. When applied to time-series networks, in contrast to traditional distance-based methods used to compare time-series signals (e.g., dynamic time warping), Siamese networks perform similarity comparison on learnt features, thus capturing intrinsic differences between time-series signals that cannot be achieved with traditional distance-based methods [
36].
Let be the n-dimensional input sample with corresponding label , where i is used to index sample number and K is the number of unique classes. Conventional supervised classifiers use training sets to learn a function that maps the input attributes to class labels. Instead, Siamese networks perform metric or similarity learning, i.e., they learn a function that compares two input samples. Specifically, the networks first perform identical transformation of the two input samples and then apply a distance metric to estimate similarity, i.e., , where function h is the used distance metric and the transform generates embeddings and . Note that the same transform is applied to both input data samples for identical representation learning, and the function s is implemented as a deep CNN.
Siamese networks are by design highly flexible and hence have a large number of applications, mainly related to image analysis or target tracking which makes best use of their ability to identify similarities. Classification is by far the most popular implementation but they can also be used for natural language processing assessing document similarity [
37] and in time-series analysis by identifying trends for prediction [
38].
Specifically, when applied to the classification problem, in contrast to traditional methods that learn a function that maps input samples to their labels, Siamese networks learn a metric function that compares respective embeddings of two input samples. During the testing phase, the learnt function is used to measure similarity between a chosen exemplar sample (referred to as anchor) and each test sample—if the distance is below a threshold, the test sample belongs to the same class as the anchor.
The main advantage of Siamese networks compared to other metric learning approaches, lies in joint feature representation and metric learning. A recent survey [
39] of Siamese network implementations in the real world, highlights the benefits these networks provide, specifically with regards to similarity and matching tasks, and importantly the ability to learn from unlabelled data, by comparing directly embeddings rather than relying on labels.
The second advantage of Siamese networks is their ability to perform well even with minimal training data. This is a key feature when investigating seismic datasets, since, as discussed previously, seismic labelling can be highly subjective. Pickers that use methods such as STA/LTA are commonly used, based on the signal amplitude ratio between long and short term averages against the user set threshold. This introduces user bias, as low thresholds will result in more picks which will require more manual validation but likely pick more low-amplitude events. Siamese networks, however, have been shown to produce excellent results with very low labelled data requirements. Indeed, ref. [
40] shows that high performance is possible with as few as 10 labels per class during training to identify the make and/or model of digital cameras. However, ref. [
40] do not discuss choosing specific images as exemplars and do not take into account the possibility of incorrect labels. In [
33], the performance of Siamese networks is shown in comparison to more traditional text classification methods with the inclusion of label tuning as an alternative to fine tuning; the paper also highlights that Siamese network inference is independent of the number of labels unlike the performance of cross-attention models which can halve when the number of labels doubles. The test results are generated with a comparison of every example against each other and do not account for outliers nor incorrect labelling. Note that the models used in [
33] are over 400 times larger, with 109,486,464 parameters, compared to our proposed model with only 232,704 parameters—see
Section 2.4.
Extracted features together with corresponding labels can be fed into another network for supervised learning. Additionally, as the Siamese network structure is based around typical network architecture and layering, pre-trained models can be used as experts to generate pseudo-labels from unlabelled data to enable a new lightweight-domain specific network to be created; once training is finished the pre-trained domain-agnostic networks can be dropped [
41].
Despite the high potential to mitigate the issue of lack and unreliability of labels, noise and misalignment of time-series signals [
36], the adoption of Siamese networks for classification of seismic signals has not been attempted yet, except in [
22], where a Siamese model is used to enhance earthquake picks (detection) from EQT [
20]. This is performed by using a confident pick from EQT as the anchor input and then comparing against other stations at the same time to cross-correlate picks. Using this method the authors generate more P and S phase picks, helping to identify earthquakes that EQT could not, due to not having enough stations to meet its required threshold of three stations.
The proposed Siamese network architecture, and the specific size and shape of the network are presented in
Figure 1 and listed in
Table 2. The network comprises: feature extractor, comparison head and decision-making head. The feature extractor comprises two identical branches, each processing one input, implementing feature-learning transform
via a fully convolutional network. The comparison head applies a distance metric to compare similarity between the two embeddings. The comparison head implements a distance function.
Similarity scores between two network inputs are calculated based on a distance metric, which needs to be discriminative enough to separate instances of different classes and group together those of the same. The top performing distance functions are Euclidean, Manhattan and Cosine Similarity, given, respectively, by:
where
P and
Q are
n-length vectors with entries
and
, respectively.
Finally, the decision-making head is a single dense neuron providing a soft-label in the range of 0 to 1, where two identical sequences would result in an output of 0 and highly dissimilar inputs would produce an output close to 1.
The convolutional layers and all but the last dense layer use ‘relu’ as their activation function; the final dense layer uses ‘sigmoid’. The convolution layers are 2D due to the 3D-based input (in our case 65 × 66 STFT image output for each of the three channels). Dropout is used after the first two convolution layers, and max pooling is used to reduce the size of the feature maps. The network has a total of 232,704 trainable parameters.
2.5. Siamese Network Training and Classification
During training, the data are randomly paired up to create a range of data sample pairs and their corresponding labels . The labels are then converted to a similarity score . These are then sent to each branch, updating simultaneously the weights of the two CNN branches so that the loss function is minimised. Once the network has been trained it can then be used to extract feature vectors (rather than similarity scores). Every event encoding can be created and then this can be processed later. This enables a large speed up when testing over an entire dataset as multiple comparisons are avoided.
Specifically, during training, each of the two branches of the network are shown one STFT image, or , with flag 0 if (e.g., both and are earthquakes), or 1 otherwise (e.g., is an earthquake and is a rockfall). The samples, and , are randomly chosen from the training set start pool to keep the classes balanced. In order to balance the number of events across classes, data augmentation is performed by shifting in time class events from the training set in the window to create new events. The output branches create their own encoded vectors (i.e., embeddings) representing the learned features of the input, where the distance between these two vectors determines the similarity score. Given two identical inputs both sides of the network would reach the same encoded vector and therefore the distance between the two would be zero.
Classification is performed on an unseen dataset (test set) by choosing an ‘anchor’ from the training set for each class. In a fully supervised approach, during testing, the anchor would be selected by the user/expert. As noted previously, this is time-consuming and subjective for microseismic signals. Instead, for a semi-supervised approach, we take a data-driven approach to choose an anchor from each class which scores the highest accuracy among its peers. Specifically, we take the highest F1-score within in-training for each class. Note that the F1-score is a standard classification metric, which is the harmonic mean of precision and sensitivity, providing a balanced measure that considers both number of wrongly detected events and the number of missed events—see [
10] on how to calculate F1-score for seismic classification. This is important as a label in the training set may be incorrect or the signal-to-noise ratio too low to perform well. It is also possible to provide multiple anchors of the same class to create an aggregate score, which will provide better generalisation (assuming that all of the class anchors are similar) but at the expense of processing time.
A threshold is required to determine which event is considered ‘similar’ enough to be classified/grouped in the same class as the anchor. The best threshold values depend on the distance function used but are generally in the centre of the expected range, e.g., for the 0–1 range.
2.6. Determining Siamese Network Parameters, Distance Function, Decision Threshold and Anchors
In this section, we explain the data-driven approach towards setting the network parameters, effectively choosing the distance function, decision threshold and anchor examplars. This methodology needs to be performed for each dataset to be classified, and thus is important for the purposes of reproducibility.
2.6.1. Network Parameters
To determine optimal network parameters, we tested different learning rates with both statics and decaying rates, with the ADAM optimiser. The learning rate of 0.0005 was found to be optimal for our model and is used in all experiments. The final best-performing model shape is listed in
Table 2. Note that initial model configurations with more neurons per layer and the same kernel size resulted in many more trainable parameters but with results which were not significantly different. A dropout of 0.4 was found to be optimal when applied only to the first two layers of the model.
2.6.2. Choice of Distance Function, Decision Threshold and Model Weights
Figure 2 shows that all three
distance metrics performed well, in terms of F1-score, on rockfall and earthquake classes. However, as expected, the network struggles with the micro-quake class, which is often labelled with uncertainty in relation to the noise class, as discussed in reference to
Table 1. Consistently, the cosine distance function resulted in the best F1-scores across all classes, which is expected since it is less sensitive to magnitude distances, and will be used from now on.
The next step is to determine the
optimal decision threshold to decide if a similarity score between a sample and the anchor is high enough for the network to infer that they belong to the same class. We experimentally evaluate the effect of threshold from 0.00 to 1.00 on F1-score. In
Figure 3, it can be seen there is little variation across the threshold range, with extremely sharp drop-offs at the two extremes, and the optimal value reached at 0.6. This shows that the network has learned well distinct differences between the classes.
To determine optimal model weights, multiple runs were performed for every configuration using sklearns’ StratifiedKFold set to 5 folds. At the start of each fold, the training and validation datasets were created using tensorflow.data.Dataset, which then, were repeated, shuffled (with a set seed) and zipped. Initial training involved all of the training and validation data (70% of all data) being split into 5-folds with 80% being used for training and 20% for validation.
Figure 4 shows the effect that 5-fold validation has on performance. F1-scores for rockfall and earthquake events remain high between folds showing that the data variation between training and validation sets makes little difference to the final performance. Noise performs worse, but has a similar spread in F1-scores. The noise class contains a number of different sub-events which are detected across the training-validation set and therefore it is understandable that changes between folds do not affect performance significantly. Micro-quake events have the greatest spread in performance between folds. Indeed, the micro-quake class has the most varied amount of signal-to-noise ratio and with very short event times; thus, a sub-set of events could be considered more challenging than others as represented by a large spread in F1-scores with the worst being below 0.5. Of the five folds, the best model weights were chosen from fold 3 which had the average F1-scores: rockfall: 0.79; earthquake: 0.94; micro-quake: 0.72; noise: 0.73.
2.6.3. Anchor Identification
The Siamese network performs classification by comparing each testing sample with each class representative (i.e., anchor). Cleverly selecting anchors can significantly improve the performance by using only best representative anchors per class from the training set. We demonstrate how to determine the best anchors to use for training in the absence of labels with 100% labelling certainty.
First, the best choice of anchors is obtained using the cosine distance values, illustrated via a heat map in
Figure 5. In this heat map, anchor events in darker colours are similar (smaller cosine distance value) whereas lighter (yellow) colours are dissimilar. That is, the smaller the cosine distance, the more similar the event on the vertical-axis is to the class/label on the horizontal axis. The well-defined triangular dark events (equivalent to cosine distance of less than 0.05) on the vertical axis represent events that make good anchors for each of the four class labels on the horizontal axis. Dark vertical or horizontal lines within the lighter portion of the heat map indicate anchor events in each class which have higher similarity, i.e., smaller distance, with events outside their class. These events are outliers, and should not be used as anchors.
Thus, the heat map provides a visual indication of events which would unlikely produce valid results for a given class, and to identify events which may have been wrongly labelled in the catalogue due to expert uncertainty since they are only highly similar to a different class.
Performance of the model was determined using the test set, generating the distance matrix of the encoded test set and comparing against each other to obtain the performance across all ‘anchors’ of each class. We observe that training with only anchors that gave high F1-scores reduced the model performance and, as expected, limiting to only the lowest performing anchors made all classes significantly worse. The best performance trade-off occurred when the middle performing events were chosen from the rockfall and earthquake class and all of the examples from the micro-quake and noise class are considered as training data. The inclusion of all examples from micro-quake highlights the uncertainty around the class with its short duration and for noise the wide range of possible sources means that all should be considered. Removing the best and worst performing anchors from the rockfall and earthquake class did not affect the performance of those classes, but improved the performance of quake and noise. Finally, by picking events with higher F1-scores using the same process as above, the best exemplar anchor candidate per class was chosen. This resulted in improved performance of the model from 0.79 to 0.98 for rockfall, 0.94 to 0.98 earthquake, 0.72 to 0.97 and 0.73 to 0.97 for noise.
We can therefore be confident that the above combination of optimal model weights obtained from 5-fold validation, decision threshold and the optimal anchor per class will result in the best labels when testing on unseen datasets.