We present an algorithm for detecting onsets based on higher-order statistics (HOS), specifically designed for applications to arrays of stations. Hereinafter, we will denote our algorithm as higher-order statistics for arrays (HOSA).
As the method aims to reveal P-wave arrival times, we will use only the vertical component of motion, where P-waves are more visible.
3.1. Single-Trace Stage
STS aims to determine P-wave arrival times in the waveforms. It is based on the HOS approach introduced by [
5]. The basic assumption is that the sought signal and the background noise are ruled by different physical processes, resulting in distinct statistics. The method utilizes HOS gradient-based characteristic functions. These functions possess the desirable property of yielding null values when computed on time windows containing Gaussian-distributed random processes, such as those consisting solely of noise, while exhibiting non-zero values when a signal is encountered. The onset (hereinafter denoted as
o) of the signal is detected among the samples where the characteristic function significantly deviates from zero.
For each seismic trace
, we can define the time series
, representing the n-th order relevant statistic. Indicating with
the expectation operator, for a generic discrete random process
with expected values
,
is given by the following:
where
is the estimation of
in a time window of length
w ending at time
t, and
is the value of
for a Gaussian variable [
31]. Thus, the estimation of
for Gaussian variables is zero. Explicitly, for a seismic trace
we have the following:
where
.
We seek the P-arrival in a part of the trace
u enclosed between the event origin time (OT) and OT +
. To increase the signal-to-noise ratio (SNR), we use a band-pass filter beforehand on the signals in the frequency range defined by
. Each time series is then scanned by a sliding time window of length equal to
w. The values of
depend on the dataset and the geometry of the receivers and will be fixed later (
Section 4). An HOS value is computed in each time window
w sliding of one sample at each step, as described in Equation (
2), resulting in a new time series:
. A moving average window of three points is used to smooth the
and mitigate noise fluctuations. In the following, based on previous works [
31,
32], we test two
values:
and
, which are, respectively, the 4-th and 6-th order statistics. As noted in [
7], the time derivative of
exhibits a considerably more impulsive character, making it more appropriate as a characteristic function. Consequently, derivatives are computed, generating the time series
. Some typical waveforms, along with their
and
, are depicted in
Figure 3.
How do we identify P-onsets from the
functions? Several studies have simply used the maximum of
. However, as explained in [
32], this criterion can lead to inaccurate detection in the presence of spurious spikes before the actual P-onset or in the case of very unstable
. This can be seen in
Figure 3, where
and
have been plotted for three representative cases of seismograms in which the P-onset is clearly impulsive, emergent, and not clearly identifiable, respectively. In the first case, the functions
show a sharp increase (corresponding to the maxima of
) at the correct P-onset, and for an emergent arrival, the functions
increase smoother, reaching smaller maxima values both in
and
, highlighting an area before the maximum of
, where the onset is likely to be found. Remarkably, the functions
show multiple relative maxima in the case of an emergent arrival. An even smoother increase in
and several relative maxima of
appear in the third case, in which the P-onset is masked by the noise.
To account for this variability in
and
and thus of the uncertainty of the retrieved P-onset, HOSA employs a multi-threshold criterion and includes an acceptance stage to evaluate the reliability of arrival times. HOSA utilizes a set of thresholds
, which are defined relatively to the maximum of
; in other words,
corresponds to one-tenth of the maximum of
, and
corresponds to the value of the maximum of
. Starting from
, we obtain a set of potential onsets
for each trace as follows. We first identify the time corresponding to the maximum of
:
We then look for the potential onset
for each
, as described in Equation (
4) and at the point 1.7 of Algorithm 1:
As indicated in Equation (
4), the onsets are sought in a 2 s time window centered in M. This is fixed because the functions
show a dispersion around
M that is typically in the order of
0.3 s for emergent arrivals (see
Figure 3) and is eventually connected to the uncertainty in the P-onset. To account for this uncertainty, we seek the values of
in a time window centered in
M and with a width equal to about ±3
±1 s.
The onset associated with the threshold intercepts, by definition, the main peak of . Various thresholds identify any potential lower amplitude peaks preceding the main one. We assume that a time consistency among the potential onsets implies a low error definition of the P-onset. In other words, we assume that if does not change with the threshold, the estimated P-onset is robust. Conversely, if spans a wide time range, we assume that the estimated P-onset is affected by a large uncertainty. Such uncertainty is quantified as the standard deviation within each set . The user can tune the degree of accuracy of STS by allowing only solutions with lower than a threshold . We will refer to this condition as the multi-threshold criterion.
Eventually, from each set
(satisfying the multi-threshold criterion), one final P-onset must be identified. As noted in the previous work by [
6], the optimal picking performance occurs when the actual onset matches the earliest part of the characteristic function’s increase; this corresponds to the case when the very beginning of the sought signal accesses the time window for HOS calculation, leading to a small increase in HOS. The value of the threshold that optimally catches this increase depends on the dataset and receiver setup. In this study, we identify the final onset with the first potential onset
(related to the threshold
), as
Section 4 indicates that this is the best choice in our case.
From
, an estimate of the uncertainty on the arrival can be assigned as the standard deviation
. Specifically, we assign a weight to the P-onset ranging between 0 and 3 based on the ratio
, as reported in
Table 2.
For example, considering the statistics
in the cases displayed in
Figure 3, the standard deviation among the set
is calculated using the case in the left panel, which is equal to
s. In contrast, the traces in the central and right panels exhibit standard deviations of about 0.1 s and 0.2 s, respectively. The choice
= 75 ms, for example, would correspond to the assignment of a weight of 0 (the largest reliability) to the left case and to reject the onsets on the other cases. Similar outcomes can be obtained considering the 6th-order statistic.
3.2. Multi-Channel Stage
The second part of the algorithm involves a multi-channel stage (MCS) that includes the analysis of multiple channels of an array of receivers. This stage exploits the array disposition information to enhance the reliability of the previous onsets that passed the multi-threshold criterion.
In order to formalize this process, we introduce the following notation. Suppose that a seismic event is registered by the receivers , resulting in the traces , where the indices i and j run, respectively, over the number of arrays and the number of receivers of array i. Let us assume that for all or a portion of the traces, STS is assigned an onset of . For each array , we denote the set of onsets as .
To evaluate the consistency of the arrivals, we first perform a clustering analysis of the elements of .
In general, a clustering procedure is an unsupervised learning technique [
33] used to group data points together, based on some similarity criterion, into subsets or clusters that are homogeneous or well separated [
34]. In this study, we employed hierarchical agglomerative clustering [
35,
36,
37,
38,
39] to build groups of concordant onsets. Hierarchical agglomerative clustering begins by partitioning the dataset into single dad (
in our case), each representing a cluster. A value of dissimilarity is then computed for each pair of initial clusters. Subsequently, the clustering procedure iteratively merges the current pair of mutually closest clusters into a new object, computing new dissimilarities until there is one final cluster left. The user can stop the fusion of the objects at a certain level of dissimilarity or when a desired number of clusters is reached.
Here, we adopt the complete linkage distance [
37] as a measure of dissimilarity between two fused objects (clusters) of onsets
and
, defined as follows:
When clusters
and
are merged into a new cluster
, the value
will correspond to its diameter, that is, the maximum dissimilarity (distance) between two entities within
[
40,
41].
A critical point of hierarchical agglomerative clustering is to establish the optimal number of final clusters or, equivalently, the maximum allowed dissimilarity between fused objects [
34]. Given the nature of our objects (seismic arrival times), we fixed the maximum intra-cluster distance as the largest physically possible time difference between P-wave arrival times at receivers of array
i, which we indicate as
. This parameter is related to the maximum spatial distance
between two generic receivers
of the same array and the P-wave propagation velocity
in the following way:
where
is a tolerance term, which takes into account any possible uncertainty or fluctuation in the considered physical quantities.
Thus, we interrupted the clustering procedure at the fusion step, after which a new cluster with a diameter larger than
was created. In this way, the clustering step returns a partition of
in a set of different clusters
with the largest intra-cluster distance less than or equal to
.
Figure 4 illustrates some possible scenarios arising from the clustering procedure in a schematic way.
Once the clusters of onsets at array
i are formed, a selection of suitable onsets is performed as follows. Let
be the total number of onsets in
, which is, in general, less than or equal to 10; that is, the number of receivers for our arrays. If the onsets are all grouped in the same cluster, we accept them in the case
; otherwise, they are discarded. If more than one cluster is present, we denote with
and
the numbers of onsets present in the two most populated clusters,
and
. We accept the onsets of
(and reject the other) if all the conditions of Equation (
7) are met (
indicates the integer division).
In this case, we claim that
represents a “predominant cluster”.
If any of the conditions of Equation (
7) are not met, no predominant cluster can be defined, indicating very spread arrivals (and, thus, large uncertainty in the solutions). In this case, all onsets of the array are rejected.