Unravelling the Relationship Between Microseisms and Spatial Distribution of Sea Wave Height by Statistical and Machine Learning Approaches

: Global warming is making extreme wave events more intense and frequent. Hence, the importance of monitoring the sea state for marine risk assessment and mitigation is increasing day-by-day. In this work, we exploit the ubiquitous seismic noise generated by energy transfer from the ocean to the solid earth (called microseisms) to infer the sea wave height data provided by hindcast maps. To this aim, we use a combined approach based on statistical analysis and machine learning. In particular, a random forest model shows very promising results in the spatial and temporal reconstruction of sea wave height by microseisms. The observed dependence of input importance from the distance sea grid cell-seismic station suggests how the reliable monitoring of the sea state in a wide area by microseisms needs data recorded by dense networks, comprising stations evenly distributed along the coastlines.


Introduction
The monitoring of sea state is a fundamental task for economic activities in coastal areas, such as transportation, tourism, and design of infrastructures, e.g., [1,2]. In particular, the importance of monitoring sea wave height for marine risk assessment and mitigation is increasing day-by-day. This is partially due to global warming that is making sea waves stronger and, hence, the extreme wave events more intense and frequent [3].
Among the different instruments used to monitor the sea state, e.g., [4], wave buoys can be currently considered the most used and reliable instruments for in situ measurements of the offshore and coastal wave climate. However, they show serious drawbacks because of the high costs of both installation and maintenance and the permissions needed for their installation [5]. Other useful tools to monitor the sea state are remote sensing instruments such as radar altimeters and synthetic aperture radar, providing data from large marine areas. Despite their good spatial coverage, they suffer from limited temporal resolution [6][7][8]. An important improvement in sea state monitoring is represented by high-frequency (HF) radar measurements, providing directional wave spectra at a high temporal resolution, with a spatial resolution that depends on both bandwidth and antenna [5,9,10].
Finally, seismic signals recorded by seismometers can also be used as a sea "wavemeter" [11]. Indeed, ocean gravity waves cause pressure fluctuations that transfer energy from the ocean to the solid earth, generating a seismic noise called "microseisms," e.g., [12]. Based on both spectral content and source mechanism, it is possible to distinguish two main types of microseism: primary and secondary. The former, also called "single-frequency" microseisms, shares the same spectral content as ocean waves (0.05-0.077 Hz; e.g., [13]) and is likely generated by interactions of ocean waves with the sloping seafloor and/or by ocean waves breaking/shoaling against the shoreline, e.g., [14]. The latter, also known as "double-frequency" microseisms, is characterized by most energy in the band 0.1-0.2 Hz, e.g., [13], shows higher amplitude than the primary microseism, and its source is associated with interactions of waves of the same frequency travelling in opposite directions, e.g., [15,16]. Finally, a short-period secondary microseism, exhibiting a frequency content between 0.2 and 0.4 Hz, e.g., [13], can also be identified. Its source is generally related to local wave-wave interactions [17,18].
On the basis of its source mechanism, microseisms have been used as a proxy of the sea state. For instance, it has been used to reconstruct the spatial and temporal distribution of sea ice around Antarctica [19]; but, above all, the relationship between microseism amplitude and sea wave height has been explored. For instance, [20] made use of seismic and buoy data acquired in California to empirically derive site-specific seismic-to-wave transfer functions, allowing reliable estimations of ocean wave height to be obtained from broadband seismometer data. [21] presented the first numerical model of seismic noise, allowing all the classes of sea states to be reproduced, which give rise to significant seismic noise levels. By comparing seismic data and sea wave height recorded by buoys near the Ligurian coast (Italy), [22,23] calibrated empirical equations allowing microseism energy to be turned into sea wave height.
In this paper, by both statistical and machine learning approaches, we will explore the relationship between microseism amplitude, recorded by seismic stations located in Eastern Sicily (Italy), and the sea wave height in the Ionian Sea, Tyrrhenian Sea, and Sicily Channel. Following the idea by [2], the latter data have been provided by hindcast maps, which show several advantages with respect to other measured data, such as high spatial resolution, widespread coverage on large areas, and temporal continuity of the information (hindcast maps are always available, as they are not affected by instrument breakage). In particular, we will try to answer the question of whether, and to what extent, it is possible to reconstruct the spatial distribution of sea wave height by using microseisms recorded at distinct stations and in different frequency bands.

Data
To analyze microseisms, seismic signals recorded by the vertical component of six stations, located close to the Eastern Sicilian coast and belonging to the seismic permanent network run by Istituto Nazionale di Geofisica e Vulcanologia, Osservatorio Etneo, Sezione di Catania (INGV-OE), were used ( Figure 1 and Table 1). The seismometers equipping these stations are broadband threecomponent Trillium 40 s (Nanometrics TM ). The digitizers, acquiring at a sampling rate of 100 Hz, are Trident (Nanometrics TM ).
In particular, the time interval 2010-2017, offering a very good data continuity in most of the considered stations, was chosen.  As for the sea data, the "MEDSEA_HINDCAST_WAV_006_012" product, shared by the Copernicus Marine Environment Monitoring Service (CMEMS) http://marine.copernicus.eu/, was used. Such a product is the hindcast product of the Mediterranean Sea Waves forecasting system, based on the third-generation wave model WAM Cycle 4.5.4 [24]. In particular, the hourly significant wave height data of a portion of the Mediterranean Sea (shown in Figure 1) at 1/24° horizontal resolution was extracted and used to make comparisons to the microseism. To speed up the computation, one grid cell every two in both the east-west and north-south directions was taken into account for the analyses performed in this work. The significant wave height is defined in terms of moments of the omnidirectional spectrum, S(f), where f is the frequency. The nth moment of the spectrum is defined as Wave height is defined as follows: It is demonstrated that Hs is approximately equal to the average of the highest one-third of the waves.

Spectral and Amplitude Analysis
To characterize the spectral content of the seismic signals recorded by the vertical component of the six considered stations, the short-time Fourier transform was carried out as follows: (i) The signal was divided into 81.92 s long non-overlapping windows and a spectrum was computed for each window; (ii) a daily spectrum was calculated by applying Welch's overlapped segment averaging estimator [25] on all the spectra of the same day; (iii) finally, all the daily spectra were gathered and visualized as a spectrogram (Figure 2a-f). In addition, a single spectrum was computed for each station as the median of all the daily spectra (Figure 2g-l). In particular, the band (i) corresponds to the primary microseism, (ii) to the secondary microseism, (iii) to the shortperiod secondary microseism, (iv) has a frequency lower than the primary microseism, (v) is characterized by a frequency higher than the short-period secondary microseism, and, finally, (vi) comprises primary, secondary, and short-period secondary microseisms. Bands (iv) and (v) have been introduced to verify the presence of sea dynamics-related seismic signals at frequencies different from the classical primary, secondary, and short-period secondary microseism bands. In Figure 3, the variation over time of the seismic RMS amplitude in band (vi) is shown.

Correlation Analysis
Following the idea of several authors, e.g., [26][27][28][29], maps showing the spatial variability of correlation coefficients, computed between the time series of seismic RMS amplitudes and the time series of significant wave height, obtained in each grid cell of the hindcast maps, were calculated. As such time series should correlate well at or near the source region of the microseism [28], this analysis provides an idea of the location of the main sources of microseisms recorded by the considered seismic stations.
Following [28,19], in place of using the more common Pearson correlation coefficient, we made use of the Spearman correlation coefficient. The former should be applied to explore the linear dependence between normally distributed parameters. As both significant wave height and seismic RMS amplitude do not have a normal distribution, as well as their relationship is likely nonlinear [27,28], the Spearman correlation coefficient was preferred. In addition, unlike the Pearson correlation coefficient, the Spearman correlation coefficient is less affected by outliers. The Spearman correlation coefficient, defined as a nonparametric measure of rank correlation, can be computed as follows [28]: where di is the difference between the ranks of the two parameters and n is the number of observations. As both seismic RMS amplitudes and significant wave heights are characterized by a sampling period of 1 h (much longer than any physically possible delays between the two time series), a simple zero-lag correlation was computed. Each considered time series of seismic RMS amplitude and significant wave height has a duration of 8 years and an hourly sampling period, thus totaling to ~70,000 samples per series. As for the spatial resolution, 101 × 251 grid cells belonging to the hindcast maps (corresponding to 25,351 correlation coefficients computed for each station and each frequency band) were taken into account.
The correlation maps between significant wave height and seismic RMS amplitudes, computed at distinct seismic stations and different frequency bands, are shown in Figure 4. Furthermore, in order to highlight relationships between the correlation coefficient and distance from the station recording the microseism to the sea grid cell providing the significant wave height data, cross-plots showing the distance in the x-axis and the correlation coefficient in the y-axis were obtained ( Figure  5).

Machine Learning
In order to explore the possibility of predicting the spatial distribution of sea wave height by microseisms, the random forest (RF) technique was applied. RF is a supervised machine learning technique, capable of both classification and regression, making use of an ensemble of decision trees to allow for superior performance and lower sensitivity to over-fitting compared to single classifiers [30,31]. RF has had several applications in geoscience, such as data-driven modeling of mineral prospectivity [32], lithological classification of underexplored areas by geophysical and remote sensing data [33], and atmospheric chemistry transport models [34].
We tried to link the microseisms recorded by different seismic stations in distinct frequency bands to the spatial distribution of significant wave height. In particular, we trained a regression model by using the seismic RMS amplitude time series as an input and the temporally corresponding hindcast maps of significant wave height as the output. To avoid overfitting problems in building the model, we applied the following method divided into three main steps: (a) Training; (b) validation; (c) testing.
First, to evaluate the capability of generalization of the final model [35], we deleted from the 8 year long time series (2010-2017) two consecutive months of data (November-December 2017), reserving them for the final test. Then, by using the remaining data, a k-fold cross-validation technique was applied to evaluate the spatial variability of the model predicting capability. Such a technique consists of [31]: 1. Dividing the original input dataset into two complementary subsets; 2. building a model on one subset (called the "training set") (step a); 3. validating the model performance on the other subset ("validation set") (step b); 4. repeating (i-iii) "k" times, selecting different time intervals as the training set and validation set each time.
In the applied k-fold cross-validation, the input dataset was partitioned into 10 distinct and complementary pairs of training and validation sets (k = 10). The parameters computed to evaluate the predicting capability of the model were: (i) The mean absolute error (MAE) between the hindcast significant sea wave height and the predicted one; (ii) the mean absolute percentage error (MAPE), given by the MAE divided by the mean value of sea wave height in the considered sea grid cell, multiplied by 100. The maps shown in Figure 6 exhibit the space distribution of the average values of MAE and MAPE, obtained by the 10-fold cross-validation technique, together with the space distribution of the mean significant wave height.  Figure  7 are referred to.
The final model was trained with the whole dataset January 2010-October 2017 and tested on the test set November-December 2017 (step (c) "testing"). In particular, comparisons between hindcast maps and predicted maps were carried out (three days are reported in Figure 8 as examples). Moreover, the cross-plots of hindcast significant wave height versus the predicted significant wave height for three grid cells located nearby the coastlines of the Sicily Channel (Figure 7a), Ionian Sea (Figure 7b), and Tyrrhenian Sea (Figure 7c) (see Figure 6 for the locations of the three grid cells) were also obtained.  Finally, the RF technique also provides an index of input importance (indicated in %), which is shown by the histograms of Figure 9a-c. Aggregation through summation of the input importance was also carried out to rank the microseism bands (Figure 9d-f) and the seismic station (Figure 9g-i) for significant wave height reconstruction purposes. Figure 9. (a-c) Index of importance of all the inputs taken into account for the three grid cells labelled "a," "b," and "c" in Figure 6, respectively. (d-f) Aggregation through summation of the input importance allowing the ranking of the frequency bands for three grid cells labelled "a," "b," and "c" in Figure 6, respectively. (g-i) Aggregation through summation of the input importance allowing the ranking of seismic stations for three grid cells labelled "a," "b," and "c" in Figure 6, respectively.

Results
Concerning the spectral analysis, it is worth noting that in most of the considered stations (EPOZ is the only exception), it is possible to clearly distinguish three peaks in the spectra, corresponding to primary, secondary, and short-period secondary microseisms (PM, SM, and SPSM, respectively, in Figure 2g-l). Furthermore, as observed in the seismic recordings around the world, e.g., [36], the spectral analysis of the microseism acquired in Eastern Sicily shows that most of the energy is recorded in the secondary and short-period secondary microseism bands, while the primary microseism exhibits a much weaker spectral amplitude ( Figure 2). In particular, the short-period secondary microseism peak is the strongest among the three in all the considered stations (Figure 2gl). With the exception of AIO station, high values of spectral amplitude can also be observed at frequencies above the short-period secondary microseism band (>0.4 Hz). In addition, in both spectrograms and RMS amplitude time series, it is possible to observe a seasonal modulation with maxima during the winters and minima during the summers (Figures 2a-f and 3).
As for the correlation maps, variable Spearman correlation values were obtained at the different stations and frequency bands. In particular, in agreement with [29], the highest correlation values (up to 0.8) were observed in the short-period secondary microseism band (0.2-0.4 Hz), and successively (up to 0.7) in the highest considered frequency band (0.4-0.8 Hz) (Figures 4 and 5). The primary microseism band (0.05-0.077 Hz), as well as the lower frequency band (0.025-0.05 Hz), show very low correlation values (up to 0.3). Intermediate correlation values were computed in the secondary microseism band (0.1-0.2 Hz). Finally, high values of correlation are also observed in the band comprising primary, secondary, and short-period secondary microseisms (0.05-0.4 Hz). In addition, Figure 5 clearly shows a decrease in the correlation values with increasing distance from the station recording the microseism to the sea grid cell providing the significant wave height data.
Furthermore, it is evident how the spatial distribution of correlation is affected by the station location ( Figure 4). Indeed, the correlation maps, obtained by the microseism recorded by the northernmost station (MSRU), exhibit the highest correlation values in the Tyrrhenian Sea. On the other hand, the southernmost stations (HLNI and HAGA) show the highest correlation values in both the Ionian Sea and Sicily Channel.
Regarding the machine learning analysis, as suggested by the spatial distribution of MAE and MAPE (Figure 6), by the cross-plots of hindcast significant wave height versus the predicted significant wave height (Figure 7) and by the comparisons between hindcast maps and predicted maps (Figure 8), promising results have been obtained. In particular, it is evident how the lowest values of MAE and MAPE were obtained in the Mediterranean portion closest to Sicily and then to the seismic stations providing the microseism information. In particular, the areas with low values of MAE (down to ~0.1 m) mainly coincide with the coastlines of Sicily (especially the Ionian and Tyrrhenian coastlines), as well as with the southernmost part of the Tyrrhenian Sea (Figure 6a), which is also characterized by very low mean values of significant wave height (Figure 6c). On the other hand, the areas with the lowest MAPE (minimum values equal to ~25%) comprise the southernmost part of the Tyrrhenian Sea, a part of the Ionian Sea and Sicily channel (Figure 6b).
The estimated aggregated importance of the different frequency bands in the three considered points nearby the Sicilian coastlines (see black dots with labels "a," "b," and "c" in Figure 6) reflects the results obtained by the correlation analysis (Figure 9d-f). Indeed, the frequency band with the highest importance in the sea state prediction is the one characterized by the highest Spearman correlation coefficient, that is, the short-period secondary microseism band (0.2-0.4 Hz; Figure 9d-f). Furthermore, the values of aggregated importance of the stations generally exhibit a dependence on the distance of the considered sea point from the station providing microseism data (Figure 9g-i). This is especially evident for points "b" and "c." As for the former, the stations with the highest importance are those closest to such a point, HAGA, EPOZ, and HLNI (Figure 9h). Regarding point "c," the station showing the highest importance is that closest to the Tyrrhenian coastline, MSRU (Figure 9i).

Discussion
Spectra of the microseism recorded by stations installed in Eastern Sicily show very high amplitudes of the short-period secondary microseism (Figure 2). This can be due to the fact that the considered stations are very close to the sea coastlines (maximum distance equal to ~19 km for HLNI) and, hence, record well the short-period secondary microseism, whose sources are linked to local sea state and wave activity, and influenced by local winds, e.g., [17,18]. According to [17], the clear split between the spectral peaks of short-period secondary microseisms and secondary microseisms, observed in most of the computed spectra (Figure 2g-l), is a common feature for stations located nearby the shoreline.
Spectral analysis also shows relatively high amplitudes at frequencies above the short-period secondary microseism band (Figure 2g-l). As indicated by the high correlation values with the sea wave height (Figures 4 and 5), such high seismic amplitudes at high frequencies (>0.4 Hz) can be related to energy transfer from the hydrosphere/atmosphere to the solid earth. According to previous studies, e.g., [37,38], such a high-frequency microseism could be related to wind-driven ocean waves breaking on the shoreline. In any case, we cannot exclude in this band (>0.4 Hz) small contributions from anthropogenic seismic noise, as well as from the continuous volcanic tremors recorded at the seismic stations located on or nearby the Etna volcano edifice (such as EPOZ and EFIU) and characterized by a frequency band 0.5-5.0 Hz [39].
Seasonal modulation, identified in both spectrograms and seismic RMS amplitude time series (Figures 2a-f and 3), has been observed in all the areas worldwide with temperate latitudes, e.g., [40,41], and is due to the more efficient energy transfer from the sea to the solid earth in winters, when the seas are stormier. Areas located close to the Equator, as well as regions at very high latitudes, do not show such a pattern: In the former, noise amplitude is mostly stable over the year, e.g., [41], while in the latter, the microseism seasonal modulation is different, as during the winters the oceanic waves cannot efficiently excite seismic energy because of the sea ice, e.g., [42,19].
By gathering all the information provided by the correlation maps (Figures 4 and 5), it can be inferred that the microseisms recorded by the considered stations are mostly generated by sources located close to the stations, and then to the Eastern Sicilian coastlines. This was expected for the short-period secondary and partially for primary microseisms. Indeed, the former is characterized by high frequencies and then by quick attenuation with distance, e.g., [17]. As for the latter, shallow water conditions are necessary for primary microseism generation: Focusing on primary microseisms at the low frequency of 0.05 Hz and following the Airy linear wave theory, most primary microseism generations should take place at water depths less than ~150 m, e.g., [17,12].
However, as also stated by [43], it must be underlined that the correlation analysis (i) shows only the dominant source areas over the entire analyzed time period 2010-2017, (ii) has no temporal resolution, and (iii) is characterized by a limited space resolution of 1/24°. Hence, we cannot rule out the possibility that part of the recorded microseism is generated in open sea far away from the Sicilian coasts. This could be the case for part of the secondary microseism, whose source can partially be associated with wave-wave interaction mechanisms in the deep ocean, e.g., [44,45], assuming that losses from both scattering and transfer to the solid earth are not too effective [46]. In addition, [47] have recently shown that primary microseisms at low frequency (~0.05 Hz) can have source regions located nearby coastlines thousands of kilometers away from the seismic stations.
Machine learning analysis shows promising results in the spatial reconstruction of sea state in terms of significant wave height by microseism. Indeed, low values of both MAE (down to ~0.1 m) and MAPE (down to ~25%) were computed in a portion of the Mediterranean Sea close to Sicily and then to the seismic stations providing microseism information (Figure 6a,b). This finding, together with the observed dependence of both Spearman's correlation coefficient ( Figure 5) and input importance (Figure 9h,i) from the distance sea grid cell-seismic station, suggests that a reliable reconstruction of the sea state in a wide area needs microseism data recorded by dense networks, comprising stations evenly distributed along the coastlines. By doing so, microseisms can become a very useful tool for significant wave height monitoring, even as effective as other instruments routinely used for that purpose. Indeed, seismic stations have lower costs of both installation and maintenance with respect to wave buoys. In addition, seismic signals are acquired continuously with a sampling rate from tens to hundreds of Hz, and then have a much higher temporal resolution than radar altimeters and synthetic aperture radar. As for the spatial resolution, it depends on the number of the stations, installed nearby the coastlines.
Finally, a further advantage of the seismic signals is represented by the fact that, in most cases, it is not necessary to install a seismic network specifically for microseism monitoring, but it is possible to make use of the broadband stations today installed worldwide to monitor seismic and volcanic activities.
In the future, effort will be taken in the reconstruction of wave directional information, which is essential in the overall description of the sea state.

Conclusions
We analyzed the microseism recorded by seismic stations, located close to the Eastern Sicily coastlines, and quantitatively compared it to significant wave height data, provided by the hindcast maps, by using both statistical and machine learning approaches. These are the main findings: • Short-period secondary microseisms showed the highest amplitude values among the considered frequency bands. In addition, high seismic amplitudes were observed at frequencies above the short-period secondary microseism band (0.4-0.8 Hz), likely related to energy transfer from the hydrosphere/atmosphere to the solid earth. • Seasonal modulations of microseism amplitudes, due to the more efficient energy transfer from the sea to the solid earth in winters, when the seas are stormier, were identified. • Correlation analysis suggests how the microseism recorded by the considered stations is mostly generated by sources located close to the stations, and then to the Eastern Sicilian coastlines.

•
Machine learning analysis shows promising results in the spatial reconstruction of sea state in terms of significant wave height by microseism data. In the future, a reliable reconstruction of the sea state in wide areas will need microseism data recorded by dense networks, comprising stations evenly distributed along the coastlines. Funding: This research was funded by Programma Nazionale di Ricerca in Antartide, grant number PNRA14_00011, called ICE-VOLC project ("MultiparametrIC Experiment at antarctica VOLCanoes: data from volcano and cryosphere-ocean-atmosphere dynamics", www.icevolc-project.com), and by Istituto Nazionale di Geofisica e Vulcanologia, Osservatorio Etneo -Sezione di Catania. S.M. has been partially supported by funds from FESR Interreg Italia-Malta, SIMIT-Tharsy Project.