Enhancing Ocean Monitoring for Coastal Communities Using AI

Spiteri Bailey, Erika; Guillaumier, Kristian; Gauci, Adam

doi:10.3390/app151910490

Open AccessArticle

Enhancing Ocean Monitoring for Coastal Communities Using AI

by

Erika Spiteri Bailey

¹

,

Kristian Guillaumier

^1,*

and

Adam Gauci

^2,*

¹

Department of Artificial Intelligence, Faculty of Information & Communication Technology, University of Malta, 2080 Msida, Malta

²

Department of Geosciences, Faculty of Science, University of Malta, 2080 Msida, Malta

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10490; https://doi.org/10.3390/app151910490

Submission received: 30 July 2025 / Revised: 12 September 2025 / Accepted: 14 September 2025 / Published: 28 September 2025

(This article belongs to the Special Issue Transportation and Infrastructures Under Extreme Weather Conditions)

Download

Browse Figures

Versions Notes

Abstract

Coastal communities and marine ecosystems face increasing risks due to changing ocean conditions, yet effective wave monitoring remains limited in many low-resource regions. This study investigates the use of seismic data to predict significant wave height (SWH), offering a low-cost and scalable solution to support coastal conservation and safety. We developed a baseline machine learning (ML) model and improved it using a longest-stretch algorithm for seismic data selection and station-specific hyperparameter tuning. Models were trained and tested on consumer-grade hardware to ensure accessibility and availability. Applied to the Sicily–Malta region, the enhanced models achieved up to a 0.133 increase in R² and a 0.026 m reduction in mean absolute error compared to existing baselines. These results demonstrate that seismic signals, typically collected for geophysical purposes, can be repurposed to support ocean monitoring using accessible artificial intelligence (AI) tools. The approach may be integrated into conservation planning efforts such as early warning systems and ecosystem monitoring frameworks. Future work may focus on improving robustness in data-sparse areas through augmentation techniques and exploring broader applications of this method in marine and coastal sustainability contexts.

Keywords:

coastal conservation; seismic data; ocean monitoring; wave prediction; artificial intelligence; low-resource technology; marine sustainability; environmental monitoring

1. Introduction

Marine and coastal ecosystems are central to the sustainability of over three billion people worldwide, impacting their livelihoods, food security, and safety [1,2]. These regions support rich biodiversity and serve as buffers against natural hazards, yet they are increasingly threatened by climate change, rising sea levels, and extreme weather events. Accurate and timely knowledge of sea conditions, particularly SWH, is essential for informed conservation planning, marine spatial governance, and coastal risk mitigation. However, real-time ocean monitoring systems remain limited, particularly in low-resource settings where high-cost instrumentation and data infrastructure are not viable.

Conventional wave monitoring approaches, such as ocean buoys and weather satellites face logistical, financial, and technical challenges. Devices deployed at sea are often dislodged and set adrift, contributing to marine pollution. This debris, including discarded mooring lines, poses a serious threat to marine life; for example, sea turtles can become entangled, leading to injury or death [3]. Moreover, buoys are susceptible to damage from marine life, vessel collisions, and extreme weather, while satellite-based methods are constrained by fuel limitations and the growing issue of orbital debris, or ‘space junk’. These limitations hinder the development of sustainable and scalable ocean observation systems.

Ocean waves, predominantly driven by weather systems, can induce ground motion when they reach coastlines. These motions generate continuous low-frequency seismic signals, known as microseisms. Despite being historically considered noise, microseisms are now a valuable data source for studying oceanographic and geophysical processes [4,5]. Microseisms mainly occur in two frequency bands: primary (0.05–0.1 Hz) and secondary (0.1–0.5 Hz) [6,7]. The lower-frequency microseisms result from direct pressure on the ocean floor, while secondary ones stem from wave–wave interactions. These signals correlate with ocean wave energy, typically measured through SWH, derived from spectral wave data [8].

Recent research has explored alternative proxies for wave monitoring, including seismic signals generated by ocean wave activity [6,7,9,10]. These signals, particularly microseisms, are detectable by seismometers located on land. To this end, they offer a low-cost, low-maintenance solution for continuous data acquisition. While studies have confirmed a correlation between microseismic amplitude and SWH [6], few have systematically applied AI to model this relationship in a way that supports environmental monitoring and conservation outcomes. Moreover, existing AI-based approaches often rely on large, spatially diverse datasets that overlook local variability in seismic–oceanic interactions. Many also depend on extensive interpolation to address data gaps that sometimes span hundreds of days. The impact of such extensive gaps on model accuracy and ecological relevance remains unclear.

In response to these limitations, this study investigates whether seismic signals can reliably predict SWH using accessible, regionally tuned AI models. Focusing on the Sicily–Malta region, we developed a reproducible baseline and improved upon it using efficient algorithms and station-specific tuning strategies. Models were trained on consumer-grade hardware with minimal preprocessing, promoting equitable access to ocean monitoring tools. Our findings show that low-frequency seismic amplitude can serve as a dependable proxy for SWH, enabling the development of lightweight and cost-effective systems for real-time wave estimation. The study demonstrates the potential for scalable, AI-driven solutions in marine sensing, with implications for coastal conservation, risk assessment, and sustainable resource management.

1.1. Main Contributions

Building upon the work of Minio et al. [10], this study makes several important advancements, including

Individual models are built per seismic station rather than a single one covering the entire region of interest. This approach allows fitting models using local and more relevant data and allows for hyperparameters to be tuned on a per-station basis.
Building separate models allows for better identification of data capture problems, such as instrument calibration issues, at individual stations.
Training and inference requires considerably fewer computational resources. Compared to previous approaches which used multi-CPU and multi-GPU workstations, this system can comfortably run on consumer grade hardware.
A ‘longest stretch’ method was used, filtering the data to minimise dependencies on missing values and the subsequent need for complex gap-filling or interpolation. In contrast to previous methods, which interpolated missing data for up to 5000 data points per feature, this allows for model training to rely primarily on real-world observations; specifically, using this method, less than 1% of the training data required any imputation.
We show that a noise threshold is not necessary with correct algorithm selection and hyperparameter tuning for each seismic station.
Feature importance is studied and the characteristics of better performing stations (in terms of model performance) are identified. This contributes to a better understanding of data quality in relationship between seismic signals and SWH.

Overall, these contributions result in an improvement over the current state-of-the-art in terms of data processing, computational requirements, and predictive performance.

1.2. Literature Review

This section will provide an overview of the current technologies used for estimating wave parameters, including both AI-based methods and other approaches. While surveying existing technologies for estimating wave parameters, areas for further improvement will also be identified.

1.2.1. Numerical Methods Approaches

Several traditional numerical approaches have been used to estimate wave parameters from seismic data, offering valuable insights and benchmarks. Ferretti et al. [6] investigated the relationship between the microseism and SWH during a major storm event in the Ligurian Sea. Their approach involved pre-processing seismic signals, converting them to the frequency domain using Fourier transforms, and establishing a statistical relationship between microseism power spectral density (PSD) and SWH. A Markov chain Monte Carlo (MCMC) method was used to estimate parameters in an empirical model. Their refined model achieved a high cross-correlation (93%) with observed wave heights, though errors up to 1.75 m were noted in extreme cases. While not AI-based, their method demonstrates the feasibility of inferring wave parameters from land-based seismic data and offers valuable baseline metrics.

In a more recent study, Borzi et al. [7] examined seismic signatures during Medicane Helios, a 2023 Mediterranean cyclone. Using spectral and correlation analysis across over 100 seismic stations, they established links between microseism signal characteristics and wave field variations, supported by satellite and radar observations. Their spatial analysis confirmed that higher frequency seismic bands exhibited stronger correlations with SWH, consistent with Ferretti’s earlier findings.

These studies show that wave–seismic relationships can be reliably quantified using signal processing and statistical modelling. However, numerical methods can be complex, computationally intensive, and site-specific, motivating the exploration of more scalable, generalisable AI-based alternatives.

1.2.2. Artificial Intelligence Methods

Early research linking ocean microseisms to SWH laid the foundational groundwork for ML-based models in ocean state monitoring. Cannata et al. [9] were among the first to explore this relationship using a random forest (RF) regression model trained on the root mean square (RMS) amplitude of seismic signals, paired with hindcast SWH maps as targets. Their use of k-fold cross-validation ensured a measure of generalisability, and their results, particularly mean absolute error (MAE) values were as low as 0.1 m along the Sicilian coast. This suggests strong local correlations between seismic activity and sea state.

Building upon this, Minio et al. [10] significantly extended the scope of AI-based ocean monitoring by training three supervised models: RF, k-nearest neighbours (KNN), and light gradient boosting (LGB). These models were trained on four years of seismic and oceanographic data (2018–2021). Unlike earlier efforts, their work aimed to construct a comprehensive and scalable solution, leveraging publicly available seismic data from the European Integrated Data Archive (EIDA) [11] and sea state data from Copernicus Marine Environment Monitoring Service (CMEMS) [12]. Notably, they incorporated an earthquake catalogue to exclude periods influenced by tectonic events, ensuring microseismic origins were predominantly oceanic.

Their seismic dataset comprised 14 coastal stations, each recording in three directions (vertical, North-South, East-West), resulting in a 588 features (14 stations × 14 frequency bands × 3 components). The region of interest is depicted in Figure 1. Pre-processing steps included linear interpolation for missing data, Box–Cox transformation for skewed distributions, and min–max normalisation, all aimed at enhancing model compatibility. Linear interpolation was carried out on features having up to 5000 missing data points. While ensemble models such as RF and LGB are generally robust to skewed data, the application of the Box–Cox transformation may offer limited added value in this context and could introduce unnecessary computational overhead [13]. While the interpolation threshold applied by the authors helps maintain data continuity, the literature suggests that gap-filling in seismic datasets is a complex task that often benefits from specialised methods [14,15].

Given the temporal autocorrelation in seismic signals, non-random splits can potentially lead to data leakage between training and testing sets. Minio et al. [10] addressed this by applying temporal chunking and random shuffling to reduce such risks. In this study, RF was found to perform the best (R² = 0.89; MAE = 0.21 ± 0.23 m). Such models are known to be resilient to noise, have a low sensitivity to hyperparameter tuning, and have the capability to model non-linear interactions. These likely contributed to its superior performance [16].

In a recent study, Baranbooei et al. [17] investigated the link between secondary microseisms and SWH near the Irish coast. Using data from a single buoy and five seismic stations, they applied a methodology similar to that of Minio et al. [10], including signal filtering, seismic event exclusion, and microseism amplitude computation.

One distinction in their approach was the reliance on a single buoy for sea state data. While this setup offers practical advantages, the spatial separation between the buoy and seismic stations may affect the reliability of the data, especially due to local variations in bathymetry and seismic wave propagation [18].

In particular, this study used approximately four years of valid data and trained artificial neural networks with five hidden layers, employing Bayesian regularisation to help mitigate overfitting. Two models, one using buoy-measured SWH (Scenario 1) and the other using wave model hindcast data (Scenario 2), were assessed. Results indicated slightly better performance for the buoy-based model, especially for wave heights below 10 m, though generalisation may still be influenced by region-specific geophysical factors.

1.2.3. Summary of Literature Gaps

Table 1 shows the results obtained from past approaches. A direct one-to-one comparison between methods is not possible, partly because, with the exception of Minio et al. [10], the underlying code was not made available, preventing a consistent evaluation of implementation details. These results are therefore presented primarily to provide context.

Existing research has reported encouraging R² values above 0.8 and errors below 0.7 m. However, some limitations remain. These include the absence of standardised benchmarks, varied preprocessing approaches, limited explanations for certain data transformations, and relatively little attention to spatial variability around seismic stations. Additionally, gap-filling methods are not always clearly described, and AI-based approaches, while promising, are still in early stages of development and assessment. These observations suggest an opportunity to further strengthen the field through more consistent methodologies and comprehensive evaluation frameworks.

1.3. Aims and Objectives

The aim of this study is to investigate the relationship between lower-frequency seismic amplitude and SWH, with a particular focus on the coastal regions of Sicily and Malta. The central objective is to establish a foundational baseline for future research in this domain. This study presents a baseline based on the work of Minio et al. [10], against which the performance of a set of models with an improved methodology is compared, offering evidence that complexity does not always equate to performance in this context, through a diverse set of evaluation metrics. Pipeline efficiency was improved through methodological clarity and the use of minimal synthetic data, foregoing more elaborate gap-filling techniques in favour of practical simplicity. Only seismic stations with sufficient data coverage—at least one full year of data—were included, ensuring the models were trained on data that captures seasonal variability. The following objectives have been addressed:

Recreate and evaluate the work of Minio et al. [10] to establish the relationship between seismic RMS amplitude and SWH, using comprehensive evaluation metrics for fair comparison.
Develop a cost-effective modelling approach, deployable on consumer-grade hardware, promoting accessibility in resource-constrained settings and supporting ethical AI practices.
Design an efficient and deployable data pipeline that minimises preprocessing to ensure practical real-time inference with low system complexity.
Apply location-specific hyperparameter tuning to optimise model performance across varying environmental and geographical conditions.
Prioritise high-integrity, real-world data over interpolated or gap-filled datasets to improve model reliability and generalisability.

This study contributes a reproducible baseline for predicting wave height from seismic data, supporting future research and applications. It shows that effective models can be trained on consumer-grade hardware, aiding deployment in low-resource settings. An algorithm was developed, that selects long continuous data stretches, improving model reliability. Tailored tuning led to an improvement of up to 0.2566 in R² over baseline RF models. Moreover, error analysis and feature importance evaluation highlighted data issues impacting performance.

2. Materials and Methods

Similarly to what was described by Minio et al. [10], seismic and sea state data from January 2018 to December 2021 were collected from the European Integrated Data Archive (EIDA) Seismic Network ‘IV’ [11] and the Copernicus Marine Environment Monitoring Service (CMEMS) MEDSEA_MULTIYEAR_WAV_006_012 [12], respectively. Additional seismic records for Malta’s WDD and MSDA stations were obtained via the University of Malta. These Maltese stations were included to extend the geographical scope and contextual relevance of the study. Seismic data was sourced through the INGV network on EIDA [11], while sea state data came from CMEMS [12]. Fourteen seismic stations in total were considered, spanning Sicily, Pantelleria, and Malta. The region of interest is depicted in Figure 2.

Preprocessing began with detrending each seismic signal via mean and linear trend removal, followed by bandpass filtering into 13 frequency bands based on Minio et al.’s implementation [10]. The signals are scaled by station sensitivity and converted to hourly RMS using a more computationally efficient sliding window method in NumPy (https://numpy.org/), replacing the original ObsPy-based approach (https://docs.obspy.org/). Sea state data are also optimised by expanding the region of interest, ensuring no relevant oceanic grid points are missed, particularly near the north of Sicily.

To avoid the information loss of using distant and poorly correlated sea state data, each seismic station is paired with its five nearest sea grid points highlighted in Figure 2. This pairing is established using distance calculations within a narrow radius of each station. To identify the longest continuous stretch of data for each station, a dynamic data selection algorithm was developed. The core idea is to traverse the time series data of each station and locate the segment with the maximum length of consecutive non-missing (non-null) data. This allowed for occasional small gaps, which are eventually filled via linear interpolation. A variable threshold is used, representing the maximum number of consecutive null values that are to be tolerated within an otherwise continuous segment. For instance, a threshold of four allows up to four consecutive null values to be treated as part of a continuous valid segment. The algorithm operates as follows, and is shown as a flowchart in Figure 3:

Initialisation: Counters are initialised to track the length of the current valid segment, the length of the longest segment, the current number of consecutive null values, and the indices marking the start and end of the segment. A flag is also set to indicate when the first valid data point has been encountered.
For each data point:
- If the value is not a null value, the current segment length is incremented. If this is the first non-null value encountered, the start of a new segment is recorded, and the flag is raised.
- If the value is a null value, the null counter is incremented.
Threshold check: If the value is null after the first valid value has been found, the algorithm checks whether the number of consecutive null values exceeds the pre-defined threshold, and subsequently:
- If within threshold, the algorithm continues, treating the missing data as part of the ongoing segment.
- If the threshold is exceeded, the algorithm compares the current segment length with the longest valid segment recorded so far. If the current segment is longer, it updates the stored start and end indices for the longest segment. In either scenario, the current segment counters are reset.
Final comparison: After iterating through the entire dataset, the final check confirms whether the end of the longest segment is the end of the dataset, recording it accordingly.

The algorithm was implemented for several interpolation thresholds. The selection of an interpolation threshold involved a trade-off between two key considerations. On one hand, it was important to minimise the number of interpolated hours, as weather conditions—and by extension, SWH—can vary considerably even over short periods. On the other hand, the approach needed to retain as many stations as possible to ensure a broad region of interest and analysis. By optimising this balance, the resulting models became minimally dependent on interpolated data, thereby enhancing their reliability and alignment with real-world conditions.

Figure 4 illustrates how the number of available data points for each station increased with higher interpolation thresholds. For example, by interpolating gaps of up to eight hours, the number of data points for station CAVT increases from 7142 to 11,201, surpassing the one-year threshold required for inclusion in subsequent analysis. Similarly, station CSLB met the inclusion criteria after allowing up to eight hours of interpolation, with its data count rising from 6058 to 9636 points. This contrasts with the methodology employed by Minio et al. [10], where up to 5000 data points (corresponding to 208 days worth of data) were interpolated.

The second subplot in Figure 4 shows that more than one year of continuous data was available for stations AIO, HAGA, MSDA, MUCR, and WDD, without requiring any interpolation. By interpolating up to eight hours worth of data, stations CAVT and CSLB met the eligibility threshold of one year, and were included in the study, further diversifying the results. Based on this analysis, models were trained on these seven stations, allowing for up to eight hours of interpolation where necessary. The data ranges for the stations AIO, CAVT, CSLB, HAGA, MSDA, MUCR and WDD used in this study can be seen in Table 2.

The preprocessing strategy, combining the selection of the nearest grid cells to each station and the identification of the longest continuous stretch of data per station, ensured a more spatially precise and computationally efficient dataset. Linear interpolation was applied only at stations CAVT and CSLB to fill fewer than 1% of missing values in each dataset. This approach improved upon the work of Minio et al. [10] by narrowing down the region of interest, relevant to each station, and by minimising linearly interpolated data.

2.1. Exploratory Data Analysis

This research explores how SWH can be inferred from coastal seismic signals more reliably using AI, based on the physical coupling between ocean waves and the Earth’s crust. Analysis began by comparing seismic stations to nearby sea state grid cells. While the stations in Malta showed strong correlations, others like HAGA, despite being closest to the coast, did not outperform more distant sites like MUCR, indicating that proximity alone does not determine predictive strength.

Temporal and statistical analyses followed. Spearman correlation peaked in the 0.2–0.5 Hz bands, consistent with known ocean microseism activity. Autocorrelation showed faster decay at lower frequencies, suggesting sensitivity to sea state changes, while higher frequencies reflected more persistent signals, which were likely anthropogenic. Seismic RMS distributions were skewed toward zero, supporting the use of ML models suited for non-normal data.

2.2. Model Selection

The model selection process was driven by both empirical observations and practical constraints. Exploratory analysis revealed high data skewness and low autocorrelation decay, especially in lower frequency bands, while higher frequencies were often contaminated by human activity. These properties made traditional linear or distance-based models unsuitable due to their sensitivity to skewed distributions and noise. In this study, a RF regression model was chosen due to its robustness to data skew, no reliance on distributional assumptions, and resilience against persistent anthropogenic signals. Additionally, RF was particularly advantageous because of its inherent resistance to noise, which allowed us to forgo a noise threshold and corresponding filtering steps, thereby enabling a more direct use of raw observational data and increased computational efficiency. Hyperparameter tuning of tree depth, sample splits, leaf size, and feature subsampling regularised the RF models, reducing overfitting to noisy fluctuations. As a result, the model handled noise effectively without requiring explicit signal filtering. Furthermore, RF models are computationally efficient, and can be trained and deployed on consumer-grade hardware, critical for real-world applicability. Given these advantages and the strong benchmark performance reported by Minio et al. [10], this study adopted an RF regressor, using Scikit-learn (v1.5.2) as the core model (https://scikit-learn.org/stable/whats_new/v1.5.html (accessed on 29 July 2025)).

2.3. Creation of a Baseline

To evaluate model performance meaningfully, a baseline is established, closely inspired by the methodology of Minio et al. [10], which is further refined at a later stage to produce a final set of models. For the baseline, the pipeline begins by extracting hourly seismic RMS from raw waveform data, then applying a noise threshold (defined by Minio et al. [10] as

1 \times 10^{- 9}

) below which, values are replaced with null values. Features with excessive null values (>5000) are discarded, and the remaining missing data are filled via linear interpolation. To address data skewness, features with skewness greater than 0.7 are transformed using the Box–Cox method. Additionally, data points affected by major seismic events (having a magnitude > 5.5 in the Mediterranean or >7.0 globally) are removed, which requires the integration of an earthquake catalogue (https://earthquake.usgs.gov/earthquakes/search/ (accessed on 29 July 2025)).

Target variables are constructed by identifying each station’s five nearest ocean grid cells and calculating both mean and median SWH values across them, resulting in seven targets per station. In total, each dataset comprised 39 input features (13 frequency bands × 3 channels). For training and testing, data are split into 40 non-consecutive chunks, with 70% randomly selected for training and 30% for testing. This chunking approach, adapted from Minio et al. [10], preserves temporal variability; ensuring that the train and test sets containd seasonal variability. Each station’s RF model is trained using the hyperparameters Minio et al. [10] identifies as optimal (200 trees, maximum depth of 15, 40 max features), forming the baseline against which further experimentation are evaluated.

2.4. Experimental Setup and Hyperparameters

All experiments were conducted on consumer-grade equipment running Microsoft Windows 10, equipped with an Intel Core i5-8250U CPU (1.6 GHz) (Intel, Santa Clara, CA, USA), 8 GB of RAM, and integrated Intel UHD Graphics 620. This hardware setup aligns with the research objective to develop models that are practical and deployable without access to specialised high-performance computing resources. Notably, no discrete GPU was utilised during model training, emphasising the focus on computational efficiency and broad accessibility.

For the final set of models, a hyperparameter grid search for each station is performed. This balances model complexity and predictive accuracy, while preventing overfitting. This station-specific tuning ensures that the models can adapt to local environmental characteristics, and the approach remains transferable to other regions, as hyperparameters can be re-adjusted to reflect site-specific conditions. Key hyperparameters included

Number of features considered when making a decision: This defines the feature subset to consider (50%, ${log}_{2}$ , or square root of total features) to decide how to split the data.
Number of trees in the model: This refers to how many decision trees are combined to make predictions—100, 200, or 300 trees.
Maximum depth of each tree: Limits how many layers of decisions each tree can make, with possible hyperparameter values being 10, 20, or 30 levels deep.
Minimum number of data points at a final decision point: A tree will not make a decision (or ‘leaf’) unless it has at least 1, 3, or 5 data samples at that point.
Minimum number of data points needed to split a branch: A decision within the tree requires at least 2, 5, or 10 samples to be considered.

These parameters are systematically varied to explore trade-offs between tree diversity, depth, and generalisation capability. Bootstrapping is used to enhance model robustness to noise and overfitting.

2.5. Evaluation Metrics and Performance Analysis

Model evaluation is carried out using a comprehensive suite of regression metrics to capture different aspects of predictive performance. These include MAE, mean squared error (MSE), root mean squared error (RMSE), and the coefficient of determination (R²). MAE and RMSE quantify average prediction errors, with RMSE placing greater emphasis on larger deviations, while R² assesses how well the model explained variance in the observed data. Using multiple metrics enables a nuanced understanding of accuracy, error distribution, and model fit.

To ensure reliability and generalisability, this study applies k-fold cross-validation with

k = 5

to the best-performing stations. Data are split into 40 temporal chunks and randomly shuffled with a fixed seed to avoid seasonal bias in training and testing folds. This procedure guarantees that each fold contains diverse data from across the year, mitigating risks of overfitting to particular time periods.

The full source code and a sample of the dataset used is publicly accessible at https://github.com/erikasbailey/seismowave/tree/main (accessed on 29 July 2025) and can be run by setting the working directory to the main project folder.

3. Results

3.1. Baseline Model

To enable a meaningful comparison with previous work such as that by Minio et al. [10], this study recreates their approach as a baseline, with slight modifications. Separate RF models are trained for each station using SWH data from the five closest grid cells. The original preprocessing steps are largely retained, including the application of a noise threshold, the Box–Cox transformation, removal of seismic event periods based on a global earthquake catalogue (30 events), and linear interpolation of missing RMS values. Stations with missing data exceeding 5000 data points (CAVT, PZIN, CLTA, HPAC, and MSRU) are not included.

Linear interpolation introduces synthetic patterns inconsistent with real-world dynamics. One such example is shown in Figure 5. The performance metrics achieved across stations are summarised in Table 3.

3.2. Model Performance

To align with the study’s objectives, the modelling approach deviates from the reproduced baseline in several ways: minimal preprocessing is applied, station-specific data segments with minimal interpolation are used, independent models are trained per station to enable deployment on consumer-grade hardware, and hyperparameters are optimised for each station. The specific stations included in the analysis differ slightly from the baseline model, since different preprocessing methods are applied, involving different feature selection techniques.

3.2.1. Hyperparameter Tuning

A grid search over five key hyperparameters produced 11,907 models. The hyperparameters selected and corresponding performance metrics are shown in Table 4. The variation in results between stations confirms the need for station-specific modelling and hyperparameter tuning. The variation in results also confirms the importance of having individual models per seismic station, rather than a single one covering the entire region of interest.

3.2.2. K-Fold Cross Validation

Five-fold cross-validation assesses the generalisability and robustness of the optimal RF models selected for each station, using hyperparameters derived from prior tuning. This procedure confirms a strong predictive relationship between seismic RMS values and SWH, with varying degrees of success across stations. Figure 6 and Figure 7 show the predicted and actual time series of SWH across all seven stations at different time periods, covering all seasons.

The performance consistency across the folds, based on R², across all stations is summarised in Figure 8. These results provide a measure of stability in model performance with respect to various folds of data.

3.2.3. Feature Importance

To extract further insight from the numerical results, this study presents a feature importance analysis. Table 5 shows the top five features identified for each station, where ‘Z’, ‘N’, and ‘E’ indicate the seismometer component, and the accompanying values represent the corresponding frequency bands in Hz.

4. Discussion

Within the recreated baseline, the strongest result was achieved at CSLB (R² = 0.868, MAE = 0.137 m), shown in Table 3. These are comparable to Minio et al.’s [10] best results (R² = 0.89, MAE = 0.21 m). Across all stations, a mean R² of 0.697 and mean MAE of 0.161 m is observed, indicating slightly lower peak performance but higher average accuracy. The average results are heavily impacted by substantially poorer performance at stations AIO and MMGO. The RMSE consistently exceeds the MAE, indicating the presence of outliers.

These findings support the feasibility of using seismic RMS to estimate SWH, but also highlight limitations of a uniform model configuration. The need for improved preprocessing and station-specific adaptation motivates the enhanced pipeline introduced in this research.

The final models incorporate hyperparameter tuning at each station. This allows for precise tuning and improved prediction accuracy, and suggests applicability to different geographical regions, for which tuning may be applied accordingly. The hyperparameter tuning exercise suggested most that maximum tree depth and the number of estimators provided the most impact on the results, considerably influencing R² and MAE values across stations. Other hyperparameters had minimal impact, indicating overall model stability.

Station AIO consistently underperformed, with only 3% of its models achieving R² > 0.6, suggesting potential data or instrument-related issues. Conversely, stations CAVT, WDD, MSDA, and CSLB achieved strong performance, with over 60% of models yielding R² > 0.8. Stations HAGA and MUCR exhibited moderate, but stable, model performance.

Optimal hyperparameters were selected based on the highest R², balancing goodness-of-fit with acceptable error levels. In cases where error metrics improved slightly at the expense of explanatory power, the configuration with stronger generalisability (higher R²) was prioritised.

The K-fold cross-validation results suggest a generally strong and consistent predictive relationship between seismic RMS values and corresponding SWH measurements across all stations. Figure 8 shows that station WDD is the most reliable and high-performing model, demonstrating both high R² values and minimal error across folds. MSDA also performed exceptionally well, displaying remarkable consistency with the lowest variability among all stations. Similarly, CAVT and MUCR have strong and stable predictive performance, reinforcing their reliability. Conversely, stations AIO, CSLB, and HAGA showed greater variability across folds, suggesting sensitivity to data partitioning and possible quality issues within the training data. Station AIO, in particular, remains the weakest model, characterised by both low average R² and significant prediction errors during abrupt shifts in SWH. CSLB and HAGA, while achieving mid-range average R² scores, were hindered by at least one poorly performing fold each, indicating occasional overfitting or external influences not captured in the model design.

A common challenge observed across stations is the underestimation of peak SWH during extreme sea conditions, which occurred during summer periods, such as station MSDA at time period 2021-06-23 and in winter periods, such as station MUCR at time period 2019-12-05 to 2019-12-25. These examples are shown in Figure 7. This outcome is likely influenced by the limited representation of extreme wave events in the training data. To investigate this further, the SWH records at each station are segmented by key percentiles (25%, 50%, 75%, 90%, and 95%). As shown in Figure 9, stations such as CAVT, MSDA, and WDD exhibit broader SWH distributions, with their 95th percentile values lying within the 1.75–2.0 m range, and WDD extending just beyond 2.0 m. In contrast, AIO, CSLB, HAGA, and MUCR have more compressed distributions, with their 95th percentiles contained within 1.5 m. These patterns highlight that high wave events were relatively rare, particularly at the latter stations, where most observations remained confined to lower ranges.

Moreover, the feature importance analysis suggests that stations whose top features lie in higher frequency bands generally exhibit stronger model performance. This indicates that high-frequency seismic components carry more predictive information for the target variable. Conversely, lower performance at stations with predominantly low-frequency features may indicate either station-specific noise characteristics or reduced informativeness of low-frequency signals. In fact, prior evidence shows that microseisms observed at a seismic station can originate from distant coastlines, particularly at around 0.05 Hz, which may further explain the reduced performance of stations where low frequency features were considered most important by the RF regressor, specifically, station AIO [19]. This insight may guide model refinement in future research.

Despite these issues, the models consistently demonstrate reliable spatial and temporal performance, with low grid-cell-level errors and good seasonal generalisation. Collectively, the cross-validation results affirm the feasibility of seismic data as a viable source for ocean wave height estimation, while also highlighting the need for enhancements in capturing rare-event dynamics.

4.1. Comparison with Baseline Models

Although reproducing the results of all the methods reviewed in the literature was not possible due to code and data availability constraints, improved face-value performance is noted. Previous studies reported MAE values of up to 0.68 m, while the models developed in this study showed lower errors. The average MAE in this study is of 0.14 m, with improvements also observed in RMSE and R² scores across all stations, indicating more consistent predictive performance.

These gains are attributed to methodological refinements. Unlike the replicated baseline, which applies linear interpolation to long gaps and used fixed hyperparameters, the final models limit interpolation to short gaps and optimised parameters per station. This station-specific tuning improves model fit while maintaining low computational cost.

Overall, the results confirm that even with minimal preprocessing, seismic data can be reliably mapped to sea state conditions. The models generalise well across locations, outperforming traditional approaches, and showing the feasibility of low-cost, onshore seismic-based monitoring of marine environments.

4.2. Summary of Key Findings

A reliable baseline model was recreated, with substantial performance improvements observed in the final models. These achieved a mean R² of 0.83 and a mean MAE of 0.14 m across seven stations. These results were obtained using a cost-effective approach, with all models trained on consumer-grade hardware. The pipeline required minimal preprocessing and benefited from station-specific model training and hyperparameter tuning. The use of the ‘longest stretch’ algorithm also improved robustness by avoiding excessive interpolation and preserving the integrity of real-world data.

Additional insights revealed potential data quality issues at station AIO, where performance inconsistencies suggest either sensor calibration problems or localised anomalies. Furthermore, consistent underestimation of higher sea states highlighted a class imbalance in the dataset—only a small fraction of data captured significant wave heights above 2.5 m. These findings, while not tied directly to core objectives, offer direction for future research and model refinement.

5. Conclusions

This research establishes a robust and cost-effective method for modelling the relationship between seismic RMS amplitude and SWH, improving prior work through station-specific models, tailored hyperparameter tuning, and careful preprocessing. The models show enhanced accuracy and efficiency while remaining deployable on consumer-grade hardware. A key innovation was the ‘longest stretch’ algorithm, which prioritises continuous, high-quality data and reduced reliance on interpolation.

Despite limitations such as data quality issues at some stations and reduced performance during extreme wave events due to class imbalance, the study offers a replicable framework adaptable to broader environmental modelling tasks. Its focus on accessibility and regional specificity makes it especially relevant for resource-limited settings, contributing to sustainable marine monitoring and alternative wave measurement strategies.

The principal practical implication of this research lies in its potential contribution to hindcasting, particularly given that seismometer records extend further back in time than accelerometer buoy observations. Leveraging these long-term seismic datasets opens the possibility of reconstructing historical wave conditions, including extreme events, thereby improving our understanding of past coastal and oceanic hazards. Furthermore, the use of onshore seismic instrumentation provides a valuable alternative means of estimating wave heights in regions where direct in situ measurements are unavailable or infeasible.

Future work should explore advanced gap-filling and data augmentation to better handle extreme conditions and possibly extend the usable range of the dataset. One method for consideration is synthetic minority over-sampling technique (SMOTE), commonly used in the case of unbalanced datasets. Although station AIO shared similar wave height distribution characteristics with station HAGA, its performance was substantially worse. The poor performance may be a limitation of the model, however it is not excluded that factors such as instrument calibration or seismic noise contamination may be affecting the seismic signal quality at this station.

Author Contributions

Conceptualisation: E.S.B., K.G. and A.G.; methodology: E.S.B. and K.G.; software: E.S.B.; validation: E.S.B., K.G. and A.G.; formal analysis: E.S.B.; investigation: E.S.B.; resources: E.S.B.; data curation: E.S.B. and A.G.; writing—original draft: E.S.B.; writing—review and editing: K.G. and A.G.; visualisation: E.S.B.; supervision: K.G. and A.G.; project administration: K.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from EIDA, CMEMS and University of Malta, Geosciences Department are available at https://www.orfeus-eu.org/data/eida/, https://data.marine.copernicus.eu/product/MEDSEA_MULTIYEAR_WAV_006_012/services (accessed on 29 July 2025) and geo.sci@um.edu.mt respectively.

Acknowledgments

The authors thank the Malta Seismic Network [20] for providing the seismic data in relation to stations MSDA and WDD.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
CMEMS	Copernicus Marine Environment Monitoring Service
EIDA	European Integrated Data Archive
KNN	k-nearest neighbours
LGB	Light gradient boosting
MAE	Mean absolute error
MARE	Mean average relative error
MCMC	Markov chain Monte Carlo
ML	Machine learning
MSE	Mean squared error
PSD	Power spectral density
RF	Random forest
RMS	Root mean square
RMSE	Root mean squared error
SWH	Significant wave height

References

United Nations Department of Economic and Social Affairs. United Nations Sustainable Development Goals. 2015. Available online: https://sdgs.un.org/goals (accessed on 25 April 2025).
IOC-UNESCO. Global Ocean Science Report 2020—Charting Capacity for Ocean Sustainability; Isensee, K., Ed.; UNESCO Publishing: Paris, France, 2020. [Google Scholar]
Orós, J.; Montesdeoca, N.; Camacho, M.; Arencibia, A.; Calabuig, P. Causes of stranding and mortality, and final disposition of loggerhead sea turtles (*Caretta caretta*) admitted to a wildlife rehabilitation center in Gran Canaria Island, Spain (1998–2014): A long-term retrospective study. PLoS ONE 2016, 11, e0149398. [Google Scholar] [CrossRef] [PubMed]
Ardhuin, F.; Gualtieri, L.; Stutzmann, E. How ocean waves rock the Earth: Two mechanisms explain microseisms with periods 3 to 300 s. Geophys. Res. Lett. 2015, 42, 765–772. [Google Scholar] [CrossRef]
Besedina, A.N.; Tubanov, T.A. Microseisms as a tool for geophysical research. A review. J. Volcanol. Seismol. 2023, 17, 83–101. [Google Scholar] [CrossRef]
Ferretti, G.; Zunino, A.; Scafidi, D.; Barani, S.; Spallarossa, D. On microseisms recorded near the Ligurian coast (Italy) and their relationship with sea wave height. Geophys. J. Int. 2013, 194, 524–533. [Google Scholar] [CrossRef]
Borzì, A.M.; Minio, V.; De Plaen, R.; Lecocq, T.; Alparone, S.; Aronica, S.; Cannavò, F.; Capodici, F.; Ciraolo, G.; D’Amico, S.; et al. Integration of microseism, wavemeter buoy, HF radar and hindcast data to analyze the Mediterranean cyclone Helios. Ocean Sci. 2024, 20, 1–20. [Google Scholar] [CrossRef]
Sverdrup, H.U.; Munk, W.H.; Scripps Institution of Oceanography; United States Hydrographic Office. Wind, Sea and Swell: Theory of Relations for Forecasting; United States Hydrographic Office: Washington, DC, USA, 1947. [Google Scholar]
Cannata, A.; Cannavò, F.; Moschella, S.; Di Grazia, G.; Nardone, G.; Orasi, A.; Picone, M.; Ferla, M.; Gresta, S. Unravelling the relationship between microseisms and spatial distribution of sea wave height by statistical and machine learning approaches. Remote Sens. 2020, 12, 761. [Google Scholar] [CrossRef]
Minio, V.; Borzì, A.M.; Saitta, S.; Alparone, S.; Cannata, A.; Ciraolo, G.; Contrafatto, D.; D’Amico, S.; Di Grazia, G.; Larocca, G.; et al. Towards a monitoring system of the sea state based on microseism and machine learning. Environ. Model. Softw. 2023, 167, 105781. [Google Scholar] [CrossRef]
Istituto Nazionale di Geofisica e Vulcanologia (INGV). Rete Sismica Nazionale (RSN); [Data set]; Istituto Nazionale di Geofisica e Vulcanologia (INGV): Rome, Italy, 2005. [Google Scholar] [CrossRef]
E.U. Copernicus Marine Service Information (CMEMS). Mediterranean Sea Waves Reanalysis; Marine Data Store (MDS). Available online: https://data.marine.copernicus.eu/product/MEDSEA_MULTIYEAR_WAV_006_012/description (accessed on 31 March 2025).
Khan, A.A.; Chaudhari, O.; Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
Guo, Y.; Fu, L.; Li, H. Seismic data interpolation based on multi-scale transformer. IEEE Geosci. Remote Sens. Lett. 2023, 20, 7504205. [Google Scholar] [CrossRef]
Kaur, H.; Pham, N.; Fomel, S. Seismic data interpolation using CycleGAN. In SEG Technical Program Expanded Abstracts; Society of Exploration Geophysicists: Houston, TX, USA, 2019; pp. 2202–2206. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Baranbooei, S.; Bean, C.J.; Rezaeifar, M.; Donne, S.E. Determining offshore ocean significant wave height (SWH) using continuous land-recorded seismic data: An example from the northeast Atlantic. J. Mar. Sci. Eng. 2025, 13, 807. [Google Scholar] [CrossRef]
Moni, A.; Craig, D.; Bean, C.J. Separation and location of microseism sources. Geophys. Res. Lett. 2013, 40, 3118–3122. [Google Scholar] [CrossRef]
Gualtieri, L.; Stutzmann, E.; Juretzek, C.; Hadziioannou, C.; Ardhuin, F. Global scale analysis and modelling of primary microseisms. Geophys. J. Int. 2019, 218, 560–572. [Google Scholar] [CrossRef]
University of Malta. Malta Seismic Network [Data set]. In International Federation of Digital Seismograph Networks; University of Malta: Msida, Malta, 2014. [Google Scholar] [CrossRef]

Figure 1. The region of interest considered by Minio et al. [10].

Figure 2. The region of interest considered in this research, showing the identified nearest grid cells to each station.

Figure 3. Flowchart for the longest stretch algorithm.

Figure 4. The number of data points and corresponding years of data available for varying interpolation thresholds, for each station.

Figure 5. An example of problems arising from linearly interpolating extended periods, at one frequency band and channel for station AIO, covering more than one full day.

Figure 6. Predicted and actual time series at stations AIO, CAVT, CSLB, and HAGA. Vertical lines separate time periods shown below the x-axis.

Figure 7. Predicted and actual time series at stations MSDA, MUCR, and WDD. Vertical lines separate time periods shown below the x-axis.

Figure 8. Mean, standard deviation, minimum and maximum R² from K-fold cross validation per station.

Figure 9. The percentiles of the mean significant wave height of the five nearest grid cells to each station.

Table 1. Performance metrics reported in the literature, where “–” indicates that no such information was available.

	MAE (m)	RMSE (m)	R²
Numerical Methods
Ferretti et al. [6]	0.19	–	–
AI-Based Solutions
Cannata et al. [9]	~0.1	–	–
Minio et al. [10]	0.21 ± 0.23	–	0.89
Baranbooei et al. [17] Scenario 1	0.6132	0.8780	0.8363
Baranbooei et al. [17] Scenario 2	0.6816	0.9505	0.8059

Table 2. Details on the final dataset chosen for each station. Null data was linearly interpolated.

Station	Start Date and Time	End Date and Time	Total Data	Null Data
AIO	2019-04-24 04:00	2020-12-06 12:00	14,217	0%
CAVT	2019-06-24 08:00	2020-10-03 00:00	11,201	0.38%
CSLB	2019-08-09 08:00	2020-09-13 19:00	9636	0.26%
HAGA	2019-02-18 01:00	2020-09-11 16:00	13,720	0
MSDA	2019-12-15 22:00	2021-09-29 11:00	15,686	0
MUCR	2019-05-17 02:00	2021-08-01 09:00	19,376	0
WDD	2018-05-21 12:00	2019-06-18 00:00	9421	0

Table 3. Baseline model performance and final model performance, where the target variable is the mean significant wave height (SWH) of the five nearest grid cells. Bold values indicate best results.

	Replicated Baseline Performance				Final Model Performance
Station	R²	MSE	MAE	RMSE	R²	MSE	MAE	RMSE
AIO	0.350	0.089	0.209	0.298	0.607	0.071	0.182	0.267
CAVT	-	-	-	-	0.892	0.023	0.101	0.151
CSLB	0.868	0.044	0.137	0.210	0.881	0.055	0.143	0.235
HAGA	0.639	0.065	0.156	0.255	0.784	0.064	0.153	0.252
MMGO	0.330	0.252	0.243	0.502	-	-	-	-
MPNC	0.861	0.030	0.109	0.174	-	-	-	-
MSDA	0.843	0.056	0.147	0.237	0.862	0.033	0.122	0.182
MUCR	0.840	0.052	0.156	0.228	0.862	0.041	0.141	0.202
SOLUN	0.698	0.054	0.135	0.233	-	-	-	-
WDD	0.841	0.067	0.157	0.258	0.921	0.021	0.102	0.144

Table 4. Optimal hyperparameters selected for each station and corresponding evaluation metrics.

	AIO	CAVT	CSLB	HAGA	MSDA	MUCR	WDD
RF_max_depth	30	30	10	20	30	30	10
RF_n_estimators	200	200	200	100	100	100	100
RF_max_features	${log}_{2}$	sqrt	${log}_{2}$	sqrt	${log}_{2}$	0.5	${log}_{2}$
RF_min_samples_split	2	2	5	2	10	10	5
RF_min_samples_leaf	1	1	3	1	1	1	3
MAE	0.18243	0.10066	0.14298	0.15251	0.12207	0.14089	0.10175
MSE	0.07107	0.02291	0.05519	0.06361	0.03282	0.04067	0.02073
RMSE	0.26659	0.15137	0.23492	0.25221	0.18116	0.20166	0.14398
R²	0.60686	0.89238	0.88108	0.78357	0.86198	0.86200	0.92060

Table 5. Feature importance rankings obtained from RF regression using impurity-based importance. Values indicate component followed by frequency range (Hz).

	Feature 1	Feature 2	Feature 3	Feature 4	Feature 5
AIO	E/0.2–0.35	Z/0.2–0.35	N/0.2–0.35	E/0.35–0.5	Z/0.35–0.5
CAVT	N/0.8–0.95	N/0.95–1.1	N/1.1–1.25	E/0.8–0.95	N/0.2–0.35
CSLB	Z/1.1–1.25	N/1.1–1.25	E/1.1–1.25	Z/0.2–0.35	Z/0.95–1.1
HAGA	E/0.5–0.65	N/0.5–0.65	E/0.65–0.8	Z/0.5–0.65	E/0.2–0.35
MSDA	Z/1.25–1.4	Z/1.55–1.7	Z/1.1–1.25	Z/1.4–1.55	E/1.4–1.55
MUCR	Z/0.2–0.35	E/0.2–0.35	N/0.2–0.35	N/0.65–0.8	N/0.5–0.65
WDD	E/1.4–1.55	N/1.1–1.25	Z/0.95–1.1	E/1.1–1.25	E/1.25–1.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Spiteri Bailey, E.; Guillaumier, K.; Gauci, A. Enhancing Ocean Monitoring for Coastal Communities Using AI. Appl. Sci. 2025, 15, 10490. https://doi.org/10.3390/app151910490

AMA Style

Spiteri Bailey E, Guillaumier K, Gauci A. Enhancing Ocean Monitoring for Coastal Communities Using AI. Applied Sciences. 2025; 15(19):10490. https://doi.org/10.3390/app151910490

Chicago/Turabian Style

Spiteri Bailey, Erika, Kristian Guillaumier, and Adam Gauci. 2025. "Enhancing Ocean Monitoring for Coastal Communities Using AI" Applied Sciences 15, no. 19: 10490. https://doi.org/10.3390/app151910490

APA Style

Spiteri Bailey, E., Guillaumier, K., & Gauci, A. (2025). Enhancing Ocean Monitoring for Coastal Communities Using AI. Applied Sciences, 15(19), 10490. https://doi.org/10.3390/app151910490

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Ocean Monitoring for Coastal Communities Using AI

Abstract

1. Introduction

1.1. Main Contributions

1.2. Literature Review

1.2.1. Numerical Methods Approaches

1.2.2. Artificial Intelligence Methods

1.2.3. Summary of Literature Gaps

1.3. Aims and Objectives

2. Materials and Methods

2.1. Exploratory Data Analysis

2.2. Model Selection

2.3. Creation of a Baseline

2.4. Experimental Setup and Hyperparameters

2.5. Evaluation Metrics and Performance Analysis

3. Results

3.1. Baseline Model

3.2. Model Performance

3.2.1. Hyperparameter Tuning

3.2.2. K-Fold Cross Validation

3.2.3. Feature Importance

4. Discussion

4.1. Comparison with Baseline Models

4.2. Summary of Key Findings

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI