LiDAR Observations of Multi-Modal Swash Probability Distributions on a Dissipative Beach

: Understanding swash zone dynamics is of crucial importance for coastal management as the swash motion, consisting of the uprush of the wave on the beach face and the subsequent downrush, is responsible for driving changes in the beach morphology through sediment exchanges between the sub-aerial and sub-aqueous beach. Improved understanding of the probabilistic characteristics of these motions has the potential to allow coastal engineers to develop improved sediment transport models which, in turn, can be further developed into coastal management tools. In this paper, novel descriptors of swash motions are obtained by combining ﬁeld data and statistical modelling. Our results indicate that the probability distribution function (PDF) of shoreline height timeseries ( p ( ζ ) ) and trough-to-peak swash heights ( p ( ρ ) ) measured at a high energy, sandy beach were both inherently multimodal. Based on the observed multimodality of these PDFs, Gaussian mixtures are shown to be the best method to statistically model them. Further, our results show that both offshore and surf zone dynamics are responsible for driving swash zone dynamics, which indicates unsaturated swash. The novel methods and results developed in this paper, both data collection and analysis, could aid coastal managers to develop improved swash zone models in the future.


Introduction
The swash zone encompasses the transition region between the sub-aqueous and the sub-aerial beach [1].It is a highly dynamic environment with alternating wet and dry conditions.Over the past five decades, this region has attracted increased research interest due to the significant role it plays in sediment dynamics and beach erosion [2].The cross-shore shoreline oscillation, globally referred to as swash, can be divided into two main components: the uprush of the wave on the beach face and the subsequent downrush.Each of these two divisions can be described by their horizontal and vertical (height) components (Figure 1).A large proportion of swash zone research has focused on obtaining empirical parametric formulae to describe extreme runup heights (see Atkinson et al. [3] and Power et al. [4] for recent reviews); however, as highlighted by Hughes et al. [5], this approach may not be fully satisfactory to provide information to coastal managers to develop operational tools other than inundation models.
The natural variability of the probability distribution functions (PDFs) of swash motions has received little research attention to date.To the authors' knowledge, only two studies have attempted to describe the variability of these PDFs in detail [5,6], both of which compared measured swash maxima PDFs to Cartwright and Longuet-Higgins' [7] theoretical PDF for the maxima of a random variable.This theoretical PDF is a direct function of the spectral width (ϕ) of the analysed timeseries and reduces to the Rayleigh PDF for narrow bandwidth processes (ϕ = 0) or to the Gaussian PDF for wide bandwidth processes (ϕ = 1).Holland and Holman [6] found that their measured swash maxima PDFs matched Cartwright and Longuet-Higgins' [7] PDF for some values of ϕ, but they could not directly correlate the variability in their PDFs to environmental parameters.It has been observed [8], however, that the spectral width parameter ϕ does not correlate with wave height PDFs in the surf zone and, by induction, should not correlate with swash heights either.More recently, Hughes et al. [5] investigated the PDFs of the shoreline height timeseries (p(ζ)) and trough-to-peak runup heights (p(ρ)), in addition to swash maxima PDFs.They observed that, on average, observed p(ζ) PDFs were consistently right skewed when compared to the Gaussian PDF predicted by Longuet-Higgins [9], possibly due to the broad-band wave spectrum observed on natural beaches.Further, these authors compared p(ρ) with both the Rayleigh and Gaussian PDFs but neither of these PDFs seemed to satisfactorily describe their observations (see Figure 6 in Hughes et al. [5], for example).In this paper, we provide novel field observations of swash motion PDFs obtained from a high-resolution LiDAR system deployed at a high-energy sandy beach and test the hypothesis that well defined, uni-variate and uni-modal PDFs (for example, the Gaussian PDF) are able to describe observed swash zone data.Specifically, the statistical properties of shoreline height timeseries (ζ) and trough-to-peak swash height (ρ) PDFs are investigated in detail.In addition, detailed surf zone and offshore data are used to the assess the patterns observed in the swash zone.The results obtained here deviate significantly from the theoretical predictions from Longuet-Higgins [9] for the analysis of p(ζ) and do not support the concept of swash saturation [10].The present data indicate that a combination of surf zone and offshore forcing control swash zone dynamics.Finally, an approach that allows offshore and surf zone parameters be linked to the variability of p(ζ) is investigated.This paper is organised as follows.Section 2 describes the data collection methods and pre-processing, Section 3 presents the results of the field data collection with a focus on probabilistic descriptors for swash, Section 4 discusses the results, and Section 5 concludes.

Data Collection
Data were collected at Seven Mile Beach, Gerroa, New South Wales, Australia, hereafter SMB.This beach is classed as modally dissipative in the Australian morphodynamic beach model [11,12] and for the duration of the present experiment, was characterised by a gently sloping profile (with slope β = 0.03) with no significant barred morphology, beach cusps, or alongshore variability.Video imagery, pressure transducer (PT), offshore spectral wave, Light imaging Detection And Ranging (LiDAR), acoustic Doppler velocimeter (ADV), and topographic data were collected during a field data collection experiment over six days in June 2018.In this paper, the focus will be on the LiDAR , surf zone (PT) and offshore data.The experimental design is shown in Figure 2 and is summarised below.See Stringari et al. [13] and Stringari and Power [14] for further details.
The PT data collection consisted of 30 PTs (RBR Solo and INW P2X ) deployed on the seabed in a cross-shore orientation.The LiDAR (SiCK LMS511) was mounted on a scaffold frame and recorded in the same cross-shore orientation as the PT transect (see Figure 2a,b).The ADV (Sontek Hydra) was deployed approximately 50 m seaward of the LiDAR.The Datawell waverider buoy was deployed offshore of the transect line at the 10 m isobath.Two video cameras were used: a Sony camera (HDR-CX240) mounted at Gerroa headland and a Raspberry-Pi-based system [15] mounted directly facing the LiDAR (shown in Figure 2b).The beach was surveyed several times each day using a Trimble S5 total station and a Trimble R4 RTK GPS, and the representative beach profile is shown in Figure 2c. Figure 3 shows the offshore conditions for the duration of the field campaign.This dataset was ultimately chosen for the analyses presented in this paper because it overcomes the limitations of classical remote-sensing datasets (e.g., pixel misregistration; see Vousdoukas et al.'s [16] Figure 7), it has precise and unique offshore conditions, and it has a high degree of offshore wave variability for comparable tidal water levels and beach slopes.

Data Processing
From the raw LiDAR data (Figure 4a), timeseries of the cross-shore evolution of the water surface elevation were extracted at 10 Hz and stacked in time, resulting in a dataset similar to a video-derived timestack [17] (Figure 4b, color scale).Incoming waves were tracked using a modified version of the method from Stringari, Harris, and Power [13].For each LIDAR timestack, the vertical Sobel [18] edge detector was applied to the timestack, and pixel intensity peaks in the resulting image were extracted and then clustered using the DBSCAN algorithm [19].Unlike the original method, no colour-based machine learning was applied to the dataset.The absence of the colour-learning step resulted in a significant increase in the number of false-positive cases of wave crests being detected.These erroneous wave crests were manually corrected in QGIS (version 10.4) to ensure that no errors were propagated into subsequent analyses.Optimal wave paths were then obtained as per the original tracking algorithm [13].The tracking algorithm was set to stop tracking waves if the water elevation above the bed was less than 0.015 m, which is significantly lower than other recent works [20,21].Note that in this paper, we have not quantified uncertainties related to the raw LiDAR data; future research should, however, focus on developing methods to quantify and correct for such uncertainties.
The temporal evolution of the shoreline position was obtained in three steps: (1) the uprush was obtained from the tracked wave paths as described above, (2) the downrush was obtained via edge detection using the horizontal Sobel [18] operator, and (3) the continuous shoreline timeseries was obtained by combining the results of steps ( 1) and ( 2) and interpolating the data to a regular time vector with a sample frequency of 5 Hz using a Gaussian radial basis function (RBF) interpolation.Finally, horizontal shoreline excursion timeseries were converted to shoreline height (ζ) timeseries and to trough-to-peak swash heights (ρ) using the measured beach profiles and a local minima analysis (see Figure 1 for definitions).The Australian Height Datum (AHD, [22]) was used as the vertical reference.

Surf Zone Dynamics
The cross-shore variation of surf zone significant wave heights (H m 0 ) and the fraction of broken waves (Q b ) were used to assess the surf zone dynamics (Figure 5).In this paper, Q b was calculated as follows: for each data run in which there were unique offshore, surf zone, and swash zone data available, 10 min of PT data were extracted from the raw records and, from these records, individual waves were extracted using a local minima analysis as per Power et al. [23] and classified as broken or unbroken using the neural network from Stringari and Power [14].The neural network was updated with field data from Seven Mile Beach to increase the classification performance for the present dataset.The updated neural network accuracy score reached 95% when classifying waves in a test dataset (that is, data that the neural network had never seen) and correlation scores (r 2 ) were >0.95 for Q b predictions (not shown).
Data from each Q b curve were segmented into three clusters using the k-means algorithm (Figure 5a): one cluster that was representative of the outer surf zone, one cluster representative of the mid surf zone, and one cluster representative of the inner surf zone.The probability distribution of Q b (p(Q b )) was then calculated for each class (Figure 5b), which showed that in the outer surf zone, most of the waves were unbroken (p(Q b ) < 0.2); in the mid surf zone, about half of the waves were broken (p(Q b ) ≈ 0.5); and in the inner surf zone, most waves were broken (p(Q b ) > 0.8).This result is consistent with the conceptual hydro-kinematic model for gently sloping beaches [24].Interestingly, Q b values close to the surf-swash boundary were never Q b = 1, which indicates that small unbroken waves reach the swash zone, even on a dissipative beach such as SMB (average Iribarren Number [25] , where L ∞ is the wave length calculated as T m 01∞ W s = 3.77, where W s is the sediment fall velocity).Based on the observed distributions of Q b , three locations in the surf zone were chosen to assess wave heights: Q b = 0.95 (inner surf zone), Q b = 0.50 (mid surf zone), and Q b = 0.05 (outer surf zone).
The analysis of the correlation between offshore (H m0 ∞ ) and surf zone (H m 0 ) wave heights showed that there was a direct correlation between H m 0 and H m0 ∞ across the full width of the surf zone (Figure 5c-f).Following the definition of surf zone saturation from Power et al. [26], the observed correlations strongly suggest that the surf zone was unsaturated during the experiment, despite the dissipative nature of the beach.Finally, the wave-height-to-water-depth ratio (γ sig ) was compared to the offshore wave height normalised by averaged water depth for each 10 min data run (H m0 ∞ /h) (Figure 5g).The results from this analysis are analogous to Figure 11 in Power et al. [23] and indicate that (1) the surf zone was unsaturated and (2) there was a terminal bore height reaching the surf-swash boundary.Following from the analysis in Figure 5a that Q b = 1, this terminal bore height could represent either broken or unbroken waves.These results are significant because if the surf zone is unsaturated, it is probable that the swash zone is also unsaturated.This is discussed further in Section 4. The number of bins for these histograms was calculated using the Freedman and Diaconis [27] rule.Correlations between offshore (H m0 ∞ ) and surf zone (H m 0 ) wave heights in (c) the outer surf zone, (d) the mid surf zone, and (e) the inner surf zone.The red swath shows the 95% confidence interval for the linear regression.(f) Comparison between offshore conditions (H m0 ∞ and T m01 ∞ ) and break point wave height (H b ).The crosses show all the measured offshore data, and the dashed lines show the long-term H m0 ∞ and T m01 ∞ averages for the nearest offshore wave buoy (Port Kembla) [28].(g) Analysis of γ sig against H m0 ∞ /h (analogous to Figure 11 in Power et al. [23]).The coloured swaths show the 95% confidence interval for the regressions.

Shoreline Height Timeseries PDFs
For each data run in which there were unique offshore and LiDAR data, 10 min timeseries were extracted from the raw LiDAR record and the time-varying shoreline was obtained using the method described in Section 2.2.The PDFs of normalised (p((ζ − µ)/σ)) and non-normalised (p(ζ)) shoreline height timeseries were then obtained via histograms and kernel density estimations (KDEs).The use of KDEs to obtain PDFs is advantageous over more traditional histogram methods because (1) they are a fully non-parametric approach, and (2) they are able to identify fluctuations in the data's distribution that are usually not seen when using histograms with potentially non-ideal bin sizes.
Analysis of individual normalised shoreline height PDFs indicated a high degree of variability between runs and that the majority of PDFs were multimodal (97.5%).The Shapiro and Wilk [29] test at the 95% confidence interval confirmed that none of the analysed timeseries were normally distributed.The observed PDFs were then grouped into three clusters based on the observed offshore conditions (Figure 6a).Cluster A represents average wave height conditions with short periods, Cluster B represents calm conditions (low wave heights and short wave periods), and cluster C represents calm conditions with long wave periods.Figure 6b shows the PDFs for cluster A, Figure 6c shows the PDFs for cluster B, and Figure 6d shows the PDFs for cluster C. By averaging the PDFs in each cluster, a right-skewed PDF similar to Hughes et al.'s [5] ensemble PDF was observed (see their Figure 2).To assess the effect of the normalisation strategy and offshore conditions on the shape of the shoreline height PDFs, each PDF was compared to every other PDF in the same cluster, and then to every PDF in each of the other two clusters using the Kullback and Leibler [30] divergence as the similarity measurement.The results from this analysis indicated that (1) non-normalised PDFs (p(ζ)) are dissimilar within and between clusters (Figure 6h), except PDFs in cluster C, which are strongly similar to each other, and (2) normalised PDFs (p((ζ − µ)/σ)) are strongly similar within and between clusters (Figure 6i).For further discussion see Section 4.
Two other methods were assessed for obtaining a function (or combination of functions) to describe the observed PDFs.This was done because KDE is a non-parametric method that requires prior knowledge of the input timeseries, thus preventing an assessment of correlations between descriptors of the analysed PDFs and environmental parameters.Note that the results presented below were invariant regardless of which PDF (p((ζ − µ)/σ) or p(ζ)) was being modelled.The first method consisted of fitting all PDFs available in the SciPy library [31] to the observed data (96 PDFs were available as of December 2020) and using three metrics to evaluate the fitted PDFs: the sum of squared errors, the Akaike information criterion [32], and the Kullback-Leibler divergence [30].The results from these analyses indicated that none of the best-fit PDFs were able to statistically satisfactorily describe the majority (>50%) of the observed PDFs, regardless of the metric adopted to rank them.The analytical PDF that best fitted the greatest number of observed PDFs (≈35%) was the non-central Student's T (NCT) PDF, which is a complicated four-parameter function [33] and thus is impractical.Examples of the NCT fit to the data are shown in Figure 6e-g (blue lines).Given the poor overall performance of the NCT and given that this PDF cannot describe the multimodal characteristics of the data, this strategy was not pursued further.The number of bins for these histograms was calculated using the Freedman-Diaconis rule [27].Analysis was performed using the Kullback and Leibler [30] divergence (KL-Div) for (h) non-normalised PDFs ((p(ζ)), and (i) normalised (p((ζ − µ)/σ)) to assess PDF similarity within each cluster and between pairs of clusters.In (h,i) lower values indicate more similar Note that the KL-Div has no upper-bound value.
To account for the multimodality observed in the data, a second approach to obtain analytical descriptions of p(ζ) and p((ζ − µ)/σ) was used.In this method, the analysed PDFs were approximated by the sum of a number of Gaussian PDFs, each described individually by their mean (µ), standard deviation (σ), and mixing weight (α), that is, a Gaussian mixture model (GMM) [34].This approach was able to precisely reproduce the observed multimodality in all the shoreline height PDFs (see Figure 5e-g, for example) and the Kolmogorov-Smirnov test confirmed that the PDFs predicted by the GMMs were statistically similar to the observed PDFs at the 95% confidence level (which is expected, given the characteristics of the method).A Gaussian mixture model is, however, a parametric method that requires prior knowledge of the number of mixtures to be used.By using the Akaike Information Criterion [32] as an evaluation metric, a mixture with three components was found to be the optimal value to statistically satisfactorily represent the majority of the observed data (≥90%) whilst maintaining model simplicity.As GMMs provide the parameters µ, σ, α, and the optimal number of mixtures (N mix ), it becomes possible to correlate these parameters to known variables in a predictive way, thus overcoming the major limitation of KDEs.A model using surf zone and offshore parameters to assess the variability observed in p(ζ), assuming that such variability is directly correlated to the optimal number of Gaussian mixtures (N mix ) for each PDF, is discussed in Section 4.

Trough-To-Peak Swash Height PDFs
In the previous section, p(ζ) was observed to be multimodally distributed and, consequently, to deviate from the expected Gaussian PDF.Based on this, it is therefore reasonable to assume that PDFs derived from a swash-by-swash analysis would follow a similar multimodal pattern.In this section, the trough-to-peak swash height (ρ) was used as a proxy variable for such analysis (see Figure 1 for definitions).By applying the wavelet decomposition method detailed in Stringari and Power [35] (see their Appendix A) it was possible to classify each swash event as occurring under infragravity or sea-swell wave dominant forcing.For each timestack, ρ was calculated for each individual swash cycle and compared to the time-varying infragravity and sea-swell energy levels obtained using data from the PT in the surf zone that was closest to the surf-swash boundary in each data run.If energy in the infragravity band was greater than energy in the sea-swell band (that is, E ig (t) > E sw (t)) during the time of swash excursion, the swash event was considered to be dominated by an infragravity wave, otherwise, the swash event was considered to be dominated by a sea-swell wave.Due to the characteristic long-period of infragravity motions, there was no need to account for time offsets between the shoreline and nearest surf zone PT timeseries.Finally, it is worth noting that the approach used here is equivalent to Guza and Thornton's [10] classical approach, only more robust, as it considers both time and frequency domains whereas the classical approach only works in the frequency domain.
As with the analyses shown in Section 3.1, both p(ρ) and (p((ρ − µ)/σ)) in both seaswell and infragravity frequency bands presented great variability, were mostly multimodal (>95%), and were, consequently, significantly statistically different (p < 0.05 using the Kolmogorov-Smirnov test) from both the Rayleigh and Gaussian PDFs previously tested by Hughes et al. [36].The mean PDF for each frequency band in each cluster was also obtained (black lines in Figure 7a-f) and these mean PDFs also deviated from the two theoretical PDFs tested by Hughes et al. [36].It is worth noting, however, that non-normalised PDFs (p(ρ)) in cluster C closely approached but were not statistically similar to a Rayleigh PDF as assessed by the Kolmogorov-Smirnov test (p ≈ 0.05).Further, when all the data in each frequency band were aggregated (Figure 7g,f), the observed PDFs in both the sea-swell and infragravity bands were right-skewed and statistically similar to the Beta PDF, which is partially consistent with Hughes et al.'s [5] results (their Figure 7).Similar results to Figure 7g,f were observed when aggregating the data in each frequency band based on the offshore clusters (not shown).See Section 4 for further discussion on the correlation between offshore parameters and the observed swash height PDFs.
Finally, Figure 7i shows an analysis similar to that of Guza and Thornton's [10] (their Figure 7), which has been widely used in the literature to support the concept of swash saturation in the sea-swell frequency band.For each data run, the trough-to-peak significant swash height (ρ sig ) in each frequency band was calculated and compared to the observed offshore wave height.In contrast to Guza and Thornton's [10] data, the data analysed here showed a correlation between increases in the offshore wave height and increases in the significant trough-to-peak swash height in both the sea-swell and infragravity frequency bands.These results do not, therefore, support the assumption of swash saturation in the sea-swell band.For further discussion on swash saturation, see Section 4. at infragravity (red) and sea-swell (blue) frequency bands for offshore (d) cluster A, (e) cluster B, and (f) cluster C. The black lines in (a-f) show the mean PDF for each frequency band.(g) Normalised trough-to-peak swash height PDF for all data from panels (a-c) in the sea-swell frequency band.(e) Normalised trough-to-peak swash height PDF for all data from panels (a-c) in the infragravity frequency band.In (g,h), the black lines show the KDE approximation to the data, the red dashed lines show the NCT PDF fit to the data, the blue lines show the Gaussian PDF fit to the data, and the green lines show the Beta PDF fit to the data.The number of bins in the histograms was calculated according to the Freedman-Diaconis rule [27].(i) Correlation between offshore wave height and significant trough-to-peak swash height (ρ sig ) for infragravity and sea-swell frequency bands.The coloured swaths in e) show the 95% confidence intervals.The regression lines are ρ sig = 0.48H m0 ∞ + 0.07 (r xy = 0.64, p 0.05) in the infragravity band and ρ sig = 0.17H m0 ∞ + 0.24 (r xy = 0.35, p = 0.02) in the sea-swell band.Note that ρ sig = 0 at H m0 ∞ = 0.

Discussion
In this paper we have presented a novel, data-driven approach for analysing the probability distribution functions of swash motions.Both shoreline height timeseries (ζ) and trough-to-peak swash height (ρ) PDFs were observed to be strongly multimodal, highly variable, and systematically statistically different from expected theoretical PDFs.Previous research [5,6] has shown that p(ζ) can deviate from the expected Gaussian PDF [9] but, to the authors' knowledge, multimodal p(ζ) and p(ρ) have not been previously reported.Given the observed multimodality of shoreline height timeseries PDFs, Gaussian Mixture Models (GMMs) were shown to be the best method to approximate p((ζ − µ)/σ) (e.g., Figure 6), and have the benefit of being easily transferable to model p(ζ), p(ρ), and p((ρ − µ)/σ).Interestingly, when the data were normalised, the shoreline height PDFs (p((ζ − µ)/σ)) collapsed into very similar PDFs, indicating that environmental forcing directly correlates with the shape of the non-normalised PDFs, further supporting the clustering approach based on offshore conditions.The influence of offshore wave conditions on swash motion PDFs is further supported by three other observations: (1) that shoreline height timeseries PDFs in cluster C, which had a narrow offshore wave height band, were very similar to each other regardless of data normalisation (see Figure 6h); (2) that the width of p(ρ) directly increased with increasing offshore height in both frequency bands; and (3) that the mean p(ρ) PDFs in cluster C were only marginally statistically different from the expected Rayleigh PDF (see Figure 7c), which is consistent with the narrow offshore wave height band of this cluster.Ultimately, these results suggest that the swash zone was unsaturated in both infragravity and sea-swell frequency bands for the data analysed here.
The multimodality observed in both shoreline height and trough-to-peak swash height PDFs can theoretically be linked to the observation by Guza and Thornton [10] that energy in different frequency bands will result in distinct density peaks at different swash height elevations.This assumption is consistent with the analysis presented in Section 3.3, in which clear density peaks in p(ρ) are observed at different frequency bands (e.g., note the separation between the mean PDFs in Figure 7d-f).Therefore, the fact that GMMs were the only method that satisfactorily reproduced the observed PDFs may be a direct consequence of this (physical) phenomenon and not necessarily a result of pure statistical inference.In contrast to the observations of Guza and Thornton [10], however, the data analysed here do not support swash saturation in the sea-swell frequency band (see Figure 7f).It is worth noting, however, that Guza and Thornton's (1982) data were from a beach more dissipative than SMB and, therefore, the present results may not be directly comparable to theirs.The results in this paper showed, nonetheless, that as a consequence of the surf zone being unsaturated, the swash zone was also unsaturated, which is supported by the correlations between the offshore clusters and swash motion PDFs.This result is consistent with recent results from Hughes et al. [37], who also showed that swash saturation is not always the case on natural beaches.Future swash zone research should focus on better linking surf and swash zone dynamics with particular emphasis on swash-by-swash approaches that have the potential to elucidate surf-swash interactions (for example, bore-bore capture) and their impact on runup and beach morphodynamics, which are currently poorly understood.
Finally, an investigation into which environmental parameters best explained the variability seen in p(ζ) was conducted.Assuming that the optimal number of mixtures (N mix ) is a direct proxy for the degree of variability and, consequently, the complexity of p(ζ), a model that ranks which environmental parameters best explained N mix was constructed.This analysis provides an initial insight into which variables are most important for describing the trends seen in the data and aims to further support our observations that the observed surf zone dynamics were directly controlling the swash zone.A random forest model was chosen to accomplish this task (see Appendix A for details).Note that, in contrast to Section 3.2, the maximum number of mixtures was not restricted to three and was, therefore, chosen based on the lowest AIC for each 10 min data run (although the N mix is unbounded here, the highest number of optimal mixtures observed was six because models with too large a number of mixtures get heavily penalised by AIC).As inputs for the model, wave heights and periods at four cross-shore locations were used (offshore, Q b = 0.05, Q b = 0.50, and Q b = 0.95).The model was trained one hundred times to account for statistical variability and the feature importance for each variable was obtained.The same approach can be used to predict which parameters best explain µ, σ, and α but this was not attempted here due to the small size of the dataset (see Section 6.4 in Stringari [38] for an attempt at using these model data).The results shown in Figure 8 indicate that a combination of several parameters was responsible for best explaining N mix , with the wave height at the seaward end of the surf zone (H m0 5% ) consistently being the most important parameter for the model.In general, this result agrees with the results from Section 3.1 as N mix directly correlates with surf zone wave heights, which implies that, as a consequence of the surf zone being unsaturated, the swash zone is unsaturated and, therefore, driven by incoming bores with non-negative terminal heights, as previously shown by two recent studies [21,23].As more data become available in the future, models based on the present approach could provide a robust predictor for shoreline statistical properties based solely on known parameters, which will be valuable tools for coastal managers.

Conclusions
In this paper, analysis of swash motions from a gently sloping sandy beach under varying offshore forcing showed that the majority of observed PDFs (both the shoreline height timeseries PDF (p(ζ)) and the trough-to-crest swash height PDF (p(ρ))) were multimodal, which, to our knowledge, has not previously been reported.Hence, Gaussian mixtures were shown to be the best approach to model p((ρ − µ)/σ), which could easily be extended to other swash processes.The parameters of the Gaussian mixtures that described these swash motions were closely correlated to wave conditions in the surf zone and further offshore, which had also not previously been directly shown and is indicative of unsaturated swash.Analysis of the correlation between significant trough-to-peak swash heights (ρ sig ) and offshore wave heights further confirmed unsaturated swash in both shortand long-wave frequency bands.The field data collection and statistical methods used in this paper were shown to overcome the limitations of more traditional methods and allowed for novel statistical descriptions of swash motions.Future research on swash zone dynamics should leverage the recent developments on LiDAR technology to further explore wave-swash interactions and the impact that these phenomena have on shoreline dynamics and beach morphology.The approaches used in this paper, although preliminary (for example, LiDAR uncertainties were not quantified here) and limited by a small dataset, should provide a robust basis for coastal managers when developing improved swash zone models in the future.
The model is then trained using the greedy algorithm know as adaptive training [40].The loss function for the model was the mean absolute error (MAE): where y i is the predicted number of mixtures and x i is the observed number of mixtures.
For the training step, the data were randomly split into training (70%) and testing (30%) datasets and the model was run 100 hundred times for each combination to account for statistical variability.The R 2 for all models always reached values greater than 95%.

Figure 1 .
Figure1.Swash zone definitions.Note that runup is defined relative to the still water level (SWL), the shoreline height (or elevation, ζ) is defined centred on the mean water level (MWL), and the trough-to-peak swash height (ρ) is defined for each swash cycle.Here, each swash cycle was defined by a local minima analysis.

Figure 3 .
Figure 3. Offshore data (spectral significant wave heights, H m0 ∞ , and periods, T m01 ∞ ), for the duration of the field campaign.The filled blue regions indicate periods of simultaneous offshore, Light imaging Detection And Ranging (LiDAR), and pressure transducer (PT) data collection.

Figure 4 .
Figure 4. (a) Raw LiDAR data showing a bore running up the beach profile.(b) Example of LiDAR timestack showing the tracked wave paths (coloured dashed lines) and the resulting time-varying shoreline position (thick red line).The grey scale indicates the bore height (that is, water depth) in relation to the measured profile in (a).(c) Flowchart indicating the methodological steps used in this paper.

Figure 5 .
Figure 5. (a) Example of Q b curve segmentation using k-means for one 10 min data run.(b) Algorithm clustered Q b probability distribution functions (PDFs).The number of bins for these histograms was calculated using the Freedman and Diaconis[27] rule.Correlations between offshore (H m0 ∞ ) and surf zone (H m 0 ) wave heights in (c) the outer surf zone, (d) the mid surf zone, and (e) the inner surf zone.The red swath shows the 95% confidence interval for the linear regression.(f) Comparison between offshore conditions (H m0 ∞ and T m01 ∞ ) and break point wave height (H b ).The crosses show all the measured offshore data, and the dashed lines show the long-term H m0 ∞ and T m01 ∞ averages for the nearest offshore wave buoy (Port Kembla)[28].(g) Analysis of γ sig against H m0 ∞ /h (analogous to Figure11in Power et al.[23]).The coloured swaths show the 95% confidence interval for the regressions.

Figure 6 .
Figure 6.(a) Clustering analysis of offshore wave conditions (H m0 ∞ , T m01 ∞ ).The dashed lines show the long-term H m0 ∞ and T m01 ∞ averages for Port Kembla wave buoy [28].Kernel density estimation (KDE) approximations, mean KDEs, and standard Gaussian PDF (N (0, 1)) for (b) cluster A , (c) cluster B, and (d) cluster C. Representative examples of PDFs for (e) cluster A, (f) cluster B, and (g) cluster C showing the KDE approximations (black), non-central Student's T (NCT) fits (blue), and Gaussian mixture model (GMM) approximations (red).The number of bins for these histograms was calculated using the Freedman-Diaconis rule[27].Analysis was performed using the Kullback and Leibler[30] divergence (KL-Div) for (h) non-normalised PDFs ((p(ζ)), and (i) normalised (p((ζ − µ)/σ)) to assess PDF similarity within each cluster and between pairs of clusters.In (h,i) lower values indicate more similar Note that the KL-Div has no upper-bound value.

Figure 7 .
Figure 7. Non-normalised trough-to-peak swash height PDFs (p(ρ)) at infragravity (red) and sea-swell (blue) frequency bands for offshore (a) cluster A, (b) cluster B, and (c) cluster C. Normalised trough-to-peak swash height PDFs (p((ρ − µ)/σ))at infragravity (red) and sea-swell (blue) frequency bands for offshore (d) cluster A, (e) cluster B, and (f) cluster C. The black lines in (a-f) show the mean PDF for each frequency band.(g) Normalised trough-to-peak swash height PDF for all data from panels (a-c) in the sea-swell frequency band.(e) Normalised trough-to-peak swash height PDF for all data from panels (a-c) in the infragravity frequency band.In (g,h), the black lines show the KDE approximation to the data, the red dashed lines show the NCT PDF fit to the data, the blue lines show the Gaussian PDF fit to the data, and the green lines show the Beta PDF fit to the data.The number of bins in the histograms was calculated according to the Freedman-Diaconis rule[27].(i) Correlation between offshore wave height and significant trough-to-peak swash height (ρ sig ) for infragravity and sea-swell frequency bands.The coloured swaths in e) show the 95% confidence intervals.The regression lines are

Figure 8 .
Figure 8. Feature importance of the random forest model.In this plot, H m0 ∞ and T m01 ∞ are the significant wave height and significant wave period offshore of the surf zone; H m0 5% , T m01 5% , H m0 50% , T m01 50% , H m0 95% , and T m01 95% are the significant wave height and significant wave period at the Q b value indicated by indexes where Q b = 5% is representative of the outer surf zone, Q b = 50% is representative of the mid surf zone, and Q b = 95% is representative of the inner surf zone.