Modeling and Monitoring Erosion of the Leading Edge of Wind Turbine Blades

: Leading edge surface erosion is an emerging issue in wind turbine blade reliability, causing a reduction in power performance, aerodynamic loads imbalance, increased noise emission, and, ultimately, additional maintenance costs, and, if left untreated, it leads to the compromise of the functionality of the blade. In this work, we ﬁrst propose an empirical spatio-temporal stochastic model for simulating leading edge erosion, to be used in conjunction with aeroelastic simulations, and subsequently present a deep learning model to be trained on simulated data, which aims to monitor leading edge erosion by detecting and classifying the degradation severity. This could help wind farm operators to reduce maintenance costs by planning cleaning and repair activities more efﬁciently. The main ingredients of the model include a damage process that progresses at random times, across multiple discrete states characterized by a non-homogeneous compound Poisson process, which is used to describe the random and time-dependent degradation of the blade surface, thus implicitly affecting its aerodynamic properties. The model allows for one, or more, zones along the span of the blades to be independently affected by erosion. The proposed model accounts for uncertainties in the local airfoil aerodynamics via parameterization of the lift and drag coefﬁcients’ curves. The proposed model was used to generate a stochastic ensemble of degrading airfoil aerodynamic polars, for use in forward aero-servo-elastic simulations, where we computed the effect of leading edge erosion degradation on the dynamic response of a wind turbine under varying turbulent input inﬂow conditions. The dynamic response was chosen as a deﬁning output as this relates to the output variable that is most commonly monitored under a structural health monitoring (SHM) regime. In this context, we further proposed an approach for spatio-temporal dependent diagnostics of leading erosion, namely, a deep learning attention-based Transformer, which we modiﬁed for classiﬁcation tasks on slow degradation processes with long sequence multivariate time-series as inputs. We performed multiple sets of numerical experiments, aiming to evaluate the Transformer for diagnostics and assess its limitations. The results revealed Transformers as a potent method for diagnosis of such degradation processes. The attention-based mechanism allows the network to focus on different features at different time intervals for better prediction accuracy, especially for long time-series sequences representing a slow degradation process.


Introduction
This work was motivated by current efforts on development of a novel micro-electromechanical monitoring system based on aerodynamic surface pressure and aero-acoustic measurements for structural health monitoring of wind turbine blades, as part of the Aerosense project [1]. Sensing nodes are distributed along the span of the blade, delivering measurements of local sectional aerodynamic pressure distributions and aero-acoustic and acceleration signals (see the illustration in Figure 1). Assuming the availability of such a monitoring system, we proposed an approach to the modeling and diagnostics of erosion of the leading edge for wind turbine blades. Leading-edge erosion (LEE) is caused by environmental variations in the blade surface, temperature oscillations, moisture, UV radiation, raindrops, sand, and hailstones or other particles impacting the leading edge of the blade. It could also be initiated by surface cracks due to global strain from blade flexing [2], errors during manufacturing in terms of paint deposits, consistency of the gel coat thickness and bonding strength, or initiating surface damage during blade handling in transport or installation. This causes the surface material to be removed from the blade surface, leaving a rough profile that degrades the aerodynamic performance, and, in the long term, if left untreated, it impacts the structural integrity of the blade. Outboard blade regions, where the relative velocity of the flow is higher, may exhibit more prominent erosion levels as impacts with eroding elements are more energetic. Moreover, the outboard sectional regions have a more significant role in power production, which further exacerbates rotor performance degradation. Bogdanoff and Kozin (1985) [3] define cumulative damage as the "irreversible accumulation of damage throughout life that ultimately leads to failure." Leading edge erosion of a wind turbine blade represents such a process, albeit at slow degradation rates, with effects slow to manifest and even harder to detect early on. Most diagnostic techniques today rely on direct visual inspection [4], image processing [5], and statistical analysis, e.g., regression or clustering of supervisory control and data acquisition (SCADA) output, such as blade pitch, rotor speed, or electrical power [6,7]. Visual inspection and image acquisition generally imply occasional stoppage of the turbines, resulting in loss of power production. More importantly, they give a local (both spatial and temporal) snapshot of the erosion condition without a direct link to the historical progression of the degradation and how it relates to environmental and operational conditions. Even though it is better positioned to discriminate the environmental and operational conditions versus the degradation process, long-term SCADA-based statistical analysis invariably does not possess the necessary resolution to determine accurately the severity and spatial extent of erosion on the blades. A gap therefore exists in the field of spatio-temporal modeling and diagnostics of leading edge erosion for wind turbines rotor blades. In this work, we were particularly interested in leveraging time series emanating from the novel Aerosense-based aerodynamic surface pressure monitoring system for wind turbines' blades. The aim was to (1) detect the occurrence of erosion and (2) classify the severity (intensity) of erosion. This could be regarded as either the inverse problem of reconstruction for the lift and drag coefficients or a problem of finding the optimal discriminants for erosion versus other effects from time-series response signals. To tackle this problem and build meaningful diagnostics models, realistic time-series data were required for training. Our options were either to collect multi-year field data, where the rotor blades are known to undergo leading edge erosion, or to develop a numerical model of erosion coupled to an aeroelastic simulator. As data availability for such systems is often limited and restricted by data-sharing protocols, we here opted for the route of numerical modeling and simulations, with the additional benefit of delivering a model that can be used for simulating such slow leading edge erosion degradation processes.
Towards accomplishing the goal of diagnostics and inference of erosion of the leading edge of wind turbine blades, this work made the following contributions: • We developed a stochastic spatio-temporal erosion model of the leading edge of wind turbine blades, which is characterized by a non-homogeneous compound Poisson process across discrete states, embedded in a generator of a stochastic ensemble of degrading airfoil aerodynamic polars for use in forward aero-servo-elastic simulations. The coupled model is able to compute the aeroelastic non-stationary response of a wind turbine, thus reflecting its behavior under the effect of leading edge erosion and varying turbulent input inflow conditions over a long period of degradation. • We adapted a deep-learning multivariate time-series-based Transformer, which employs attention mechanisms to detect and classify long-term and slow leading edge erosion processes assuming the availability of on-blade sectional monitoring data, under short-and long-term wind inflow uncertainties and aerodynamic uncertainties on the lift and drag coefficients of the airfoil sections along the span of the blade.
A graphical abstract of this research is presented in Figure 2. The remainder of this article is organized as follows. In Section 2, we offer a review of the state of the art in the modeling and diagnostics of leading-edge erosion. In Section 3, we present the details of the spatio-temporal stochastic model for leading-edge erosion. In Section 4, we present the uncertainty modeling and the aeroelastic simulations setup. In Section 5, we elaborate the theory of the Transformer model for diagnostics and inference. In Section 6, we illustrate the novelty and principles of the proposed framework on a simulations-based application and further provide discussions and list the limitations in Section 7.

Review of Prior Art
We review modeling approaches of leading edge erosion in conjunction with aeroelastic simulations and subsequently assess the state-of-the-art in diagnostics methods.

Modelling Leading-Edge Erosion
Several studies have quantified the impact of leading-edge erosion on the aerodynamic and aeroelastic performance of wind turbines.
The general thread in the first set of studies amounts to describing the constitutive laws, physical processes, and instigators at the micro-scale resulting in surface erosion/degradation and the various stages and types of erosion severity e.g., [2,[8][9][10][11]. Such models incorporate the effect of wind speed, air density, particle size, incubation time, and erosion intensity due to rainfall, snowfall, sea spray, and fog. It is unclear however how the micro-scale models couple to the macro-scale aspects of leading-edge erosion and consequently the effect on turbine performance.
The second set of studies quantified the impact on wind turbine performance via computational approaches (aeroelastic or computational fluid dynamics simulations) or extensive experimental campaigns in wind tunnels, coupled to uncertainty propagation schemes e.g., [12][13][14][15][16][17]. The limitations in all these works is that they do not allow for multiple zones along the span of the blades to be independently affected by stochastic erosion processes.
Bortolotti et al. [18] proposed a simple stochastic model to describe the extent of blade span-wise degradation due to leading-edge erosion via a single factor that was assumed to follow a truncated beta distribution. However, the long-term temporal degradation process of the blade's leading-edge was not accounted for. Dimitrov [19] describes a first attempt to introduce temporal progression to the damage growth for leading-edge erosion due to rainfall events indirectly via loss of annual energy production.
To the best of our knowledge, we found a gap in the existing literature for comprehensive models that encompass both stochastic, spatial, and temporal elements to describe the erosion process of the leading edge of wind turbine blades and the effects on their performance over long periods of degradation time horizons.

Diagnosing Leading Edge Erosion
Different approaches exist to diagnose the extent of leading-edge erosion on wind turbine blades. Traditionally, manual human inspection was used, although modern visual approaches are being developed to utilize drones paired with computer vision [5]. Our aim was to use data-driven continuous monitoring methods to perform diagnostics without resorting to turbine shutdown. We review different machine learning and probabilistic approaches that have been put forth and which, given data recorded by sensors placed on the wind turbine (e.g., novel aerodynamic pressure sensors, SCADA, accelerometers, strain gauges, etc.), would be able to estimate the extent of leading-edge erosion on wind turbine blades. In this context, the diagnostics task can be described as a multivariate time-series (MTS) classification/clustering problem; an algorithm is tasked with predicting the degradation status/class of a blade by analyzing multiple time-dependent signals. We therefore included MTS methods, which have not been specifically applied to wind turbine blades in the past, allowing us to explore promising machine learning approaches from diverse fields unrelated to structural health monitoring: activity recognition, anomaly detection, sound analysis, human health signal (EEG and ECG), interpretation, etc. If the forward model of the degradation process is known, the diagnostics problem becomes a supervised classification task, seeing that each MTS data point is associated with its degradation ground truth. Data-driven MTS classification is a field that is currently generating a significant amount of attention, with many promising methods that have been recently developed [20]. Within these approaches, we can differentiate between those that use the raw time-series data [21], known as end-to-end approaches, and those that first extract features that are then passed to a classification algorithm [22].
Extracting features from time-series data can be accomplished in many ways, with domain-dependent feature engineering or statistical methods often being the first choices. Algorithms based on empirical mode decomposition (EMD) are implemented for representing MTS as a superposition of simpler well-behaved components called intrinsic mode functions (IMFs). EMD approaches are suitable for damage detection and degradation state recognition in the analysis of nonlinear and non-stationary signals. However, they have shown low performance in highly stochastic signals with spike pulses and jumps [23,24]. Distance-based metrics evaluate the similarity between time-series data form another popular class of conventional feature construction methods [25]. Alternatively, in [26] and [27], feature vectors are built from MTS data by passing the inputs through numerous random convolutional kernels. This approach, named ROCKET, has a relatively low computational cost, yet, when combined with a standard ridge classifier, it achieves state-of-the-art accuracy on the UCR time-series archive dataset [28]. Other methods turn the time-series inputs into feature images [29], which are then fed into convolutional neural networks (CNNs) [30] designed for image classification, thus taking advantage of the recent advances in computer vision performance. Although feature engineering and feature extraction can potentially lead to accurate diagnostics with good explainability, we were interested in implementing end-to-end methods that use raw MTS data. This is due to the fact that feature-based methods often require large amounts of cumbersome human intervention, which limits generalization. Robustness is another potential downside, where changing the input, due to sensor malfunction for instance, might require re-selecting relevant features, while it can be factored into end-to-end approaches, provided it is part of the training data.
Conventional ("shallow") data-driven classification methods do not often use untransformed MTS as inputs, due to the difficulty associated with extracting relevant information from noisy, complex data. In contrast, deep-learning methods often excel at learning hidden discriminative features, making them particularly well suited for end-to-end approaches [31]. CNNs have been applied to end-to-end MTS classification in the context of prognostic health management [32], surgeon skill classification [33], and other tasks [34]. CNNs are limited by their receptive field, making their use limited to relatively short time-series. Recurrent neural networks (RNNs) [35] are, by design, effective at solving tasks relating to sequential data. Long short-term memory (LSTM) networks [36], currently the most commonly used type of RNNs, have been used to successfully classify clinical measurements [37] and action recognition tasks [38]. However, RNNs are also limited to relatively short input sequences due to memory requirements and are difficult to train. In recent years, Transformers [39] have attracted a lot of attention in the Natural Language Processing (NLP) community and have mostly replaced RNNs. These neural networks leverage the power of attention mechanisms to learn the relevant interdependencies within sequences. Transformers have increasingly been applied in other domains, such as computer vision [40], time-series forecasting [41], and multivariate time-series classification [42]. Graph neural networks [43], a class of deep data-driven methods that utilize graph-structured data, have also been applied to end-to-end MTS classification [44]. This is done by first constructing an adjacency matrix from the untransformed MTS data. In [45], spatio-temporal graph Transformer neural networks (GTNN) were implemented in order to effectively capture both dynamic spatial and temporal trends. We could envisage using a GTNN model to leverage temporal and spatial correlations in the degrading sections of the three blades on a wind turbine.
As previously mentioned, deep end-to-end models are, in general, better suited for generalization than feature-extraction approaches. Moreover, they can be further optimized for transferability [46], for instance by applying transfer learning methods, such as using domain adaption layers [47].
In the context of data-driven wind turbine blade condition monitoring, a number of methods have been reported in the literature [48]. In [49] vibrational, frequency response functions were used as input features to neural networks, in order to detect changes in the structural health of a blade. LSTMs were trained to classify faults, based on multivariate sensor recordings of a model wind turbine test rig in [50], while [51] implemented a CNN to detect possible wind turbine blade breakages based on SCADA data. In [52], support vector machines were trained on features built from active acoustic recordings to diagnose healthy and damaged states of a model wind turbine. Another feature-extraction approach was combined with decision trees in [53] to diagnose different fault scenarios for a model wind turbine. Finally, ice buildup on wind turbine blades was predicted with random forests trained on SCADA data in [54], while Weijtjens et al. [55] used Gaussian processes trained on data emanating from low-cost tower-mounted sensors to do so.
Considering their extensive use in other domains, Transformers bear great potential for SHM damage detection. Some researchers have made use of attention mechanisms similar to those found in Transformers for bearing remaining useful life prediction [56], but to the best of the authors' knowledge, only in [57] have proper Transformer models been used in this context. In this work, however, an end-to-end approach was not used; instead, the model was fed a pre-processed input composed of time series features such as Mel frequency Cepstral coefficients (MFCCs) and short-time Fourier transformation (STFT). We chose Transformers as the base of our end-to-end approach for diagnostics. However, we implemented additional modifications so as to be able to handle multivariate sequences on the order of several 10s of thousands of data points while still being able to track long-term degradation progression, reflected by newly sampled sequences with possible discontinuities. Few, if any, of the existing MTS learning models discussed above are capable of this.

Non-Homogeneous Compound Poisson Process
Erosion initiates on the blades in the form of pits near the leading edge on the pressure side. These pits develop gradually over time into gouges, then steadily grow in their size and density to coalesce as delaminations. Gaudern [9] identified five categories of erosion severity classes. In Sareen et al. [8], we find an analysis of the effect of erosion at the leading edge on the aerodynamic performance of the DU96 − W − 180 airfoil and categorized erosion from pits to delamination in different types and stages (9 in total), as shown in Table 1. Each combination of type and stage of erosion has associated with it aerodynamic polar curves (C L , C D , C M ) derived via measurements in a wind tunnel [8,9,12]. Throughout the article we adopt a similar labeling of the leading-edge erosion severity as shown in Table 2. We proposed a two-step empirical spatio-temporal stochastic model of LEE.
Step 1 of the model includes a damage process that occurs at random times, emulating the nonstationary time of arrival of degradation and its magnitude, with 10 discrete states based on the non-homogeneous compound Poisson process (NHCPP) to describe the random degradation of the blade surface and implicitly its aerodynamic properties. In step 2 of the model, the local leading edge degradation manifests by physical changes in the aerodynamic polars. This step generates an ensemble of stochastic aerodynamic polars for a given erosion severity of LEE in NHCPP by parameterizing the lift and drag coefficients curves [58].
A non-homogeneous Poisson process (NHPP) is a stochastic degradation process over a finite time horizon. The NHPP {N(t), t ≥ 0} has rate of arrivals λ(t) on the time interval t ∈ [0, T]. A sample path of a NHCPP emulates two main properties in the proposed LEE: (1) random time of arrival, 0 ≤ τ 1 ≤ τ 2 ≤ . . . : these are the time instances corresponding to a change in the degradation level of an airfoil's aerodynamic properties (shocks) and (2) the magnitude of change {Y n , n ≥ 1} in the degradation level of an airfoil aerodynamic properties at the nth time of arrival. In practice, for a probabilistic assessment, many NHPP paths are sampled. The model incorporates one or more zones (adjacent or disjoint) along the span of the blade(s) affected by erosion, but they are assumed to undergo independent degradation processes. The NHCPP modifies the definition of a Poisson process so that it can incorporate a time-dependent rate. The NHPP is suitable given the fact that the rate of LEE is not constant in time; on the short time scales (hourly), it is dependent on the rotor speed due to variations in the wind speed, while, on the longer time scales, it is dependent on seasonal variations such as rain intensity and changes in environmental temperatures. We thus inferred that the rate of occurrence of shocks is periodic, as shown in Figure 3. In addition, LEE consists of a limited and a finite number of stages resulting in varied and nonstationary incremental severities on the airfoils aerodynamic polars. It is in fact pragmatic to assume that the severity of LEE has a finite number of stages, because these could be categorized (e.g., via field observations) and their impact on aerodynamic properties emulated and quantified via measurements in wind tunnels as demonstrated in [8]. The impact of LEE on the lift and drag coefficients of the N ACA64618 airfoil [12,15,59] for 10 severity classes is shown in Figure 4a    The NHPP is similar to an ordinary Poisson process, except that the average rate of arrivals per period λ(t) is allowed to vary with time ( Figure 3). A classic definition of the NHPP is formulated as follows [60]. Let

Rate of Occurence, (t)
where s is a a very short interval of time. Here, o(s) is a function that is negligible compared to s, as s → 0. Assuming o(s) = g(s), then: One approach for generating NHPP is the "process analogue" of acceptance-rejection called thinning, which is the scheme we adopted in our model. The procedure is as follows:

1.
Choose Initialize t = 0 and I = 0 3. Generate If t > T, stop; else go to next 6. Generate Go to Step 3 In the above procedure, the ratio λ u is the thinning probability. In our model, λ u was chosen as the maximum of the rate function λ(t) (Figure 3). For long degradation periods, it might be necessary to break [0, T] into small intervals and pick a λ u for each interval in order to avoid a high rejection rate. This is known as piece-wise thinning. At the end of this procedure, the event times S(I) (arrival times) and the counting process are yielded according to the non-homogeneous process. Finally, for a NHPP with rate λ(t), the number of arrivals in any interval is a Poisson random variable; however, its parameter can depend on the location of the interval. More specifically, we can write: Introducing the compounding effect to the NHPP yields the non-homogeneous compound Poisson process (NHCPP). The compound Poisson process replaces the unit jumps of a Poisson process with random jump sizes [61]. The jumps' magnitude at the nth arrival time has value Y n , n ≥ 1 attached to it. The successive Y n , Y = (Y 1 , Y 2 , . . . ) are assumed to be independent, identically distributed, real-valued random variables, and are also assumed independent of the underlying Poisson counting process of shocks. The compound Poisson process associated with the given Poisson process {N(t), t ≥ 0} and the sequence Y is the stochastic process Z = {Z t , t ≥ 0}, where: The damage is always positive, meaning that P(Y n ≥ 0) = 1, and the damage accumulates additively according to Z t in Equation (2). In our model, Z t represents the cumulative amount of erosion incurred on the blade section at time N(t). In our model, we categorized 10 erosion severity levels reflected in the lift and drag coefficients, as shown in Figure 4a The final level of degradation (level 9 of erosion severity) leads to the loss of function of the blade, not in the sense of a catastrophic loss (e.g., rupture) but rather implying that the blade loses its ability to generate lift efficiently. We adopted a classical approach whereby Y 1 , Y 2 , . . . , Y n are each exponentially distributed according to density: where µ is the mean jump (shock) magnitude. It follows that the sum . For details, we refer the interested reader to [62,63]. Formally, we adopted a truncated exponential distribution for Y n in Equation (3). The logic being that large erosion severity magnitude jumps in one shock event are physically hard to justify. As a result, the upper truncation limit for the shock was set to a magnitude of 4 in our model. Furthermore, as more events arrived between [0, T], and damage was compounded Y 1 + Y 2 , + . . . , we adjusted the upper truncation limit for the shock magnitude from 4 Finally, sampled jumps of magnitude Y n > 0 for τ n < T were not allowed once the compounded damage has reached its highest compounded severity class (Z max = 9) before the end of time horizon T has been reached. This is reasonable as any additional shock cannot incur additional erosion, as no surface material that can easily be eroded remains.
Finally, we introduced wind speed and blade location dependencies to the model. For a given time of arrival of a shock, our model associated higher damage (1) in the event when the shock is concurrent with higher inflow wind speeds due to higher momentum (and thus kinetic energy) of the impacting particles anywhere on the blade and (2) in the outboard sections of the blade due to the local higher relative speed of the flow. We required the mean jump (shock) magnitude µ to comprise a linearly increasing function of blade radial location and quadratic with wind speed. The final outcome was thus an NHCPP degradation path as shown in Figure 5, where five zones along the span of blade 1 were chosen to undergo leading-edge degradation. Even though the exact pattern and location of surface changes during operation is a random process, we observed that our model tended to provide higher damage rates of the leading edge on the outboard sections of the blade compared to the inboard sections.

Stochastic Aerodynamic Polars Model
In this section, we generated a stochastic ensemble of degrading airfoil aerodynamic polars for use in forward aero-servo-elastic simulations. The outputs of the NHCPP were the degradation paths describing different degradation zones along the span of the rotor blades. Each arriving shock at a given time along these paths corresponds to a jump of a certain magnitude in the leading edge erosion severity. A jump in the erosion severity corresponds implicitly to a degradation of the aerodynamic properties of the airfoils affected by erosion along the span of the blade(s). However, uncertainties in the degrading airfoil aerodynamics lift and drag coefficients curves still remained to be accounted for. Our proposed LEE model hence takes into account the inherent uncertainty in airfoil static lift and drag coefficients during the erosion period via a stochastic model of static airfoil lift and drag polar curves. The details of this model can be found in Abdallah et al. [58], with a short summary presented herein for brevity.
Lift and drag coefficients are affected by several sources of uncertainty: uncertainties related to wind tunnel airfoil testing, 3D-flow correction uncertainties, surface-roughness uncertainties, uncertainties related to deformations in the blade geometry (during manufacturing and handling or deflections induced under load), uncertainties stemming from Reynolds number effects, uncertainties related to post-stall extrapolation of airfoil characteristics, and finally uncertainties resulting from prototype testing. The joint distribution of all these random variables cannot be quantified, and, therefore, a stochastic model was used as suggested in [58]. In this model, the lift coefficient was parametrized by the slope in the linear range ∂C L ∂α , the point of maximum lift (AoA max , C L,max ), the point indicating the start of the trailing edge separation (AoA TES , C L,TES ) and the point where the stall recovery is initiated (AoA SR , C L,SR ). This model also includes the parametrization of the drag coefficient through a bias at low angles of attack and by the maximum drag coefficient point at AoA = ±90 • , where the largest variation was observed. Figure 6 shows samples of the stochastic lift and drag coefficients for N ACA64618 for an erosion of severity class 5. Based on these perturbations, modified C L and C D curves were produced that preserved the main characteristics of the reference curves but yielded magnitudes and features that reflect possible uncertainties related to the aerodynamic properties of the airfoil section under actual conditions.
Putting it all together, the NHCPP coupled to the generator of stochastic ensembles of aerodynamic polars yields realistic degrading aerodynamic polars over the degradation time horizon, for every degrading zone and airfoil section on the rotor blades. For example, Figure 7a shows the degrading C L over a 20-year period (240 months) for for N ACA64618 airfoil, and Figure 7b represents the corresponding degrading max lift coefficient. Figure 7c,d show the degrading angle of attack at max lift coefficient and the degrading lift coefficient at α = 5 • , respectively.

Overview of the Algorithm
The main ingredients of the model included a (1) spatio-temporal stochastic damage process and (2) a generator of stochastic ensemble of degrading airfoil aerodynamic polars. The algorithm is summarized in Algorithm 1.

Uncertainty Modeling and Aeroelastic Simulations
Two categories of uncertain random variables (RV) were considered: wind inflow and aerodynamic effects. We elaborated on the aerodynamic uncertainties in the previous section. In this section, we detail the wind inflow uncertainties and the aeroelastic simulations setup. The conventional wind turbine structural damage computation process utilizes aeroelastic load simulations for wind turbines under normal and extreme operation and wind inflow turbulence conditions, whereby the extreme and fatigue load cycles determined over a short period of time at each mean wind speed are extrapolated over the full expected lifetime. Such a process does not consider the changes in inflow and environmental changes over long time periods, which are indeed relevant to the LEE process.

Stochastic Models of Inflow RVs
In turbulent inflow conditions, the structural dynamic response of wind turbines is highly influenced by factors such as average wind speed, turbulence intensity, wind shear, and skewness of the inflow. Taking these influences into account, our simulations were set up with the following RVs: turbulence intensity, T i , mean wind speed, U, wind shear, α, horizontal inflow skewness, Ψ, and the vertical inflow skewness, Σ.
A Weibull distribution describes the mean wind speed, truncated to (4-25) m/s, with the following parameters: The normal turbulence model, defined in the wind turbine design standard [64], describes the conditional dependence between the mean wind velocity U and the turbulence σ U . We chose to utilize I re f = 0.16 as the value for reference ambient turbulence intensity, which was the expected value of the turbulence intensity at 15 m/s. The local statistical moments of σ u ∼ LN µ σ U , σ 2 σ u determine this conditional dependency as: As a result, we can express the turbulence intensity as: T i = σ u u . A power-law relationship describes the wind profile, by defining the average wind velocity u at height Z above ground with respect to the reference mean wind speed u h , measured at hub height Z h : where α is the shear exponent. The conditional dependence between the wind shear exponent α ∼ N µ α , σ 2 α and the mean wind speed U is expressed as [65]: We introduced a custom conditional dependence between the average wind velocity U and the inflow horizontal skewness Ψ, truncated to (−11, 11) deg., Ψ ∼ N µ Ψ , σ 2 Ψ : The conditional dependence between the average wind velocity U, the turbulence intensity T i , and the vertical inflow skewness Σ, truncated to (−6, 6) deg., was proposed such that: The blade leading-edge degradation scenario occured over a 240-months period; consequently, we introduced an additional element of uncertainty by allowing a timevarying aspect of the inflow conditions, namely, the mean wind speed E(U), the shape parameter K U of the Weibull distribution, and the reference ambient turbulence intensity I re f , as shown in the example in Figure 8. By generating samples with the Sobol quasi-random sequences, the wind inflow and aerodynamic RVs were sampled uniformly over the unit hypercube, i.e., as evenly as possible over the multi-dimensional input space [66]. Over the degradation period, 1200 joint samples of the wind inflow RVs were sampled, as shown in Figures 9 and 10. We generated a realization of the inflow turbulent wind field time-series by sampling U, σ U , α, Ψ, and Σ, which, together with a sample of the stochastic lift and drag coefficients, formed the input to the OpenFAST aero-servo-elastic simulator. Using this simulator, we could then compute the aeroelastic response of the wind turbines structure under continuous erosion of the blade's leading edge.

Setup of the Aero-Servo-Elastic Simulations
Our coupled aero-servo-elastic simulations with the proposed NHCPP leading-edge degradation model was based on the OpenFAST simulator. OpenFAST [67] is a coupled aero-hydro-servo-elastic analysis tool for modeling wind turbines. The primary use of OpenFAST is to run nonlinear time-domain simulations. We elected to simulate in OpenFAST the NREL reference 5 MW wind turbine [68], which is a well-documented threebladed up-wind horizontal-axis wind turbine, with a rotor diameter of 126 m and a 90 m hub height. The layout of the airfoils along the span of the blade is shown in Figure 11. Using blade element momentum theory as the basis for the aerodynamic model, the OpenFAST simulator includes dynamic stall, skewed inflow, and generalized dynamic wake. Aerodynamic forces are computed by interpolating from lookup tables composed of provided aerodynamic polars. In our case, the aerodynamic polars correspond to the degrading lift and drag coefficients generated in the NHCPP, such as the one shown in Figure 7a. The Kaimal turbulence model [69] was utilized in OpenFAST to compute the stochastic input wind field. In our setup, we assigned a maximum of five independent zones that undergo leading-edge erosion per blade. The location of the degradation zones, normalized by the blade length, along the span of the blades were:

Retained Output from Aero-Servo-Elastic Simulations
As mentioned in Section 1, this research was motivated by the development of an aero-acoustic measurement system that can be used for SHM tasks. In this simulated setup, we assumed one measurement node located at 0.96 of the blade length, located within Eroding Zone 5 (near the tip of the blade). From our aero-servo-elastic simulations, we elected to retain those output signals, which are expected to emanate from the aero-acoustic measurements nodes, namely, lift and drag coefficients, angle of attack, and wind speed as listed in Table 3. Table 3. Output sensors of the aero-servo-elastic simulations retained for diagnostics.

Sensor Name Description
Time

Time steps of the simulations Wind1VelX
X-direction wind velocity at hub-height B1N9Cl Lift force coefficient at Blade 1, Aerosense Node at 0.96R B1N9Cd Drag force coefficient at Blade 1, Aerosense Node at 0.96R

B1N9Al pha
Angle of attack at Blade 1, Aerosense Node at 0.96R

Diagnosing LEE via Transformers
The simulation pipeline described above allowed us to generate data that replicate the sensory output of Aerosense nodes. Thus, we were able to produce datasets to train a deep learning method, with the end goal being to detect and estimate the extent of leading-edge erosion on wind turbine blades. We chose to implement a modified version of a popular class of sequential deep learning models: Transformers [39].
To motivate the diagnostics task, we considered the case where C L time-series signals emanate from two sections along the span of the blade. Figure 13a shows a scatter plot of time series of C L at sections 0.96R vs 0.75R of the blade for wind speeds varying between 6-16 m/s and no LEE (clean blade). Figure 13b shows a similar plot for a fixed wind speed 11 m/s and evolving severity of leading-edge erosion at section 0.96R. An important property is that the relational dependency of the C L between the two sections will change over time either due to changes in the severity of the leading-edge erosion or due to short-term variation in inflow and operating conditions. The problem is thus to detect leading-edge erosion and its severity, which could be regarded as a problem of finding the optimal discriminants for erosion versus other effects (inflow, operational conditions, aerodynamic uncertainties, etc.) from multivariate time-series response signals. The following are aspects we built into the diagnostics method for LEE classification, taking into account the nature/circumstance of an operating wind turbine in the field: • Labeled data for LEE are hard to acquire and are scarce; as a result, any method must be designed not to suffer from over-fitting under scarce labeled data availability. • Diagnostics shall be done with remote streaming of sensor data. Human intervention and turbine down-time should be alleviated.
• The data used in the diagnostics method intend to emulate the sensory output of a single MEMS-based aerodynamic and aero-acoustic measurement node positioned in proximity to the tip of the wind turbine blade. • The method shall be capable of ingesting 10-min-long multivariate time-series (industry standard SCADA recording length), sampled at 100 Hz, resulting in 60,000 data-point-long sequences. A sampling rate of 100 Hz may seem excessive, but the aim is to capture even small turbulence scales. • A supervised scheme shall be used to train a Transformer-based network by utilizing labeled data resulting from aeroelastic simulations of a turbine combined with the degradation model, as presented earlier in the article. • Physics-constraints shall be built into the loss or likelihood function. • The output predictions should be probabilistic in nature.

Experiments and Datasets
We utilized multivariate time-series data generated via the coupled simulation pipeline previously described. The designation multivariate stems from treatment of multiple variables, referring to signals pertaining to lift, drag, angle of attack, and inflow velocity, which are recorded via an Aerosense node located near the blade tip. On the other hand, the term multivariate could further refer to the use of multiple Aerosense stations at different locations along the span of the blade(s); this, however, lies beyond the scope of the present work. After some initial testing, we opted to utilize the four variables shown in Table 3 as inputs to the Transformer. For instance, no performance benefit was found when including simulated acceleration data.
Three sets of experiments, each with different data requirements, were performed aiming to answer the following questions: 1.
Are we able to learn the LEE severity classes from aerodynamic MTS data, in the general machine learning sense, with balanced data classes and no prior knowledge? 2.
In a continuous monitoring context, are we able to diagnose jumps in LEE severity and therefore identify the degradation path that the system takes? 3.
Are we able to do so in a realistic setting, with all previously described sources of uncertainty present in the simulations?
Thus, the goal of the first set of experimental tasks was to train the neural network on a generalized dataset with reduced uncertainty and where each degradation class appeared in equal measure. This is not representative of a real degradation path, where the inherent stochasticity of the NHCPP may result in some classes appearing very briefly, but we aimed to avoid biasing the network towards a random predominant class. Moreover, this set of experiments also aimed to understand how data availability affects classification performance and how separable the classes are. Two datasets, corresponding to Experiments 1.1 and 1.2, were generated: one with the full 10 severity classes and one with a reduced amount of severities (levels 0, 1, 6, and 9). For this set of experiments, no information on previous states was used. In each of the sub-experiments, 4800 aero-elastic simulations were gathered, then split into train, validation, and test sets in a 70/20/10 split.
The second set of experiments aimed to assess whether using full degradation paths in a continuous monitoring setup to train the diagnostics method is a suitable strategy. Here, the datasets were comprised of full NHCPP degradation paths, albeit with reduced uncertainty. Due to the stochastic degradation, the datasets did not have balanced classes. Assuming that a continuous monitoring system is in place, we therefore had access to the previous degradation states. This allowed for degradation monotonicity to be enforced in the prediction, thus constraining the output to physically possible solutions (i.e., it is physically impossible to have a state that is less degraded than previous states-except for the case of direct service intervention for repair and maintenance, which is out of scope in this work). Another objective was to evaluate whether severity grouping is a viable strategy. Indeed, grouping the LEE stages by type (see Table 1) could be beneficial as a coarse predictor, if it proves to be sufficiently accurate. Thus, in Experiment 2.1, all 10 degradation levels were used as labels, while, in Eperiment 2.2, the levels were grouped by type. Three full NHCPP degradation paths (new sample every six days, over a 20-year period) with 1200 simulations each were separated into training and validation sets in a 80/20 split. A full degradation path, with 1200 data points, was reserved as the test dataset for final evaluation.
The goal of the third set was to evaluate how the model would perform in a realistic continuous monitoring situation. Here, the dataset was also made of full NHCPP simulations, but in this case, all possible stochasticity was turned on. Again, we assumed that we had access to the previous degradation states. We also evaluated class grouping, separating the experiment into two. Three full NHCPP degradation paths (new sample every six days, over a 20-year period), each consisting of 1200 MTS data points, were divided into training and validation sets based on a 80/20 split. The test dataset was comprised of one full degradation path, with 1200 data points also unused in training or validation. Table 4 summarizes the differences between the datasets of the three sets of experiments, including the partition of data between the different subsets.

Transformer Architecture
Here, we introduce the architecture of our Transformer neural network, used to classify long-sequence multivariate time-series data and infer the degradation status of the leading edge.
Transformers are models that rely on self-attention mechanisms to highlight and learn dependencies within sequences. Typically, self-attention is implemented by parsing the input sequence into into key (K), query (Q), and value (V) vectors. The attention weights on the values are then obtained by taking the scaled dot products of the query with all keys and then by applying a softmax function. Using the classical notation [39], this writes: where d k is the dimension of the key and query vectors and is used to scale the dot product. In practice, h-scaled dot-products are used in parallel, within multi-head attention layers.
A multi-head model is able to attend jointly to information gathered from different parts of the input, providing an improvement in representation over a single attention mechanism. The final attention vector is obtained by concatenating each of the dot-product results: with where Multi-head attention layers are combined with fully connected networks and layer normalization to form the base architecture of each layer of the Transformer stack. We used a novel time-windowing Transformer model, which is based on the patch system used in the Vision Transformer (ViT) model [40]. This architecture, as shown in Figure 14, aims to address the quadratic attention bottleneck, an obstacle that limits the length of sequences a Transformer can use. Indeed, the computational complexity of the self-attention layers in Transformer models scales as O(L 2 ), where L is the length of the input sequence. Given our requirements, the input time-series had a length of L = 60,000, which is prohibitively resource-intensive for standard Transformer models. To alleviate this issue, we proposed to divide the data into N windows along the time-dimension before passing each window through the learnable input embedding. This allowed us to use the full input time-series without resorting to downsampling, which would entail losing information.
Before dividing the MTS into windows, each of the individual channels/variables was normalized in time. Then, the input embedding encodes each time-series window into a vector of size d model , which is the latent size used throughout the Transformer. The type of input embedding used has a considerable impact on performance, and many embedding types can be envisaged. We chose to use a learnable linear embedding for simplicity, based on initial testing and the literature [40]. A class token is concatenated to the embedded sequence such that its state at the output of the Transformer is used to infer the degradation severity. Class tokens are commonly used in NLP tasks [70] and are an effective way to retrieve categorical information from Transformers. Traditionally, positional encodings are then added to the sequence in order to instill directional information into the model. Contrary to standard NLP Transformer models, we did not use a positional encoding. Two positional encoding methods were tested (classical sine functions and simple linear encoding); however, both of these methods hindered classification performance.
The classification head consists of a single hidden layer multilayer perceptron (MLP) that processes the class token at the output of the Transformer. It outputs a vector with a size equal to the number of possible degradation classes, which, when passed through a softmax activation, gives the likelihood scores of each class. For each class output x i , the softmax likelihood writes: If we aim to predict the degradation at multiple zones along the blade, we can simply stack multiple MLPs, one for each zone. Table 5 summarizes the values used in the Transformer architecture.  Figure 14. Time-windowing architecture of the novel Transformer model. The input multivariate time-series is split along the time dimension into windows, which are then individually passed through the learnable linear embedding. The learnable class token that is added to the sequence is used in the classification MLP head to predict the level of degradation.

Loss Functions
We first trained and evaluated the Transformer model on the datasets with equal amounts of degradation classes, in the sense of a traditional classification problem. Thus, the objective function for this first task was the standard cross-entropy loss: where y is a one-hot label vector indicating the correct class,ŷ is a vector containing the predicted softmax probabilities for each class, and n is the number of classes.
In the second and third set of experiments, it was assumed that the state of the system is known for the previous sampling period. We used this information to enforce physicality; we added a second term to the loss function that penalizes a predicted degradation class that is lower than the known previous class. This second term is the margin ranking loss: L MR (ĉ, c prev ) = −max(0,ĉ − c prev + m) (15) whereĉ is the predicted class, c prev is the degradation class of the previous known state, and m is the margin hyperparameter. Overall, the objective function for the second and third experiments was therefore: where α is a hyperparameter used to balance the two components.

Training Regime
During training, we used a batch size of 20 along with the Adam optimizer [71] with parameters β 1 = 0.9, β 2 = 0.999, = 10 −8 , and a learning rate of lr = 5 · 10 −5 to perform stochastic gradient descent. After 40 epochs of training, the learning rate decayed by a factor of 0.2. To ensure that the trained model is able to generalize well, and does not overfit, dropout with a 0.3 rate was used in the attention, fully-connected, and embedding layers. Furthermore, training and validation metrics (loss and accuracy) were monitored throughout the training sequence, allowing us to halt training without overfitting. We found that for most experiments, training for around 60 to 120 epochs was sufficient.

Results
For each diagnostic experiment, a Transformer model was trained on the training dataset, monitored on the validation dataset, and final evaluation took place on the test dataset. All model hyperparameters were static throughout the experiments, and only the number of training epochs varies, as we choose the epoch with the best validation performance. We report in the following sections the results for all three sets of experiments gathered on the previously unseen test dataset.

Experiment Set 1
The first set of experiments aimed to assess the potential of the Transformer for diagnostics given ideal conditions: a balanced data set where each individual degradation class appears in equal measure. Table 6 reports the testing accuracy scores gathered for Experiments 1.1 and 1.2. We note a significant difference between the results of the two experiments, where the 4-class setup outperformed the 10-class setup by 30% in terms of test accuracy. Knowing that Transformers are data-intensive models that respond well to large datasets, this discrepancy can be partly explained by accounting for the number of simulations per individual class. Indeed, as both training sets have the same total number of samples, there are 2.5 times the amount of samples per severity class in Experiment 2.2. To factor out the disparity in the amount of data per class, we generated a large version of training dataset of Task 1.1, with a total of 10,176 training samples and 2544 validation samples. In this experiment (1.1 large), the number of training samples per class was larger than for Experiment 1.2 (1017 versus 840), yet the 4-class model still outperformed the 10-class model by over 16%. In general, it is advantageous to train on larger amounts of severity classes from a risk management perspective, as it allows for more optimal maintenance intervention schemes. However, as highlighted above, a fine-grained approach requires more training data in order to diagnose accurately with high confidence and could result in higher amounts of false positives, which could then lead to an increase in operation and maintenance (O&M) costs. Thus, there is a trade-off between risk minimization via fine-grained diagnostics of LEE severity and increased costs due to higher amounts of false positives and requirements for larger training datasets, which can be optimized to meet risk management policies and budgetary specifications. The difference in accuracy can also be explained by the separability of the individual severity classes. We show in Figure 15 the confusion matrices generated on the test sets for Experiments 1.1. and 1.2. It can be noted that the Transformer model was able to successfully differentiate between the lowest and highest of erosion severities, as demonstrated by the lack of misclassified samples with large differences in severity. Similar LEE erosion classes exhibited higher amounts of wrongly classified samples. This aligned with our expectations: small increases in the level of erosion only marginally increased the roughness of the leading edge, which has a minor impact on blade aerodynamics. It is therefore more challenging to distinguish between similar LEE severities based on integrated pressure quantities.

Experiment Set 2
Here, we assess the use of stochastic degradation paths as diagnostic training points. Table 7 shows accuracy scores gathered on an un-seen degradation path for Experiments 2.1 and 2.2. Figure 16a plots the predicted states of each MTS data point alongside the true degradation path for the test dataset that contains a full NHCPP path with up to 10 possible LE severity classes. In addition, we show the prediction confidence of each model output based on its softmax likelihood. Figure 16b displays the median predicted severity using a three-month rolling window. These results show that this approach is viable and that we were able to successfully detect the jumps in degradation severity with some limitations. The main issue with training on stochastic degradation paths is the data imbalance. Indeed, the training dataset contained a majority of points belonging to Classes 0, 8, and 9, while Classes 5, 6, and 7 were underrepresented. This is reflected in the results, where the accuracy was high for Class 0 but low for Classes 5 and 6. Moreover, the prediction confidence was, in general, lower for underrepresented classes. A possible solution to address this was to train on grouped severity classes. We evaluated this approach in Experiment 2.2 and plotted the resulting predictions in Figure 17. Overall, this approach yielded a higher prediction accuracy, but it is still limited. Grouping artificially creates a somewhat balanced set and reduces class sparsity; however, it also leads to increased intra-class heterogeneity. As a result, learning to set the boundaries between the classes becomes more difficult and dependent on the number and the separability of the groups. This is reflected in the disparity of the minimum 70% confidence accuracy results between Experiments 2.1 and 2.2. Indeed, the low intra-class heterogeneity of the 10-class approach results in low confidence predictions for ambiguous samples, while the same ambiguous samples are incorrectly classified with a high likelihood in the grouped approach due to the network accounting for high intra-class variance. There is a balance to be found between the accuracy gain and the increase in intra-class heterogeneity induced by class grouping, depending on the number of groups. Should grouping be desired, unsupervised clustering approaches should be envisaged, in order to find the optimal number of groups.
In Experiments 2.1 and 2.2, non-physical predictions were penalized via the additional loss term (see Equation (16)). The result of using this extra component is apparent in Figures 16a and 17a: there are noticeably fewer misclassified points below the true degradation path than above it (11.5% vs 23.08% for Experiment 2.1 and 6.67% vs 25.58% for Experiment 2.2).

Experiment Set 3
In Experiments 3.1 and 3.2, we assessed how the diagnostics method performs on more uncertain degradation paths, with multiple sources of variability. Not only were there large class imbalances due to the stochastic degradation but aerodynamic uncertainty (see Figure 6) and long-term weather fluctuations made it challenging to distinguish between different LEE severities. Weather variability is not commonly used for long-term aeroelastic modeling in the wind energy industry and is somewhat unrealistic, but the goal was to understand the limitations of the Transformer model by making inference extremely challenging. We report in Table 8 the accuracy scores gathered on an un-seen degradation path for Experiments 3.1 and 3.2. Overall, classification accuracy was much lower than previous experiments, but using only a high confidence predictions yielded a larger boost in accuracy. This can be explained by the inherent stochasticity and the high intra-class variance of this dataset, which makes high likelihood predictions rarer.  Figure 18 shows how challenging this final task is, where we see many non-physical misclassified points and an overall low prediction confidence. Here, the data imbalance had a big impact on classification performance. As the training set does not contain any samples with an LEE Severity Class 9, the network was unable to predict this severity on the given test set. Given these harsh training conditions, it is unsurprising that performance was underwhelming. To mitigate this, class grouping by LEE type was again tested in Experiment 3.2 (see Figure 19). Compared to Experiment Set 2, the benefit of grouping classes was clear here: performance was improved by almost 20%. This indicates that class grouping is a viable approach if intra-class heterogeneity is high by default and some classes are completely underrepresented. Nevertheless, even the results for grouped classes are lackluster, indicating that other strategies should be examined to deal with the very imbalanced datasets induced by the stochastic degradation paths. For instance, one could consider conditional retraining methods. By modifying the architecture so that it outputs confidence metrics based on how similar a sample is to the training set, as implemented in [72], one could then launch re-training procedures if multiple test samples that are very dissimilar to the training space are encountered.

Transferability and Curriculum Learning
Here, we evaluated the transferability of the different experiments and assessed the use of curriculum learning [73] as a method to improve accuracy on difficult results. In a curriculum learning training setup, the network is progressively exposed to harder datasets. In our implementation, we first trained the model on the dataset from Experiment 1.1, then we progressively added the training data from Experiments 2.1 and 3.1. The benefits of this approach should be twofold: (1) the data-intensive Transformer model is exposed to a larger dataset, and (2) the weights of the model are fine-tuned on more challenging samples, which helps to avoid local minima during gradient descent. Tables 9 and 10 show the results of the different transferability and curriculum learning experiments. 1. This was to be expected due to the data-intensive nature of Transformers and the similarity between these sets. • The first curriculum learning experiment reduced transferability to Set 3.1. This is an indication of a loss of capacity to generalize to ambiguous data. • The best test results on Set 3.1 were obtained in the second curriculum learning experiment. This suggests that pretraining on sets with reduced stochasticity followed by fine-tuning on uncertain data is an effective approach. • The second curriculum learning experiment yielded the best high confidence accuracy for Sets 1.1 and 2.1. Thus, adding difficult, stochastic data points of Experiment 3.1 to the training dataset helps with regularization, enabling the model to construct a better internal representation of each severity class.
The curriculum learning test, which includes the data from all experiments, led to the highest high-confidence accuracy for the 2.1 test set. As this set is representative of a degradation path with standard aero-elastic simulation practices, we have plotted these results in Figure 20.

Limitations and Discussion
In the presence of the availability of comprehensive historical data, as those that we aimed to collect via the Aerosense monitoring system, it would be possible to tune the proposed NHCPP model parameters and calibrate all assumptions and modeling choices and further develop our probabilistic model in order to emulate more closely deterioration due to leading-edge erosion on operating wind turbines in the field over long time horizons [74,75]. In addition, the impact of field data quantity and quality on model calibration and its implication on the generated NHCPP will need to be closely evaluated.
Similar to the NHCPP, the gamma process has also been traditionally used for modelling deterioration. The shocks in a gamma process follow a gamma distribution, although the increments in the process are assumed to be independent from each other. Unlike Poisson processes, in gamma processes, there are infinite jumps in a finite time period. This is why the former are suited for modeling sporadic shock-induced damage, while the latter suit continuous, monotonous, and gradual deterioration [76].
One interesting extension to our NHCPP macro degradation model is to further couple it to a micro-model that describes the relative velocity between particles (including their density and size) and the blade [11]. Subsequently, based on the site's atmospheric conditions and the wind turbine's operational settings, the erosion severity is calculated from the number of particles impacting the blade surface.
An important outcome of the diagnostic experiments is the fact that the Transformer model is data-intensive. Our results point to the fact that more data almost always aids in classification performance. Furthermore, training on balanced datasets is always desirable, but obtaining these is difficult in real monitoring scenarios as it would require gathering data from many systems undergoing different degradation paths. To overcome this difficulty, the coupled NHCPP aeroleastic simulation setup can be used to replicate realistic degradation and produce balanced training sets, which can then be augmented with real field data, in a similar manner to the curriculum learning experiments.
As mentioned in the literature review, feature-extraction learning methods are often preferred for MTS classification tasks owing to their higher precision. In this context, we tested the state-of-the-art MINIROCKET method [27] with 10k kernels, combined with a standard linear classifier at the output for our diagnostics problem. This yielded an accuracy of only 55.21% on the test set of Experiment 1.1 and 68.96% on the test set of Experiment 1.2. This highlights the fact that very long sequences are challenging to deal with, while many methods found in the literature are not optimized for such tasks. Furthermore, feature-extraction methods may encounter difficulty finding optimal features, due to the fact that this problem features a very slow degradation rate with no flagrant traits from one stochastic sample to another. On the contrary, Transformers are well suited for this kind of input, as the self-attention mechanisms are effective at filtering out unnecessary data while focusing on the parts of the signal that are important.
While the loss term added to penalize non-monotonicity of the degradation process does help to reduce non-physical predictions, it is not strictly enforced. Another option would be to enforce this condition in the softmax output layer. Although this method would most certainly improve accuracy, it would lead to error accumulation when implemented in the context of a continuous monitoring setup. With our approach, erroneous predictions can be ignored by using a rolling window for instance. The proposed model could, however, be further improved by additional tuning of the objective function. One such improvement involves incorporation of an inflow-dependent multiplier that reduces the importance of samples that have highly unlikely inflow conditions. Furthermore, while the leading edge degradation process is inherently time-dependent, time was not factored into the final diagnostics approach. Initial tests, which included a global time stamp in the model input, showed the Transformer overfitting to this variable. A probabilistic loss term that penalizes unlikely severities given the amount of operational time could be a potential solution to this.
Another possible improvement could be the modification of the training procedure to a bi-directional procedure, similar to the BERT model [70]. In this approach, a pairwise input would be fed into the Transformer, and during the masked pre-training phase, the model would try to predict the class of either the first or second MTS at random. This could lead to a better representation of the data and allow the use of known reference states for comparative diagnostics.
Our proposed approach does not take positional information into account, yet, as shown in Figure 13b, this may allow to more clearly distinguish the LEE severity classes along the span of the blade. Inserting positional information into our Transformer model can be achieved in different ways; one possible approach is to encode a graph structure in the Transformer architecture, as implemented in [77].

Conclusions
In this work, we tackled the problem of leading-edge erosion of wind turbine blades on a two-fold front. Firstly, we dealt with the development of an appropriate model for simulation of such degradation processes, and, secondly, we proposed a monitoring-driven method for diagnostics of such damage processes. On the first front, we proposed a stochastic spatio-temporal empirical model for modeling leading-edge erosion degradation based on a non-homogeneous compound Poisson process and coupled this to the nonlinear time marching OpenFast aeroelastic wind turbine computer simulator. The coupled model was able to compute the aeroelastic non-stationary dynamic response of a wind turbine reflecting its behaviour under the effect of leading-edge erosion, varying inflow conditions and aerodynamic uncertainties over a long period of degradation time horizon. On the diagnostic front, we adapted a deep neural network, namely, a Transformer, to allow for use on very long sequence multivariate time-series. This allowed us to solve the problem of spatio-temporal diagnostics of leading-edge erosion on wind turbine blades, using the data emanating from the non-homogeneous compound Poisson process setup, a scheme that intends to emulate data recorded by aero-acoustic sensors placed on wind turbine blades. We showed that the diagnostics model effectively captured the temporal trends induced by long-term degradation of the leading edge. An attractive feature of this method is that it is well-suited for spatio-temporal degradation problems with a very long time horizon. Data Availability Statement: The data that was generated through the coupled NHCPP-OpenFAST simulation pipeline and which was used for classification experiments will be made available for download from the following repository: https://zenodo.org/record/5544043.

WT
Wind