Towards Fleet-wide Sharing of Wind Turbine Condition Information through Privacy-preserving Federated Learning

Terabytes of data are collected by wind turbine manufacturers from their fleets every day. And yet, a lack of data access and sharing impedes exploiting the full potential of the data. We present a distributed machine learning approach that preserves the data privacy by leaving the data on the wind turbines while still enabling fleet-wide learning on those local data. We show that through federated fleet-wide learning, turbines with little or no representative training data can benefit from more accurate normal behavior models. Customizing the global federated model to individual turbines yields the highest fault detection accuracy in cases where the monitored target variable is distributed heterogeneously across the fleet. We demonstrate this for bearing temperatures, a target variable whose normal behavior can vary widely depending on the turbine. We show that no turbine experiences a loss in model performance from participating in the federated learning process, resulting in superior performance of the federated learning strategy in our case studies. The distributed learning increases the normal behavior model training times by about a factor of ten due to increased communication overhead and slower model convergence.


Introduction
Wind energy plays a pivotal role in climate change mitigation.A massive growth in the installed wind power capacity and grid infrastructure is required to decarbonize the power supply (Barthelmie & Pryor, 2021;Edenhofer et al., 2011).New wind farms are being planned and commissioned on an unprecedented scale in many countries (IEA, 2021(IEA, , 2022;;OECD et al., 2018).Access to the condition monitoring information from wind farms is an important requirement for reducing downtimes and enabling condition-based maintenance of wind farms (Carroll et al., 2016;Faulstich et al., 2011;Pinar Pérez et al., 2013).Despite this, manufacturers have been guarding condition data and reliability information from their turbines and are reluctant to share them due to business strategic interests (Kusiak, 2016).A strong data lack has resulted (Clifton et al., 2022;Leahy et al., 2019) which has been hampering the development, large-scale validation and operational deployment of data-driven models for wind turbine monitoring and diagnostic tasks.Our study addresses this problem by proposing a privacy-preserving approach for sharing wind turbine (WT) condition information within a fleet of WTs of different owners without sharing any data from the WTs.In the context of this study, a fleet is the set of all WTs of the same model.The WTs within a fleet are identical in design.We demonstrate how data-driven condition monitoring models can be trained collaboratively by a WT fleet in a manner that allows sharing condition information among the WTs without sharing the WTs' condition data.Specifically, we propose to train accurate turbine-specific models of each WT's normal operation behavior for fault detection tasks by making use of the condition monitoring data of the entire WT fleet in a privacy-preserving manner.This is a highly relevant scenario because in practice only the manufacturer can access the condition data from all WTs of a fleet, whereas other stakeholders only have access to the small share of the fleet's data from their own WTs or even to no data at all (Kusiak, 2016).Other stakeholder groups concerned may include operators, owners, thirdparty companies, regulators, and researchers.Thus, our study demonstrates a path towards privacy-preserving sharing of condition information without any manufacturer, operator or owner having to grant anyone access to their WTs' operation and condition data.Wind farm operators usually have no access to WT data from other operators and are, therefore, not able to make use of condition information from other operators' WTs for their own wind farms.The lack of data sharing (among wind farms of different owners) within fleets is particularly unfortunate in situations where the relevant data are scarce: For example, when the operator or other stakeholders seek to establish a damage database but have only few (or even no) fault events of each fault type in their database, or when a new WT has been commissioned and the stakeholder has no condition data available yet for that WT type.In such situations, it would be highly desirable to benefit from fleet-wide information sharing.Manufacturers, on the other hand, usually have access to the operation and condition data of all operating WTs produced by them but do not share these data.To address the data imbalance, we propose and investigate the potential of privacy-preserving federated learning (McMahan et al., 2017) for condition monitoring and diagnostics tasks in wind farms based on WT data distributed among multiple owners.Federated learning has received growing interest in the field of mobile devices and Internet-of-Things applications after McMahan et al. (2017) presented the FedAvg algorithm.There have been numerous recent improvements of and contributions towards FedAvg, for instance, in enhancing security and privacy (Mothukuri et al., 2021;Yin et al., 2022) and in improving its efficiency (Acar et al., 2021;Asad et al., 2020).Comprehensive reviews of federated learning algorithms have been provided in (Aledhari et al., 2020;Kairouz et al., 2021;L. Li et al., 2020;T. Li et al., 2020;Lim et al., 2020;Mothukuri et al., 2021;Yang et al., 2019).An application of federated learning that is in use operationally are next-word predictions for virtual keyboards in mobile apps (Hard et al., 2018;Pichai, 2019).First applications have also been proposed in other fields, such as automotive systems (Y.Liu et al., 2020;Lu et al., 2020;Thorgeirsson et al., 2021).The capabilities of federated learning are still largely unexplored in renewable energy applications.Recently, Zhang et al. (2021) proposed a federated learning case study for probabilistic solar irradiance forecasting.Their presented FedAvg-based framework, enhanced by secure aggregation with differential privacy, was shown to achieve performance advantages over a setting in which data sharing between participants was unavailable.However, the authors noted that the shared federated learning model resulted in slightly inferior performance compared to a centralized setting with data sharing, as it is susceptible to data distribution deviations between clients.Lin et al. (2022) presented a federated learning approach for community-level disaggregation of behind-the-meter photovoltaic power production.To address the data heterogeneity of each community, a layerwise aggregation was introduced.Only the parameters of the shallow layers, learning community-invariant features, were exchanged, while the communityspecific parameters of the deep layers remained local.This customization step was shown to result in improvements compared to a completely shared global model.With a focus on efficiency, Q. Liu et al. (2022) demonstrated a successful federated learning application to collaborative fault diagnosis of photovoltaic stations.To address the inefficiencies of FedAvg, especially when computing capabilities and dataset sizes differ between the participants, the authors proposed asynchronous decentralized federated learning.This framework without a central server resulted in significant reductions in communication overhead and training time.In the field of wind energy, Cheng et al. (2022) presented the first and, to our knowledge, so far only study of a federated learning model for wind farms.The authors proposed an approach for detecting blade icing by classification.A blockchain-based architecture with a cluster-based learning module was introduced to address concerns regarding privacy and malicious attacks, as well as the data imbalance.The authors remarked that, while not considered in their study, existing data heterogeneity may negatively affect the performance of the model.Classification methods such as Cheng et al. (2022) are relatively uncommon though for fault detection tasks in wind farms in practice due to the typically small number (or even absence) of fault observations.In contrast, fault detection based on normal behavior models is more common because it relies on learning an accurate representation of only the normal behavior of the WT and does not require a comprehensive amount of representative fault condition data, unlike fault classification approaches (Tautz-Weinert & Watson, 2017).Normal behavior modelling involves modelling the behavior of the monitored WT under normal fault-free operation conditions.The resulting normal behavior models (NBMs) characterize the normal operation behavior of the monitored WT as expected under the prevailing operating conditions.For example, NBMs can predict bearing temperatures or the active power output expected under the current normal conditions.NBMs are very useful because they enable the detection of significant deviations from the normal behavior that may indicate operation faults and trigger further investigation (Bilendo, Badihi, et al., 2022;Bilendo, Meyer, et al., 2022;Schlechtingen et al., 2013a).Such deviations can be detected based on the residuals of the measured and the expected state.SCADA-based NBMs have been proposed for single and for multiple monitored state variables (Meyer, 2021;Schlechtingen et al., 2013b;Zaher et al., 2009).Multiple sensor systems are usually available in a WT for condition monitoring and normal behavior modelling for fault detection and diagnosis.They include temperature sensors, accelerometers for monitoring the vibration responses in the drivetrain and tower, oil quality sensors, and environmental sensors such as anemometers (Badihi et al., 2022;García Márquez et al., 2012;Tchakoua et al., 2014;Wymore et al., 2015).Condition monitoring can also be performed based on data from the supervisory control and data acquisition (SCADA) system of the WT (e.g., (Dao, 2022;Tautz-Weinert & Watson, 2017;A. Wang et al., 2022;Zaher et al., 2009;Y. Zhu et al., 2022) and based on combinations of SCADA and vibration data (e.g., (Sun et al., 2022)).SCADA-based condition monitoring can be considered a low-cost approach since no additional sensor systems need to be installed.On the other hand, the WT health information provided by SCADA data may be less informative in that it can be less component-specific, less timely and less accurate with regard to the fault diagnostics task than dedicated sensing systems such as accelerometers.For example, gearbox faults can be identified from vibration measurements at an early stage of fault development (e.g., (Jonas et al., 2023)), whereas associated SCADA data, such as from the gearbox temperature, would allow the fault to be detected only once it resulted in an unusual increase of the gearbox temperature, i.e., at a late development stage.Such temperature increases typically result from abnormal heat generation that can originate from excessive friction.Therefore, in SCADA-based fault detection, a fault can often be detected only at a relatively advanced fault development stage in which initial damage may have already occurred.Nevertheless, SCADA-based fault detection is a popular monitoring technique due to its simplicity, low cost and complementarity to other condition monitoring techniques in WTs.Comprehensive reviews of data-driven approaches in condition monitoring and diagnostics for wind farms were provided by (Black et al., 2021;Nunes et al., 2021;Pandit et al., 2023;Stetco et al., 2019;Tautz-Weinert & Watson, 2017).The potential of collaborative fleet-wide learning of normal behavior models for fault detection tasks in WTs based on SCADA data has not been discussed or investigated, despite its high relevance for practical applications.Our study addresses this research gap by proposing federated learning of normal behavior models in a data-privacy-preserving manner.We propose a solution to an important practical problem in wind farm monitoring and diagnostics: How to train NBMs for detecting developing faults in WT subsystems when SCADA and sensor data for training NBMs are missing or not representative of the WT's current operation.This is a major challenge in newly commissioned wind turbines and in turbines whose operation behavior changed, for example, due to large hardware or software updates.We demonstrate the federated learning of NBMs in two case studies for gear bearing temperatures and power curves in two wind farms.The main contributions of our study are: 1. a new privacy-preserving approach to wind turbine condition monitoring, 2. a customization approach to tailor the federated model to individual WTs if the target variable distributions deviate across the WTs participating in the federated training, 3. federated training and customization are demonstrated in condition monitoring of bearing temperatures and active power.Our study contributes to resolving a major problem: the "lack of data sharing in the renewable-energy industry [which] is hindering technical progress and squandering opportunities for improving the efficiency of energy markets" (Kusiak, 2016).This study is structured as follows.Section 2 details our proposition for collaborative privacypreserving learning for condition monitoring and diagnostics tasks in WT fleets.Section 3 presents two case studies of a federated learning of normal behavior models in bearing temperatures and active power.We report and discuss our results in section 4. Section 5 summarizes the conclusions from our study.

Federated learning of wind turbine conditions 2.1 Federated learning
In conventional machine learning, all data on which a model is trained need to be available and accessible in a central system.If the data belong to different owners, such a centralized setting requires that the data owners give up their data privacy by sharing their data with others.In contrast, federated learning is a machine learning approach that learns a task from the joint data of different data owners without disclosing the data or sacrificing their privacy.In a federated learning environment, multiple industrial systems (clients, in our case: wind turbines) train a machine learning model in a collaborative distributed manner such that each client's training data remain on its local client system, thereby preserving the privacy of the training data (McMahan et al., 2017;Smith et al., 2017).With federated learning, the training data are distributed across multiple client systems and are not located in one central system, as is the case with conventional machine learning.The parameters of a collaboratively trained model are learned from the distributed data without exchanging the training data among the client systems or transmitting them to a central system.Only updates of the locally computed model parameters are shared with and aggregated by the central system.The model training is collaborative in the sense that each client contributes to the joint model training task by using its locally stored data for that task.We adopt the FedAvg federated learning approach of McMahan et al. (2017) in our study.For a formal definition, it is assumed that a fixed number of  client WTs are participating in the federated learning process.Each client WT  has a fixed dataset   of size   = �  �.In our case study, this is the dataset from the SCADA system used for training a normal behavior model of the WT normal operation behavior.Each dataset   is stored locally in the client WT and not accessible to other client WTs or the central system.The FedAvg training proceeds in iteration rounds, at the start of which a central server transmits the initial model parameters of the current round to the  client WTs (Table 1).Then, each client WT  updates the received model parameters by training on its local dataset   , and then transmits the update to the central server.The server updates the parameters of the global model by aggregating the updates received from all client WTs by averaging.The objective of the iterative FedAvg training process is to arrive at model parameters  that minimize the sum of prediction losses ℒ  from the  client WTs on all data points (  ,   ) of their local datasets   , In our case study, the model parameters  will be weights of a feed-forward neural network.We compute the prediction losses ℒ  in terms of the mean squared errors.In each algorithm round , the update step involves that the  client WTs perform local weight updates in parallel, so each client WT performs a gradient descent step on its local data, In addition to preserving the data privacy, further advantages of federated learning result from the fact that it does not require all client data to be stored in a central location.This can be highly beneficial when applied to complex remotely monitored power infrastructure such as wind farms.Modern WTs are equipped with hundreds of sensors that can collect hundreds of gigabytes of data every day (Siemens Gamesa, 2022).Transmitting and storing all those data in a central system (as would be common in conventional machine learning) is expensive and requires a high transmission bandwidth and data buffer.If the data are stored centrally, the data center managers of the central storage system are also responsible for protecting the data privacy and preventing unwanted third-party access, which entails an additional burden.

Federated learning for condition monitoring
Condition monitoring of wind turbines is often based on normal behavior models in practice (Schlechtingen et al., 2013b;Tautz-Weinert & Watson, 2017).Normal behavior models (NBMs) can be used for applications in fault detection and diagnostics.We propose and demonstrate the federated learning of normal behaviour models for such condition monitoring tasks.In the following, we analyze how NBMs for condition monitoring can be trained collaboratively by a fleet of WTs, in a manner that allows information sharing within the WT fleet without disclosing the data of any of the WTs.NBMs have been proposed for fault detection tasks based on SCADA and sensor data (Meyer, 2021;Schlechtingen et al., 2013b;Zaher et al., 2009).Our case studies explore the application of the FedAvg method (McMahan et al., 2017) for training accurate NBMs for fault detection applications in WTs which have few or no representative data.We focus on fault detection based on NBMs of drivetrain component temperatures and on the active power production (Meyer & Brodbeck, 2020).The drivetrain component temperatures exhibit more heterogeneous distributions across WTs.We investigate how federated learning can still be applied to extract accurate NBMs for condition monitoring, despite significant inter-turbine differences in the distribution of the target variable, in our case: the gear bearing temperature.
The temperature behavior of components and the active power form the basis of NBMs that are key for the condition monitoring in WTs (Kusiak et al., 2009;Lydia et al., 2014;Marvuglia & Messineo, 2012;Schlechtingen et al., 2013a;Shokrzadeh et al., 2014;Y. Wang et al., 2019).We demonstrate federated learning for NBMs of these applications.
Policies involved in the practical implementation of a federated learning process are beyond the scope of this study.There is certainly more than one setup and distribution of roles in the federated training process that can work in practice.For example, the federated learning process can be orchestrated by a regulatory entity who might define the process, the machine learning model structure, the aggregation,

Federated model training process
The server selects the model architecture and initial weights, and iterates the steps below.
1.The clients participating in the training are selected.2. The central server transmits the starting weights of the current iteration to the participating clients.and distribute the software needed for the implementation.Federated learning can be organized in a centralized way, as presented here, but also in decentralized ways.Federated learning processes can be orchestrated by a central agency, such as a regulator.They may also be implemented and orchestrated by operators to enable data access across the fleet.Federated learning can, in principle, even be implemented by the manufacturer for customers who prefer not to give the manufacturer access to their turbines' data.In the centralized learning process proposed in our study, the client WTs only need to be equipped with a client computer that can train neural networks on their local data, with computing capacity and storage similar to that of a laptop computer.

Customizing federated models to individual WTs
A possible limitation of global federated learning models is that a single global model is trained for application in all client WTs of the fleet.Having a single non-customized model for all fleet members can limit the model's performance in the fault detection task, especially in cases in which the client WTs' SCADA and sensor datasets follow somewhat different statistical distributions in normal operation, requiring NBMs that are customized to each WT.Previous research investigating the effects of non-identically distributed data on the FedAvg algorithm has shown that data distribution differences can negatively impact the convergence and the performance of the global FedAvg model (Q.Li et al., 2022;Zhao et al., 2018;H. Zhu et al., 2021).We investigate NBM customization in our case studies.
The NBM resulting from the federated training process (Table 1) is a global model trained on the data of all client WTs, so it is not customized to a specific client WT.We demonstrate the limitations of a single non-customized model in our case studies based on the example of WT gear bearing temperatures and active power.Despite all WTs being the same model, each WT's local dataset can present a somewhat different data distribution.The arising data heterogeneity can be described as domain shift (Huang et al., 2023;Kouw & Loog, 2018;Quinonero-Candela et al., 2008), where the WTs' local datasets form diverse domains with different feature distributions.For example, the temperature behavior of the gear bearing can differ across WTs because of differing thermal behaviors.A single global NBM without customization learned through the FedAvg training process can lead to poor generalizability across domains (i.e., WTs) and to situations where for some client WTs the global NBM outperforms a locally trained one, whereas for other client WTs the global NBM performs worse than a model trained only on their local data.A lack of generalizability can become especially critical when WTs that have little or no representative data are dependent on information contained in the data of other WTs with distinct domains.Shared global models may be inadequate under these circumstances.Customized federated learning aims at alleviating this issue by customizing the global model to each client WT, while still participating in the distributed learning process.Customization techniques that have been proposed for federated learning models range from customization layers in neural networks (Arivazhagan et al., 2019) to meta-learning with hypernetworks (Shamsian et al., 2021).We refer to Kulkarni et al. (2020) and Tan et al. (2022) for an overview and taxonomy of customization techniques.
In this study, we customize the global FedAvg model by means of local finetuning updates (Collins et al., 2022;Tan et al., 2022) which ensures that the participating client WTs can benefit from the federated learning process.

Case studies: Federated learning of fault detection models
The goal of our case studies is to estimate WT-specific normal behavior models for WTs that lack representative observations, and to perform the estimation in a collaborative privacy-preserving manner.A WT can suffer from a lack of representative training data for various reasons.A lack of representative data arises at the commissioning of a WT but can also occur after events that can affect the WT's normal operation behavior, such as control software updates or hardware replacements.
In the first case study, normal behavior models of the active power are developed: Some of the WTs participating in the federated learning process have representative local training data covering all wind conditions, whereas the training data of other WTs are dominated by low wind speeds.The second case study focuses on federated learning of normal behavior models of bearing temperatures.Unlike in the first case study, the bearing temperatures exhibit heterogeneous distributions across the WTs participating in the federated training.We show that customizing the trained global model to individual WTs yields the highest fault detection accuracy under such conditions.
The case studies are performed with data from two wind farms.The two wind farms are in separate locations (with a distance of at least 900 km) with different geographical and environmental factors.In the following sections 3 and 4, we will describe, present, and discuss our case studies with regards to data from the first wind farm.We then apply the same case study design and validate our results on the dataset from the second wind farm, which is presented in appendix A4.SCADA data from ten commercial onshore wind turbines are analyzed for the case studies.All ten WTs are of the same manufacturer and model.The WTs are a horizontal-axis variable-speed model with pitch control and share the same technical specifications (Table 2).The data were acquired from the WTs' SCADA systems at a sampling rate of ten minutes over the course of 13 months.Each WT holds around 50'000 valid data points that contain wind speeds measured at the nacelle, the corresponding power generation, measured rotor speeds, and gear bearing temperatures.The measurements are provided as average values over 10-minute periods.All WTs are from the same wind farm, and we assume that no data sharing is allowed between the WTs.One randomly selected turbine out of the ten WTs is used only to define the network architecture with optimal hyperparameters, as explained in Appendix A1.The NBMs of the remaining nine client WTs are estimated based on the SCADA data.Two planetary stages, one helical stage

Federated learning of Active Power models
The first case study demonstrates the privacy-preserving collaborative learning of NBMs of a wind turbine's active power generation.The trained NBMs enable the detection of underperformance faults in the monitored WTs.The normalized 10-minute average wind speed serves as regressor in the normal behavior model of the power generation.The wind speed was min-max normalized such that all normalized wind speeds are in the range of [0,1].We investigate a scenario in which five of the nine WTs are affected by a lack of representative SCADA data in the sense that the data are dominated by low and moderate wind speed observations whereas observations from time periods of high wind speeds are lacking.In practice, this scenario may arise when the existing WT data were taken during extended periods of low wind speeds which are not uncommon in many regions, e.g., in central Europe in summertime, and can last for several weeks and even months (Ohlendorf & Schill, 2020) For each WT, we set aside the last 30% of its SCADA data, i.e., the data gathered in months ~9-13 of the 13-months data collection period, as that turbine's test set.Further, we assign nine randomly selected WTs as "client" turbines.The remaining WT is treated as a public turbine in the sense that its SCADA data will serve us for the model selection.The remaining 70% of the data of each client WT are split into a training set and a validation set in a manner that represents the data-scarce conditions as discussed above: We define the training set of each of the five WTs to be composed of the 10-minute average

A. Conventional machine learning
We evaluate the training of a NBM in a conventional non-distributed machine learning environment.Each client WT individually learns a NBM based on its own past operation data and without any access to data from other WTs of the fleet.This constitutes the default situation in practice.We typically lack access to data from other fleet members because they have other owners and no data sharing is in place.

B. Federated learning of a single global model
Our second training strategy for the NBM is a federated learning environment.In this setting, a central server communicates with the client WTs in a privacy-preserving manner.We implement the federated averaging approach of McMahan et al. (2017), see Table 1.First, the server broadcasts the model architecture, determined with the model search over the server-accessible public WT, and further information such as the optimizer, loss, and metrics to the client WTs in the initialization step.The iterative update step consists of the client WTs first updating their models in parallel -which we implemented as running three epochs over their private local training sets -and then sending their model weights back to the server.Next, the server averages the collected client weights and broadcasts the averaged model weights to the client WTs.The averaged model weights represent the global FedAvg model.An additional sidestep involves that all clients evaluate the updated global model on their validation set and send their validation losses to the server.We repeat the update step until the average validation loss of the clients has not improved within 5 repetitions, representing 15 local epochs by each client.The global federated learning model is then evaluated by calculating the root mean squared error (RMSE) on the test set of each client WT.

C. Customized federated learning of turbine-specific models
A possible disadvantage of the presented federated learning approach (B) is that it results in a single global model that is not customized to a specific client WT.The individual client WTs may exhibit somewhat different data distribution characteristics depending, for example, on their sites, maintenance history, or local ambient conditions.The feature distributions of the monitored target variable may differ significantly across the fleet (Figure 3).A) does not capture the power curve behavior correctly at higher wind speeds (Figure 7.3).For the four WT clients with representative wind speed data, the power curve can be fit accurately even with conventional machine learning with only the local training data.The results by the global federated learning model (strategy B) show a contrast between the client WTs with scarce high wind speed observations and the client WTs with representative wind speeds.For the WTs with scarce high wind speed observations, the RMSE of the active power NBM are significantly reduced by the global federated learning (mean: 0.125) compared to the conventional machine learning setting.By receiving shared model parameters from all client WTs through the server aggregation step, the client WTs with scarce high wind speed observations are now able to also model the upper wind speed ranges by means of the shared global model.Therefore, these client WTs benefit from the federated learning process through a significant improvement in model performance.Panel 7.4 shows the accordingly improved power curve of one of the five client WTs with few or no high wind speed data with a realistic behavior in the upper ranges, despite not having any reference data points available in its own local training set.Conversely, the model performance has slightly but noticeably decreased for all but one of the four client WTs with representative wind speed observations (mean: 0.113) by the global federated learning as compared to the conventional machine learning setting.The averaging step of the global federated learning leads to a loss of individual characteristics contained in the local models of those client WTs.Therefore, as these clients were already locally capable of fitting a model tailored to their individual turbine characteristics, also in the upper wind speed ranges, the averaged global federated learning model leads to a performance loss by incorporating individual information from other turbines.Such performance losses could discourage operators of client WTs with sufficiently representative training data from joining the federated learning process.These client WTs should not drop out of the federated learning though because they are essential to the performance increase of the client WTs with scarce data in this example.Indeed, our results show that a customized federated learning implementation can counteract this issue.The local finetuning of the global federated learning model manages to revert the impact of the global averaging and re-introduces individual characteristics into the models.Thus, the active power NBMs include both global information as well as customized adjustments.Panel 7.5 shows that the active power NBM from the customized federated learning model is very similar to but somewhat deviating from the global federated learning model to correct for local dataset characteristics.Comparing the average performances of the three learning strategies (A-C), the customized federated learning approach (C) accomplished the lowest RMSEs for the clients with scarce high wind speed observations (mean: 0.117) and achieves the same performance as the conventional machine learning strategy (A, mean: 0.104) for clients with representative wind speed observations.Our results suggest that a customization method should be applied for possible performance improvements of the trained NBMs and as an incentive for all client WTs to join the federated training process.Compared to conventional machine learning (A), a distributed learning process such as federated learning requires additional computational costs due to the communication between server and clients, overhead operations, and slower model convergence.Figure 6 shows the measured computational time taken to accomplish the training process for the three learning strategies.All client WTs finish training within less than three minutes in a conventional machine learning setting (A).With a global federated learning strategy (B), the clients require more than 9 minutes for the learning to be accomplished.Given this increase, the training time needs to be investigated when considering federated learning applications for more complex models and for training with a larger number of client WTs.In customized federated learning (C), the computational costs are dominated by the global learning step as the customization step requires a finished federated learning process.The added time taken by the actual customization step, i.e., the local finetuning, becomes negligible (on average +10.4 seconds) in comparison.Thus, a local finetuning step is a very cost-efficient improvement.2) for a randomly selected one of the five WTs with few or no high wind speed data in their training sets, and the power curve models trained for that WT based on conventional machine learning (7.3), the global federated learning model (7.4) and the customized federated learning model (7.5).As the training set of the WT contains only few data points for high wind speeds, the conventional machine learning model fails at modeling the true power curve behavior for higher wind speeds, which is shown by the underlying test set data.By privacy-preserving learning from other WTs, the global federated learning model (7.4) can now model these higher ranges.The finetuning step in the customized approach slightly adjusts the global model (dashed line) to the private training set (7.5).

Federated learning of Bearing Temperature models
The privacy-preserving learning strategies A-C are also investigated in a second case study.A feedforward neural network is trained to model the normal behavior of the gear bearing temperature using SCADA data.To ensure fair comparisons between the strategies, all client WTs and federated learning strategies use the same feedforward MLP model architecture, outlined in Table 4. Model performance.The accuracies of the NBMs trained with non-collaborative strategy A are shown in Figure 8. WTs with scarce datasets (mean RMSE: 4.29) have a higher average RMSE than WTs with representative datasets (mean: 3.93) with this strategy.The models trained on scarce datasets with strategy A are not capable of fully capturing the temperature behavior.An example of this is shown in Figure 9, in which the trained model is unable to adequately estimate temperatures in underrepresented ranges (very low and high temperatures, as shown in 9.1), which leads to larger errors in lowest and highest observed temperature values on the unseen test dataset (9.2).Comparing the performance of global federated learning (B) to conventional machine learning (A) for the five client WTs with scarce datasets, the global model leads to performance increases in only two client WTs but raises the prediction errors of the NBMs in three other WTs.The global models results in worse NBM performance even though the three WTs lack representative data and receive shared model parameters.This result suggests that the substantially differing bearing temperature behavior across clients strongly affects the generalizability of the global model, such that one global model trying to combine all individual characteristics cannot always offer a satisfactory fit.Therefore, despite receiving information about temperature ranges not represented in their training set, these values do not necessarily reflect the actual bearing temperature behavior of that WT.An example is shown in Figure 9.3 where the global model introduces a strong overestimation of the lower bearing temperatures.For the four clients with a fully representative training set, the global federated learning model leads to a noticeable increase of the RMSE in all cases.The global model incorporates information from all turbines, leading to a loss of individual characteristics within the model and thus to a loss in performance, as already observed in case study 1.A customized federated learning strategy can encourage operators of client WTs without data scarcity to participate in the federated learning process because a customized strategy can revert potential performance degradation introduced by the global model.Both case studies show that the customization step is a necessity to encourage clients without data scarcity to join the federated learning process.For clients with scarce datasets, the customized federated learning strategy achieves the best performance across all strategies.The local finetuning enables the customized models to retain and transfer usable knowledge from the global model (for data not represented in the scarce dataset) and additionally incorporate individual characteristics from the private local dataset.An example is shown in Figure 9.4 where the bearing temperature estimates from the customized model are now improved in the unseen low and high temperature ranges.Our results suggest that a customized federated learning strategy can enable fleet-wide learning of condition information even in the presence of a significant domain shift.
The computational times taken to train the NBMs following the three learning strategies (Figure 8) confirm the results of case study 1.We observe a strong increase in training time of the global federated learning model compared to conventional machine learning.Training a model according to the conventional machine learning strategy takes on average 43 seconds, while the federated learning process requires more than 10 minutes.In contrast, the increase in time for the local finetuning of the global model (customization part of strategy C) remains negligible as it only requires an additional 12.9 seconds of training on average.The results of case study 2 reinforce that a disadvantage of the federated learning process is its additional computational costs and that customized federated learning (strategy C) is a very time-efficient model improvement strategy.Detailed results for all WTs are shown in Appendix A2.All experiments were run on an Intel Xeon CPU @ 2.20 GHz with implementations using TensorFlow v2.8.3, Keras v2.8, and the tensorflow-federated v.0.20.0 framework (Abadi et al., 2016;Chollet & others, 2015).

Second wind farm
We further validate our findings by replicating our case studies using data from the second wind farm.The wind turbines in the two farms belong to different fleets.They have different manufacturers, different rated powers, and major constructional differences.Details and results are provided in appendix A4.The transfer across different fleets is not in the scope of our study.

Conclusions
A wealth of data is being constantly collected by manufacturers from their wind turbine fleets.Stakeholders interested in those data can include operators, owners, manufacturers, third-party companies, regulators, and researchers.There are various reasons why different stakeholder want access to information contained in a fleet's operation data.Benefits of making the information accessible include technological progress, for example through new and improved data-driven applications, and economic advantages resulting from increased transparency and competition.For example, improved machine learning models can be trained based on a fleet's data to provide better decision support to wind farm operators.This may involve improved predictions of failure events and estimations of the remaining useful lifetime of critical parts.Conventional machine learning on local wind turbine datasets is often applied in practice but it cannot exploit the information contained in the operation data of distributed wind turbine fleets.Conventional machine learning cannot overcome the lacking access to fleet-wide data because it is incompatible with data privacy needs.We have demonstrated a distributed machine learning approach that enables fleet-wide learning on locally stored data of other participants in the federated learning process, without sacrificing the privacy of those data.We have investigated the potential of federated learning in case studies in which a subset of wind turbines was affected by a lack of representative data in their training sets.The case studies involve the collaborative learning of normal behaviour models of bearing temperatures and power curves for condition monitoring and fault detection applications.The results of our case studies suggest that a conventional machine learning strategy fails to adequately train normal behavior models for fault detection when representative training data are lacking.The presented privacy-preserving federated learning strategy significantly improves the accuracy of normal behavior models for wind turbines lacking representative training data, as they can benefit from the training on the data of other turbines.However, when the distributions of the monitored variable differ strongly across the fleet, a single global model shared by all turbines can deteriorate the performance of the normal behavior models, compared to conventional machine learning, even if representative training data are lacking.We have presented a customized federated learning strategy to address this challenge of heterogeneously distributed target variables.By customizing the global model to each client WT by local finetuning of neural network layers, we first successfully revert the performance losses of the global model, so that no turbine suffers from a performance loss by participating in the federated learning process.Customized federated learning yields the best model performance across all compared learning strategies.Our case studies suggest that fleet-wide learning and sharing of condition information can be achieved even where the monitored target variable is distributed heterogeneously across the fleet.Client WTs with scarce training sets were able to extract and customize knowledge from other fleet members.The federated learning process increased the average model training time by factors of 7 and 14 in the presented case studies, which can be attributed to more comprehensive communication and overhead operations and slower model convergence in the federated learning process.Our proposed federated learning method proposes a solution to a major problem in energy and power system fleets: The lack of data sharing which "is hindering technical progress (..) in the renewableenergy industry" (Kusiak, 2016).Future research directions may involve investigating further applications of federated learning in renewable energy domains, various customization strategies, and different characteristics and effects of heterogeneously distributed target variables.It should also investigate how model training times scale with fleet size for large fleets and possibly more complex models such as multi-target normal behavior models.

A3. Customized federated learning
We employed a customization approach by finetuning the global federated learning model, as outlined in section 4.This finetuning process, resembling a transfer learning approach, involves maintaining the weights of chosen layers and only training the weights of the remaining layers for several epochs with a smaller learning rate to adjust the pretrained weights to the local dataset of the client.The model consists of three layers with trainable weights (Table 3 and 4).Thus, we have evaluated the options of 1) only finetuning the last layer (1 finetuned layer), 2) finetuning the last two layers (2 finetuned layers), and 3) finetuning all trainable layers (3 finetuned layers), while maintaining the weights of the other layers in accordance with their states in the global federated learning model.For each of the three options, we trained the model using a smaller learning rate (half of the learning rate used in the conventional and standard federated learning process) until the validation loss, defined as the root mean squared error on the validation set, did not improve for 5 epochs.Tables A3 and A4 show the results on the validation set for each client WT in the case studies.For each client turbine, we choose the best performing model from the three options, that is, the model with the lowest validation loss, as the customized federated learning model used for the evaluation in Tables A1 and  A2.
Table A3.The root mean squared errors calculated over the respective client WT's validation set of three evaluated customization experiments in case study 1. "WT": wind turbine, "FL": federated learning.

A4. Second wind farm dataset
We additionally investigated our two presented case studies (section 3) using data from the publicly available Penmanshiel wind farm dataset (Plumley, 2022).The onshore wind farm consists of 14 identical WTs of the same configuration (Table A5).The dataset comprises 10-minute averages of SCADA data recorded across a time span of 5 years.Each WT's local dataset contains around 150'000 valid datapoints per variable, which includes wind speeds measured at the nacelle, power generation, and gear bearing temperatures.We assume that no data sharing between WTs is allowed.

A4.1 Case Studies
We apply the identical case study designs as outlined in sections 3.1 and 3.2.For the first case study, the federated learning of active power models, the normalized 10-minute average wind speed serves as regressor of the power generation.Seven WTs, that is 50% of the turbines in the wind farm, are affected by a lack of representative training data in our scenario.For each WT, the last 30% of its SCADA data are set aside as test set.The remaining 70% are split into a training and validation set.For the seven WTs affected by data scarcity, only the four weeks with the lowest average wind speeds comprise the training set, with the remainder belonging to the validation set.For the other half, the remaining 70% of SCADA data are split into a training set (the first 70% of data) and a validation set (the last 30% of data).For the second case study, the federated learning of bearing temperature models, the normalized 10minute rotor speeds and power generation are regressor inputs to the model predicting the (front) bearing temperature.The WTs' datasets were split according to the same scheme as in the first case study, with the difference that the training sets of the seven randomly chosen WTs affected by the data scarcity scenario now consist of only randomly selected four consecutive weeks of data.Figure A3 shows the distributions of the monitored variables in each case study (power generation, bearing temperature) for all 14 WTs in the wind farm.While the distributions of the active power exhibit almost identical distributions across the wind farm, the temperature distributions show significant differences.These characteristics are in accordance with the discussed setting in section 3.3.

A4.2 Results
We evaluate the presented strategies A-C (conventional machine learning, global federated learning, customized federated learning) from section 4.1 for both case studies.

A4.2.1 Federated Learning of Active Power models
In case study 1, a NBM of the power generation is trained.We use the same model architecture and configuration summarized in Table 4.The results, shown in Figure A4 and Table A6, validate our previous findings of case study 1 discussed in section 4.2.For WTs lacking representative training data, a conventional machine learning strategy results in a poor fit (mean RMSE: 0.188), as the local training sets are lacking representative data for high wind speed ranges.These WTs benefit from a significant error reduction by participating in the global federated learning process (mean: 0.039).The global model however results in a performance loss for WTs with representative training sets (mean: 0.038) compared to strategy A (mean: 0.035).Customized federated learning reverts these performance losses back to the original level (mean: 0.034) by enabling the WTs to adjust the global model to their local datasets, thus resulting again in overall the best performing strategy.In terms of computational time, the average training time of the federated learning strategy increased by a factor of 18 compared to the conventional machine learning strategy.The additional training time for the customized federated learning strategy, i.e., the local finetuning, remains negligible (on average +29.4 seconds).
our observations from section 4.3, a customized federated learning strategy can not only revert the performance losses for WTs with representative training data (mean RMSE by strategies: A: 5.82, B: 7.04, C: 5.82), it also enables data-scarce WTs to retain and transfer knowledge from the global model, such that this strategy results in the lowest error for these WTs in this scenario (mean: 5.91).The average training time of the global federated learning strategy increased by a factor of 7 compared to the conventional machine learning strategy, while the efficient local finetuning step only required an average additional training time of 30.9 seconds.
3. Each client trains a local model with stochastic gradient descent on its local training data.The local training is performed by all clients in parallel.4. The updated weights are transmitted to the server which averages them.The resulting average weights are sent to the clients to serve as their new model weights and starting weights of the next iteration.
wind speed and power generation values of the four weeks with the lowest average wind speeds out of the considered 9-months measurement period.Thus, the training sets of the five WTs are characterized by low and moderate wind speed conditions.All other time periods form the validation set of that respective client WT.The training and validation set of the remaining four WTs comprise all wind resources, including low, moderate and high wind speeds.The last 30% of the training set data form the validation set for these clients.An illustration of the training, validation and test datasets is given in Figure1for one of the five data-scarce WTs and for one of the four WTs with representative training data.The accuracy of the power curves of the five WTs is limited due to the lack of observations of high wind speeds in their local training data.Note that the data from the four WTs with representative training data are inaccessible to the data-scarce wind turbines.So it is not possible to derive and transfer a power curve from any of those four WTs to any data-scarce WT because the data are local and, thus, unavailable to standard (non-federated) learning approaches.

Figure 1 .
Figure 1.Datasets of two different client turbines.First row: Only data from the four weeks with the lowest average wind speed were kept for the training set of this client turbine.The training set does not contain sufficient data to represent the true power curve behavior in high wind speed situations (upper left panel).Second row: Wind speed and power data from a client WT whose training data contain representatively distributed wind speed observations.

Figure 2 .
Figure 2. Datasets of two client WTs.First row: Only data from four randomly chosen consecutive weeks were kept for the training set of this client turbine.In this case, the training set contains insufficient data to represent the temperature behavior in low temperature situations (upper left panel).Second row: Gear bearing temperature and rotor speed data from a client WT whose training data contain representatively temperature observations.3.3Heterogeneously distributed target variablesDeviations in the data distributions across the local datasets of the participating client WTs can negatively affect the FedAvg learning process, as discussed in section 2. Our case studies exhibit different degrees of distributions shifts in the target variables, enabling us to investigate the effects of deviating distributions of the monitored variables.Figure3shows the distributions of the active power generation and the gear bearing temperatures across the nine WTs participating in the federated training.The distributions of active power (i.e., the target variable of the first case study) display only minimal differences across the client WTs.So one expects that a global federated model should be able to capture information that can be generalizable across WTs.We assess how this global knowledge can be shared and utilized by WTs with scarce training datasets in the case studies.We also evaluate how the loss of WT-specific information in the global model affects WTs with representative training datasets, and the utility of customized models under these conditions.The distributions of the gear bearing temperatures show distinct differences across all client WTs.A globally shared model may have difficulties capturing global information that is generalizable across WTs.Customization to individual WTs may improve the model performance in the case of nonidentically distributed datasets.We assess the effect of the observed distribution shifts on the performance of the global model in the second case study, and whether collaborative condition information sharing across the fleet is still possible and beneficial for the participating WTs under these conditions.

Figure 3 .
Figure 3. Kernel density estimates of the distributions of the monitored variables on the test set of all nine client WTs.

Figure 4 .
Figure 4. Learning strategies applied in the case studies: Conventional machine learning (A), Federated learning of a single global NBM (B), Customized federated learning of WT-specific NBMs (C).
Such differences are not represented by the global NBM, which may result in performance losses of the model for some client WTs.Some turbine operators might be incentivized to opt out of the federated learning process if they find that a local NBM trained only on their local data with conventional machine learning (A) outperforms the global NBM (B).Training WTspecific NBMs can make it attractive for all client WTs to join the training, so we customize the MLP that represents the global NBM to specific client WTs.After training a single global MLP model based on FedAvg (B), we achieve the customization by having each client WT finetune a subset of the trained layers of the global MLP on its local dataset (Figure 5 and appendix A3).This turbine-specific finetuning resembles transfer learning methods in which neural network layers of a previously trained model are finetuned on a separate dataset (Collins et al., 2022; Kulkarni et al., 2020; Pan & Yang, 2010; Tan et al., 2022; Zhuang et al., 2021).Based on the validation set losses, each client WT optimizes the number of layers to finetune in its customized MLP.The weights of the other layers remain fixed with the weights of the global federated learning model.The resulting model performances are presented in Appendix A3.The customized model with the lowest RMSE on the client WT's validation set was finally evaluated on each test set.

Figure 5 .
Figure 5. Illustration of the federated learning process with customization in step 5. Step 1: The server initializes an empty model and broadcasts the architecture to the clients.Step 2: Each client updates their model weights by running training epochs over their private local datasets.Step 3: The clients broadcast their model weights to the server which aggregates them into a server model.Step 4: The server broadcasts the calculated model to the clients.Steps (2)-(4) are repeated until a training stop criterion is satisfied.At the end of step 4, the server and clients share the same ("global") model weights.The customization step 5 involves the finetuning of layer weights of the global model trained in steps 1-4.

4. 2
Federated learning of Active Power modelsIn case study 1, a feedforward neural network is trained as a NBM of the power generation.The inputs to the model are the normalized SCADA wind speeds.The model outputs a prediction of the active power generation in MW.In each case study, all client WTs and federated learning strategies make use of the same feedforward multilayer perceptron (MLP) model architecture to ensure fair comparisons among experiments.The respective MLP architecture will be determined by applying a random search model selection algorithm on the SCADA dataset of the public turbine.For the first case study, the resulting model architecture is summarized in Table3.The search algorithm and model architecture selection are outlined in Appendix A1.Each client WT minimizes the mean squared error loss over the training set by applying stochastic gradient descent (SGD) in the conventional machine learning according to strategy A. Training is stopped once the client WT's validation set loss has not improved within 15 epochs.The model performance is finally evaluated as the RMSE on the test set of over the client WT.The results are summarized in Figure6.Detailed results for each WT are presented in the Appendix A2.

Figure 6 .
Figure 6.Left: Performances of the training strategies on the test set in terms of mean RMSE between the NBM predicted power and actual monitored variable in case study 1.Right: Mean training time in seconds for all three learning strategies.The error bars display the standard deviation.

Figure 7 .
Figure 7. Training set (7.1) and test set (7.2) for a randomly selected one of the five WTs with few or no high wind speed data in their training sets, and the power curve models trained for that WT based on conventional machine learning (7.3), the global federated learning model (7.4) and the customized federated learning model (7.5).As the training set of the WT contains only few data points for high wind speeds, the conventional machine learning model fails at modeling the true power curve behavior for higher wind speeds, which is shown by the underlying test set data.By privacy-preserving learning from other WTs, the global federated learning model (7.4) can now model these higher ranges.The finetuning step in the customized approach slightly adjusts the global model (dashed line) to the private training set (7.5).

Table 4 .
Figure 8. Left: Performances of the training strategies on the test set in terms of mean RMSE between the NBM predicted temperature and the actual monitored variable in case study 2. Right: Mean training time in seconds for all three learning strategies.The error bars display the standard deviation.

Figure 9 .
Figure 9. Actual versus predicted gear bearing temperatures based on a NBM of a WT with scarce training data.All data points would be located on the diagonal line with a perfect model.Panels 9.1 and 9.2 show predictions using conventional machine learning on the training set and test set, respectively.Panels 9.3 and 9.4 show the test set predictions by the global federated learning NBM and by the customized federated learning NBM.
Figure A1 illustrates the training, validation, and test set for one of the seven data-scarce WTs and one of the seven WTs with representative training sets.

Figure A1 .
Figure A1.Datasets of two different client turbines from the Penmanshiel wind farm.First row: Only data from the four weeks with the lowest average wind speed were kept for the training set of this client turbine.The training set does not contain sufficient data to represent the true power curve behavior in high wind speed situations (upper left panel).Second row: Wind speed and power data from a client WT whose training data contain representatively distributed wind speed observations.
Figure A2 illustrates the training, validation, and test set for one of the seven data-scarce WTs and one of the seven WTs with representative training sets.

Figure A2 .
Figure A2.Datasets of two client WTs from the Penmanshiel wind farm.First row: Only data from four randomly chosen consecutive weeks were kept for the training set of this client turbine.Second row: Gear bearing temperature and rotor speed data from a client WT whose training data contain representatively temperature observations.

Figure A3 .
Figure A3.Kernel density estimates of the distributions of the monitored variables on the test set of all nine client WTs.

Table 1 .
2) wherein  is the learning rate and ∇ℒ  (  ,   ,    ),  = 1, ⋯ ,   denotes the gradients on   of client WT  with regard to the model weights   .The central server then aggregates the received weights and returns an updated model state  +1 = ∑ Steps in the training of a federated learning model based on McMahan et al., 2017.

Table 2 .
Technical specifications of the wind turbines employed in the case studies.
. This case study demonstrates just one of several possible scenarios of lacking representative training data for NBMs.It serves to illustrate the advantages of federated learning approaches for condition monitoring and fault diagnostics tasks.A lack of representative SCADA data from a particular WT means that accurate NBMs can hardly be estimated for that WT with conventional machine learning approaches.It may take up to several months of SCADA data collection until a sufficiently representative dataset has been collected for training a new NBM from the WT's own SCADA data.Active power monitoring and detection of underperformance faults are hardly possible during this time period.We demonstrate that collaborative learning of the nine WTs can mitigate this lack of training data and allows learning accurate power curve NBMs in a privacy-preserving manner.

Table 3 .
The model architecture of the Active Power NBM used in all experiments of the first case study.

Table A2 .
Training strategies performance on test set in terms of RMSE between the NBM predicted and actual gear bearing temperatures in °C in the second case study."Scarce" and "Repres."denote the WTs whose training sets consist of four randomly chosen consecutive weeks and representative gear bearing temperature observations, respectively."Conv.ML": Conventional non-collaborative machine learning; "Global FL": Federated learning with global model; "Cust.FL": Customized federated learning."Training Time" is time required for the model training to finish in seconds.

Table A4 .
The root mean squared errors calculated over the respective client WT's validation set of three evaluated customization experiments in case study 2. "WT": wind turbine, "FL": federated learning.

Table A5 .
Technical specifications of the wind turbines from the Penmanshiel wind farm employed in the case studies.

Table A7 .
Training strategies performance on test set in terms of RMSE between the NBM predicted and actual gear bearing temperatures in °C in the second case study using data from the Penmanshield wind farm dataset."Scarce" and "Repres."denote the WTs whose training sets consist of four randomly chosen consecutive weeks and representative gear bearing temperature observations, respectively."Conv.ML": Conventional non-collaborative machine learning; "Global FL": Federated learning with global model; "Cust.FL": Customized federated learning."Training Time" is time required for the model training to finish in seconds.