1. Introduction
Wind and solar photovoltaic power plants are now the cheapest options for electricity generation in most parts of the world [
1], paving the way for the extensive decarbonization [
2,
3,
4] of the power sector and the economy. However, in the case of wind energy, the accurate assessment of the wind resource at a prospective wind farm site continues to be key to successful development and competitive pricing. While traditional modeling approaches [
5,
6] for wind project development [
7] are often sufficiently accurate in flat or mildly rolling landscapes, wind power predictions in complex terrain are considerably more challenging [
6,
8,
9,
10,
11]. A 5% mean error in predicted wind speed can be the threshold for the approval or rejection of a wind project. Although traditional modeling approaches based on linearized flow solvers are nowadays often complemented with Reynoldsaveraged Navier–Stokes (RANS) flow models and sometimes even largeeddy simulations (LES), the benefit is not always clear. Welltuned linearized flow models, complemented by expert knowledge, may often yield similar or better results compared to advanced computational fluid dynamics (CFD) models.
In an attempt to evaluate and compare the performance of numerical wind flow models, different validation studies have been reported in the literature. Flow modeling approaches include assessments using wind industrystandard linearized flow models such as the one included in the Wind Atlas Analysis and Application Program (WAsP) [
5], and more complex CFD approaches, e.g., RANS and LES. Beaucage et al. [
6] performed a validation study over different terrain features, where the linearized flow model in WAsP and the Meteodyn CFD model were compared. In contrast to what is normally assumed, the results of WAsP were similar or better than those of the CFD model, with an overall root mean squared error (RMSE) of 0.62 m/s (corresponding to an 8.0% error) for WAsP and 0.76 m/s (9.4%) for Meteodyn. Another comparative analysis was performed between the linearized WAsP model and the RANSmodel also available within the WAsP suite (WAsP CFD) at a complex site in Brazil [
8]. In this study, it was pointed out that the WAsP CFD model did not present a clear advantage over the linearized version. The authors conjectured that thermal effects not considered by either model highly contributed to the uncertainty of both results. Other studies have suggested the capacity of CFD models to outperform linear models [
9,
10,
11]; Hristov et al. [
11] suggested improvements of about 8% when predicting the annual energy production (AEP).
More recently, statistical learning methods have made their appearance in wind resource assessment. Such methods have the potential for taking advantage of a larger amount of input data than conventional approaches. In such databased methods, the learning process commonly relies on terrain and meteorological features to estimate the target variable. This provides greater flexibility for considering microclimatic effects that are generally not accounted for by flow models. Examples of machine learning methods include ensembles of regression trees [
12], support vector regression (SVR) [
13] and neural networks (NN) [
14].
The combination of physicsbased and databased methods may be a natural solution to the problem of accurate wind resource estimations in a complex terrain. Such hybrid approaches take advantage of each method’s strengths to increase robustness and predictive performance. However, for wind resource assessment, these methods are less popular. According to a literature review, only Tang et al. [
15] appear to have used a method that combines flow (CFD) simulations with a datadriven technique for assimilating multiple onsite measurements in complex terrain, in addition to a more traditional inversedistance weighting (IDW) method [
16].
Here, we address the continuing challenges of accurate wind resource estimation in complex terrains by designing a method which taps into the capabilities of modern wind resource modeling suites such as WAsP or WindSim, but simultaneously provides the capability of accounting for microclimatic effects, which are generally not properly addressed in flow models. The proposed new method is based on a conceptually simple machine learning approach, the knearest neighbors (
kNN) method (see, e.g., [
17]). To the best of our knowledge, this approach has not been used before. The basic idea was to take advantage of the
similarity between different locations at a prospective wind farm site in terms of a set of classifiers or
features (
Section 2.1). Ideally, it should be possible to determine such features with the terrain information alone (elevation and roughness) and possibly wind resource information from one reference location, very much like in conventional flow models. It will be argued below that all feature parameters required for this work can actually be determined with flow models such as WAsP or WindSim alone, although an improved prediction can be obtained in the case of power density estimates if the wind rose at one reference location is known. The new
kNN method can then be used in complete analogy to a conventional flow model to predict the wind resource at an arbitrary location for the purpose of turbine yield calculation or wind mapping. It should be noted that while the new method essentially uses the same input information as a conventional flow model, it provides larger flexibility, since it is not restricted by the deterministic relationships between a target and a reference location, which necessarily only consider the properties of the fluid model but not the microclimate.
In order to build a reference case against which the new method can be compared, flow simulations based on both the linearized model within the WAsP suite (
Section 2.6) and the RANSCFD model (WindSim) (
Section 2.7) were conducted first and the results were analyzed by crossvalidation. The setup of both models was explored in a detailed manner, ensuring that the models were optimally configured and not artificially underperforming. This included the finetuning for atmospheric stability (in the case of WAsP), the use of both standard and highresolution roughness maps, the exploration of a detailed forest model (in the case of WindSim), and the study of two different turbulence closure models (in the case of WindSim).
A number of different implementations were studied for the
kNN approach (
Section 2.4). The basic approach consists of directly using the observed wind speed or power density at the available met tower locations, with the exception of the target location used for validation in each turn. Alternatively, the local wind climates can be first transferred to the target location, and the
kNN can then be applied to the transferred climates; this is what we call the
hybrid approach below. Given that the hybrid method is a statistically independent implementation, the linear combination of both methods, i.e., an
ensemble, bears the potential of providing more accurate predictions and was implemented as well.
2. Methods and Data
2.1. The kNN Method Applied to Wind Resource Modeling: The General Concept
The general idea of the
kNN approach is to determine similarity between different locations, using a certain number
${N}_{f}$ of
features as variables in a generalized coordinate space. Each site can then be represented as a point in this
${N}_{f}$dimensional space, and its distance from any other site can be determined by an appropriate norm. Here, the L2norm
${L}_{2}\left(x\right)={(\sum {x}_{i}^{2})}^{1/2}$ was used throughout, prior to the standardization of each feature variable
x through
$x\to (x{\mu}_{x})/{\sigma}_{x}$. Feature candidates include strictly terrainrelated variables, such as the terrain complexity parameter RIX [
18], speedup and (geometric) distance between sites; variables related to local atmospheric conditions, such as temperature and turbulence intensity; and mixed variables such as the vertical wind shear, which depend on both terraindependent flow and atmospheric stability.
The key assumption of the
kNN method is that the variable of interest, e.g., the wind speed
${v}_{n}=v\left({t}_{n}\right)$ at time step
${t}_{n}$, can be predicted from the values of its
k nearest neighbors in the
${N}_{f}$dimensional feature space, either by a simple or a weighted average:
where
${x}_{0}$ is the target site and
${N}_{k}\left({x}_{0}\right)$ is the set of nearest neighbors identified by the algorithm. The weights
${w}_{i}$ were taken either as constant (uniform weighting) or as
${w}_{i}=1/d({x}_{0},{x}_{i})$ (inverse distance weighting, IDW), where
$d({x}_{0},{x}_{i})$ is the distance between features (=
${L}_{2}\left(\right{x}_{0}{x}_{i}\left\right)$. It should be noted that the number
k of nearest neighbors can be dynamically determined at each time step by the algorithm, allowing to account for timedependent features.
The
kNN procedure is illustrated in
Figure 1 for the case of two features, which in this case are the RIX number and the speedup. The differences in RIX between the predictor and the predicted site (deltaRIX) can be used to correct results obtained with the linearized flow model in WAsP for complex terrain analysis [
5]. Here, RIX is used as one of the potential features determining similarity. In
Figure 1, the target site is shown is a diamondshaped location in the RIXspeedup plane, together with a number of potential predictor sites. Note that both features may be taken as constant or time dependent; in the more general case of timedependent features,
Figure 1 shows a snapshot of a set of locations at a given time step
${t}_{n}$. The methodology can also be extended to include the
time as an additional feature.
2.2. Hyperparameter Estimation in kNN Models
In a
kNN regression, the number of nearest neighbors in feature space
k is a
hyperparameter that determines the accuracy of the model. The other hyperparameter used in this work is the type of weighting
${w}_{i}$ (uniform or inversedistance weighting); see Equation (
1). Both hyperparameters are evaluated systematically in a process called
parameter tuning, which determines the optimal value of
k so that prediction errors are minimized. The standard methodology is based on splitting the data set into a training and a testing set, in such a way that the training set allows to estimate the optimal
${k}^{\ast}$ value (usually by crossvalidation to avoid overfitting). Then, the selected
${k}^{\ast}$ value is evaluated on the testing set (unseen data) to obtain an unbiased estimation of the model’s performance. In the present work, this approach was used for reference purposes (see below), however, the main contribution of this work to the
kNN methodology is the estimation of the optimal hyperparameters without the target location, while still using the full observational period, thereby avoiding seasonality biases; a detailed description can be found in
Section 2.4. The two conceptually different approaches used in the work for hyperparameter estimation can then be described as follows:
In the first approach, the complete annual set of wind measurements was used to obtain the
${k}^{\ast}$ minimizing the prediction errors for each site. In a first step (termed method kNN0 in
Section 2.4), data from all locations, including the target site, were used for that purpose. This step provides a fit to the data, not a prediction. It does, however, allow creating prior knowledge of the best hyperparameters possible for the reference sites. An estimation of the optimal hyperparameters
without using data from the target site, required for wind resource estimation at a location without measured data, is then conducted in a separate step, as described in
Section 2.2.1.
The second approach consists of the implementation of a nested crossvalidation method, similar to the standard procedure of splitting the data set into a training and evaluation set. As opposed to the standard procedure, however, biases are avoided by using training periods of variable lengths, while maintaining validation and testing periods at constant lengths; as can be seen in
Section 2.2.2.
2.2.1. Hyperparameter Estimation with the Full Data Set
The parameter estimation for the full observational period is based on the following four algorithms. Algorithm 1 estimates the wind speed at each time step i using multiple combinations of features and a maximum of k neighbors. The uniformly weighted kNN average ${\widehat{y}}_{i}$ and the inversedistance weighted kNN prediction ${\widehat{y}}_{i}^{\left(w\right)}$ were used to estimate the wind speed. The mean percentage error MPE was used as the main error metric.
Algorithm 2 was used to determine the best hyperparameter
k and its associated regression type
c, either uniform or inversedistance weighted. Those best
ks were selected for each site and for each set of features based on their MPE values. A ranking of the best set of features for the study site was determined by
cardinality (i.e., the number of features in a set) based on the average of the MPE of all sites. Here, the four highestranking sets of features (i.e., the four best indicators of similarity) are reported in all cases.
Algorithm 1kNN regression with subset and number of neighbors selection 
Let X be the matrix of predictors and y be the vector of target values Let ${x}_{0}$ be the vector of query predictors Let $\mathcal{F}$ be the set of all possible combination of features Let K be the maximum number of neighbors to consider Let $\overline{z}\equiv \frac{1}{N}\sum _{i=1}^{N}{z}_{i}$ for each${f}_{\alpha}\subseteq \mathcal{F}$do for $k\leftarrow 1$ to K do for $i\leftarrow 1$ to N do ${D}_{i,j}\leftarrow \left\right{x}_{j}{x}_{0i}{\left\right}_{2},\forall {x}_{j}\in X$ ▹ Variable D stores the L2 norm between query site ${x}_{0i}$ and each predictor site ${x}_{j}$ for each instance i. $w}_{i,j}\leftarrow \frac{1}{{D}_{i,j}$ Let ${N}_{0}$ be an empty set for $i\leftarrow 1$ to k do ${N}_{i+1}$←${N}_{i}\cup {\displaystyle arg\underset{{x}_{p}X\setminus {N}_{i}}{min}\left(D\right)}$ ▹ The set N stores the nearest neighbors to the query site for each instance i. end for ${\widehat{y}}_{i}$←$\frac{1}{k}\sum _{{x}_{i}\in {N}_{k}}^{k}{y}_{i}$ ${\widehat{y}}_{i}^{\left(w\right)}$←$\frac{{\displaystyle \sum _{{x}_{i}\in {N}_{k}}^{k}{y}_{i}{w}_{i}}}{{\displaystyle \sum _{{x}_{i}\in {N}_{k}}^{k}{w}_{i}}}$ end for $MP{E}_{\alpha ,k}$←$\left(\right)open="("\; close=")">\frac{\overline{\widehat{y}}\overline{y}}{\overline{y}}\times 100\%$ ▹ The MPE variable stores the percentage error for each set of features $\alpha $ and number of neighbors k $MP{E}_{\alpha ,k}^{\left(w\right)}$←$\left(\right)open="("\; close=")">\frac{\overline{{\widehat{y}}^{\left(w\right)}}\overline{y}}{\overline{y}}\times 100\%$ end for end for $MP{E}_{\alpha ,k}$←$MP{E}_{\alpha ,k}\cup MP{E}_{\alpha ,k}^{\left(w\right)}$ return${f}_{\alpha},k,MP{E}_{\alpha ,k}$

Algorithm 2 Selection of the optimal k and best predictors for cardinality 2 to $\#\left\{{X}_{0}\right\}$ 
Let X be the matrix of predictors and y be the vector of target values Let ${x}_{0}$ be the vector of query predictors Let $\mathcal{F}$ be the set of all possible combination of features Let K be the maximum number of neighbors to consider Let $Sites$ be the set of sites that will be predicted for each$site\in Sites$do Algorithm 1 ($X,{x}_{0},y,\mathcal{F},K$) $MP{E}_{site,\alpha ,k}$ end for return$MP{E}_{site,\alpha ,k}$ for each$site\in Sites$do for each ${f}_{\alpha}\subseteq \mathcal{F}$ do ${k}_{site,\alpha}^{\left(MPE\right)}$←$arg\underset{k}{min}\left(MP{E}_{site,\alpha ,k}\right)$ ▹ Selection of optimal hyperparameter ${k}^{\ast}$ for each site and set of features indexed by $\alpha $ end for end for Let ${F}_{0}^{\left(MPE\right)}$ be an empty set for$c\leftarrow 2$ to $\#\left\{{X}_{0}\right\}$ do ${F}_{c+1}^{\left(MPE\right)}$←${F}_{c}^{\left(MPE\right)}\cup {\displaystyle arg\underset{{f}_{\alpha}F\setminus {F}_{c}^{\left(MPE\right)},\#\left\{{f}_{\alpha}\right\}=c}{min}\left(\right)open="("\; close=")">{\displaystyle \frac{1}{\#\left\{sites\right\}}\sum _{site=1}^{\#\left\{sites\right\}}\leftMP{E}_{site,\alpha ,{k}_{site,\alpha}^{\left(MPE\right)}}\right}}$ ▹ Selection of features ${f}_{\alpha}$ that minimize the average MPE of all sites per cardinality end for return${F}_{\#\left\{{X}_{0}\right\}}^{\left(MPE\right)}$

As mentioned previously, the optimal hyperparameters (
${k}^{\ast}$ and
${c}^{\ast}$) determined by Algorithm 2 (also referred to as method kNN0, as can be seen in
Section 2.4) were determined using the wind measurements of all sites and are therefore not suitable for the estimation of the wind resource at a target location without onsite measurements. To overcome this limitation, the kNNa method (
Section 2.4) was designed to estimate the hyperparameters of an arbitrary target site using the optimal hyperparameters of neighboring reference sites, where the estimation is based on the similarity between sites. An illustration of this approach is given in
Figure 2. A
kNN classifier was used to predict the two target variables
${\widehat{k}}_{site}$ and
${\widehat{c}}_{site}$ at the target site by the majority or weighted majority class of its
k nearest neighbors. This procedure was conducted using Algorithm 3, where an indepth exploration of the parameters was performed using multiple combinations of features and a number of neighbors in the classifier. Each combination of parameters is called a different classifier. Given that hyperparameters in the present study are not timedependent, all the features used to estimate
$\widehat{k}$ and
$\widehat{c}$ must be constant in time. Therefore, a mean value was calculated for timedependent variables.
Algorithm 3kNN classification with subset and number of neighbors selection 
Let $\overline{\overline{X}}$ be the matrix of mean predictor values for all sites Let ${k}^{\ast}$ and ${c}^{\ast}$ be the vector of optimal parameters for k and c for all sites Let ${\overline{x}}_{0}$ be the vector of mean predictor values at query location Let $\mathcal{F}$ be the set of all possible combinations of features Let $\#\left\{Sites\right\}1$ be the maximum number of neighboring sites for each${f}_{\alpha}\subseteq \mathcal{F}$do for $j\leftarrow 1$ to $\#\left\{Sites\right\}1$ do for each $site\in Sites$ do ${D}_{site,q}\leftarrow \left\right{x}_{q}{\overline{x}}_{0}{\left\right}_{2},\forall {x}_{q}\in \overline{\overline{X}}{x}_{q}\ne {\overline{x}}_{0}$ $I}_{site,q}\leftarrow \frac{1}{{D}_{site,q}$ $W}_{site}\leftarrow \sum _{q=1}^{j}{I}_{site,q$ ${w}_{site,q}\leftarrow \frac{{I}_{site,q}}{{W}_{site}}$ Let ${Q}_{0}$ be an empty set for $i\leftarrow 1$ to j do ${Q}_{i+1}$←${Q}_{i}\cup {\displaystyle arg\underset{qSites\setminus {Q}_{i}}{min}\left({D}_{site,q}\right)}$ end for Let ${k}_{\gamma}^{\ast}$ be the set of unique ${k}_{l}^{\ast}$ denoted as ${\{{k}_{1}^{\ast},...,{k}_{L}^{\ast}\}}_{\ne}$ ▹ The subset of unique values of ${k}^{\ast}$ is given by ${k}_{\gamma}^{\ast}$ $vot{e}_{{k}_{l}^{\ast}}$←$\sum _{q\in {Q}_{j}}I({k}_{q}^{\ast}={k}_{l}^{\ast})$, $\forall l\in {k}_{\gamma}^{\ast}$ ▹$vot{e}_{{k}_{l}^{\ast}}$ saves the votes for each class ${k}_{l}^{\ast}$ from neighborhood ${Q}_{j}$ ${\widehat{k}}_{\alpha ,j,site}$←$arg\underset{{k}_{l}^{\ast}}{max}\left(vot{e}_{{k}_{l}^{\ast}}\right)$ ▹ The majority class is the predicted number of neighbors $\widehat{k}$, estimated with the classifier ${f}_{\alpha}$ and number of neighbors j $vot{e}_{{k}_{l}^{\ast}}^{\left(w\right)}$←$\sum _{q\in {Q}_{j}}I({k}_{q}^{\ast}={k}_{l}^{\ast}){w}_{site,q}$, $\forall l\in {k}_{\gamma}^{\ast}$ ${\widehat{k}}_{\alpha ,j,site}^{\left(w\right)}$←$arg\underset{{k}_{l}^{\ast}}{max}\left(vot{e}_{{k}_{l}^{\ast}}^{\left(w\right)}\right)$ Let ${c}^{\ast}$ be a set of two labels $\{{c}_{1},{c}_{2}\}=\{uniform,distance\}$ $vot{e}_{{c}_{l}^{\ast}}$←$\sum _{q\in {Q}_{j}}I({c}_{q}^{\ast}={c}_{l}^{\ast})$, $\forall l\in {c}_{l}^{\ast}$ ▹$vot{e}_{{c}_{l}^{\ast}}$ saves the votes for each class ${c}_{l}^{\ast}$ from neighborhood ${Q}_{j}$ ${\widehat{c}}_{\alpha ,j,site}$←$arg\underset{{c}_{l}^{\ast}}{max}\left(vot{e}_{{c}_{l}^{\ast}}\right)$ ▹ The majority class is the predicted type of regression $\widehat{c}$, estimated with the classifier ${f}_{\alpha}$ and number of neighbors j $vot{e}_{{c}_{l}^{\ast}}^{\left(w\right)}$←$\sum _{q\in {Q}_{j}}I({c}_{q}^{\ast}={c}_{l}^{\ast}){w}_{site,q}$, $\forall l\in {c}_{l}^{\ast}$ ${\widehat{c}}_{\alpha ,j,site}^{\left(w\right)}$←$arg\underset{{c}_{l}^{\ast}}{max}\left(vot{e}_{{c}_{l}^{\ast}}^{\left(w\right)}\right)$ end for end for end for ${\widehat{k}}_{\alpha ,j,site}$←${\widehat{k}}_{\alpha ,j,site}\cup {\widehat{k}}_{\alpha ,j,site}^{\left(w\right)}$ ${\widehat{c}}_{\alpha ,j,site}$←${\widehat{c}}_{\alpha ,j,site}\cup {\widehat{c}}_{\alpha ,j,site}^{\left(w\right)}$ return${\widehat{k}}_{\alpha ,j,site},{\widehat{c}}_{\alpha ,j,site}$

To determine which of the
kNN classifiers evaluated in Algorithm 3 had the best performance, wind speed predictions were conducted using the predicted hyperparameters
$\widehat{k}$ and
$\widehat{c}$. The classifier minimizing the average error of the reference sites (leaving out the target site) was selected. Each site at its turn was assumed to be a target site, therefore
${N}_{\mathrm{site}}$ (number of sites) average errors were calculated; the classifier that repeated the most in those performance rankings was then selected. A pseudocode description of the method described is shown in Algorithm 4.
Algorithm 4kNN regression to estimate $\widehat{y}$ using estimated hyperparameters 
Let X be the matrix of predictors and y be the vector of target values Let ${x}_{0}$ be the vector of query predictors Let $\mathcal{F}$ be the set of all possible combinations of features Let ${F}_{\#\left\{{X}_{0}\right\}}^{\left(MPE\right)}$ be the set of best predictors for each${f}_{\beta}\subseteq {F}_{\#\left\{{X}_{0}\right\}}^{\left(MPE\right)}$do ▹ The kNN regression uses the set of selected features (${F}_{\#\left\{{X}_{0}\right\}}^{\left(MPE\right)}$) by Algorithm 2, and indexed by $\beta $ for each $site\in Sites$ do for each ${\widehat{k}}_{r}\in {\widehat{k}}_{\alpha ,k,site}\phantom{\rule{0.222222em}{0ex}},\phantom{\rule{0.222222em}{0ex}}{\widehat{c}}_{r}\in {\widehat{c}}_{\alpha ,k,site}$ do ▹ The hyperparameters $\widehat{k}$ and $\widehat{c}$ predicted by the kNN classifiers are indexed by r for $i\leftarrow 1$ to N do ${D}_{i,j}\leftarrow \left\right{x}_{j}{x}_{0i}{\left\right}_{2},\forall {x}_{j}\in X{x}_{j}\ne {x}_{0}$ $w}_{i,j}\leftarrow \frac{1}{{D}_{i,j}$ Let ${J}_{0}$ be an empty set for $i\leftarrow 1$ to ${\widehat{k}}_{r}$ do ${J}_{i+1}$←${J}_{i}\cup {\displaystyle arg\underset{{x}_{p}X\setminus {J}_{i}}{min}\left(D\right)}$ end for if ${\widehat{c}}_{r}=uniform$ then ${\widehat{y}}_{i,\beta ,site,{\widehat{k}}_{r}}$←$\frac{1}{{\widehat{k}}_{r}}\sum _{{x}_{i}\in {J}_{{\widehat{k}}_{r}}}^{{\widehat{k}}_{r}}{y}_{i}$ else ${\widehat{y}}_{i,\beta ,site,{\widehat{k}}_{r}}$←$\frac{{\displaystyle \sum _{{x}_{i}\in {J}_{{\widehat{k}}_{r}}}^{{\widehat{k}}_{r}}{y}_{i}{w}_{i}}}{{\displaystyle \sum _{{x}_{i}\in {J}_{{\widehat{k}}_{r}}}^{{\widehat{k}}_{r}}{w}_{i}}}$ end if end for $MP{E}_{\beta ,site,{\widehat{k}}_{r}}$←$\left(\right)open="("\; close=")">\frac{{\overline{\widehat{y}}}_{\beta ,site,{\widehat{k}}_{r}}{\overline{y}}_{site}}{{\overline{y}}_{site}}\times 100\%$ ▹ The $MP{E}_{\beta ,site,{\widehat{k}}_{r}}$ variable stores the errors using features $\beta $, at a given $site$ with the estimated $\widehat{k}$ and $\widehat{c}$ end for ${r}_{\beta ,site}^{\left(MPE\right)}$←$arg\underset{r}{min}\left(\right)open="("\; close=")">{\displaystyle \frac{1}{\#\left\{Sites\right\}1}\sum _{\begin{array}{c}i=0\phantom{\rule{0ex}{0ex}}i\ne site\end{array}}^{\#\left\{Sites\right\}1}\leftMP{E}_{\beta ,site,{\widehat{k}}_{r}}\right}$ ▹ Selection of the classifier that belongs to the index r that minimizes the avg. error leaving one site out end for Let ${\eta}_{\alpha}$ be the set of unique ${r}_{\beta ,l}^{\left(MPE\right)}$ denoted as ${\{{r}_{\beta ,1}^{\left(MPE\right)},...,{r}_{\beta ,L}^{\left(MPE\right)}\}}_{\ne}$ $vot{e}_{{r}_{\beta ,l}^{\left(MPE\right)}}$←$\sum _{site\in Sites}I({r}_{\beta ,site}^{\left(MPE\right)}={r}_{\beta ,l}^{\left(MPE\right)})$, $\forall l\in {\eta}_{\alpha}$ ▹$vot{e}_{{r}_{\beta ,l}^{\left(MPE\right)}}$ saves the votes for each classifier ${r}_{\beta ,opt}^{\left(MPE\right)}$←$arg\underset{{r}_{\beta ,l}^{\left(MPE\right)}}{max}\left(vot{e}_{{r}_{\beta ,l}^{\left(MPE\right)}}\right)$ ▹${r}_{\beta ,opt}^{\left(MPE\right)}$ saves the classifier that repeat the most for each set of features $\beta $ end for return${\overline{\widehat{y}}}_{\beta ,site,{r}_{\beta ,opt}^{\left(MPE\right)}}$

2.2.2. Hyperparameter Selection and Testing through Nested CrossValidation
As described above, often
kNN methods deals with time series data, which portray different characteristics when analyzed within different periods. For hyperparameter tuning and providing an unbiased validation of a given method for modeling or predicting a variable (such as wind speed), crossvalidation is commonly carried out. In the present work, the nested crossvalidation approach illustrated in
Figure 3 was implemented. With such an approach, the testing period (green square in
Figure 3) appears chronologically after the training period (blue squares). The training data are used to systematically evaluate the number of neighbors and type of regression, and the setup that minimizes the MPE is selected and used in the testing set to provide an unbiased measure of the model prediction error. By repeating this process with multiple testing sets, a better estimate of model performance is obtained. In this work, the data set was split into five nested subsets, using the setup shown in
Table 1. The testing period was constant in all runs (2 months). As mentioned before, this implementation was conducted for reference purposes only.
2.3. Feature Selection for kNN Methods
A number of features were considered for their use with kNNbased methods. A basic requirement for all candidates was the possibility of calculating their values for each location of interest from geographic information alone, in a completely analogous way in which flow modeling tools such as WAsP or WindSim construct wind maps for a site or region of interest. The following feature candidates were found to be suitable candidates:
The terrain ruggedness index RIX. Although the
$\Delta $RIX correction procedure [
18] did not provide a significant improvement of the WAsP flow modeling results for the site, a weak correlation between the WAsP flow modeling error for the wind speed indicated a possible suitability of the RIX index as a similarity feature.
The orographyinduced speedup (e.g., relative increase in wind speed due the terrain slope). A modest correlation between the prediction error and the difference between the WAsPcalculated topographyinduced speedups was found (
Figure 4), indicating that the speedup is a possible similarity measure candidate.
The distance between measurement locations (met towers). As shown in
Figure 4, the distance between locations bears a similar impact on the flow modeling results as the difference between speedup values.
The vertical wind shear. The wind shear was found to have a significant (negative) correlation with the prediction error, and was therefore expected to be among the important predictor variables of the
kNN method. Note that the degree of correlation between the flow modeling error and the feature candidates shown in
Figure 4 does not have any impact on the
kNN methodology, beyond the decision of including or not a given candidate feature in the list of
kNN predictors. The
kNN method determines the optimal set of predictors in an automated fashion, as described above.
The following additional feature parameters were used for the assessment of power density:
The Weibull scale and shape parameters. Determining Weibull parameters requires onsite measurements (or simulated winds from atmospheric models) for at last one location. Using such a reference location, a transferred wind rose (the latter understood as the set of angular wind speed histograms) can be constructed using a wind flow modeling tool. Since local flow conditions are different for different wind directions, the average transferred wind rose will generally have different Weibull parameters.
The ${\mathit{R}}^{\mathbf{2}}$value, or coefficient of determination, of the Weibull fit. The transfer of wind roses may not only change the Weibull parameters, but may also distort the underlying histogram, resulting in varying degrees of goodness of fit.
${R}^{2}$related parameters can be constructed for their use as similarity features; additionally, a modified weighting scheme (similarly to the one proposed in [
19]) was explored as well, where the
${R}^{2}$value was used to build a confidence level matrix as part of a generalized inversedistance averaging scheme.
2.4. Setup of the kNN Simulations
The current work consists of the following sequence of assessments:
 (1)
KNN0. All available wind speed data for the full oneyear period (see
Section 2.5) were used to perform a
kNN regression for each of the five towers as target site. The optimal hyperparameters
${k}^{\ast}$ and
${c}^{\ast}$ were calculated from the full regression. Algorithms 1 and 2 were used for this purpose. This is a baseline case, constructed for reference purposes.
 (2)
KNNa. In order to conduct an independent test of the methodology, the optimal hyperparameters were estimated from all towers, excluding the target tower itself. The corresponding procedure is described in Algorithm 3.
 (3)
KNNb. Instead of using the measured wind speed information directly as a predictor for a given target site, an alternative method consists of using the WAsP predictions for the target site, prepared with each of the predictor sites. The same set of features and feature combinations as before were used in this step.
 (4)
KNNc. Since methods KNNa (driven exclusively with observed wind speed data) and KNNb (working on WAsPprocessed observational data) represent independent assessments, an ensemble version was performed, where methods KNNa and KNNb were combined linearly.
 (5)
KNNd. For an independent validation of the
kNN approach, an additional method was implemented, where optimal hyperparameters were determined and validated in the nested approach described in
Section 2.2.2.
2.5. Validation Data
Onsite tall tower meteorological data from the development phase of a commercial wind farm in Mexico were used for model construction and validation. The data are proprietary and are not in the public domain. Each of the five towers at the site (“site B”) was equipped with three pairs of redundant cup anemometers (class I for primary sensors, standard for redundant) placed at 80, 60, and 40 m above ground level. The wind direction was measured at two levels (42 and 78 m). Temperature measurements were taken at 12 and 80 m above ground level. Data were recorded at 10 min intervals; for each variable, the mean, maximum, minimum and standard deviation were recorded. One full year of concurrent information with only minor data gaps was selected to avoid seasonal biases. Initial quality assurance was conducted in a semiautomatic way using Windographer. Overall data recovery after quality assurance was 99.9%. All reported results for the wind speed modeling accuracy in this study refer to the 80 m wind speed.
In order to build a continuous observational period, three reconstruction methods were used: (1) replacement of missing or invalid 80 m data with those from the redundant sensor, prior tower shadow correction when necessary; (2) vertical extrapolation, and (3) principal component analysis (PCA). Vertical extrapolation was used when 40 m and 60 m wind speed data were available. PCA was used to take advantage of concurrent data from other towers where no concurrent wind speed data at the same tower were available.
In order to ascertain that the reconstructed continuous data records for each met tower were statistically indistinguishable from the qualitycontrolled original data, both a ${\chi}^{2}$ test for the observed (${f}_{O}\left({x}_{i}\right)$) and reconstructed (${f}_{S}\left({x}_{i}\right)$) wind speed distributions, with ${\chi}^{2}=\sum ({f}_{O}\left({x}_{i}\right){f}_{S}\left({x}_{i}\right)/{f}_{O}\left({x}_{i}\right)$, and a Kolmogorov–Smirnov test for the cumulative probability density functions (${F}_{O}\left({x}_{i}\right)$ and ${F}_{S}\left({x}_{i}\right)$, respectively) were conducted. All tower measurements at site B successfully passed both tests, demonstrating that the qualitycontrolled original data sets and the reconstructed full data sets were statistically indistinguishable.
The wind regime at the site is essentially bimodal, with Southerly winds predominating most of the year, the remainder corresponding to Northerly winds (see
Figure 5). All towers were located at either the Southern or the Northern edge of a plateau structure, with three towers (S01_B, S02_B, SO3_B) being located at the Southern and two (S04_B, S05_B) at the Northern edge. S01_B and S05_B are somewhat more exposed locations, sitting on extruded portions of the plateau, whereas the other towers sit between a steep slope and the flat upper part of the plateau.
2.6. Flow Modeling: WAsP
WAsP [
5] was used for (1) wind flow modeling and crossprediction of the wind resource at all locations with observational data; and (2) the determination of the feature parameters (
Section 2.4). Orography inputs come from the Shuttle Radar Topography Mission (SRTM) digital elevation data [
20] with a resolution of 1 arcsec (30 m). The SRTM 1 ArcSecond Global version [
21] used was downloaded from the U.S. Geological Survey (USGS) website. The domain extends 10 km beyond the limits of the site of interest in order to create a large enough buffer to avoid boundary effects; contour lines with a spacing of 10 m were used. The GlobeLand30 (GLC30) database [
22] with a resolution of 1 arcsec (30 m) was used to create a roughness map. Using the GLC30 version 2020 database improves the crossvalidation results for the wind speed compared to using the WAsP default database (GlobCover2009 map [
23]); therefore, we use the former roughness data hereafter. A detailed discussion of the impact of potentially more accurate roughness maps and its impact of flow modeling accuracy is not part of the present work.
In the WAsP modeling chain, the observed wind is generalized using predefined heights and roughness lengths, which can be modified. It was observed that a fine tune of these values reduces the error in the predicted wind. Here, the predefined heights were set to 10, 30, 60, 80 and 100 m above ground level and the standard roughness lengths were set to 0.0, 0.05, 0.11, 0.23 and 0.5 m.
In order to account for atmospheric stability, a systematic assessment of the heat flux impact on the vertical wind shear profiles was conducted (not shown). By using the WAsP default configuration, corresponding to a slightly stable condition with a heat flux offset of
$40$ W/m
${}^{2}$, a good fit with the observed profiles was found at all tower locations. Allowing for heat flux offset variations did not improve the results. Another option in the WAsP software is the geostrophic shear model, which is turned on by default. The geostrophic shear is obtained from coarse reanalysis data, which cannot resolve the slopes in mountainous terrain [
24], and can therefore give unphysically large geostrophic wind shear values. This was also observed for the study site in this work (0.02 m/s/m). The model was therefore turned off in all cases.
An attempt to further improve the results of the crossvalidation predictions using the deltaRIX methodology [
18] was performed. However, as mentioned in
Section 2.3, no statistically significant improvement was obtained.
As an additional reference, the WAsP software was run in (RANS) CFD mode; the results were found to be similar to the ones obtained with WindSim, discussed in the next subsection. Unless mentioned otherwise, all WAsP results were obtained with the standard (linear) flow solver.
2.7. Flow Modeling: WindSim
A RANSbased solver, the commercial software package WindSim ([
25]), was used in an attempt to improve crosspredictions, given the terrain complexity of the site. The same topography data (STRM and GLC30) as those in WAsP were used. The WindSim mesh was setup with an initial horizontal resolution of 50 m in the refinement area for convergence and initial performance tests; the resolution in the refinement area used in the final selected configuration was then set to 20 m for a better representation of the terrain at its steep parts. Setup parameters for WindSim include (1) a toggle parameter for including the forest model (on/off); (2) the free parameters of the forest model; (3) the turbulence closure model (
k
$\u03f5$ or
k
$\omega $); and (4) the type of atmospheric stability. A Taguchi experiment design [
26] was set up in order to assess the different free parameters of the forest model. The assumption of neutral stability was found to produce a good agreement with the observed wind speed profiles; moreover, simulations under stable and unstable conditions lead to convergence problems, so only neutral stability conditions were considered for the final model setup. The use of the
k
$\omega $ turbulence closure model and the omission of the forest model produce the best results generally, so we use this configuration hereafter. A detailed discussion of the setup, experiment design and results obtained with the WindSim study are beyond the scope of the present work and will be reported elsewhere.
4. Summary and Conclusions
Here, a novel method for the modeling of the wind resource with improved accuracy was proposed and evaluated using onsite data from a site with complex terrain characteristics. As opposed to industrystandard approaches, which mainly rely on flow models, the new method uses a simple and intuitive machine learning algorithm, based on the knearest neighbors concept. While being a consolidated method in the literature on advanced statistical modeling and machine learning, to the best knowledge of the authors, no such approach was previously proposed or demonstrated in the field of wind resource assessment.
The kNN method in this work was based on the concept of similarity between met tower locations. To assess similarity, the method uses classifiers or features of each location. Features related to the terrain characteristics, flowrelated quantities and microclimate parameters were selected. All location features can be readily obtained with a microscale flow model. All kNN runs performed in this work use features generated with WAsP. In order to generate a baseline for comparisons, crossvalidations between the five tall (80 m) tower locations were conducted with both WAsP (using its linear flow solver) and WindSim (using a RANS implementation). Both software suites were finetuned in order to strive for optimal performance.
Given that all met towers are located at edge locations, a clear improvement of the RANSCFD predictions over the linear flow model was expected. Some improvement by the besttuned RANS model was indeed observed. Both flow models failed, however, to accurately transfer the measured climatologies from the northern edge to the southern edge locations and vice versa with the accuracy required by modern competitive wind farms.
Four conceptually different kNN approaches were proposed and assessed, with one of the methods (kNNd) being used for reference purposes. The baseline model (kNNa) works directly with the qualitycontrolled wind speed or power density 10 min time series retrieved from all met towers other than the target location used for validation. The hyperparameters required by the kNN model were estimated with a novel methodology based on the similarity between locations, which takes advantage of the full observational period. This is somewhat different from a typical machinelearning approach (implemented as method kNNd) and more suitable for the context of this work.
Two other variants of the kNN concept were implemented as well. In kNNb, the observed wind data were first transferred to the target site using WAsP, and the kNN method was then applied to those transferred climatologies. This hybrid approach has the advantage of being an independent assessment of the wind resource at the target site, allowing for the construction of an ensemble method (kNNc) by combining the predictions of kNNa and kNNb.
The kNN concept, in its different implementations, provided significant improvements overflow modeling results. While the average WAsP prediction accuracy was around 5% for the wind speed and 7% for the wind power density, the corresponding figures for kNN method were only approximately 1.5 and 3%, respectively. It should be noted that from the perspective of a wind resource modeler, the kNN method works similarly to a flow model such WAsP or WindSim; only terrainrelated information and wind time series for different measurement locations are needed.
While providing improved accuracy over wind flow models, the kNN approach does have some limitations. First, a flow model is needed to determine the feature parameters of each location of interest. Therefore, the kNN method can be viewed as an addon to existing wind modeling suites such as WAsP and WindSim, rather than a standalone method. However, wind resource modelers typically have access to flow modeling tools, so this is not a practical limitation. Second, the kNN method needs more than one measurement location, as opposed to WAsP or WindSim. This is, however, not a strong limitation, since nowadays, a number of met towers are routinely installed at prospective wind farm sites, and additional measurement locations can be created dynamically using remote sensing devices such as lidar and sodar. The proposed method was therefore believed to have good prospects for applications in a practical wind resource assessment context.