#### 6.1. Neural Network Regression

Neural network models have gained a lot of research attention due to their capabilities in modeling nonlinear input–output relations. In general, neural networks work similarly to the human brain’s neural networks. A “neuron” in a neural network is a mathematical function that collects and classifies information according to some pre-determined architecture, achieving statistical objectives such as curve fitting and regression analysis.

In terms of architecture, a neural network contains layers of interconnected nodes.

Figure 21 illustrates the tether force prediction problem as a neural network with two hidden layers. Each node is a perceptron and is similar to a multiple linear regression. The perceptron feeds the signal produced by a multiple linear regression into an activation function that may be nonlinear. In a multi-layered perceptron (MLP), perceptrons are arranged in interconnected layers. The input layer collects input patterns. The output layer has classifications or output signals to which input patterns may map. In our work, our predicted output is the tether force. Hidden layers fine-tune the input weightings until the neural network’s margin of error is minimal. It is hypothesized that hidden layers extrapolate salient features in the input data that have predictive power regarding the outputs.

We used the TensorFlow and Keras [

48] libraries to create a regression-based neural network with linear activation functions. At a high level, an activation function determines the output of a learning model, its accuracy and also the computational efficiency of the training a model. It can generally be designed to be linear or nonlinear to reflect the complexity of the predicted function.

For exploration, we used two hidden layers of 12 and 8 neurons, respectively, over 500 optimization iterations (epochs, forward and backward passes). A model summary is reported in

Table 4 highlighting the dimensions of dense layers, the number of parameters to be optimized in each epoch per layer and the total number of trainable parameters.

For a small network of two layers with 12/8 neurons, a total number of 281 parameters need to be trained. This number grows quickly as the number of layers and neurons per layer increases. Although adding layers/neurons would clearly improve the prediction accuracy, it obviously comes with an added computational cost. Trade-off studies are often used to find a practical implementation with acceptable accuracy, for a given training data set.

Figure 22 highlights the decreasing training and validation losses along epochs. Once the model was trained to a satisfactory error metric, we used it for predicting tether force values of new input vectors.

#### 6.2. Comparing Regression Models

To further demonstrate the value of machine learning regression models for an accurate prediction of the power output of airborne wind energy systems, we evaluated different standard regression models per the quality metrics in

Section 5.2, along with the training time. We used standard Scikit Learn implementations. Results are reported in

Table 5. For this study, we split the full data set into random train and test subsets, we used

$70\%$ for training and save

$30\%$ for testing.

A key remark at this point is that no one model scores best for all data sets in terms of all quality metrics. Multiple iterations and hyper-parameter tuning operations would be needed for further model optimization. We note the different trade-offs highlighted in

Table 5, e.g., between training time and accuracy [

49].

For example, linear regression is one of the simplest algorithms trained using gradient descent (GD), which is an iterative optimization approach that gradually tweaks the model parameters to minimize the cost function over the training set. A linear model might not have the best accuracy but is simple to implement and hence is best of quick domain exploration. It makes a prediction by computing a weighted sum of the input features, plus a constant called the bias term $\widehat{y}={h}_{\theta}\left(x\right)=\theta \xb7\mathbf{x}$, where ${h}_{\theta}\left(x\right)$ is the hypothesis function and $\theta $ is the model’s parameter vector containing the bias term ${\theta}_{0}$ and the feature weights ${\theta}_{1}$ to ${\theta}_{n}$.

Regularization is often used to further improve the loss function optimization. On the one hand, ridge regression is a regularized version of linear regression where a regularization term equal to $\alpha {\sum}_{i=1}^{n}{\theta}_{i}^{2}$ is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. The hyper-parameter $\alpha $ controls how much you want to regularize the model. If $\alpha =0$, then ridge regression is just a linear regression. If $\alpha $ is very large, then all weights end up very close to zero and the result is a flat line going through the data’s mean. On the other hand, lasso regression is another regularized version of linear regression that adds a regularization term to the cost function, but uses the ${l}_{1}$ norm of the weight vector instead of half the square of the ${l}_{2}$ norm; like this $\alpha {\sum}_{i=1}^{n}\left|{\theta}_{i}\right|$. Lasso regression tends to completely eliminate the weights of the least important features (i.e., set them to zero), in other words, it automatically performs feature selection and outputs a sparse model (i.e., with few nonzero feature weights). Elastic net regression is a middle ground between ridge regression and lasso regression. The regularization term is a simple mix of both ridge and lasso regularization terms, and you can control the mix ratio r. When $r=0$, elastic net is equivalent to ridge regression, and when $r=1$, it is equivalent to lasso regression.

Despite their longer training times, nonlinear models are expected to perform better for our data set. As we noticed in

Figure 11,

Figure 12,

Figure 13,

Figure 14,

Figure 15,

Figure 16 and

Figure 17, input features and output force are not linearly related. To start, polynomial regression introduces nonlinearity by imposing powers of each feature as new features. It then trains a linear model on this extended set of features.

Alternatively, ensemble learning methods use a group of predictors, voting amongst them for the best performance; and hence are often called voting regression. The accuracy of voting regression depends on how powerful each predictor is in the group and their independence. Finally, boosting refers to any ensemble method that combines several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor, often resulting in the best performance, compared to individual models. Due to the limitation of training data, voting among multiple regressors yielded higher accuracy (less MSE) compared to individual models, as shown in

Table 5. If more training data is available, optimizing a single model to outperform ensemble models would be feasible.

Per our machine learning experiments, we could conclude the clear success of a neural network model applied to AWE for predicting tether force, even without hyper-parameter tuning. The main model drawback is that it takes a longer training time than other algorithms, despite its overall accuracy performance.

A major advantage for our ML model is cost. Once a model is trained, there is no need to physically run new experiments (with the same test setup, as shown in

Figure 5), to predict the tether force. Instead, we could simply rely on our current NN model to predict the estimated tether force for new input combinations. We could use our gradient boosting model, if we care about evaluation/prediction time, rather than model accuracy. Note that the evaluation time is the time required to calculate the predicted tether force from our model (prediction formula). The neural network generates a more accurate formula, but also more complex and takes more time for evaluation.