#### 4.6.1. Performance Evaluation Techniques

The overall performance evaluation process is summarized in

Figure 8. For testing purposes, a separate test set, to which the learning algorithm was not exposed at training time, is used to evaluate the performance of the hypothesis,

$h\left(x\right)$, trained in the previous step. This is often done through

k-fold cross-validation as illustrated in

Figure 9 for the case of

k = 5.

The general dataset is first permuted and split into five folds. Then, cross-validation is run through five rounds. In each round, one of the folds is kept for testing while the others are used for training. The test error, ${\u03f5}_{i}$, is calculated and used as an estimate of the prediction error for that round. At the end, the average error over all folds is computed as: $\widehat{\u03f5}=\frac{1}{k}{\displaystyle \sum _{i=1}^{k}{\widehat{\u03f5}}_{i}}$.

In practical situations the case for

k = 10,

i.e.,

10-fold cross-validation, has become the standard way of measuring the prediction error of a learning algorithm. Although, several tests and theoretical evidence proved the number 10 to be the right choice for getting reliable prediction error estimates, data mining and machine learning circles still debate about the best choice for the number of

k [

183]. For more reliable estimates of the prediction error, the author of [

184] recommends to use stratified 10-fold cross validation, an approach where prior to cross validation the data needs to be arranged so that in every fold, each class comprises approximately the same number of instances. Individual 10-fold cross-validation runs generate different results due to randomizing data instances before selecting the folds. As such, another improvement proposed by [

185] is to repeat 10-fold cross-validation 10 times,

i.e.,

$10\times 10$ cross validation, and take the average over the individual results.

#### 4.6.2. Performance Evaluation Metrics

Figure 8 also shows that an adequate error metric has to be selected to estimate the prediction or

generalization error that the model will tend to make on future unseen examples. As mentioned in

Section 3.5, for a classification task the confusion matrix and metrics derived based on it are commonly used.

Figure 10 depicts a typical confusion matrix where the rows indicate the actual class an instance belongs to, while the columns indicate to which class it was assigned to by the prediction algorithm. For instance, it can be seen that items from class

b have been correctly classified 486 times (

i.e., TP—true positives) and have been classified as class

a and

d 124 times (

i.e., FN—false negatives). At the same time, one instance of class

a, four instances from class

c and one instance from class

d have been classified as belonging to class

b, thus representing FP—false positives. All the other cells of the matrix represent the TN—true negatives.

Four widely used metrics are computed based on the confusion matrix: accuracy, precision, recall and the F1 score.

The accuracy of a classification system is defined as

$accuracy=\frac{TP+TN}{TP+TN+FP+FN}$, where TP, TN, FP and FN are respectively the:

true positives— instances that are correctly classified as the actual class,

true negatives—instances that are correctly classified as not being the actual class,

false positives or

Type I error—instances that are misclassified as the actual class, and

false negatives or

Type II error—instances from the actual class that are misclassified as another class [

169]. In machine learning and data mining literature the accuracy is also referred to as the overall recognition rate of the classifier [

169] and gives information about the percentage of correctly classified instances.

Precision, defined as $precision=\frac{TP}{TP+FP}$ represents the fraction of correctly classified instances within all instances that were classified (correctly and incorrectly) as a particular class. In other words, it is the percentage of positive instances within all positive labeled instances. Hence, it can be thought as the exactness of the classifier.

Recall, sensitivity or the true positive rate, defined as $recall=\frac{TP}{TP+FN}$ is the fraction of correctly classified instances of a particular class within all instances that belong to that class.

An alternative way of using precision and recall is to combine them into F1 or F-score defined as the harmonic mean of precision and recall: $F1=\frac{2\times precision\times recall}{precision+recall}$.

In case of imbalanced classes, the last three metrics (precision, recall and the F1 score) give more precise results of a classifier’s performance. For instance, if we have 98% instances of class A, and only 2% of class B, an accuracy rate of 98% is not an adequate measure, because it cannot indicate whether the classifier correctly classifies only the instances of class A and misclassifies all instances of B. For this purpose, precision and recall are used. In general, a good classifier should have high precision and high recall. For some applications it is important to be able to control the recall and precision values, for example when it is more important to reduce the false negatives rate (e.g., in predicting whether a patient has cancer) than the false positive rate. Such scenarios require a recall-precision trade-off.

The baseline any learning algorithm should be compared against is random guessing. If the algorithm performs the same or worse than random guessing, then no learning has been realized. Similarly, algorithms that have 100% precision and recall most likely are the result of a fault in one of the KD steps. When evaluating several algorithms at once, their performance should be ranked against the baseline and against each other.

#### 4.6.3. Improving the Performance of a Learning Algorithm

The previous paragraphs discussed methods and metrics to evaluate the performance of a learning algorithm. This section will discuss how to improve the performance of learning algorithms, which in practice requires a two-step process (i) diagnose why a learning algorithm performs poor, and (ii) apply the correct set of actions to improve its performance. To be able to perform correct diagnosis, it is important to understand that one of the main reasons preventing a learning algorithm to perform well on new unseen examples is the bias-variance problem. In order to understand the bias-variance problem, first, it is important to note that in machine learning, more complex models are not always better at predicting or classifying. Complex models tend to "over-fit" the training data (

i.e., over-fitting occurs), meaning that the model is trained to match the dataset too much, thereby modeling not only the underlying relationships but also the random error or noise that is present in the data. Models that over-fit the data are said to have a high variance problem.

Figure 11a shows an examples of a model with high variance. Such models are too complex and it is to be expected that the model will appear to perform less well at predicting on new unseen data than on the original dataset used for training. In machine learning and data mining, this situation is known as that a model fails to generalize well. In contrast, simple models, as one shown on

Figure 11c, tend to "under-fit" the training dataset (

i.e., under-fitting occurs), meaning that the statistical model is too simple with regards to the data it is trying to model and as such fails to model crucial underlying particularities of the dataset. As indicated on

Figure 11c, models that under-fit the data are said to have a high bias problem.

How to know when a model has struck the right balance and is not under-fitting or over-fitting the data it seeks to model?

Figure 11b shows an examples of an optimally trained model,

i.e., that is accurate and expected to generalize well.

The generalization ability of a model can be tested by diagnosing whether it suffers from high bias or high variance. One approach to diagnose the bias-variance problem is using learning curves [

180]. The idea behind learning curves is to observe the test or cross-validation (CV) error and the training error at the same time with respect to a parameter that represents a measure of varying amounts of learning effort [

186]. The training error is the estimated error that the model makes against the same dataset on which is was trained. As such, it will not be a reliable estimate of the prediction error that the model will make on future unseen examples, but is a useful measure for diagnosing the bias-variance problem as is further explained.

Typically, learning curves are plots of the training and the cross-validation error as a function of the number of training examples or the number of iterations used for training.

Figure 12 shows two characteristic learning curve shapes when plotting the training and the validation error as a function of the training set size,

m. The bias is highlighted with red, the variance with green, while the blue is the desired prediction error. In case of high bias (

Figure 12a) the model is over-fitting and as such increasing the number of training examples causes the training and the validation error to converge to an unacceptably high error (above the desired error). In contrast, in case of high variance problems (

Figure 12b) the training error stays below the acceptable value while the testing error tends to be much higher. In this case, adding more training data can help to decrease the validation error closer to the desired value. The intuition behind the phenomena in

Figure 12 is that for small values of

m it is easy to fit the training data, which is why the training error

$Er{r}_{train}$ is small, but the produced model will not generalize well, which is why the test error

$Er{r}_{CV}$ is high. By increasing the training set size

m, it gets more difficult to fit the data perfectly but the model tends to generalize better to new instances, which is why the training error increases, while the test error decreases with

m. However, for even higher values of

m, in case of high bias both training and testing error do not satisfy the desired error threshold (blue line), while in case of over-fitting both errors seem to approach the desired value, and typically a gap appears between them as illustrated on

Figure 12b.

Additionally, the bias-variance problem can also be solved by controlling the complexity of the trained model through tweaking its internal and general parameters. Typically, the regularization parameter

λ [

123] is controlled as an internal parameter for parametric models. Examples, of general model parameters that can be tweaked are: the number of neighbours in k-NN classification, the number of hidden layers or the number of sigmoid nodes in the hidden layer of neural networks,

etc. Choosing a simple model (e.g., a neural network with two nodes in the hidden layer) will increase the risk of having high bias, because a simple model tends to under-fit the training data, leading to high training and high validation error. Choosing a too complex model (e.g., a neural network with 100 nodes in the hidden layer) increases the risk of having a high variance problem, because a complex model tends to fit the noise in the training data and does not generalize well, leading to high validation error but small training error. To decide on optimal configurations, it is helpful to plot the test and (cross)validation error with regard to the model parameters as shown in

Figure 13. The optimal choice with the minimal validation error is denoted in dotted blue.

Finally, another configuration parameter that can cause a learning algorithm to perform poorly is related to the convergence of the internal optimization algorithm used by the learning algorithm itself during the training phase,

i.e., the rate with which the algorithm stabilizes. Typically, gradient descent is the internal algorithm integrated into a machine learning algorithm and is used for optimizing the model coefficients. Often, the convergence can be achieved by increasing the number of training iterations, also known as epochs, that the algorithm is allowed to run. One epoch corresponds to one update step when calculating/optimizing the model coefficients during training phase. It is one complete sweep over the training data. Using too few epochs results in inaccurate models. Another tunable parameter influencing the convergence rate is the learning rate

α, that was introduced in

Section 4.5.

To summarize, the goal of tuning a learning algorithm, also referred to as model selection, is to avoid bias and variance while keeping the test error (

$Er{r}_{CV}$) below the desired prediction error.

Table 9 summarizes the discussed actions one can take to improve the performance of learning algorithms based on the diagnosed problem.

#### 4.6.4. Performance Evaluation for the Fingerprinting Problem

For the device fingerprinting classification problem we will evaluate the performance for the data mining algorithms that were selected in the previous step:

k-NN, decision trees, logistic regression and neural networks. We use the 10-fold cross-validation implementation available in the Weka toolbox to obtain performance results of each model. As was anticipated in

Section 4.2 the GaTech dataset is quite unbalanced and as such per-class performance evaluations should be used. Thus, we present results in form of a confusion matrix and derive per-class metrics such as precision and recall from it. We start our discussion by presenting partial results for the device classification problem (

i.e., identify the exact device ID), and then demonstrate detailed results, analysis and evaluation procedures for the device type classification problem (

i.e., identify the device type). By intention we select the default parameters for the neural networks algorithm in Weka, to demonstrate how the recommended actions proposed in

Section 4.6.3 can be applied to improve a learning algorithms’ performance.

Table 10 presents the confusion matrix for the

k-NN learning algorithm trained for

device classification. The model performs generally well, as the majority of the predictions are located in the diagonal elements of the confusion matrix, which corresponds to correctly classified instances. For instance, for the

dell1 class 164 instances were correctly classified as

dell1, for

dell2 190 instances were correctly classified as

dell2,

etc. By dividing the sum of the diagonal elements (

i.e., 4033) with the total number of instances (

i.e., 4926) we obtain the rate of correctly classified instances of the model,

i.e., ∼82%. It can be seen that the remaining ∼18% misclassified instances occurred mostly because of confusing

dell2 with

dell3,

dell4 with

dell5,

ipad1 with

ipad2 and

dell2, and

nokia1 with

nokia2, which was already hinted at by high

${R}^{2}$ scores in

Table 5 of

Section 4.3.

Similar conclusions can be derived from the results of the remaining algorithms, available at [

188]. Decision trees demonstrated the best performance with 91% correctly classified instances, while logistic regression achieved 88%. Only neural networks demonstrated poor performance,

i.e., 25%, which was not expected from the data analysis

${R}^{2}$ results. However, it turned out that neural networks showed similarly poor performance for the device type classification problem, hence, the performance of neural networks will be discussed in more detail in the subsequent part of this section. The goal is to exemplify how to diagnose whether a algorithm can be improved or it should be discarded in the model selection phase.

Table 11 presents the summarized confusion matrix, precision and recall and their weighted average values over all classes for:

k-NN, decision trees, logistic regression and neural networks learning algorithms that are trained for device type classification.

Table 11 shows that the

k-NN trained model performs generally well, confusing mostly

ipads with

dells and

dells with

nokias, as anticipated by the high

${R}^{2}$ scores, especially for

dell and

nokia, obtained in

Section 4.3. The highest precision is for the

iphone class,

i.e., 1, as no instance from other classes were confused

iphones (

$FP=0$), followed by

ipad and

nokia with a precision score of

$0.98$ and

dell with precision

$0.935$. The highest recall was obtained for

nokia,

i.e.,

$0.996$, followed by

dell,

iphone and

ipad with scores

$0.993$,

$0.979$ and

$0.784$, respectively.

For decision trees,

Table 11 shows also very good results, misclassifying only a few instances. As it can be seen, both precision and recall are above

$0.99$ for all classes, meaning that the model is resistant to false positive and false negative errors and this learning algorithm was able to capture differences between device type classes based on the training instances better then

k-NN.

The logistic regression model performs slightly worse then k-NN and decision trees, having more classification errors and confusing also classes that had lower ${R}^{2}$ scores like iphones with nokias. The precision and recall are slightly lower then in previous models, having average precision and recall both of 95.3%, which is ∼4% lower then average precision and recall of decision trees, and ∼1% lower then in case of k-NN.

The performance of the neural networks is significantly poorer compared to the other models, confusing many more

dells with

nokias (and vice versa), and

ipads with

iphones as was anticipated by high

${R}^{2}$ scores obtained in

Section 4.3, but also

iphones with

dells which had lower

${R}^{2}$ scores. The precision and recall scores are drastically lower for each class, having an average precision scores of 47.9% and recall scores of 46.2%, which is at least 48% lower then average precision and recall of previous models. This is suspicious as the performance of neural networks should be comparable or better than decision trees [

189]. The main advantage that neural networks have over decision trees is their flexibility to train simple to very complex models through tuning a large parameters space they have (recall that neural networks are known to be universal approximators!). As a result, we have to look into the details of the mining algorithm and diagnose what is going wrong.

As discussed earlier, to gain insight in the behavior of an algorithm, besides the validation error we also need information about the training error. For this purpose, we run validation of neural networks in Weka again using the training set instead of cross-validation for testing, to obtain the training error. The training error turned out to be as high as the cross-validation error. As discussed, high training errors in combination with high cross-validation errors might indicate the existence of a high bias problem. According to the guidelines from

Table 9, increasing the complexity of the algorithm (in this case the number of sigmoid nodes in the hidden layer) might help decrease the cross validation error. Hence, an option is to calculate the training and cross-validation error again for more complex models, e.g., 15, 20, 50 nodes and plot them similar to

Figure 13. Unfortunately, the outcomes showed that this is not likely to solve the problem, and that high error rates persist. Similarly, the classification errors are also unacceptably high for simpler models: 1, 2, 3, 5 and 6 nodes in the hidden layer. Returning to

Table 9, a high bias problem can also be solved by either: reducing the number of training instances, adding more features or decreasing regularization. The first option did not influence the accuracy of the model but it does improve the training speed, which is practical for computationally expensive algorithms. Adding more features might solve the problem. However, our model already considers a considerable number of features,

i.e., the 500 bins of the histograms, therefore adding more features will likely not have a big influence on the performance. The last option is also not a relevant candidate as no regularization was used.

Returning to

Table 9, we see that there is a third issue that can prevent a model from performing well, namely the

convergence problem, which can be solved by using more training iterations or reducing the learning rate. (i) To evaluate the influence of training iterations, we need to plot the learning curves comparing the performance to the number of training iterations,

i.e., epochs. For this purpose, we calculate the training and validation error for several input values of number of epochs: 200, 500, 800, 1000,

etc. In order to keep the diagnosis process time-efficient, we selected a simpler model with one hidden layer with two sigmoid nodes,

$HLsigmoid=2$, whereas all remaining parameters are the same.

Figure 14a shows the corresponding learning curves. As it can be seen, both the training and the test error are relatively high for all cases, meaning that increasing the training iterations does not solve the problem. (ii) Alternatively, a last possible issue could be the use of a learning rate

α that is too high causing the algorithm to fail to converge on the best solution. Namely, with a higher learning rate the algorithm uses bigger steps during its convergence process. If the step is to big, the algorithm will likely never approach the optimal solution. In order to verify this option, we recalculated the train and validation error for the same parameter set (

$HLsigmoid=2$,

$epochs=500$) but for the value of

$\alpha =0.1$. The results showed that both training and validation error decreased below 20%. Hence, tuning the learning rate solved the convergence problem.

Now that the convergence problem is solved, we can re-iterate the previous processes to further fine-tune the parameters of the learning algorithm. For this purpose, we calculated the training and validation error one last time with regard to the number of sigmoid nodes in the hidden layer of neural networks,

i.e., the model complexity.

Figure 14b shows how the errors vary with the model complexity (

$HLsigmoid=1,2,3,5,6,10$). The best model was one with 6 hidden nodes. Increasing the complexity of the model further may introduce over-fitting, and makes no sense since the current 6 sigmoid configuration has optimal training and validation errors.

Table 12 presents the confusion matrix for the new neural networks model. Although the neural network model initially seemed to perform worst out of all investigated algorithms, with the optimized configuration settings the model now classifies correctly instances from all classes.

#### **Practical Considerations for Performance Evaluation**

The models trained in the previous KD step can be evaluated using some of the following data mining toolboxes/libraries: Weka, RapidMiner, Orange, KNIME, Rattle GUI, ELKI, Vowpal Wabbit, Shogun, scikit-learn, libsvm, Pybrain, etc.