3.1. Classification of Disease Emergence Risk
For the purposes of fitting the model and analysis, we divided our data randomly (to reduce bias) into training, validation and test sets as defined in the seminal text by [
27]. We set aside 1000 observations (which is approximately two thirds of the entire dataset) for training our SVM model. Out of the remaining, we selected 400 observations for validating the SVM model and leave aside 100 observations for testing. The training set of D2 was initially examined to develop a statistically reliable method for classifying the risk of freshwater fish disease emergence for each cell. We were interested in achieving a risk categorization of low, medium and high for freshwater disease emergence in England. The classification we developed in this study relies on a combination of a statistical reasoning and logical perceptions. Based on the accepted assumption (see [
28]) that the increased numbers and frequencies of live freshwater fish introductions in an area, increases the risk of fish disease emergence, this database was used to train, validate and test model for the risk of freshwater fish disease emergence in England. According to [
28], and information provided by experts, the following factors and assumptions contribute towards freshwater fish disease emergence:
The native and non-native fish movements into a cell increases the diversity of fish species in a cell.
The diversity of fish in a cell contributes towards the likelihood of one of fish holding a pathogen. That is, the more varied the more likely they are to hold a pathogen.
The higher is the number of fish movements into a cell, the higher is the possibility of a freshwater fish disease emerging in that cell.
Taking these factors and assumptions into consideration, we developed the following methodology for classifying the risk of freshwater fish disease emergence. For classification purposes, we were mainly interested in the variables titled “Number of varieties”, “1Native species moves to”, and “Non-native species moves to” variables which are found in D2. Next, we introduced a new column titled “Sum” into the database (Equation (
1)), purely for classification purposes. The “Sum” column contains the following information and its creation was influenced by the assumption that a high number of fish movements (regardless of whether it is native or non-native) would increase the chances of a freshwater fish disease emerging.
Following its introduction, we analyzed the distribution of the “Sum” column to determine the cut off points for the proposed risk classification. The cumulative distribution function (c.d.f.) was used for this purpose. The c.d.f describes the relationship:
which is the probability that a real valued random variable
X with a given probability distribution will be found at a value less than or equal to
x. As such, the c.d.f for a continuous variable
X can be defined as:
where
f is the probability density function.
Figure 1 shows the c.d.f for the “Sum” column. The optimal cut off points shown here were generated based solely on the Sum variable, which is a combination of native and non-native fish movements. Prior to determining the cut off points as optimal, we also evaluated modelling with different points to ascertain the sensitiveness and robustness of the adopted points. Next, we analyzed this c.d.f to identify statistically reliable, optimal cut off points for low, medium and high risk classification of the database. Accordingly, the optimal cut off points generated was based on the “Sum” variable, which combines native and non-native fish movements into a particular cell (see
Figure 1).
The determination of the risk classifications can be further explained as follows. It is visible that up until
,
suggesting zero fish movements (
Figure 1). When compared with the actual data, this converts into a cut off point of 1. Likewise, at
, we arrive at the next cut off point, which is 28. Using such key information in combination with the logical perceptions relating to the variety of fish in a particular cell, we arrived at the final risk classification.
3.2. Support Vector Machine (SVM)
The foundations of SVM were developed by [
7] and those interested in a detailed elaboration of the theory underlying SVM are referred to [
29]. In brief, SVM separates two classes by a function, which is induced from the available data observations, with the ultimate goal of producing a classifier that can be generalized. Note that, determining a class boundary using a separating hyperplane is adequate where classes are linearly separable, but there exists other less complex methods, which could provide satisfactory results in such situations. Therefore, SVM is most appropriate where classes are not linearly separable [
30].
An initial analysis of D2 showed that the classes were not linearly separable and thus prompted the use of an appropriate non-linear model in the form of SVM. Furthermore, there is evidence suggesting that, in general, freshwater ecological variables and their underpinning processes are very complicated and non-linear [
20], thereby further supporting the adoption of a non-linear model like SVM.
Following [
7,
29], the theory underlying SVM starts with the problem of separating a set of training vectors belonging to two separate classes:
with a hyperplane,
where
denotes the inner product of the vectors
and
.
The simple solution to the problem is finding the hyperplane which the minimum distances between the hyperplane and the points
is maximized in both classes. In other words, to find the solution to above mentioned problem, one may solve the following optimization problem:
The parameter
M is called the margin and shows the minimum distance between the observation points
and the hyperplane
(the margin between two classes). Once the optimization problem in Equation (
4) is solved, the classification function classifies the new observation
as follows:
The classification function in Equation (
5) is called
linear support vector classifier. In optimization problem (Equation (
4)), the second constraint guarantees all observations lie on the right side of the hyperplane. this constraint comes from the assumption “
The observations are linearly separable (i.e.,: there exists a hyperplane which separates two classes)”. However, in real world problems (e.g., the problem in our hands), the linear classification is not always possible which means the optimization problem (Equation (
4)) does not have a solution. To handle the problem, one needs to allow some of the points lie on the wrong side of the hyperplane. In this case, the optimization problem is formulated as follows:
The error term
allows the observation
to lie on the wrong side of the hyperplane. The parameter
C is some nonnegative constant called
tuning parameter. Once the optimization problem (Equation (
6)) is solved, one may use the classification function (Equation (
5)) to classify new observations.
Solving the optimization problem (Equation (
6)), it turns out the optimal solution to the linear classification problem only involves all possible inner products of the observation vectors
[
31], which implies one cane reformulate the linear support vector classifier as follows:
where the coefficients
and the parameter
b are estimated solving (Equation (
6)), based on all inner products of observation vectors
( see [
32], Chapter 12 for more details on the solution).
Using the reformed classification function (Equation (
7)), the linear support vector classifier can be extended for nonlinear problems by using a nonlinear function instead of inner product [
29]:
where
is, a symmetric, positive semi-definite function. The function
is called
Kernel and allows the support vector classifier (Equation (
8)) to classify between two classes even if they are not linearly separable. Some popular choices for kernel function are:
Linear Kernel: .
dth-Degree polynomial: .
Gaussian: .
Radial basis: .
Neural network: .
As can be seen, each kernel has its own extra parameters (i.e., polynomial kernel has the parameter
d and the Gaussian kernel has the bandwidth matrix
). Cross-validation is a common method to select the appropriate kernel function and estimates its extra parameters [
32].
For the purposes of fitting the model and analysis, we divided our data according to [
27] randomly into training, validation and test sets. We set aside 1000 observations (which is approximately two thirds of the entire dataset) for training our SVM model. Out of the remaining, we selected 400 observations for validating the SVM model and leave aside 100 observations for testing. Out of the various SVM variants, we selected the ”nu-svc” classification variant (
http://scikit-learn.org/stable/modules/svm.html#nusvc) for modeling the risk of freshwater fish disease emergence. Here, the
parameter sets the upper bound on the training error and the lower bound on the fraction of data points to become Support Vectors (default: 0.2). A further interesting property of
is that it is related to the ratio of support vectors and the ratio of the training error. We then used the risk categorization developed in this Section along with the following variables to develop the proposed SVM model.
Dependent Variable = Risk (classified as low, medium and high).
Independent Variables = Area, Population, Pop Per Ha, Pet Shops, Garden Centers, Fish Farms, No. Fish Importers, No. Fish Exporters, No of Varieties, Varieties per Ha, Origins, Total Imports, Total Species, Native species Fish Moves to, Non Natives Fish Moves to, Native species Fish Moves From, Non Natives Fish Moves From, and Species Group.