GASVeM: A New Machine Learning Methodology for Multi-SNP Analysis of GWAS Data Based on Genetic Algorithms and Support Vector Machines

: Genome-wide association studies (GWAS) are observational studies of a large set of genetic variants in an individual’s sample in order to ﬁnd if any of these variants are linked to a particular trait. In the last two decades, GWAS have contributed to several new discoveries in the ﬁeld of genetics. This research presents a novel methodology to which GWAS can be applied to. It is mainly based on two machine learning methodologies, genetic algorithms and support vector machines. The database employed for the study consisted of information about 370,750 single-nucleotide polymorphisms belonging to 1076 cases of colorectal cancer and 973 controls. Ten pathways with different degrees of relationship with the trait under study were tested. The results obtained showed how the proposed methodology is able to detect relevant pathways for a certain trait: in this case, colorectal cancer.


Introduction
The results of the Human Genome Project [1] and the International HapMap Project [2] made it possible to find genes linked to traits and health problems. Genome-wide association studies (GWAS) have contributed to several new discoveries in human genetics. GWAS exploit the fact that genetic variants that are close together tend to be statistically correlated, something which in genetics is known as linkage disequilibrium [3]. The advances in genome arrays of genetic variations have led to the discovery of many DNA variants associated with complex traits such as those related to diseases.
Nowadays, one of the main criticisms of GWAS is that to date, most of the discoveries have not been applied in a clinical practice [4], but despite this obvious drawback, GWAS do have a great relevance. As an example, it can be said that until the development of GWAS it was not possible to find any gene linked to schizophrenia [5].
Not only are GWAS of interest for the discovery of robust associations, but they also give information about the nature of variations in traits and have contributed to the discovery of new biological knowledge about how DNA variations can affect gene regulation. The variations in the human genome are mainly down to two causes: point

Support Vector Machines
Support vector machines (SVM) are a supervised-learning classification technique that has shown its ability in dealing with classification [16] and regression problems [17,18]. In the case of the present research, SVM is employed for binary classification in cases and controls. Let us suppose they are denoted by {−1, +1}. Therefore, predictors of the form f : R D → {−1, +1} are considered.
The SNPs of each member of the population are represented by a vector x n ∈ R D where D is the number of SNPs employed in the study and the case or control label is given by y n . Given a training dataset consisting of pairs {(x 1 , y 1 ), (x 2 , y 2 ) . . . (x n , y n )} with the objective being to train an SVM model with the lowest classification error. Let x i ∈ R D be a data sample, and consider the function f , R D → R in such a way that x → w, x + b are w ∈ R D and b ∈ R the hyperplane that separates the two classes in the binary classification problem can be written as x ∈ R D : f (x) = 0 .
Please note that w is a normal vector for the hyperplane and b the intercept. When training the classifier, we want to ensure that the examples with positive labels (cases) are on the positive side of the hyperplane, while those with negative labels are on the negative side. These two conditions can be expressed as y n ( w, x n + b) ≥ 0 where y n = −1 or 1.
To find a unique solution, one idea is to choose the separating hyperplane that maximizes the margin between the cases and controls. The margin represents the distance of the hyperplane to the closest examples of cases and controls, respectively, in those cases in which it would be possible to assume that the dataset is linearly separable [19], but this is not the case of the present research.
Let us consider one individual x i , which without loss of generality can be considered as a case and labeled as +1. This case observation should be on the positive side of the hyperplane w, x i + b > 0. The distance of x i to the hyperplane is given by the distance of the orthogonal projection of the point to the plane. Using vector addition, x i can be expressed as follows: In other words, all the case observations must be at least a distance r from the hyperplane. It can be expressed by the following equation: And the optimization problem to be solved can be expressed as: maxr Subject to: y n ( w, x n + b) ≥ r ω = 1, r > 0 Can then be interpreted as the maximization of r while ensuring that the case observations lie on the correct side of the hyperplane. This is usually called the margin maximization parameter.
In these cases, like in the one in the present problem where data are not linearly separable, some examples would fall into the margin region or even on the wrong side of the hyperplane. This model is called soft margin SVM and makes use of the following equations: Subject to: y n ( w, x n + b) = 1 − ξ n , ξ n ≥ 0. ξ n is the slack variable corresponding to each observation. The parameter C > 0 trades off the size of the margin and the total amount of slack that we have, and is called the regularization parameter.
The margin term ω 2 is called the regularizer. The convex duality via Lagrange Multipliers of the previous formula can be expressed as follows: When the Lagrangian is differentiated with respect to the three primal variables, ω, b and ξ, the following result is obtained: Setting the three partial derivatives to zero, it would be possible to express the previous equations by means of their dual negative equations. This converts the problem from a maximization into a minimization one: Although SVM requires linearity for classification, the use of kernels makes this methodology also useful for non-linear problems. A kernel can be defined as a function K : X × X → R for which there is a Hilbert space H and φ : X → H is feature map such that: Kernels must be symmetric and positive semidefinite functions in such a way that the kernel matrix K is symmetric and positive semidefinite. The use of different kernels has different effects on the separating hyperplanes. In this research lineal, polynomial, radial basis functions and sigmoid kernels have all been tested. These kernel functions can be expressed as follows [19]:

Genetic Algorithms
Genetic algorithms can be defined as biologically inspired methods for optimization [20]. The foundations of genetic algorithms can be found in the works of Holland [21], Rechenberg [22] and Schwefel [23].
For their initialization, genetic algorithms require an initial set of candidate solutions for the optimization problem to be solved. Table 1 shows the pseudocode of a genetic algorithm. As can be observed in the table, the first step involves creating an initial population. Data representation and how the initial population is created both have a great importance on the genetic algorithm performance. The second operation performed is the crossover. A non-deterministic crossover function can be defined as X : Ω × Ω → Ω . The result of X x i , x j gives a new population member with the same length as x i and x j and as such, all their elements belong either to x i or x j with a certain probability.
Mutation is employed to inject new strings into the next generation [24], which gives the genetic algorithm the ability to search beyond the confines of the initial population. The mutation function can be expressed as: µ : Ω → Ω . It is like a crossover: a nondeterministic function that assigns to each string member a certain probability of being randomly changed. The fitness computation assigns a value to each member of the population that represents how well they fit to the problem to be solved. Those individuals with the most favorable results of the fitness function are more likely to be selected as parents of the next-generation offspring.
This process is repeated until any termination condition is reached. In the case of the present research, there are certain key parameters to be controlled in order to obtain a fine-tuned version of the algorithm. These parameters are the number of iterations, the population size, crossover and probability of mutation.

The Proposed Algorithm
The algorithm proposed in the present research makes use of both genetic algorithms and support vector machines in order to find out whether a certain pathway, which in this context can be considered in the same way as a set of SNPs, is able to identify cases and controls for a certain trait or illness. Figure 1 shows the flowchart of the proposed algorithm. The first step involves selecting the subset of SNPs that belongs to the pathway under analysis. This means that from the total number of SNPs included in the database, the information required for the analysis is reduced to a selected subset of SNPs of all the members of the population. In other words, the SNPs chosen are only those that belong to the pathway to be studied.
The members of the genetic algorithm (GA) population for this analysis are strings of "1s" and "0s" that indicate which SNPs will form a part of the SVM model to be computed. Please note that "1" means that the SNP will take part of the SVM model and "0" that it will not. In the case of the present research, each member of the GA population has the same length as the number of SNPs that constitute the pathway under analysis. Please note that each GA population has several members, and an SVM model is trained for each one.
All the classification SVM models are trained using as input variables the SNPs with the "1" value and as output, the variable trait that indicates which elements are cases and which are controls. As may be seen in the flowchart, the initial population is formed by rows from an identity matrix selected in a random way up to the completion of the total number of individuals required for the GA population. This means that in the initial population, only one SNP is active in each population member. In other words, it means that after selecting as input information only those SNPs that belong to the pathway under analysis, the initial population is formed by individuals in which only one of those SNPs is active and, afterwards, the different SNPs that belong to the subset that is being employed are switched on and off with the aim of improving the results of the fitness function. The reason for choosing only one SNP in each member of the initial population is that the goal is to get the maximum values of the fitness function while making use of the minimum number of SNPs required and to allow for the importance of every single SNP to be taken into account individually.
In the following populations, the number of SNPs selected in each population member can be more than one as each one of the population members evolves, taking into account genetic algorithms rules in search of the maximization of the value of the fitness function. The fitness function consists of calculating the area under the ROC curve [25] that is obtained when data are classified, making use of the SVM calculated for that member of the population (string of "0s" and "1s") using the active SNPs as independent variables and as dependent variables whether or not the individual suffers from a certain trait, which in the case of the present research is colorectal cancer. In order to avoid problems related to epistasis, members of the population that choose more than one SNP from the same gene have a value of 0 assigned to their fitness function.
When the stop criterion is reached, i.e., the maximum number of cycles allowed, the area under the ROC curve (AUC )value achieved is recorded and the SNPs employed are deleted from the set of SNPs. The process starts again looking for a new SNPs subset for which the AUC value can be as high as possible. The process is then repeated until the SNP set is empty or until a total of 80 cycles have been completed. Afterwards, the same process is repeated 1000 times, making use of permutations of cases and controls labels. Please note that in this field, some of the existing literature recommends 10,000 permutations [26] while in classical literature the number of permutations considered in most of the papers for estimating the power of a permutation test was 1000 [27][28][29] and in some of them only 500 [30,31]. It is also worth noting that previous research [32,33] stated that 1000 is a reasonable number of permutations for a test at the 5% level of significance. Finally, in the field of genomic studies there is software that considers that permutation values from 1000 can feasibly be employed in GWAS [34].  When the stop criterion is reached, i.e., the maximum number of cycles allowed, the area under the ROC curve (AUC )value achieved is recorded and the SNPs employed are deleted from the set of SNPs. The process starts again looking for a new SNPs subset for which the AUC value can be as high as possible. The process is then repeated until the SNP set is empty or until a total of 80 cycles have been completed. Afterwards, the same process is repeated 1000 times, making use of permutations of cases and controls labels. Please note that in this field, some of the existing literature recommends 10,000 permutations [26] while in classical literature the number of permutations considered in most of the papers for estimating the power of a permutation test was 1000 [27][28][29] and in some of them only 500 [30,31]. It is also worth noting that previous research [32,33] stated that 1000 is a reasonable number of permutations for a test at the 5% level of significance. Finally, in the field of genomic studies there is software that considers that permutation values from 1000 can feasibly be employed in GWAS [34].

Design of Experiments
Nowadays, it is well-known that there are no optimal parameter values valid for all problems [20]. This statement is part of the no-free-lunch theorem [35], which states that there is no overall superior optimization algorithm capable of solving every kind of optimization problem.
Parameter tuning strategies treat the parametrization of GA as an optimization problem. In the case of the present research, the GA parameters are tuned with the help of design of experiments methodology (DOE), but there are other possible approaches. For example, the parameter calibration process can be supported on statistical methods [36], a deterministic control whereby an extended scheme is used to control the parameters [37], adaptive control strategies like Rechenberg's mutation rate [22] or self-adaption [20]. In the case of the present research, DOE was chosen.
DOE is a statistical methodology that is employed in order to find relationships among variables that affect certain processes [38]. In other words, DOE makes it possible to see how a simultaneous change in more than one variable will affect the output variable. Although DOE can be applied to both categorical and continuous variables, in the case of the present research all the variables under study with DOE are continuous.
A design of experiments was performed to select the most suitable values of the GA parameters for the algorithm proposed in the present research, using the colorectal cancer pathway as a reference. Furthermore, for designing the experiment, a full factorial design with center points was employed. In a full factorial design, all possible combinations of factor levels are used [39]. For each point, experiments were repeated three times. Please note that each individual experimental setting is referred to as a run and the response measured is called an observation. The present work made use of DOE for the fine-tuning of the GA algorithm. For the DOE analysis, the continuous variables of the research are considered, namely, the number of iterations of the algorithm, the population size and the values of mutation and crossover. The response value is the area under the ROC curve obtained with the SVM model that makes use of the variables selected. Each variable was measured at three different levels and, therefore, a 3 4 full factorial design with 81 experiments, each one repeated three times, was obtained. The variables employed and their corresponding level are shown in Table 2.

Datasets
The database employed in this study belongs to the Colorectal Cancer Transdisciplinary Study (CORECT) project. This was an observational multicentric multi-case control study performed from September 2008 to December 2013. For this research, the subset of information belonging to Leon University Hospital and Hospital of Bellvitge was employed. It contains 1076 cases of, and 973 controls for, colorectal cancer, for which the information from 370,570 SNPs was available.
The cases are incidental and histologically confirmed, with ages between 20 and 85. Those with communication disabilities, physically unable to participate or with a previous diagnosis of colorectal cancer were excluded. For their recruitment, the study staff contacted them at the selected hospitals.
Controls were randomly selected from the population lists assigned to family physicians in the catchment area of the hospitals where the cases were recruited, and with the same sex and age distribution (±5 years). All had been residing in the area of the hospital where the cases were recruited for at least 6 months before and did not present physical or communication impediments.
The protocol of the project was approved by the Ethics Committees of the institutions that took part in the study. The participation of the subjects in the study was voluntary, after signing an informed consent. The confidentiality of the data is guaranteed by eliminating the personal identifiers in the datasets and all the files that include information about the subjects, complying with Spain's Organic Law 15/1999. Please also note that the files have been registered with the Spain Data Protection Agency (Number 2102672171). Access to this information for other researchers is allowed on request.
For the present study, ten different pathways were selected from the KEGG database [40][41][42] so as to include, on the one hand, pathways for which there was already significant scientific evidence that they were associated with the trait analyzed; on another hand, pathways for which the evidence indicated that their association with the trait was improbable, and finally others for which the evidence was inconclusive. Once the ten pathways to be included had been chosen, all the SNPs that belonged to the genes considered in each one of the pathways under analysis were retrieved from the database.

Results
After having fixed the GA parameters to a population size of 5500, 6000 iterations for each cycle and a 100% crossover with a mutation rate of 1%, the proposed algorithm was applied to 10 different pathways.

Design of Experiments
The variables employed in this design of experiments were number of iterations, population size, crossover, and mutation rate. The values tested are presented in Table 2. Each of the combinations of variables combinations was tested three times. Figure 2 shows the main effect plots of these four variables. According to the results obtained, the number of iterations of the GA was fixed at 6000, as there is only a very small increase in the AUC value from 6000 to 8000 (about 0.1%). In the case of the population size, the value of 5500 individuals was considered to be sufficient, as increasing the number of population members to 10,000 only meant an improvement of less than 0.2% in the AUC results. For the mutation rate whose values are presented in logarithmic scale, the maximum of the three values tested was achieved for 1%, which was the center point. Finally, in the case of the crossover rate, 100% was chosen due to it giving the highest performance. For all the pathways analyzed in this research, the number of iterations is 6000, with a population size of 5500, a mutation rate of 1% and a crossover of 100%.

Application of the Algorithm to Different Pathways
After having fixed the GA parameters at a population size of 5500, 6000 iterations for each cycle and a 100% crossover with a mutation rate of 1%, the proposed algorithm was applied to 10 different pathways with different degrees of relationship with the trait under study.
As can be seen in Figure 1, the first step in all the cases involves selecting as part of the initial set only those SNPs of the database taking part in the pathway under study. For

Application of the Algorithm to Different Pathways
After having fixed the GA parameters at a population size of 5500, 6000 iterations for each cycle and a 100% crossover with a mutation rate of 1%, the proposed algorithm was applied to 10 different pathways with different degrees of relationship with the trait under study.
As can be seen in Figure 1, the first step in all the cases involves selecting as part of the initial set only those SNPs of the database taking part in the pathway under study. For example, in the case of the adipocytokine signaling pathway, it has a total of 752 SNPs and in the case of the AMP-activated protein kinase (AMPK) signaling pathway the number is 1812. The initial population of 5500 individuals is always formed by vectors in which only one element is different from zero. It indicates the SNP employed for the SVM classification in the first iteration. Starting from such an SNP, new individuals that combine zeros and ones are formed in search of those SNPs that can provide the maximum AUC value while at the same time taking into account that no more than one SNP from the same gene can be included in the same population member. This iterative process is repeated 6000 times. In the case of the AMPK signaling pathway, the AUC value obtained after this process was 0.584023, and for the adipocytokine signaling pathway it was 0.565382. When the iterations are finished, the SNPs employed in the subset with the maximum AUC obtained are removed and the process is repeated in search of the best remaining SNPs. Although this process would normally be repeated while there were still SNPs pending employment, it was halted after 80 cycles. There are two reasons for this. On the one hand, the time required for these iterations, which is quite high (34.51 s per iteration on average) and on the other hand that although we do not know in advance the exact number of SNPs involved in 80 iterations, in order to compare the results of pathways of different lengths, a number of repetitions was chosen that would be feasible for any of the pathways.
After having run the algorithm for all the pathways included in the study, the same process was repeated 1000 times for each one but permuting phenotypes of cases and controls while preserving the total number of 1076 cases of colorectal cancer and 973 controls. The results obtained were compared.
The main results are detailed in Table 3. This table presents the total number of SNPs that are included in each of the 10 pathways under analysis. As was mentioned before, the algorithm was repeated 80 times in all cases. This means that not all the SNPs were employed for the process of classification of cases and controls. For example, for the adipocytokine signaling pathway, which has a total of 752 SNPs, only 496 were employed in any of the iterations for the classification of individuals in cases and controls. Figure 3 shows the AUC values of the 80 iterations performed for the adipocytokine signaling pathway in the case of cases and controls (phenotype) and for 5 different permutations of the 1000. For the graphical representation and for the comparison of wins in pathway versus permutated, AUC values are ordered from highest to lowest. In this case, the phenotype curve (in green) does not seem to classify cases and controls in a better way than the permutated ones. Please note that as can be observed in Table 3, the average AUC value of the 80 cycles of phenotypes is 0.535858, while in the case of case of permutated cases and controls it is 0.537543, which means it is 0.31% lower. The column called win subsets indicates the percentage of times when the AUC value of the phenotype is higher than the permutated ones.
Something similar happens in the case of the insulin resistance pathway, where the percentage of the win pathways is 29.75% and the values of the AUC and the average permutated AUC are 0.555483 and 0.556201, respectively (−0.13%). Therefore, in this case, as in the adipocytokine signaling pathway, there does not seem to be any relationship between the two pathways and colorectal cancer. Please also note how in Figure 4 the curve of Phenotype does not seem to be higher than the permutated ones. In addition, for the longevity regulating pathway, whose curve is represented in Figure 5, the situation is similar and no significant influence of those pathways on colorectal cancer can be reported. Table 3. Pathways under analysis. Total number of single nucleotide polymorphism (SNPs) per pathway (Tot. SNPs), SNPs employed in the 80 iterations by the non-permutated phenotypes (SNPs employed), average area under the receiver operation curve (AUC) obtained in the 80 iterations by the non-permutated phenotypes (AUC), AUC obtained by the permutated phenotypes (AUC perm.), percentage of non-permutated AUC values than are higher than the maximum permutated AUC value (win subsets).  Something similar happens in the case of the insulin resistance pathway, where the percentage of the win pathways is 29.75% and the values of the AUC and the average permutated AUC are 0.555483 and 0.556201, respectively (−0.13%). Therefore, in this case, as in the adipocytokine signaling pathway, there does not seem to be any relationship between the two pathways and colorectal cancer. Please also note how in Figure 4 the curve of Phenotype does not seem to be higher than the permutated ones. In addition, for the longevity regulating pathway, whose curve is represented in Figure 5, the situation is similar and no significant influence of those pathways on colorectal cancer can be reported. Something similar happens in the case of the insulin resistance pathway, where the percentage of the win pathways is 29.75% and the values of the AUC and the average permutated AUC are 0.555483 and 0.556201, respectively (−0.13%). Therefore, in this case, as in the adipocytokine signaling pathway, there does not seem to be any relationship between the two pathways and colorectal cancer. Please also note how in Figure 4 the curve of Phenotype does not seem to be higher than the permutated ones. In addition, for the longevity regulating pathway, whose curve is represented in Figure 5, the situation is similar and no significant influence of those pathways on colorectal cancer can be reported.   The opposite case applies for apelin signaling, mitochondrial biogenesis and colorectal cancer. In these three cases, the average AUC value obtained for cases and controls are clearly higher than for the permutated phenotypes. In the case of the apelin signaling pathway, it is 5.15% higher in the case of phenotype when compared with the permutated solutions. In the case of mitochondrial biogenesis, it is 3.23% and for the colorectal cancer pathway, the value is 2.45%. In all the cases, the AUC values obtained are higher in the phenotypes than in the permutated cases with 100% of winning cases for apelin signaling pathway, colorectal cancer pathway and mitochondrial biogenesis. Figures 6-8 clearly show how the phenotype curves are higher than the permutated ones. Although in the case of Figure 9, where the AMPK signaling pathway is represented, it does not seem to be as clear as in the three previous cases, the AUC value is 2.26% higher when compared with permutated cases. Please note that in 89.75% of cases the values obtained are higher in the phenotypes than in the permutated cases which, from our point of view, would mean that there is certain influence of this pathway on the trait under analysis.  The opposite case applies for apelin signaling, mitochondrial biogenesis and colorectal cancer. In these three cases, the average AUC value obtained for cases and controls are clearly higher than for the permutated phenotypes. In the case of the apelin signaling pathway, it is 5.15% higher in the case of phenotype when compared with the permutated solutions. In the case of mitochondrial biogenesis, it is 3.23% and for the colorectal cancer pathway, the value is 2.45%. In all the cases, the AUC values obtained are higher in the phenotypes than in the permutated cases with 100% of winning cases for apelin signaling pathway, colorectal cancer pathway and mitochondrial biogenesis. Figures 6-8 clearly show how the phenotype curves are higher than the permutated ones. Although in the case of Figure 9, where the AMPK signaling pathway is represented, it does not seem to be as clear as in the three previous cases, the AUC value is 2.26% higher when compared with permutated cases. Please note that in 89.75% of cases the values obtained are higher in the phenotypes than in the permutated cases which, from our point of view, would mean that there is certain influence of this pathway on the trait under analysis. The opposite case applies for apelin signaling, mitochondrial biogenesis and colorectal cancer. In these three cases, the average AUC value obtained for cases and controls are clearly higher than for the permutated phenotypes. In the case of the apelin signaling pathway, it is 5.15% higher in the case of phenotype when compared with the permutated solutions. In the case of mitochondrial biogenesis, it is 3.23% and for the colorectal cancer pathway, the value is 2.45%. In all the cases, the AUC values obtained are higher in the phenotypes than in the permutated cases with 100% of winning cases for apelin signaling pathway, colorectal cancer pathway and mitochondrial biogenesis. Figures 6-8 clearly show how the phenotype curves are higher than the permutated ones. Although in the case of Figure 9, where the AMPK signaling pathway is represented, it does not seem to be as clear as in the three previous cases, the AUC value is 2.26% higher when compared with permutated cases. Please note that in 89.75% of cases the values obtained are higher in the phenotypes than in the permutated cases which, from our point of view, would mean that there is certain influence of this pathway on the trait under analysis.   The last three pathways under study were glucagon signaling (Figure 10), Huntington's disease ( Figure 11) and insulin signaling ( Figure 12). In these three pathways, in a similar way to the case of the AMPK signaling pathway, the AUC value of the phenotype is slightly higher than the average value obtained for the permutated ones. Additionally, in most of the cases (from 82.50% to 96.50%), the AUC obtained in the phenotype iteration is higher than in the permutated one.  The last three pathways under study were glucagon signaling (Figure 10), Huntington's disease ( Figure 11) and insulin signaling ( Figure 12). In these three pathways, in a similar way to the case of the AMPK signaling pathway, the AUC value of the phenotype is slightly higher than the average value obtained for the permutated ones. Additionally, in most of the cases (from 82.50% to 96.50%), the AUC obtained in the phenotype iteration is higher than in the permutated one.  The last three pathways under study were glucagon signaling (Figure 10), Huntington's disease ( Figure 11) and insulin signaling ( Figure 12). In these three pathways, in a similar way to the case of the AMPK signaling pathway, the AUC value of the phenotype is slightly higher than the average value obtained for the permutated ones. Additionally, in most of the cases (from 82.50% to 96.50%), the AUC obtained in the phenotype iteration is higher than in the permutated one.

Pathway
In summary, taking into account the results obtained with the algorithm proposed in the present research, it can be said that there is a clear relationship linking apelin signaling, colorectal cancer and mitochondrial biogenesis pathways with colorectal cancer. A weak relationship with colorectal cancer was found for AMPK signaling, glucagon signaling, Huntington's disease and insulin signaling pathways. Finally, no relationship with colorectal cancer was found for adipocytokine signaling, insulin resistance or longevityregulating pathways.  In summary, taking into account the results obtained with the algorithm proposed in the present research, it can be said that there is a clear relationship linking apelin signaling, colorectal cancer and mitochondrial biogenesis pathways with colorectal cancer. A weak relationship with colorectal cancer was found for AMPK signaling, glucagon signaling, Huntington's disease and insulin signaling pathways. Finally, no relationship with colorectal cancer was found for adipocytokine signaling, insulin resistance or longevityregulating pathways.

Discussion
From the authors' point of view, the results obtained are coherent with previous results to be found in existing literature. Regarding the AUC values, it must be stated that although they would seem to be low, they are in line with other research [43]. In the case of the AMPK signaling pathway, previous studies found that AMPK promotes the survival of colorectal cancer stem cells [44]. This study found that colorectal cancer stem cells show a higher level of antioxidant genes and have a lower level of reactive oxygen species than non-colorectal cancer stem cells. This would be because the colorectal cancer stem cells also possess more mitochondria mass and show higher mitochondrial activity. In the case of this study, higher AMP-activated protein kinase (AMPK) activity was observed in these colorectal cancer stem cells.
Another study that included patients with stage II or III of colorectal cancer, amongst whom the 5-year survival rate was between 50% and 87%, found that the AMPK encoded

Discussion
From the authors' point of view, the results obtained are coherent with previous results to be found in existing literature. Regarding the AUC values, it must be stated that although they would seem to be low, they are in line with other research [43]. In the case of the AMPK signaling pathway, previous studies found that AMPK promotes the survival of colorectal cancer stem cells [44]. This study found that colorectal cancer stem cells show a higher level of antioxidant genes and have a lower level of reactive oxygen species than non-colorectal cancer stem cells. This would be because the colorectal cancer stem cells also possess more mitochondria mass and show higher mitochondrial activity. In the case of this study, higher AMP-activated protein kinase (AMPK) activity was observed in these colorectal cancer stem cells.
Another study that included patients with stage II or III of colorectal cancer, amongst whom the 5-year survival rate was between 50% and 87%, found that the AMPK encoded in gene α1 was overexpressed in patients who suffered from colorectal cancer. For its authors, the AMPK encoded in gene α1 regulate the glutathione reductase (GSR) phosphorylation, possibly through residue Thr507, which enhances its activity. They suggested the suppression of AMPK expression in gene α1 by using nano-sized polymeric vector to induce a favorable therapeutic effect.
In the case of the apelin signaling pathway, which was also found to be relevant to colorectal cancer, there are some studies linking it to colorectal cancer [45]. Apelin is an endogenous ligand of the apelin receptor (APJ), a seven-transmembrane G protein-coupled receptor [45]. It can be found in the brain and also in peripheral organs like the heart, the lungs, blood vessels, and adipose tissue. It is involved in regulating cardiac and vascular function, heart development, and vascular smooth muscle cell proliferation. According to previous research, apelin is not only related to colorectal cancer, but also to others like lung cancer, gastroesophageal, hepatocellular carcinoma, prostate cancer, endometrial cancer, oral squamous cell carcinoma, brain cancer, and tumor neoangiogenesis. This means that Apelin/APJ may be a potential anticancer therapeutic target. A study suggested that the APJ receptor antagonist F13A significantly reduced cellular proliferation [46]. Another study [47] found that Apelin receptor is co-expressed in colorectal cancer cell lines and its activation leads to adenylyl cyclase inhibition and Akt phosphorylation. For the authors of that research, apelin and its receptor might be co-expressed in the tumor compartment where this co-expression would underlie a constitutive activation of apelin signaling and create a functional autocrine loop. It was the first study that reported that apelin peptide is highly expressed in human colon adenomas and tumors [47]. This co-expression was also observed in several colorectal cancer cell lines. In the LoVo cell line, quantitative real-time polymerase chain reaction (qRT-PCR) experiments and apelin-induced Akt phosphorylation confirmed the concomitant expression of both ligand and receptor. In addition, apelin behaved as an anti-apoptotic peptide, by reversing caspase activation and poly ADP ribose polymerase protein (PARP) degradation induced by theMG132 proteasome inhibitor. Another study [48] that measured apelin, and its receptor mRNA, and protein expression levels in tumor tissue of 56 surgically treated colorectal adenocarcinoma patients and compared them with 27 healthy controls, found that serum levels of apelin and its receptor were increased in colorectal cancer patients in comparison to controls, which leads to the conclusion that apelin could be an important factor in the progression of colorectal carcinoma. The finding of the colorectal cancer pathway as being significant by our algorithm is not surprising, as it can be considered as the pathway of reference.
Mitochondria are semiautonomous organelles that participate in energy metabolism, free radical production, and apoptosis. Apart from the nucleus, the mitochondrion is the only cellular organelle that contains its own genome and genetic machinery [49,50]. Mitochondrial biogenesis is an essential process by which new mitochondria are obtained, and is one which requires coordination between the nuclear and mitochondrial genomes [51]. Mitochondria, as well as most of the processes related to them, are closely linked to the genesis of cancer [52]. For this reason, it is essential to study mitochondrial biogenesis, as well as to find out what happens to these organelles during tumor processes. The progression of CRC in humans is closely linked to mitochondrial alteration, increased production of free mitochondrial oxide radicals, and oxidative stress [53].
An article in a literature review article [54], aimed at evaluating whether increased or decreased peroxisome proliferator-activated receptor gamma coactivator 1-α (PPARGC1A or PGC1α) expression affects the development of colorectal cancer, found that an altered expression of PGC1α modifies colorectal cancer risk and mitochondrial biogenesis is regulated by PGC1α. According to this study, it seems plausible that the proposed algorithm found a relationship between the mitochondrial biogenesis pathway and colorectal cancer.
Glucagon increases the production of glucose by increasing glycogenolysis and gluconeogenesis in the liver, and by reducing glycogenesis and glycolysis. The release of glucagon in response to food consumption depends on the type of meal that has been eaten. If a meal is rich in carbohydrates, blood glucagon levels fall to prevent an undue rise in the level of circulating glucose. Conversely, when a protein-rich meal is eaten, the blood glucagon level rises. Nowadays cancer is known to be one of the major causes of death in diabetic patients, and an association between antidiabetic drugs and the risk of cancer has been reported [55]. Glucagon is nowadays recognized as a pivotal factor implicated in the pathophysiology of diabetes. A recent study has found [56] expression of the glucagon receptor in colon cancer cell lines and in colon cancer tissue obtained from patients. According to this study, glucagon significantly promoted colon cancer cell growth. Molecular assays showed that glucagon acted as an activator of cancer cell growth through deactivation of AMPK and activation of mitogen-activated protein kinase (MAPK). Another study [57] found the relationship between glucagon signaling pathway and endometrial cancer.
A study published in 2002 [58] stated that Huntington's disease provides clues about cancer and that it would be a marker of certain cancers like colorectal cancer. It can be said that, in general, people with Huntington's disease have been observed to have lower rates of cancers [59] and although the relationship of Huntington's disease with prostate cancer has been reported [60], no similar study has been found for colorectal cancer.
The insulin signaling pathway is another of the pathways where a moderate relationship with colorectal cancer was found. It has been reported that the modification in the individual values of plasma insulin levels due to diet may affect the risk of suffering from colorectal cancer [61]. A similar result was found by other researchers in a study performed with a sample of postmenopausal women [62]. Although there are many studies in this line [63,64] and it is known that genetic variants in metabolic signaling pathways may interact with lifestyle factors such as dietary fatty acids influencing colorectal cancer risk, these interrelated pathways are not fully understood [65].
As mentioned before, the proposed algorithm did not find any relationship linking adypocitokine signaling, insulin resistance and longevity-regulating pathways with colorectal cancer. In the case of the adipocytokine signaling pathway, no relationship has been found in the existing literature with colorectal cancer. However, its relationship with atherosclerosis, diabetes and breast cancer [66] has been reported. In the case of insulin resistance, as far as the authors know, the relationship is not clear enough at this point in time [67]. Finally, in the case of the longevity-regulating pathway, there were no studies found linking it with colorectal cancer.

Conclusions
This paper presents a novel method called GASVeM, which is based on two wellknown machine learning methodologies-genetic algorithms, and support vector machines. Although the results achieved appear promising, as usually happens in machine learning methodologies applied to GWAS, it is difficult to find a direct biological link between the SNPs involved in the results and the trait under study. In spite of this, through studying existing literature it has been possible for the authors to find previous well-known relationships between the relevant pathways and the trait under analysis.
From the authors' point of view and due to a lack of a machine learning gold standard for GWAS analysis, the present method could be of interest for future GWAS. In this direction, based on the results obtained, it would be interesting, on the one hand, to work on studying the classification capacity when information from several pathways is combined and, on the other hand, to try to replicate the results obtained in other databases and in the analysis of other pathologies, to validate the usefulness of the method under different conditions of use.
We also agree with those authors that consider that we are in the infancy of the use of machine learning in GWAS [68] as we are still quite a long way from achieving gold standard methods producing consistently validated biological insights. In addition, we would like to highlight the characteristics of this kind of database, whereby a high number of SNPs (columns) when compared with the number of cases (rows) cause a kind of problem with GWAS that is difficult to deal with from a machine learning point of view and that, in the case of the present research, is present in the need for an a priori SNPs selection based on pathways. It is the authors' opinion that this problem has a great impact on the reproducibility of results when the same algorithm is applied to a different database.
Finally, it must be said that we are aware that the translation of the results obtained with this method to a population-based clinical practice to carry out a personalization of interventions based on genomic data still requires further steps to be taken before the selection capacity can be refined. However, we consider that the method has demonstrated, as was our objective, a good ability to discriminate which pathways are associated with the event and which are not, through the choice of a limited set of SNPs. We consider the ability to classify individuals to be a second step to be taken in the development of the model, through lines of study such as the inclusion of several pathways in the model. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.