^{*}

This article is an open-access article distributed under the terms and conditions of the Creative CommonsAttribution license (http://creativecommons.org/licenses/by/3.0/).

Genomic microarrays are powerful research tools in bioinformatics and modern medicinal research because they enable massively-parallel assays and simultaneous monitoring of thousands of gene expression of biological samples. However, a simple microarray experiment often leads to very high-dimensional data and a huge amount of information, the vast amount of data challenges researchers into extracting the important features and reducing the high dimensionality. In this paper, a nonlinear dimensionality reduction kernel method based locally linear embedding(LLE) is proposed, and fuzzy K-nearest neighbors algorithm which denoises datasets will be introduced as a replacement to the classical LLE's KNN algorithm. In addition, kernel method based support vector machine (SVM) will be used to classify genomic microarray data sets in this paper. We demonstrate the application of the techniques to two published DNA microarray data sets. The experimental results confirm the superiority and high success rates of the presented method.

The recent sequencing of the human genome has opened a new era in biomedical research; genomic microarray data have attracted a great deal of attention, as reflected by the ever increasing number of publications on this technology in the past decade. The application of microarrays technology encompasses many fields of study. From the search for differentially expressed genes, genomic microarrays data present enormous opportunities and challenges for machine learning, data mining, pattern recognition, and statistical analysis, among others. In particular, microarray technology is a rapidly maturing technology that provides the opportunity to assay the expression levels of thousands or tens of thousands of genes in a single experiment [

The LLE is considered as among one of the most effective dimensionality reduction algorithms for data preprocessing of high-dimensional data and streaming, and has been used to solve various problems in information processing, pattern recognition, and data mining [

The purpose of this paper is to fill these gaps by presenting a kernel method based LLE algorithm(KLLE). The kernel method [^{n}^{N}^{n}^{N}

Recently, support vector machine(SVM) has been extensively used by the machine learning community because it effectively deals with high dimensional data, provides good generalization properties, and defines the classifier architecture in terms of the so-called support vectors [

This paper focused on genomic microarray analysis, which enables researchers to monitor the expression levels of thousands of genes simultaneously [

The remainder of this paper is organized as follows. In Section 2, we introduce the kernel method. The kernel method based LLE algorithm is constructed in Section 3. In Section 4, the kernel method based SVM is introduced. In section 5, we apply our proposed dimensionality reduction method to the Lymphoma and the SRBCT genomic microarray data sets, experiments and comparisons are conducted and presented. Conclusions are drawn in the final section.

The kernel method [^{n}^{N}, the kernel function is the form _{i}_{j}_{i}_{j}_{ij}_{i}, x_{j}

The kernel methods solution comprises two parts: a module that performs the mapping into the embedding or feature space and a learn algorithm designed to discover linear patterns in that space. Firstly, we need to create a complicated linear feature space, and then work out what the inner product in that space would be, and finally find a direct method for computing that value in terms of the original inputs. In fact, the kernel function

However, an explicit mapping _{i}_{j}_{i}_{j}_{i}_{j}_{i}_{j}_{i}_{j}

The kernel matrix is taken as an information bottleneck, which follows from the fact that the learning algorithm can glean from the training data and the chosen feature space is contained in the kernel matrix. The kernel matrix is not only the central concept in the design and analysis of the kernel method, but can also be regarded as the central data structure in their implementation. It is perhaps not surprising that some properties of the kernel matrix can be used to assess the performance of a learning algorithm.

LLE [

Conventional fuzzy KNN algorithm assigns an unlabeled pattern _{1}, c_{2},…, _{M}_{i}_{1}, _{i}_{1}, _{k}_{n}_{j}_{i}_{i}_{i}_{j}_{j}_{ij}

The KLLE extends LLE to work with the kernel method, which is used to map the nonlinear data into the linear feature space, which best reconstructs it as a linear combination of its neighbors. Moreover, the kernel matrix is positive semi-definite, and some properties and eigen-decomposition of kernel matrix are used to optimize the KLLE's objective function. In another way, the larger candidate neighborhood number

Mapping. Let
_{1}, _{2}, …, _{n}} be a set of ^{D}^{D}^{N}

The fuzzy neighborhood for each point. Assign neighbors to each data point _{i}_{i}_{i}_{i}_{i}_{i}

The kernel method based manifold reconstruction error. The KLLE's reconstruction error is similar to those of LLE, which is measured by cost function:

Considering reconstruction weights
^{T}Q^{T}Q^{T}_{i}^{T}^{−1}^{T}U^{T}^{−1}

The kernel method computes low-dimensional embedding

^{T}

In this step, we propose a method to yield KLLE embedding. _{1}, and the smallest eigenvalue is ^{T}_{1}^{d}

The KLLE algorithm finds some global coordinates _{i}

The original support vector machine can be characterized as a powerful learning algorithm based on recent advances in statistical learning theory [

Let _{1}, _{1}), (_{2}, _{2}),……, (_{n}_{n}_{i}^{n}_{n}

The objective of SVM is to maximize the margin of separation and minimize the training errors. The problem can then be transformed into the following Lagrange formulation
_{i}_{j}_{i}_{j}

The wide use of microarrays is in classification-for example, the prediction of the phenotype of a biological sample based on its patterns of gene expression. The analysis of gene expression profiles, which serve as molecular signatures for cancer classification and identification of differentially expressed groups of genes, provides a high-level view of functional classes or pathways, and has become challenge and significantly affecting topics in bioinformatics research. In order to do this, one needs a 'training set' of samples that have well-defined phenotypic differences, and that can be used to generate reproducible profiles. There is a wide range of algorithms that have been used for classification, artificial neural networks [

In this section, we evaluate the performance of our kernel based dimensionality reduction algorithms and classifier to two published DNA microarray data sets: one is small round blue cell tumors(SRBCTs) dataset, and the other is lymphoma dataset. The manipulation is described as follows, in two steps: The first step, applying KLLE to project the training sets from

In the first series of computational experiments, we considered a data set on SRBCTs presented in the work of [

In this paper, considering some genes are irrelevant for diagnosis and would degrade the performance of the classifier, we followed Khan's and [

By our purposed method, the classifier accuracy was 100% when the 20 genes were reduced to at least 5 dimensionality space, and 96 genes were reduced to at least 7 dimensionality space. However, by LLE method, the classifier accuracy was 100% when the 20 genes were reduced to at least 9 dimensionality space, and 96 genes were reduced to at least 14 dimensionality space. Finally, PCA method, the classifier accuracy was 100% when the 20 genes were reduced to at least 11 dimensionality space, and 96 genes were reduced to at least 29 dimensionality space. The implementation (classifier accuracy was 100%) returned a ranked list in about 1307 sec for the SRBCTs, 1743 sec by LLE and 1933 sec by PCA, much faster than 4127 sec by SVM without dimensionality reduction.

In fact, although the previous studies showed that linear classifiers are good enough to achieve almost perfect classification [

The second data set includes samples originating from the lymphoma dataset[Alizadeh et al], which can be obtained from

To find the genes that contribute most to the classification, the T-test, which has been used in gene selection [

We followed the same procedure as we did in the SRBCT dataset. We performed KLLE, LLE, and PCA on the dataset which selected by T-score, consisting of expression levels of the top 165 genes, and the Tibshirani's data set, of which only 48 genes were selected based on the nearest shrunken centroids for gene selection.

For high-dimensionality reduced by the KLLE method, the KLLE-SVM classifier accuracy was 100% when the 48 genes were reduced to at least 7 dimensionality space, and 165 genes were reduced to at least 10 dimensionality space. However, by LLE method, the classifier accuracy was 100% when the 48 genes were reduced to at least 11 dimensionality space, and 165 genes were reduced to at least 15 dimensionality space. Finally, PCA method, the classifier accuracy was 100% when the 48 genes were reduced to at least 18 dimensionality space, and 165 genes were reduced to at least 22 dimensionality space. The implementation (classifier accuracy was 100%) returned a ranked list in about 1766 sec for the lymphoma dataset, 2247 sec by LLE, and 3105 sec by PCA, much faster than 5343 sec by SVM without dimensionality reduction.

From the results it is obviously seen that KLLE performs excellently on datasets dimensionality reduction. Two facts demonstrate this capability. One the one hand, in nonlinear structures dataset, the kernel based nonlinear dimensionality reduction KLLE preserves intrinsic properties more than the linear LLE and PCA. On the other hand, we also found that the time consumption of KLLE-SVM is smaller than those of the other methods, than the time consumption of SVM in computing lower dimensionality. The results also show that our proposed KLLE enhances SVM competence of classification in high-dimensionality.

The application of machine learning to data mining and analysis in the area of microarray analysis is rapidly gaining interest in the community. The large number of gene expressions coupled with analysis over a time course, provides an immense space of genomic dimensionality reduction and selection. In this paper, we presented an effective approach to reduce high-dimensionality and genes classifier in genomic microarray experiments. In our approach, kernel method is demonstrated to be able to extract the complicated nonlinear information embedded on the data sets by a nonlinear mapping. This paper proposed an improved kernel locally linear embedding algorithm for dimensionality reduction, based on the traditional LLE, kernel method and fuzzy KNN. The proposed algorithm compresses and denoises the redundant information in manifolds and preserves most intrinsic properties at the same time. It is conformed that our proposed KLLE has overcome the some primary shortcomings of the original algorithm, stimulating the applications of LLE.

The experimental results indicate that the proposed method performs well in dimensionality reduction and achieve high classification accuracies in SRBCT and lymphoma dataset. And the results also showed that this approach preserved the dataset's intrinsic nonlinear relationship and performed better than the current popular LLE and PCA approach. We conclude that the KLLE not only helps biological researchers classify differentiate cancers that are difficult to be classified for high-dimensionality, but also helps researchers focus on a small number of important genes to find the nonlinear relationship between those important genes.

This work is supported by Foundation of National Natural science No.10671030.

The experiments on the SRBCTs dataset: (a)The classifier accuracy of dimensionality reduction in 96 genes selected by Khan ;(b)The test error of dimensionality reduction in 96 genes; (c)The classifier accuracy of dimensionality reduction in 20 genes;(d)The test error of dimensionality reduction in 20 genes selected by Nikhil.

The experiments on the lymphoma dataset: (a)The classifier accuracy of dimensionality reduction in 165 genes selected by T-score;(b)The test error of dimensionality reduction in 165 genes; (c)The classifier accuracy of dimensionality reduction in 48 genes;(d)The test error of dimensionality reduction in 48 genes selected by nearest shrunken centroids

The comparison of three methods on SRBCT dataset

Algorithms | 96 genes | 20 genes | ||||
---|---|---|---|---|---|---|

| ||||||

Dimensional | Support vectors | Time(sec) | Dimensional | Support vectors | Time(sec) | |

SVM | 96 | - | - | 20 | 106 | 4127 |

PCA-SVM | 29 | 87 | 2672 | 11 | 64 | 1933 |

LLE-SVM | 14 | 63 | 2102 | 9 | 42 | 1743 |

KLLE-SVM | 7 | 42 | 1934 | 5 | 31 | 1307 |

The comparison of three methods on lymphoma dataset

Algorithms | 165 genes | 48 genes | ||||
---|---|---|---|---|---|---|

| ||||||

Dimensional | Support vectors | Time(sec) | Dimensional | Support vectors | Time(sec) | |

SVM | 165 | - | - | 48 | 124 | 5343 |

PCA-SVM | 18 | 104 | 2672 | 22 | 83 | 3105 |

LLE-SVM | 15 | 74 | 2133 | 9 | 56 | 2247 |

KLLE-SVM | 7 | 56 | 1934 | 5 | 41 | 1766 |