Reanalysis of Non-Small-Cell Lung Cancer Microarray Gene Expression Data †

: Cancer is one of the leading causes of death in many countries, and this continues to be the case because of the lack of sufficient treatment. One of the most common types is non-small-cell lung cancer (NSCLC). The increasingly large and diverse public datasets about NSCLC constitute a rich source of data on which several analyses can be performed so as to find candidate oncogenic drivers or therapeutic targets. The aim of this study is to reanalyze an existing NSCLC NCBI GEO Dataset (accession = GSE19804) in order to see if novel involved genes can be found. For this, we used microarray technology for preprocessing and, based on random forest, support vector machine and C5.0 decision tree models, made a comparison of the 10 most important genes recorded. This study was realized with R-Studio 4.0.2 and Bioconductor 3.11. In conclusion, the EFNA4 gene and other genes, namely KANK3, GRK5, CLIC5, SH3GL3, ACACB, LIN7A, JCAD, and NEDD1, are thought to be potential genes that may play a role in NSCLC and it is recommended that researchers working in the wet laboratory should focus on these genes.


Introduction
Lung cancer is the leading cause of death in many countries around the word [1]. Non-small-cell lung cancer (NSCLC) is the most widespread, accounting for approximately 85% of lung cancers, with a five-year survival rate of approximately 5% [2]. Many studies have been done and several methods have been developed to fight this disease but the main obstacle that they face is the development of drug resistance or the late detection of the disease [3]. Thus, finding the genes involved in NSCLC and their roles can help to overcome this disease.
DNA microarray analysis is one of the new technologies that helps to measure the expression levels of a large number of genes simultaneously through chips. With DNA microarray technology, it is possible to define the gene expression profile of the tumor [4]. Gene expression analysis is a study used to classify cancers, predict clinical outcomes and discover disease-associated biomarkers [5]. Microarray technology has been used in the study of several types of cancer, such as esophageal [6,7], prostate [8], breast [9] and gastric cancer [10], and it has also been utilized in other types of cancer. However, one of the major obstacles of gene expression experiments is that not only is their analysis usually done in isolation but it is also carried out with a very small number of samples and is not easy to conduct.
In this article, our work consists of reconducting a thorough analysis of an existing GEO Non-Small Cell Lung Cancer dataset retrieved from NCBI (reference GSE19804) [11,12]. For this, we have used the R programming language through R-Studio version 4.0.0 and also Bioconductor's 3.11 version. Firstly, the dataset was downloaded through the GeoQuery package, and differential gene expression analysis with the limma package  was carried out. The obtained differentially expressed genes were filtered through the GeneFilter package. In order to identify important candidate genes, random forest [13], support vector machine (SVM) [14] and C5.0 decision tree [15] were used.

Datasets
Many studies have provided a differentially expressed gene list but unfortunately these data cannot be verified due to many issues, such as overfitting of the small discovery dataset and the lack of a sufficient validation set. Over the years, available public databases have continued to collect data. The work carried out in this study was based on one of these public databases, namely the Gene Expression Omnibus (GEO) dataset repository located at https://www.ncbi.nlm.nih.gov/geo/ (11 November 2020). From the GEO database, the DNA microarray dataset was downloaded under the accession number GSE19804 by the getGEO function in GEOquery package [16].
The dataset was provided by an analysis of paired tumor and adjacent normal lung tissue specimens obtained from nonsmoking female non-small-cell lung carcinoma patients in Taiwan [11,12]. The gene expression profile consisting of 120 samples was made up of 60 NSCLC samples and 60 control samples, rendering it a balanced dataset. The platform used for the gene microarray was GPL570 (HG-U133_Plus_2) Affymetrix Human Genome U133 Plus 2.0 Array and patients between 37 and 80 years old were enrolled.

Setup and Visualization of the Dataset
The DNA dataset was downloaded and read into the R statistical environment with the help of Bioconductor, a package that provides tools for the analysis and comprehension of high-throughput genomic data [17]. A boxplot on the dataset is shown in Figure 1, distinctly separating NSCLC and control samples on two sides, demonstrating that the dataset was perfectly normalized and thus ready for further analyses.

Gene Expression and Identification of Candidate Genes
GeneFilter, a package delivered by Bioconductor, provides different methods for filtering genes from high-throughput experiments [18]. NsFilter, a function of the GeneFilter package, offers a filtering operation that reduces the number of the ExpressionSet features by filtering features exhibiting little variation, or a consistently low signal, across samples and also removes duplicate probes corresponding to the same gene in the dataset [17].
The ExpressionSet resulting from our dataset consists of 54,675 features or genes. Nsfilter function applied to the dataset with a cut-off of 0.9 reduces the differentially expressed gene number from 54,675 to 2018. In order to identify the important genes involved in the dataset of the study, a feature selection operation was applied to the obtained reduced dataset. To do so, random forest, support vector machine and C5.0 decision tree algorithms imported from the Caret package were used [19,20]. For every selected algorithm, the 10 most important candidate genes were recorded. The results obtained from these algorithms were compared.

Results
The normalized NSCLC dataset was downloaded from the GEO database. With the method mentioned in the previous section, 54,675 differentially expressed genes were identified. After the filtration of these genes by Genefilter's NsFilter function with a cutoff of 0.9, the dataset was reduced to 2018 differentially expressed genes. Feature selection performed on the reduced dataset with random forest, support vector machine and C5.0 decision tree algorithms showed a number of important features. Later, the features were sorted from the most important to the least, and for every single created model, the 10 most important genes were recorded, as presented in Tables 1-3. 210081_at AGER advanced glycosylation end-product specific receptor 10 209904_at TNNC1 troponin C1, slow skeletal and cardiac type 210081_at AGER advanced glycosylation end-product specific receptor 204533_at CXCL10 C-X-C motif chemokine ligand 10 7 204469_at PTPRZ1 protein tyrosine phosphatase receptor type Z1 8 1552417_a_at NEDD1 NEDD1 gamma-tubulin ring complex targeting factor 9 1569003_at VMP1 vacuole membrane protein 1 10 204475_at MMP1 matrix metallopeptidase 1 As seen in Tables 1-3, the genes listed in order of importance may differ according to the algorithm used. However, it is expected that algorithms applied to the same dataset will present a similar list. The presence of the same gene in more than one table suggests that this gene may be a good candidate. As shown in Figure 2 below, COL10A1 is present in the three models; EFNA4, FUT2, AGER, RTKN2 and SPTBN1 are present in the SVM and random forest models, and the SPOCK2 gene is common to the SVM and C5.0 models.

Discussion
In this study, a GEO dataset was downloaded and analyzed, and 54,675 differentially expressed genes were identified. A filter operation applied to the dataset reduced the features number to 2018. Feature selection was performed on the reduced dataset and, with random forest, SVM and decision tree algorithms, the 10 most important genes were recorded and compared.
The aim of the present study was to reanalyze an existing dataset in order to see if novel genes could be found. Microarray data analysis, filtering and feature selection revealed that COL10A1, SPOCK2, SPTBN1, RTKN2, FUT2 and AGER differentially expressed genes may be potentially involved in NSCLC and many other studies have demonstrated the same result [21][22][23][24][25][26][27]. COL10A1 [22] was identified to be common to all three algorithms, suggesting that it could be a gene that may play an important role in NSCLC. The EFNA4 gene was found in two models but was not found in the literature. Moreover, SPOCK2 [23], present in Tables 1 and 2, and SPTBN1 [21], RTKN2 [24] and AGER [25,26], present in Tables 2 and 3, were also identified to be present in at least two algorithms. Other genes, such as GOLM1 [28] in Table 2, MMP11 [29] in Table 1 and MMP1 [30] in Table 3, were also expressed and also recognized to be involved in NSCLC.

Conclusions
In conclusion, this study identified 54,675 differentially expressed genes; 2018 of them, chosen by a filter method with a cut-off = 0.9, were evaluated and a feature selection operation was performed. After a comparison between feature selection methods, COL10A1, SPOCK2, SPTBN1, RTKN2 and AGER genes, which are known to play a role in NSCLC, were also detected in our study. Genes such as GOLM1, MMP1, MMP11, CXCL10, PTPRZ1, TNNC1, FUT2, VMP1 are already well-known genes. KANK3, GRK5, CLIC5, SH3GL3, ACACB, LIN7A, JCAD, NEDD1 genes can be suggested as gene candidates even if they were found in only one model. The EFNA4 gene is thought to be a stronger candidate as it was detected in both SVM and random forest models.
In the future, this study could be utilized in the detection of possible candidate genes by reanalyzing existing datasets with different algorithms.  Data Availability Statement: Data available in a publicly accessible repository. Publicly available datasets were analyzed in this study. This data can be found here: [https://www.ncbi.nlm.nih.gov/ sites/GDSbrowser / reference number: GSE19804].

Conflicts of Interest:
The authors declare no conflict of interest.