1. Introduction
The advancement in next-generation sequencing (NGS) methods, coupled with cell sorting and culturing, have made it possible to study the precise transcriptomic profiles of individual cells. The first account of single cells’ expression profiles captured by using NGS, in 2009, was a significant advancement over traditional bulk expression analysis [
1]. RNA samples collected from bulk cells provided only the average gene expression value of an ensemble of single cells and hence failed to present the temporal and spatial variability across multiple similar groups of cells. Single-cell RNA sequencing (scRNAseq) has helped in unveiling new cell types and subpopulations of cells which were hitherto unknown [
2,
3]. This technology has been able to provide novel insights into the cellular compositions and biological processes, including dynamic processes involved in development and differentiation. Moreover, scRNAseq studies have also helped in uncovering the cellular heterogeneity in complex tissues and even in populations of seemingly similar cells. Spatial heterogeneity and mapping are now widely studied issues in the field of scRNAseq data analysis.
Through technological and computational advances, it has been realized that one of the major limitations of scRNAseq is the loss of spatial information during cell isolation; hence, much effort has been made to overcome this limitation, both in terms of experimental technology [
4,
5,
6,
7] and computational data analytics [
8,
9,
10,
11,
12,
13,
14]. With the advancement in the scalability of scRNAseq [
15,
16,
17], along with major improvements in the spatial mapping and cellular heterogeneity analysis, researchers are making progress in investigating expression levels through entire organs and organisms, to analyze the cellular composition and variability. These large-scale studies have been able to construct cellular maps, for example, of
C. elegans [
18],
D. melanogaster [
13,
19] and mouse organs [
20,
21]. For the human context, Human Cell Atlas [
22] is a highly ambitious single-cell mapping project already underway. However, deep challenges in realizing the full potential of scRNAseq have remained. For example, the primary methodology of identification of cells in scRNAseq is by clustering their expression profiles employing a similarity metric [
23]. Clustering, in turn, is considerably dependent on how the similarity metric is specified and the feature sets used for defining a profile [
24]. A well-known challenge in scRNAseq analysis is the availability of expression values for as few as 10–20 per cent of genes in a given cell [
25], while most of the rest are missing values or dropouts due to a small amount of RNA coming from each cell. The background noise also complicates the analyses. These problems pose a great challenge in deciphering the heterogeneity and identity of cells in these populations, posing major analytical barriers in the analysis of scRNAseq data in their full application potential.
Many of the above challenges in scRNAseq data analysis can be effectively addressed by improved computational strategies to cluster single-cell expression profiles in the absence of reliable values for all genes in most of the entities to be clustered. The development of such computational strategies requires rigorous benchmarking on datasets and systems with well-characterized biological contexts. For this purpose, evaluating computational methods for their ability to reproduce topological associations between cells can come handy as one can safely assume that biologically similar cell populations are enriched in terms of their topological similarity to each other. Researchers have hoped that the transcriptome data collected from similar topological locations could be easily labeled by using standard clustering techniques, but it is now clear that such analyses constitute a highly non-trivial problem. As described above, gene dropouts and missing values make these datasets extremely sparse. The missing values belong to different gene sets in each cell and for each measurement of expression profile, further complicating the problem of reconstructing them. To alleviate this problem, and other undesirable attributes of the high-dimensional feature space of scRNAseq data, a priori feature-selection methods are implemented before clustering and downstream analysis of the dataset to identify informative genes to improve clustering results.
The problem of identifying marker genes, which would ultimately contain the complete information of all genes combined and which would be sufficient for further downstream analysis and interpretations have remained challenging and yet of great interest for a long time. For example, one very recently published a study in 2017 by Subramanian et al. [
26], has identified approximately 1000 genes which are shown to be sufficient to predict the expression values of remaining genes. Gene-selection methods, like highly variable genes, highly expressed genes and deviance, help identify these informative gene sets which help in better clustering of scRNAseq data. Even many of the computational strategies for spatial mapping of single-cells from scRNAseq data make use of a reference in situ gene expression atlas of several marker genes to rediscover the lost positional information [
10,
11,
12,
13,
14,
27].
Gene (feature) selection is a combinatorial optimization problem, and an exhaustive search of feature space will have to evaluate approximately 2
N different combinations, where
N is the number of features. Such a course of action will require substantial computational power, or an algorithm which can traverse the vast solution space intelligibly. Genetic algorithm (GA) [
28] is one of the most advanced methods for solving combinatorial problems in an efficient and effective manner. It is a metaheuristic that is based on the mechanics of natural genetics and Darwin’s theory of evolution. GA works on a population of individuals to produce successively better “offspring”, by making slight and slow changes (crossover), and often slight changes to its solutions as well (mutation). One of the significant advantages of GA is its ability to search a large solution space and avoid getting trapped in local minima, along with a good convergence rate.
GA has been known to have applications in many fields of science, for example, for solving NP-hard problems, in machine learning, and also in evolving simple programs. A few other of its applications are in neural network designing, like recurrent neural networks, classifier systems and classification algorithms; Traveling Salesman Problem (TSP) and sequence scheduling; robotics; designs; and in economics, as well, like cobweb model and equilibrium resolution. We also see vast applications of GA in medicine, in the areas of proteomics, radiology, infectious diseases, cardiology, healthcare management, haplotype assembly [
29], magnetic resonance images [
30] and biochemical-parameter estimation and optimization [
31]. In general GA is among the battery a set of evolutionary techniques used in feature selection and parameter optimization. In essence GA creates combinations of constituent features (genes) in a population through a selection strategy, thereby doing away with the superfluous and ineffective combinations. This significantly reduces the solution space and the model “evolves” or “learns” to select the most desired combination defined by a fitness function. In this way GA can be used to solve the feature selection, as well as the optimization, problems. Use of GA, for feature selection in gene-expression analysis, has been notably reported in the past. For example, Li L. et al. [
32], in 2001, used a Genetic Algorithm/K-nearest neighbor (GA/KNN) coupled strategy to identify genes which can distinguish between several classes of samples from gene-expression data. In 2003, Ooi C.H. et al. [
33] tried to determine gene sets and also their optimal sizes, using GA, which maximized classification success. Dolled-Filhart, M. et al. [
34] in 2006 identified a set of tissue biomarkers for breast cancer from mRNA profiles using GA. In 2013, Lin T.C. et al. [
35] tried to distinguish six subtypes of pediatric acute lymphoblastic leukemia from microarray data by using GA for feature selection, followed by application of silhouette statistics to classify them. Lastly, T. Latkowski et al., 2014 [
36] applied GA to select genes which could help in the recognition of autism with high accuracy.
Recognizing the scope of genetic algorithm in improving clustering and classification along with spatial mapping from single-cell transcriptomics data, we participated in Dialogue for Reverse Engineering Assessments and Methods (DREAM) Single-Cell Transcriptomics Challenge [
37] to assess whether GA will be a useful technique. DREAM [
38] has driven open scientific contests in areas of biology and medicine since their beginning in 2006. DREAM Single-Cell Transcriptomics Challenge (SCTC), is one such challenge where participating teams were asked to predict the positions of cells in the
Drosophila melanogaster embryo using single-cell sequencing data of 1297 cells from [
13], and a reduced number of marker genes from BDTNP database [
39], i.e., 60 genes (sub-challenge one), 40 genes (sub-challenge two) and 20 genes (sub-challenge three). The challenge attempts to use fewer marker genes to infer the spatial locations of cells. These challenges are associated with a computational mapping strategy called DistMap [
13], developed by Karaiskos et al. [
13], in 2017, to reconstruct the
Drosophila melanogaster embryo. The embryo studied is at the developmental Stage Six, having bilateral symmetry and consisting of approximately 6000 cells which express unique gene combinations. DistMap uses an in situ hybridization pattern of 84 genes from the Berkeley
Drosophila Transcription Network Project (BDTNP) database [
39] with their corresponding 3000 locations, i.e., one-half of the bilaterally symmetric embryo. The algorithm tries to maximize the correlation between single-cell RNA sequencing data and the in situ hybridization pattern of 84 genes in BDTNP. It has been shown that the combinatorial expressions of these 84 genes are enough to map every cell to its position. In order to identify the most important genes out of these 84 and to help model topological locations based on a gold-standard mapping, in this manuscript, we present the use of a genetic algorithm, followed by gene-ontology analysis of selected features. It may be noted that realizing the limitations of benchmarking presented in this challenge, the term “silver standard” was finally employed in the published consortium paper. We have in this paper used gold and silver standard terms equivalently but essentially refer to the DREAM challenge benchmarks, on which different methods have tried to perform the best.
4. Discussion
Gene sets with reduced sizes were able to recover the original cell–bin relationship efficiently, as indicated by strong MCC–MCC and topology-assignment scores. The reduced gene set is able to capture the overall intrinsic relationships between these functionally related genes. Applying genetic algorithms to identify the reduced gene sets helps us span through the whole sample space of possible solutions, of high complexity, in an efficient manner. It is known that the gene expression interactions are complex and deeply interactive. Such intrinsic correlation in gene expression can be exploited to mine unique patterns which can help solve and improve similar challenges that still exist. There are a few limitations in this method. The base algorithm of DistMap is used to generate the optimal cell–bin relationship with a complete gene set and gene subsets, but it has its weaknesses. One of them is that it uses binarized data, which cause a loss of information contained in both the single-cell gene expression data and BDTNP. The classical genetic algorithm is used to select the optimum number of genes. However, we hard-coded the optimum size in this case. Hence the user can define the number of features they wish to have. Further improvement in the selection, made by using the genetic algorithm, can be made by scanning a wider sample space, using a broad range of parameters, along with the implementation of cross-validation techniques.
One question that may arise in the use of current GA-based approach is that it is a supervised technique, like a few other top-rated teams in DREAM challenge. For example, Random Forest and Particle Swarm Optimization have been used effectively for breaking the teams into top 10 in some of the sub-challenges. On the other hand, purely unsupervised feature selections based on expression variability (and also principal components analysis) have been shown to be particularly successful in this DREAM challenge [
45]. In
Supplementary Table S3, we shows that, as expected, the time taken by an unsupervised feature selection is much less than GA or particle swarm optimization. However, while unsupervised methods can produce the most informative features out of a set, in a general context, they cannot operate together with any specific objectives. Supervised methods incorporate an objective function (fitness function in GA). We have clearly observed that optimization of one scoring function leaves others in a sub-optimal state. For example, optimizing models for location prediction may not produce the topologies desired. Similarly, if one would like to know the best feature set that can describe a given phenotype, such as one disease or the other, unsupervised techniques will not be able to produce distinct sets without introducing additional steps. Therefore, development of both supervised and unsupervised techniques for feature-selection and goal-seeking models are needed, of which GA, as one example, was discussed in this work.