# Multi-Objective Genetic Algorithm for Cluster Analysis of Single-Cell Transcriptomes

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

**Multi-objective clustering:**Given d-dimensional transcriptomes, we aim to find k cluster prototypes (cluster centers) by optimizing two objective functions (Equation (1)) as follows.

**Chromosome encoding:**In MOGA, clustering solutions are encoded by the real-valued chromosomes of size $l=k\times d$, where k is the number of clusters and d is the number of the dimensions of the dataset. Here, d denotes the number of Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) components, derived during preprocessing of the datasets, and it is set to 2. Thus, the first d values of a chromosome represent the prototype of the first cluster and the next d values represent the prototype of the second cluster, and so on. At the beginning of the evolution, a population of chromosomes is created. Each chromosome is initialized with random values, bounded by the range of values of each dimension of the dataset.

**Evolution:**After the initialization of the population, fitness values of the individual chromosomes are computed, namely ${f}_{1}\left(x\right)$ and ${f}_{2}\left(x\right)$, and evolution begins. Parent chromosomes are selected using a tournament selection [37], with the tournament size equal to 20% of the population. Specifically, 20% of chromosomes are randomly chosen from the population pool, and the ones with the best fitness values are selected for breeding. Chromosomes with better fitness values have a greater probability of being selected in the tournament. The tournament selection is repeated to create a population of size P, the same size as the initial population.

**Non-dominance solution sorting:**After the genetic operations of crossover and mutation, the parent population is merged with the offspring population, and NDSS is performed to select chromosomes for the next iteration of the algorithm. NDSS is based on the notion of dominance between solutions. Specifically, one solution dominates another if and only if all objective values of that solution are no worse than the objective values of another solution, and at least one objective value of the solution is better than the other one [38,39]. NDSS may select solutions that are very close to each other. To diversify the population, crowding thresholds are applied to select solutions that are evenly distributed across the entire Pareto front [38,39].

**Selection of the final solution:**In the last step of MOGA, the final clustering solution is selected from the Pareto front. For each Pareto-optimal solution, Davies Bouldin Index (DBI) is computed [40]. DBI measures clustering quality by computing the mean ratio between the intra-cluster and inter-cluster distances over all of the clusters (Equation (2)). DBI values range from 0 to 1, where lower values imply a better clustering quality.

**Hyperparameter tuning:**Population size, number of generations, crowding values, individual mutation rate, mutation probability, and crossover probability are determined by a grid search. Notably, to ensure a fair comparison with baseline methods, including a single-objective GA (SOGA), we tune the parameters for the second objective only, and do not tune the parameters for the first objective (Equation (1)).

**Evaluation of cluster validity:**Two validations of clustering solutions are performed, namely internal and external validation. In internal validation, ground truths are not known, and the quality of clustering solutions is measured using a widely-accepted internal validity metric, the Silhouette Coefficient (Sil) [42]. Sil is the mean silhouette width of all cells (Equation (3)), where $a\left({x}_{i}\right)$ refers to the mean distance between cell ${x}_{i}$ from the other cells in the same cluster, and $b\left({x}_{i}\right)$ refers to the minimum of the mean distances of ${x}_{i}$ from all cells in any other cluster. Sil ranges from −1 to 1, and a higher Sil implies better clustering, with a clear separation and good cohesiveness of clusters. Notably, singletons could exist in clustering solutions, where a single cluster only contains one data instance. Sil handles singletons by setting ${S}_{i}$ equal to 0, where $b\left({x}_{i}\right)=a\left({x}_{i}\right)$.

**Metamorphic evaluation of clustering stability:**The stability of clustering results is validated using metamorphic perturbations of transcriptomes. The purpose of this validation is to assure that small perturbations of the input data do not change cluster memberships when the perturbed transcriptomes are reclustered [47,48]. Using the scrnabench package (version 1.0), six metamorphic perturbations of the experimental transcriptomes are generated. They include permutation of the order of cells (MR1), modification of counts of a single gene (MR2), duplication of a transcriptome of a single cell (MR3), permutation of the order of genes (MR4), addition of a pseudo-gene with zero-variance expression counts (MR5), and negation of gene counts (MR6). Metamorphic datasets are reclustered and cluster validity metrics are computed and compared with the metrics of the original clusterings.

**State-of-the-art and baseline methods:**MOGA is compared with three state-of-the-art and two baseline methods. To demonstrate the value of multi-objective formulation, MOGA is compared with SOGA, which optimizes only the second objective function (Equation (1)). Therefore, NDSS and DBI-based selection are not performed, and the evolution of SOGA consists of the initialization, fitness evaluation, tournament selection, crossover, and mutation. The final solution of SOGA is encoded by a chromosome with the best fitness value in the last iteration of the algorithm. Notably, all hyperparameters of MOGA are set to the same values as the hyperparameters of the SOGA. Both MOGA and SOGA are implemented using the DEAP package (version 1.3.1) [49].

**Datasets:**Two types of single-cell transcriptomic datasets are used, namely experimental and synthetic. These datasets vary in size, sparsity, dimensionality, quality, and the availability of known cluster memberships.

**Data preprocessing:**The same standard preprocessing workflow is used to prepare the experimental and synthetic datasets [18]. Specifically, preprocessing comprises five steps: filtering, highly variable gene selection, transformation, scaling, and dimensionality reduction. In filtering, cells with fewer than 200 expressed genes and genes expressed in fewer than three cells are removed. Additionally, cells with mitochondrial content greater than 10% are filtered out. Moreover, outlier cells and genes are removed. The mRNA counts and gene counts are bounded by ${10}^{mean\left(lo{g}_{1}{}_{0}\left(x\right)\right)\pm 2\times std\left(lo{g}_{1}{}_{0}\left(x\right)\right)}$ to ensure that each cell has meaningful expression data, where x is the total mRNA count or the total gene count per cell.

**Estimation of compute time:**Run-time data are collected during the experimentation, including the datasets’ size, number of clusters, combinations of HPC resources, such as the number of CPUs, number of tasks per node, and the number of CPUs per task. These data are used to train a Random forest regressor [56], an ensemble tree-based algorithm for supervised learning. The accuracy of run-time estimates is validated by training a predictor on the simulated datasets and testing it on the reference datasets. Mean absolute error (MSE) between the estimated and actual values is used to evaluate the accuracy of the predictions.

## 3. Results

## 4. Discussion

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

ARI | Adjusted Rand Index |

DBI | Davies Bouldin Index |

GA | Genetic Algorithm |

ML | Machine Learning |

MOGA | Multi-Objective Genetic Algorithm |

MOO | Multi-Objective |

MSE | Mean Square Error |

NDSS | Non-dominance Solution Sorting |

NMI | Normalized Mutual Information |

Sil | Silhouette Coefficient |

SOGA | Single Objective Genetic Algorithm |

## References

- Rood, J.E.; Maartens, A.; Hupalowska, A.; Teichmann, S.A.; Regev, A. Impact of the Human Cell Atlas on medicine. Nat. Med.
**2022**, 28, 2486–2496. [Google Scholar] [CrossRef] - Yau, C. pcaReduce: Hierarchical clustering of single cell transcriptional profiles. BMC Bioinform.
**2016**, 17, 140. [Google Scholar] - Yang, L.; Liu, J.; Lu, Q.; Riggs, A.D.; Wu, X. SAIC: An iterative clustering approach for analysis of single cell RNA-seq data. BMC Genom.
**2017**, 18, 9–17. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kiselev, V.Y.; Kirschner, K.; Schaub, M.T.; Andrews, T.; Yiu, A.; Chandra, T.; Natarajan, K.N.; Reik, W.; Barahona, M.; Green, A.R.; et al. SC3: Consensus clustering of single-cell RNA-seq data. Nat. Methods
**2017**, 14, 483–486. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Marco, E.; Karp, R.L.; Guo, G.; Robson, P.; Hart, A.H.; Trippa, L.; Yuan, G.C. Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape. Proc. Natl. Acad. Sci. USA
**2014**, 111, E5643–E5650. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Zhang, H.; Lee, C.A.; Li, Z.; Garbe, J.R.; Eide, C.R.; Petegrosso, R.; Kuang, R.; Tolar, J. A multitask clustering approach for single-cell RNA-seq analysis in recessive dystrophic epidermolysis bullosa. PLoS Comput. Biol.
**2018**, 14, e1006053. [Google Scholar] [CrossRef] [Green Version] - Grün, D.; Muraro, M.J.; Boisset, J.C.; Wiebrands, K.; Lyubimova, A.; Dharmadhikari, G.; van den Born, M.; Van Es, J.; Jansen, E.; Clevers, H.; et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell
**2016**, 19, 266–277. [Google Scholar] [CrossRef] [Green Version] - Zeisel, A.; Muñoz-Manchado, A.B.; Codeluppi, S.; Lönnerberg, P.; La Manno, G.; Juréus, A.; Marques, S.; Munguba, H.; He, L.; Betsholtz, C.; et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science
**2015**, 347, 1138–1142. [Google Scholar] [CrossRef] - duVerle, D.A.; Yotsukura, S.; Nomura, S.; Aburatani, H.; Tsuda, K. CellTree: An R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq data. BMC Bioinform.
**2016**, 17, 363. [Google Scholar] [CrossRef] [Green Version] - Lin, P.; Troup, M.; Ho, J.W. CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol.
**2017**, 18, 1–11. [Google Scholar] [CrossRef] [Green Version] - Zhang, J.M.; Fan, J.; Fan, H.C.; Rosenfeld, D.; Tse, D.N. An interpretable framework for clustering single-cell RNA-Seq datasets. BMC Bioinform.
**2018**, 19, 1–12. [Google Scholar] [CrossRef] [PubMed] - Olsson, A.; Venkatasubramanian, M.; Chaudhri, V.K.; Aronow, B.J.; Salomonis, N.; Singh, H.; Grimes, H.L. Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature
**2016**, 537, 698–702. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Li, H.; Courtois, E.T.; Sengupta, D.; Tan, Y.; Chen, K.H.; Goh, J.J.L.; Kong, S.L.; Chua, C.; Hon, L.K.; Tan, W.S.; et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet.
**2017**, 49, 708–718. [Google Scholar] [CrossRef] - Ntranos, V.; Kamath, G.M.; Zhang, J.M.; Pachter, L.; Tse, D.N. Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol.
**2016**, 17, 112. [Google Scholar] [CrossRef] [Green Version] - Wang, B.; Zhu, J.; Pierson, E.; Ramazzotti, D.; Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods
**2017**, 14, 414–416. [Google Scholar] [CrossRef] [PubMed] - Xu, C.; Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics
**2015**, 31, 1974–1980. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Traag, V.A.; Waltman, L.; Van Eck, N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep.
**2019**, 9, 1–12. [Google Scholar] [CrossRef] [Green Version] - Hao, Y.; Hao, S.; Andersen-Nissen, E.; Mauck, W.M., III; Zheng, S.; Butler, A.; Lee, M.J.; Wilk, A.J.; Darby, C.; Zagar, M.; et al. Integrated analysis of multimodal single-cell data. Cell
**2021**, 184, 3573–3587. [Google Scholar] [CrossRef] - Jiang, L.; Chen, H.; Pinello, L.; Yuan, G.C. GiniClust: Detecting rare cell types from single-cell gene expression data with Gini index. Genome Biol.
**2016**, 17, 1–13. [Google Scholar] [CrossRef] [Green Version] - Qiu, X.; Mao, Q.; Tang, Y.; Wang, L.; Chawla, R.; Pliner, H.A.; Trapnell, C. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods
**2017**, 14, 979–982. [Google Scholar] [CrossRef] [Green Version] - Levine, J.H.; Simonds, E.F.; Bendall, S.C.; Davis, K.L.; Amir, E.a.D.; Tadmor, M.D.; Litvin, O.; Fienberg, H.G.; Jager, A.; Zunder, E.R.; et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell
**2015**, 162, 184–197. [Google Scholar] [CrossRef] [PubMed] - Wolf, F.A.; Angerer, P.; Theis, F.J. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol.
**2018**, 19, 1–5. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp.
**2008**, 2008, P10008. [Google Scholar] [CrossRef] [Green Version] - Petegrosso, R.; Li, Z.; Kuang, R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Brief. Bioinform.
**2020**, 21, 1209–1223. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Deb, K. Multi-Objective Optimization Using Evolutionary Algorithms; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2001. [Google Scholar]
- Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence; Complex Adaptive Systems, A Bradford Book: Cambridge, MA, USA, 1992. [Google Scholar]
- Fogel, D.B.; Fogel, L.J. An introduction to evolutionary programming. In Proceedings of the Artificial Evolution; Alliot, J.M., Lutton, E., Ronald, E., Schoenauer, M., Snyers, D., Eds.; Springer: Berlin/Heidelberg, Germany, 1996; Lecture Notes in Computer Science; pp. 21–33. [Google Scholar]
- Khuri, S.; Bäck, T.; Heitkötter, J. An evolutionary approach to combinatorial optimization problems. In Proceedings of the 22nd Annual ACM Computer Science Conference on Scaling Up: Meeting the Challenge of Complexity in Real-World Computing Applications, Phoenix, AZ, USA, 8–10 March 1994; pp. 66–73. [Google Scholar]
- Bhandari, D.; Murthy, C.D.; Pal, S.K. Genetic algorithm with elitist model and its convergence. Int. J. Pattern Recognit. Artif. Intell.
**1996**, 10, 731–747. [Google Scholar] [CrossRef] - Gliesch, A.; Ritt, M.; Moreira, M.C.O. A genetic algorithm for fair land allocation. In Proceedings of the Genetic and Evolutionary Computation Conference, London, UK, 7–11 July 2017; Association for Computing Machinery: New York, NY, USA, 2017. GECCO ’17. pp. 793–800. [Google Scholar]
- Wang, J.; Luo, P.; Zhang, L.; Zhou, J. A Hybrid Genetic Algorithm for Weapon Target Assignment Optimization. In Proceedings of the 2nd International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence, Phuket, Thailand, 23–25 March 2018; Association for Computing Machinery: New York, NY, USA, 2018. ISMSI’18. pp. 41–47. [Google Scholar]
- Burak, J.; Mengshoel, O.J. A multi-objective genetic algorithm for jacket optimization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, Lille, France, 10–14 July 2021; Association for Computing Machinery: New York, NY, USA, 2021. GECCO’21. pp. 1549–1556. [Google Scholar]
- Barbareschi, M.; Barone, S.; Bosio, A.; Han, J.; Traiola, M. A Genetic-algorithm-based Approach to the Design of DCT Hardware Accelerators. ACM J. Emerg. Technol. Comput. Syst.
**2022**, 18, 50:1–50:25. [Google Scholar] [CrossRef] - Peng, C.; Wu, X.; Yuan, W.; Zhang, X.; Zhang, Y.; Li, Y. MGRFE: Multilayer Recursive Feature Elimination Based on an Embedded Genetic Algorithm for Cancer Classification. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2021**, 18, 621–632. [Google Scholar] [CrossRef] - Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput.
**2002**, 6, 182–197. [Google Scholar] [CrossRef] [Green Version] - Kim, M.; Hiroyasu, T.; Miki, M.; Watanabe, S. SPEA2+: Improving the performance of the strength Pareto evolutionary algorithm 2. In Proceedings of the International Conference on Parallel Problem Solving from Nature, Birmingham, UK, 18–22 September 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 742–751. [Google Scholar]
- Goldberg, D.E.; Korb, B.; Deb, K. Messy Genetic Algorithms: Motivation, Analysis, and First Results. Complex Syst.
**1989**, 3, 493–530. [Google Scholar] - Deb, K.; Jain, H. An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point-Based Nondominated Sorting Approach, Part I: Solving Problems With Box Constraints. IEEE Trans. Evol. Comput.
**2014**, 18, 577–601. [Google Scholar] [CrossRef] - Jain, H.; Deb, K. An Evolutionary Many-Objective Optimization Algorithm Using Reference-Point Based Nondominated Sorting Approach, Part II: Handling Constraints and Extending to an Adaptive Approach. IEEE Trans. Evol. Comput.
**2014**, 18, 602–622. [Google Scholar] [CrossRef] - Coelho, G.P.; Barbante, C.C.; Boccato, L.; Attux, R.R.F.; Oliveira, J.R.; Von Zuben, F.J. Automatic feature selection for BCI: An analysis using the davies-bouldin index and extreme learning machines. In Proceedings of the The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, 10–15 June 2012; IEEE: Brisbane, QLD, Australia, 2012; pp. 1–8. [Google Scholar]
- Hassanat, A.; Almohammadi, K.; Alkafaween, E.; Abunawas, E.; Hammouri, A.; Prasath, V.B.S. Choosing Mutation and Crossover Ratios for Genetic Algorithms—A Review with a New Dynamic Approach. Information
**2019**, 10, 390. [Google Scholar] [CrossRef] - Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.
**1987**, 20, 53–65. [Google Scholar] [CrossRef] [Green Version] - Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc.
**1971**, 66, 846–850. [Google Scholar] [CrossRef] - Hubert, L.; Arabie, P. Comparing partitions. J. Classif.
**1985**, 2, 193–218. [Google Scholar] [CrossRef] - Vinh, N.X.; Epps, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res.
**2010**, 11, 2837–2854. [Google Scholar] - Studholme, C.; Hill, D.L.G.; Hawkes, D.J. An overlap invariant entropy measure of 3D medical image alignment. Pattern Recognit.
**1999**, 32, 71–86. [Google Scholar] [CrossRef] - Segura, S.; Fraser, G.; Sanchez, A.B.; Ruiz-Cortés, A. A Survey on Metamorphic Testing. IEEE Trans. Softw. Eng.
**2016**, 42, 805–824. [Google Scholar] [CrossRef] [Green Version] - Yang, S.; Towey, D.; Zhou, Z.Q. Metamorphic Exploration of an Unsupervised Clustering Program. In Proceedings of the 2019 IEEE/ACM 4th International Workshop on Metamorphic Testing (MET), Montréal, QC, Canada, 26 May 2019; IEEE: Montreal, QC, Canada, 2019; pp. 48–54. [Google Scholar]
- Fortin, F.A.; Rainville, F.M.D.; Gardner, M.A.; Parizeau, M.; Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res.
**2012**, 13, 2171–2175. [Google Scholar] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Zappia, L.; Phipson, B.; Oshlack, A. Splatter: Simulation of single-cell RNA sequencing data. Genome Biol.
**2017**, 18, 174. [Google Scholar] [CrossRef] [PubMed] - Whitener, N. Scrnabench: A Package for Metamorphic Benchmarking of scRNA-seq Data Analysis Methods; GitHub: San Francisco, CA, USA, 2022. [Google Scholar]
- Stuart, T.; Butler, A.; Hoffman, P.; Hafemeister, C.; Papalexi, E.; Mauck, W.M.; Hao, Y.; Stoeckius, M.; Smibert, P.; Satija, R. Comprehensive Integration of Single-Cell Data. Cell
**2019**, 177, 1888–1902. [Google Scholar] [CrossRef] [PubMed] - Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci.
**1901**, 2, 559–572. [Google Scholar] [CrossRef] [Green Version] - McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw.
**2018**, 3, 861. [Google Scholar] [CrossRef] - Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn.
**2006**, 63, 3–42. [Google Scholar] [CrossRef] - Chen, X.; Yang, Z.; Chen, W.; Zhao, Y.; Farmer, A.; Tran, B.; Furtak, V.; Moos, M.; Xiao, W.; Wang, C. A multi-center cross-platform single-cell RNA sequencing reference dataset. Sci. Data
**2021**, 8, 1–11. [Google Scholar] [CrossRef] - Chen, W.; Zhao, Y.; Chen, X.; Yang, Z.; Xu, X.; Bi, Y.; Chen, V.; Li, J.; Choi, H.; Ernest, B.; et al. A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples. Nat. Biotechnol.
**2021**, 39, 1103–1114. [Google Scholar] [CrossRef] - Fina, E. Signatures of Breast Cancer Progression in the Blood: What Could Be Learned from Circulating Tumor Cell Transcriptomes. Cancers
**2022**, 14, 5668. [Google Scholar] [CrossRef] - Moore, C.M.; Seibold, M.A. Possibilities and Promise: Leveraging advances in transcriptomics for clinical decision making in allergic diseases. J. Allergy Clin. Immunol.
**2022**, 150, 756–765. [Google Scholar] [CrossRef] - Handl, J.; Knowles, J. An Evolutionary Approach to Multiobjective Clustering. IEEE Trans. Evol. Comput.
**2007**, 11, 56–76. [Google Scholar] [CrossRef] - Li, X.; Zhang, S.; Wong, K.C. Deep embedded clustering with multiple objectives on scRNA-seq data. Brief. Bioinform.
**2021**, 22, bbab090. [Google Scholar] [CrossRef] [PubMed] - Jin, K.; Li, B.; Yan, H.; Zhang, X.F. Imputing dropouts for single-cell RNA sequencing based on multi-objective optimization. Bioinformatics
**2022**, 12, 3222–3230. [Google Scholar] [CrossRef] [PubMed] - Liu, Q.; Luo, X.; Li, J.; Wang, G. scESI: Evolutionary sparse imputation for single-cell transcriptomes from nearest neighbor cells. Brief. Bioinform.
**2022**, 23, bbac144. [Google Scholar] [CrossRef] [PubMed] - Liu, Q.; Zhao, X.; Wang, G. A Clustering Ensemble Method for Cell Type Detection by Multiobjective Particle Optimization. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2021**, 14, 1545–5963. [Google Scholar] [CrossRef] [PubMed] - Hwang, C.L.; Masud, A.S.M. Multiple Objective Decision Making—Methods and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1979; Volume 164. [Google Scholar]
- Sipper, M.; Fu, W.; Ahuja, K.; Moore, J.H. Investigating the parameter space of evolutionary algorithms. BioData Min.
**2018**, 11, 2. [Google Scholar] [CrossRef] - Das, S.; Chaudhuri, S.; Das, A.K. Cluster analysis for overlapping clusters using genetic algorithm. In Proceedings of the 2016 Second International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), Kolkata, India, 23–25 September 2016; pp. 6–11. [Google Scholar]
- Rocha, M.; Neves, J. Preventing premature convergence to local optima in genetic algorithms via random offspring generation. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Cairo, Egypt, 31 May–3 June 1999; Springer: Berlin/Heidelberg, Germany, 1999; pp. 127–136. [Google Scholar]
- Oliva, D.; Rodriguez-Esparza, E.; Martins, M.S.R.; Abd Elaziz, M.; Hinojosa, S.; Ewees, A.A.; Lu, S. Balancing the Influence of Evolutionary Operators for Global optimization. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation (CEC), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]

**Figure 1.**

**Cluster analysis of single-cell transcriptomes**. Shown are the steps of cluster analysis of high-dimensional single-cell transcriptomes, including data preprocessing and dimensionality reduction.

**Figure 2.**

**Cluster analysis as a multi-objective optimization problem.**A solution to a cluster analysis problem is a set of k cluster prototypes ${c}_{1},\dots ,{c}_{k}$. Solution space of two objective functions and a corresponding objective function space are shown. Objective function ${f}_{1}\left(x\right)$ maximizes inter-cluster distances, and ${f}_{2}\left(x\right)$ minimizes intra-cluster distances. Pareto front encompasses optimal solutions that are not dominated by any other feasible solutions.

**Figure 3.**

**Architecture of the multi-objective Genetic Algorithm.**Shown are main steps of the proposed GA. Initial population of chromosomes is randomly created and inputted to the GA optimizer. After Pareto-optimal solutions are found, the best solution is selected using a predefined criterion.

**Figure 4.**

**Hyperparameter tuning of population size, number of iterations, individual mutation rate, mutation probability, and crossover probability for objective ${f}_{2}\left(x\right)$.**Shown are (

**A**) a line plot of fitness values obtained by varying population size and number of iterations and (

**B**) a three-dimensional heatmap of fitness values obtained by varying mutation probability, crossover probability, and individual mutation rate. A small fitness value is preferred.

**Figure 5.**

**Internal validation and comparison of MOGA, SOGA, KMeans, PhenoGraph, Seurat, and Scanpy.**Shown are the box plots of (

**A**) Sil of 48 scRNA-seq reference datasets with six algorithms and (

**B**) Sil scores of MOGA-based clustering by sequencing technology. Experiments are repeated 30 times, and the best Sil values of each dataset are shown. Cell line A: breast cancer; Cell line B: normal B lymphocytes.

**Figure 6.**

**Cluster stability in metamorphic testing.**Shown are the box plots of the distributions of the Sil of (

**A**) MOGA, (

**B**) SOGA, (

**C**) Kmeans, (

**D**) PhenoGraph, (

**E**) Seurat, and (

**F**) Scanpy in original clustering as well as in metamorphic tests (MR1 to MR6). Experiments are repeated 30 times, and the best Sil of each dataset are shown.

**Figure 7.**

**Internal and external validation of MOGA on synthetic datasets.**Shown are the bar graphs of NMI, ARI, and Sil of (

**A**) 60 synthetic datasets with different number of cells and the same number of clusters and (

**B**) 60 synthetic datasets with different number of clusters and the same number of cells. Experiments are repeated 30 times, and the best metrics were retained.

**Figure 8.**

**Running time analysis of MOGA and SOGA.**Shown are time comparisons of MOGA and SOGA (

**A**) with 48 scRNA-seq reference datasets and (

**B**) 60 synthetic datasets. Experiments are repeated 30 times, and the average computational time is retained.

**Table 1.**

**Internal validity of MOGA, SOGA, KMeans, PhenoGraph, Seurat, and Scanpy.**Shown are Silhouette scores of clustering of 12 reference transcriptomes, where MOGA outperformed other methods.

Dataset | MOGA | SOGA | KMeans | PhenoGraph | Seurat | Scanpy |
---|---|---|---|---|---|---|

C1_FDA_HT_A_featureCounts | 0.60 | 0.55 | 0.55 | 0.51 | 0.51 | 0.50 |

C1_FDA_HT_A_kallisto | 0.59 | 0.48 | 0.47 | 0.43 | 0.45 | 0.43 |

C1_FDA_HT_A_rsem | 0.61 | 0.55 | 0.55 | 0.47 | 0.47 | 0.45 |

C1_LLU_A_featureCounts | 0.68 | 0.54 | 0.5 | 0.67 | 0.68 | 0.68 |

C1_LLU_A_kallisto | 0.69 | 0.54 | 0.49 | 0.68 | 0.70 | 0.69 |

C1_LLU_A_rsem | 0.80 | 0.62 | 0.57 | 0.79 | 0.80 | 0.80 |

ICELL8_PE_A_featureCounts | 0.68 | 0.41 | 0.41 | 0.39 | 0.45 | 0.36 |

ICELL8_PE_A_kallisto | 0.83 | 0.58 | 0.41 | 0.41 | 0.44 | 0.40 |

ICELL8_PE_A_rsem | 0.77 | 0.42 | 0.42 | 0.40 | 0.44 | 0.40 |

ICELL8_SE_A_featureCounts | 0.81 | 0.57 | 0.41 | 0.41 | 0.43 | 0.39 |

ICELL8_SE_A_rsem | 0.55 | 0.43 | 0.44 | 0.37 | 0.42 | 0.37 |

ICELL8_SE_A_kallisto | 0.71 | 0.53 | 0.42 | 0.41 | 0.44 | 0.43 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zhao, K.; Grayson, J.M.; Khuri, N.
Multi-Objective Genetic Algorithm for Cluster Analysis of Single-Cell Transcriptomes. *J. Pers. Med.* **2023**, *13*, 183.
https://doi.org/10.3390/jpm13020183

**AMA Style**

Zhao K, Grayson JM, Khuri N.
Multi-Objective Genetic Algorithm for Cluster Analysis of Single-Cell Transcriptomes. *Journal of Personalized Medicine*. 2023; 13(2):183.
https://doi.org/10.3390/jpm13020183

**Chicago/Turabian Style**

Zhao, Konghao, Jason M. Grayson, and Natalia Khuri.
2023. "Multi-Objective Genetic Algorithm for Cluster Analysis of Single-Cell Transcriptomes" *Journal of Personalized Medicine* 13, no. 2: 183.
https://doi.org/10.3390/jpm13020183