Discordant Genome Assemblies Drastically Alter the Interpretation of Single-Cell RNA Sequencing Data Which Can Be Mitigated by a Novel Integration Method

Advances in sequencing and assembly technology have led to the creation of genome assemblies for a wide variety of non-model organisms. The rapid production and proliferation of updated, novel assembly versions can create vexing problems for researchers when multiple-genome assembly versions are available at once, requiring researchers to work with more than one reference genome. Multiple-genome assemblies are especially problematic for researchers studying the genetic makeup of individual cells, as single-cell RNA sequencing (scRNAseq) requires sequenced reads to be mapped and aligned to a single reference genome. Using the Astyanax mexicanus, this study highlights how the interpretation of a single-cell dataset from the same sample changes when aligned to its two different available genome assemblies. We found that the number of cells and expressed genes detected were drastically different when aligning to the different assemblies. When the genome assemblies were used in isolation with their respective annotations, cell-type identification was confounded, as some classic cell-type markers were assembly-specific, whilst other genes showed differential patterns of expression between the two assemblies. To overcome the problems posed by multiple-genome assemblies, we propose that researchers align to each available assembly and then integrate the resultant datasets to produce a final dataset in which all genome alignments can be used simultaneously. We found that this approach increased the accuracy of cell-type identification and maximised the amount of data that could be extracted from our single-cell sample by capturing all possible cells and transcripts. As scRNAseq becomes more widely available, it is imperative that the single-cell community is aware of how genome assembly alignment can alter single-cell data and their interpretation, especially when reviewing studies on non-model organisms.


Introduction
The use of single-cell RNA sequencing (scRNAseq) technology has greatly increased since it was first developed in 2009 [1]. scRNAseq provides transcriptome information about individual cells, enabling researchers to answer a wide variety of biological questions about topics such as cell-cell heterogeneity, tissue composition and cell-specific gene expression responses to disease and/or injury [2]. Commercialised scRNAseq kits have helped to lower the costs of single-cell experiments, making these experiments more readily accessible to researchers. Indeed, scRNAseq is becoming a routine investigatory approach

Animal Husbandry
All experimental procedures were performed in accordance with the UK Animals (Scientific Procedures) Act 1986 and institutional guidelines, and conform to the guidelines from Directive 2010/63/EU of the European Parliament on the protection of animals used for scientific purposes. Adult male and female A. mexicanus surface fish (1-year) were maintained in the laboratory on a 14/10 h photo-period at 22-25 • C.

3 UTR Extension
The v1.0.2 and v2.0 A. mexicanus genome assemblies are poorly annotated in the 3 UTR (Supplementary Figure S1). To enable the maximum capture of transcripts during data exploration pending the availability of a better genome annotation, a terminal exon extension algorithm was applied to extend the 3 UTR annotation. The extension algorithm used a full-length poly(A) RNA-seq sample as a reference and applied the following heuristic:

1.
Identify transcripts without 3 UTR annotation in the Ensembl GTF file 2.
Compare fragment coverage over 100 bp flanking the terminal exon 3.
If median 3 coverage > median 5 coverage, extend last exon 100 bp in the 3 direction and repeat steps 2 & 3 until no further extension occurs.
The extension algorithm was used to create two custom extended GTF files that were used for read counting with the corresponding v1.0.2 and v2.0 genome assemblies (available in Supplementary Materials). In all, 5077/25,489 gene-level annotations on 2530 contigs were extended for v1.0.2, and 8721/27,420 on 25 chromosomes and 1363 contigs were extended for v2.0.

10x Single-Cell RNA-Sequencing and Analysis
Single-Cell RNAseq libraries were generated using the 10x Chromium Next GEM Single-Cell 3 v3.1 kit (10x Genomics, Leiden, Netherlands, cat. no. 1000092) and sequenced using the Illumina NextSeq ® 500/550 High Output Kit v2 (Illumina, San Diego, United States, cat.no. FC-404-2005). After sequencing, FASTQ files were generated using Cell Ranger mkfastq (v3.0.2). The raw reads were mapped to Astyanax mexicanus v1.0.2 and v2.0 genome assemblies using Cell Ranger count with the corresponding 3 UTR-extended annotations, and two filtered feature matrices were produced. Downstream analysis was performed using the Seurat R package (v4.0.6) [25]. Initial quality control thresholds removed all cells with <50 captured genes and all genes present in <2 cells. Cell filtering thresholds were set based on the average nFeatures and nCounts present in each genome assembly dataset.
SCTransform [26] was used to normalise, find variable features and scale v1.0.2 and v2.0 datasets individually. The dimensions of the datasets were reduced using Principal Component analysis (PCA) and uniform manifold approximation projection (UMAP). Cells were assigned to clusters using the FindNeighbours and FindClusters functions and the appropriate resolution was chosen using the Clustree package (version 0.4.3) [27]. Marker genes for each cluster were found using. Cell-type annotations were based on marker genes and canonical markers present in v1.0.2 and v2.0 annotations. The v1.0.2 and v2.0 datasets were integrated using 3000 integration features in the Seurat SCTIntegration pipeline. Differential expression analysis was performed using the LR test in the FindMarkers function and visualised using EnhancedVolcano (v1.8.0). Gene Set Enrichment Analysis was carried out by converting A. mexicanus genes to their correspondent mouse homologs via BiomaRt (v2.46.3). Any genes which did not have a mouse homolog or mapped to multiple mouse genes were removed, and the final mouse gene lists was tested using the fgsea package (v1.16.0). Conserved genes in the integrated dataset were found using FindConservedMarkers.

Datasets Generated from the Same Sample Change in Their Fundamental Structure Depending on Genome Assembly and Annotation
To investigate the influence of genome assembly on a single-cell dataset, cardiac ventricular cells from three A. mexicanus surface fish were pooled and sequenced. The sequenced reads were aligned to both available genome assemblies with their corresponding gene annotations using Cell Ranger to generate two filtered feature matrices (gene x cell): a v1.0.2 and a v2.0 dataset. To investigate differences in the two datasets, we performed an initial comparison of the Cell Ranger outputs (Table 1). We found that the v1.0.2 and v2.0 datasets had different matrix dimensions, representing differing numbers of cells and expressed genes detected by Cell Ranger. The v2.0 assembly had a >10% higher percentage of sequenced reads that mapped to the transcriptome, resulting in a higher average number of reads and genes detected per cell. The difference in genes present in the matrix had an unexpected impact on cell capture rates; the v2.0 assembly captured an additional 148 cells compared to the v1.0.2. Further comparison between the captured cells showed that both datasets had assembly-specific cells (16 in v1.0.2 and 225 in v2.0). These fundamental differences in cell and feature capture rates required quality control thresholds to be set according to genome-assembly (Figure 1), resulting in a difference of 209 cells available for analysis post-filtering. Therefore, we found that genome assembly choice can alter the fundamental structure of scRNAseq datasets, impacting the number of genes and cells available for downstream analysis.

Dimensional Reduction and Data Visualization Is Robust to Differences in Genome Assembly
To determine whether genome assembly alignment impacted how cells clustered together, the two datasets were separately normalised and scaled using the nonlinear normalisation method SCTransform (Seurat, v3.1.5). The dimensionality of both datasets was reduced using Principal Component Analysis (PCA) and a Uniform Manifold Approximation Projection (UMAP), and cells were clustered together using a graph-based clustering approach. Cells were first embedded in a K-Nearest Neighbour (KNN) graph and then iteratively grouped together with the number of modules optimised using the Louvain algorithm. We found that the resultant datasets are very similar in their structure and number of clusters identified (Figure 2), and thus, genome assembly did not materially impact cell clustering and dimensional reduction.

Dimensional Reduction and Data Visualization Is Robust to Differences in Genome Assembly
To determine whether genome assembly alignment impacted how cells clustered together, the two datasets were separately normalised and scaled using the nonlinear normalisation method SCTransform (Seurat, v3.1.5). The dimensionality of both datasets was reduced using Principal Component Analysis (PCA) and a Uniform Manifold Approximation Projection (UMAP), and cells were clustered together using a graph-based clustering approach. Cells were first embedded in a K-Nearest Neighbour (KNN) graph and then iteratively grouped together with the number of modules optimised using the Louvain algorithm. We found that the resultant datasets are very similar in their structure and number of clusters identified (Figure 2), and thus, genome assembly did not materially impact cell clustering and dimensional reduction.

Incomplete Reference Genomes, When Used in Isolation, Create Problems in Cell-Type Identification and Differential Gene Expression Analysis and Have the Potential to Miss Data
We next sought to identify the cells present in both datasets. We used the receiver operating characteristic (ROC) test to produce a list of gene markers for each cluster in each dataset. Based on these gene markers, we were able to identify the expected different cell types present within the heart, including myocardial, endocardial and epicardial cells, fibroblasts and blood circulating cells (see Supplementary Figure S2 for top cell marker genes used during cell annotation). However, the discordant annotations of the v1.0.2 and the v2.0 genome assemblies presented problems during cell-type identification. We found that the results of differential gene expression analysis were very different between the two datasets, producing two diverging top marker lists for each cardiac cell type, and very few genes were identified as top cell-type markers in both datasets ( Table 2). Cell-type identification was further confounded as many canonical cell-type markers, such as α-smooth muscle actin (acta2: a marker of pericytes, smooth muscle and myofibroblasts [28][29][30]), are only annotated in one genome assembly.

Incomplete Reference Genomes, When Used in Isolation, Create Problems in Cell-Type Identification and Differential Gene Expression Analysis and Have the Potential to Miss Data
We next sought to identify the cells present in both datasets. We used the receiver operating characteristic (ROC) test to produce a list of gene markers for each cluster in each dataset. Based on these gene markers, we were able to identify the expected different cell types present within the heart, including myocardial, endocardial and epicardial cells, fibroblasts and blood circulating cells (see Supplementary Figure S2 for top cell marker genes used during cell annotation). However, the discordant annotations of the v1.0.2 and the v2.0 genome assemblies presented problems during cell-type identification. We found that the results of differential gene expression analysis were very different between the two datasets, producing two diverging top marker lists for each cardiac cell type, and very few genes were identified as top cell-type markers in both datasets ( Table 2). Cell-type identification was further confounded as many canonical cell-type markers, such as αsmooth muscle actin (acta2: a marker of pericytes, smooth muscle and myofibroblasts [28][29][30]), are only annotated in one genome assembly.   In addition to the discordant annotation between assemblies resulting in assemblyspecific genes, we found that even when genes were present in both genome assemblies, they could show different patterns of expression. We found that arhgap27, a gene we have previously shown to be linked to cardiac regeneration [31], has different patterns of expression in the two datasets. The v1.0.2 dataset shows arhgap27 to be expressed in very few leukocyte cells. On the other hand, the v2.0 dataset shows arhgap27 to have higher expression levels and suggests that it is also expressed in endothelial cells and cardiomyocytes ( Figure 3). Therefore, genome assembly alignment can produce datasets that are inconsistent in terms of the genes present in the matrix and the expression pattern of shared genes. they could show different patterns of expression. We found that arhgap27, a gene we have previously shown to be linked to cardiac regeneration [31], has different patterns of expression in the two datasets. The v1.0.2 dataset shows arhgap27 to be expressed in very few leukocyte cells. On the other hand, the v2.0 dataset shows arhgap27 to have higher expression levels and suggests that it is also expressed in endothelial cells and cardiomyocytes ( Figure 3). Therefore, genome assembly alignment can produce datasets that are inconsistent in terms of the genes present in the matrix and the expression pattern of shared genes.

Genome Assembly Alignment Can Distort Data Interpretation from Specific Cell Types Due to the Problems Created by Underlying Differences in Genome Assemblies
To investigate whether the observed discrepancies between the A. mexicanus genome assemblies would impact functional data interpretation, we sought to determine whether any cell types showed assembly-specific differences. Despite similarities in the median number of reads captured per cell, we found that in the v2.0 alignment, there is a higher median number of features detected (Table 1). We therefore sought to determine whether this caused any cell-specific differences in feature-detection rates between the two datasets. We found that although both v1.0.2 and v2.0 datasets showed a similar distribution of features and counts across all clusters (Figure 4a,b), when we compared the number of unique genes detected in each cell type between genome assemblies, a number of v2.0 endothelial cells had a greater number of genes/cell than their v1.0.2 counterparts (Figure  4c,d). To investigate how this increase in median genes/cell might impact the interpretation of endothelial cells we performed differential gene expression analysis between v1.0.2 and v2.0 endothelial cells. We found that when we compared v1.0.2 and v2.0 endothelial

Genome Assembly Alignment Can Distort Data Interpretation from Specific Cell Types Due to the Problems Created by Underlying Differences in Genome Assemblies
To investigate whether the observed discrepancies between the A. mexicanus genome assemblies would impact functional data interpretation, we sought to determine whether any cell types showed assembly-specific differences. Despite similarities in the median number of reads captured per cell, we found that in the v2.0 alignment, there is a higher median number of features detected (Table 1). We therefore sought to determine whether this caused any cell-specific differences in feature-detection rates between the two datasets. We found that although both v1.0.2 and v2.0 datasets showed a similar distribution of features and counts across all clusters (Figure 4a,b), when we compared the number of unique genes detected in each cell type between genome assemblies, a number of v2.0 endothelial cells had a greater number of genes/cell than their v1.0.2 counterparts (Figure 4c,d). To investigate how this increase in median genes/cell might impact the interpretation of endothelial cells we performed differential gene expression analysis between v1.0.2 and v2.0 endothelial cells. We found that when we compared v1.0.2 and v2.0 endothelial cells by logistic regression testing, 2512 genes were upregulated in an assembly-specific manner (Figure 4e). Gene Set Enrichment Analysis (GSEA) of these differentially expressed endothelial genes showed that genes downregulated in response to UV radiation were significantly enriched in the v2.0 dataset (Figure 4f) but not in the v1.0.2 dataset, thus showing that genome alignment can distort the functional interpretation of scRNAseq results, as different results can be generated from the same sample.

Integration of the Two Datasets Improves Cell-Type Annotation and Maximises the Information That Can Be Obtained from a Single-Cell Dataset
To overcome the problems posed by the discordant genome assemblies, we integrated both datasets together using the SCTIntegration pipeline. This produced an integrated dataset in which cells that originated from either the v1.0.2 or the v2.0 were present and treated as unique. We found that SCTIntegration produced a UMAP with all of the expected cell types (Figure 5a) and that cells clustered together regardless of genome alignment (Figure 5b). The integrated dataset included all 241 genome-specific cells (Figure 5c). This increased the size of a wide range of cell-type clusters such as erythrocytes, endothelial cells, cardiomyocytes, leukocytes, smooth muscle and fibroblasts, ensuring the maximal amount of data were captured in the final integrated dataset.
During cluster annotation, the integrated dataset enabled accurate cell-type identification, as canonical markers that are only annotated in one assembly, such as acta2, could be used simultaneously with the integrated dataset cluster marker genes to annotate each cluster (Figure 5d). We found that the integrated dataset resulted in the inclusion of 4311 assembly-specific genes for v1.0.2, and 5638 for v2.0, that can be utilised during cell annotation. Additionally, we found that 525 cells from the v1.0.2 dataset and 765 cells from the v2.0 dataset were annotated differently in the integrated dataset. Specifically, we found that the integrated dataset allowed more accurate annotation of doublets (see Supplementary Figure S3 for transcriptional profile of doublets), as many of the cells that changed annotation were found in doublet clusters in the integrated dataset (25.5% for v1.0.2 cells and 68.5% for v2.0 cells) (Figure 5e). During cluster annotation, the integrated dataset enabled accurate cell-type identification, as canonical markers that are only annotated in one assembly, such as acta2, could be used simultaneously with the integrated dataset cluster marker genes to annotate each cluster (Figure 5d). We found that the integrated dataset resulted in the inclusion of 4311 assembly-specific genes for v1.0.2, and 5638 for v2.0, that can be utilised during cell annotation. Additionally, we found that 525 cells from the v1.0.2 dataset and 765 cells from the We finally used the integrated dataset to calculate a list of conserved gene symbols present in both genome alignments that will act as a robust list of cell-type markers of previously uncharacterised A. mexicanus cell types (Supplementary Table S1). The integrated dataset provided a solution in which all possible genes and transcripts could be probed simultaneously for optimal cell-type annotation.

Discussion
In this study, we use single-cell RNA sequencing data and two published reference assemblies from A. mexicanus to show, for the first time, that the same set of scRNAseq reads can produce different results when aligned to different genome assemblies, generating differences in matrix dimensions, gene-expression patterns and cell-type identification. Critically, the finding that only v2.0 endothelial cells show significant enrichment for genes responsive to UV radiation confirms our hypothesis that genome assembly choice can impact data interpretation. To overcome the problems posed by multiple-genome assemblies that are discordantly annotated, we propose the alignment of scRNAseq samples to all available assemblies, followed by integration, to create a finalised dataset for use in downstream analysis.
Our proposed methodology will be a useful tool for researchers using non-model organisms with more than one available genome assembly. Within the A. mexicanus field, there is no consensus as to which genome assembly should be used for sequencing experiments. Recent publications have used v1.0.2 [31,32], v2.0 [33], an archived version of the genome (astmex1, Ensembl 87 gene model) [34], and even a new Pachón build only available on NCBI [35]. Such inconsistent use of the A. mexicanus genome assemblies will hinder research progress by introducing artefacts. Our finding that the pattern of arhgap27 expression completely changes with genome alignment emphasises that an integrated use of all genome assemblies is essential when characterizing novel cell types and gene expression patterns. Furthermore, our finding that v2.0 endothelial cells were significantly enriched for genes annotated in the UV response hallmark pathway suggests that if assemblies continue to be used in isolation, this could lead to a problem wherein results found using v1.0.2 are not transferable to v2.0 (and vice versa). Therefore, we propose that both v1.0.2 and v2.0 assemblies be used for all A. mexicanus genomic studies until a better-curated, accurate and well-annotated reference genome is available.
This work is relevant not only to A. mexicanus researchers but also to the wider scRNAseq community, in particular for groups that work with non-model organisms with genomes that continue to be produced at a rapid pace using a variety of different technologies. It is likely that the observed genome discordancies that arise in the v1.0.2 and v2.0 A. mexicanus assemblies are a result of the different sequencing technologies used during genome construction and the post-sequencing assembly and annotation algorithms. The Pachón cavefish assembly (v1.0.2) was assembled from DNA from the heart, liver, spleen and gill of a single 7-year-old adult female using short reads and mate-paired libraries, with a de Bruijn graph assembler [36], in 2014. The surface fish assembly is a more recent assembly from a single adult female surface fish that was constructed using singlemolecule-long reads, optical mapping, and a genetic linkage map. Long read constructions are better able to resolve heterogeneous, highly repetitive regions of the genome, which may explain why v2.0 endothelial cells showed significant enrichment of genes responsive to UV radiation that was absent in v1.0.2 cells. Although it is beyond the scope of individual labs to resolve genome assembly or gene annotation inconsistencies, single-cell researchers should be aware of how the assembly and annotation of their genome might be impacting their data interpretation and capture of specific cell types and genes of interest. Therefore, we recommend that researchers working with multiple incomplete reference genomes align their scRNAseq data to all constructed genome assemblies to ensure that no cells or genes are unnecessarily excluded from the final dataset. We propose that researchers use our methodology until a high-quality reference genome is available for their non-model organism of choice.
Finally, our applicable integration methodology not only provides a unifying approach to the problems posed by multiple-genome assemblies, but it also ensures that the maximum amount of information can be extracted from a single-cell sample. Integrating the v1.0.2 and v2.0 datasets allowed the expression of all assembly-specific genes to be explored during analysis of the final dataset. This led to the inclusion of more than 9k assemblyspecific genes that could be used during cluster annotation, thereby ensuring that cell types can be attributed as accurately as possible. Additionally, the integrated dataset includes 241 cells that would have been lost if the A. mexicanus genome assemblies were used in isolation. This approach would therefore be very beneficial to researchers trying to detect rare cell types, as maximising the number of cells that are retained for analysis decreases the chance that any important small cell populations are lost from the final dataset.
We do suggest that our integrated approach should mainly be used for cell-type annotation and marker-gene identification. Once the integrated dataset has been created and cell types have been accurately identified, researchers should subset the integrated dataset into its component genome-assembly datasets to perform downstream analysis. Additionally, we used a 3 UTR extension algorithm to create a custom 3 UTR extended gtf file for read counting, to try to offset the incomplete 3 UTR annotation of the A. mexicanus genome assemblies. This maximised our ability to capture transcripts and identify possible leads from our available data. However, genes of interest identified using our 3 UTR extension algorithm should be treated with caution, and would require confirmation of gene expression levels using additional methods such as qPCR and in situ hybridization.
In conclusion, we propose a novel solution to address the vexing problems posed by multiple-genome assemblies that are discordantly annotated. Our methodology opens the door to applying scRNAseq to non-model organisms, even those with multiple, fragmentary genome assemblies. This could change how we approach answering biological questions as, all too often, model organisms that are not ideally suited to dissecting a particular disease or biological process are used for pragmatic reasons, as they possess polished and well-annotated reference genomes. We hope that our approach will help researchers to design their scRNAseq using their organism of choice based on how well it recapitulates the question at hand, rather than being limited by incomplete genome assemblies.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cells11040608/s1, Figure S1: Custom 3 UTR extension enables transcript capture for genes with incomplete 3 UTR annotation; Figure S2: Identification of the A. mexicanus major cardiac cell types; Figure S3: The integrated dataset enabled more accurate identification of doublets; Table S1: Consistent Gene Symbol Markers for the major cardiac cell types.  Institutional Review Board Statement: The animal study protocol was approved by Oxford University's central Committee on Animal Care and Ethical Review (PPL 30/3258).

Informed Consent Statement: Not applicable.
Data Availability Statement: The single-cell RNAseq data files have been deposited in GEO under accession number GSE194093. The code used to generate the plots in the paper can be found in github (https://github.com/helenpotts/Astyanax_mexicanus_genome_assembly_analysis, accessed on 22 January 2022).