The Transcriptome of SH-SY5Y at Single-Cell Resolution: A CITE-Seq Data Analysis Workflow

Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) is a recently established multimodal single cell analysis technique combining the immunophenotyping capabilities of antibody labeling and cell sorting with the resolution of single-cell RNA sequencing (scRNA-seq). By simply adding a 12-bp nucleotide barcode to antibodies (cell hashing), CITE-seq can be used to sequence antibody-bound tags alongside the cellular mRNA, thus reducing costs of scRNA-seq by performing it at the same time on multiple barcoded samples in a single run. Here, we illustrate an ideal CITE-seq data analysis workflow by characterizing the transcriptome of SH-SY5Y neuroblastoma cell line, a widely used model to study neuronal function and differentiation. We obtained transcriptomes from a total of 2879 single cells, measuring an average of 1600 genes/cell. Along with standard scRNA-seq data handling procedures, such as quality checks and cell filtering procedures, we performed exploratory analyses to identify most stable genes to be possibly used as reference housekeeping genes in qPCR experiments. We also illustrate how to use some popular R packages to investigate cell heterogeneity in scRNA-seq data, namely Seurat, Monocle, and slalom. Both the CITE-seq dataset and the code used to analyze it are freely shared and fully reusable for future research.


Introduction
Single-cell studies are becoming more popular across the field of biomedical research for the level of resolution they can offer, especially in the field of transcriptome analysis. Single-cell RNA-sequencing (scRNA-seq) was first described by Tang et al. in 2009 [1] as a method for transcriptome analysis with higher sensitivity that allowed the study of samples with a very small number of cells. In this study, they focused on the early-phases of embryonic development, specifically the blastomere, a stage in which the embryo is composed by around 30 cells, too few for standard bulk RNA-Seq, which requires hundreds of thousands of cells as starting material [1]. Since then, scRNA-seq has irreversibly transformed the field of cell biology research, by making it possible to capture the complexity of the transcriptome of higher eukaryotes within each cell forming a specific tissue through different developmental stages [2]. In recent years, the improvement of scRNA-seq, together with the development of different technical approaches, made it more affordable and widely used. This technology provides a detailed view of the sample heterogeneity and genes to be used for qRT PCR experiments. The current protocol can be extended to analyze more complex CITE-seq data, e.g., derived from the sequencing of heterogeneous tissues, organs, or tumors.

Cell Hashing and Library Preparation for 10X Genomics Sequencing
Cells reaching > 80% confluence coming from two different culture flasks were harvested for sequencing. Cell hashing was performed following manufacturer's instructions (TotalSeq™-A Cell Hashing and TotalSeq™-A Antibodies Protocol for Simultaneous Proteomics and Transcriptomics with 10X Single Cell 3' Reagent Kit v2; BioLegend, San Diego, CA, USA). Briefly, after cells counting and viability assessment, single cells suspension samples were incubated 10 min on ice with Human TruStain FcX™ Fc Blocking reagent (BioLegend). Next, sample-specific TotalSeq A antibodies (TotalSeq-A0251 anti-human hashtag 1 Antibody and TotalSeq-A0252 anti-human Hashtag 2 Antibody, BioLegend) were added with subsequent incubation on ice for 30 min. Cells were washed three times with Cell Staining Buffer (BioLegend, Cat# 420201) and finally resuspended in PBS. Cells were counted in an automatic cell counter (Cell counter Thermo Scientific) and samples pooled together. Primers to cDNA PCR (HTO PCR additive primer) were added to increase yield of hashtag oligonucleotide (HTO, cDNA derived from the TotalSeq™ hashtag antibodies). After cDNA amplification, HTO-derived cDNAs (180 bp) and mRNA-derived cDNAs (>300 bp) 2.0 Fluorometer (Invitrogen ® , Waltham, MA, USA) and Agilent Bioanalyzer DNA assay (Agilent ® , Santa Clara, CA, USA). Libraries were prepared for sequencing following manufacturer's instructions (Illumina ® , San Diego, CA, USA) and then sequenced in 150 bp paired-end mode on Illumina ® NovaSeq6000.

Data Processing and Analysis
Raw reads were mapped to the human genome version hg38/GRCh38 using the STAR aligner version 2.7 [16]. The resulting aligned reads data were saved in BAM format, which also included unaligned reads. This BAM was then processed with Cell Ranger v4.0.0, in order to obtain matrices of gene counts per cell in CSV (comma-separated value) format. The dataset has been deposited to GEO under the following accession: GSE171391.
Gene count matrices were loaded in the R statistical software version 4.0.4 running Bioconductor version 3.12. Where not differently indicated, all plots were produced by using base R functions or ggplot2 package version 3.3.3. Data log normalization and scaling was performed using the Seurat package version 4.0.0 [17], which was also used for data dimensionality reduction, cell-cycle regression analysis, and differential expression analysis among conditions. Assignment of cells to cell cycle phases was performed using the cell cycle genes defined in [18,19]. Cluster and state trajectory analysis was performed with the Monocle3 package version 0.2.3.0 [20], and single-cell heterogeneity investigated by factorial single-cell latent variable model implementation in slalom (version 1.12.0) [21] using WikiPathways gene annotations [22]. The corto package version 1.1.4 [23] was used to infer a SY5Y specific co-expression network and Cytoscape was used for network visualization [24]. All code to reproduce the analysis is given in a fully reproducible R Markdown document in Supplementary Material. Alternatively, it can be accessed on GitHub at the following address: github.com/N0toriou5/CITE-seq_pipeline (accessed on 9 April 2021).

Quality Check and Cell Selection
Cells reaching 90% confluence ( Figure 1A) were split into two culture flasks (tagged as SplitA and SplitB) and left growing until reaching > 80% confluence before sequencing. Cells were not synchronized during cell culture. A raw counts sparse matrix in .h5 format, generated by the 10X Genomics software Cell Ranger, was converted to a comma-separated value file (.csv) and imported in R for further analysis. The information on hashtag counting derived from the CITE-seq experiment, hashtag 1 and 2, is already present in this raw counts matrix, as the two hashtags were counted alongside other genes, on a cell-by-cell basis. using WikiPathways gene annotations [22]. The corto package version 1.1.4 [23] was used to infer a SY5Y specific co-expression network and Cytoscape was used for network visualization [24]. All code to reproduce the analysis is given in a fully reproducible R Markdown document in Supplementary Material. Alternatively, it can be accessed on GitHub at the following address: github.com/N0toriou5/CITE-seq_pipeline.

Quality Check and Cell Selection
Cells reaching 90% confluence ( Figure 1A) were split into two culture flasks (tagged as SplitA and SplitB) and left growing until reaching > 80% confluence before sequencing. Cells were not synchronized during cell culture. A raw counts sparse matrix in .h5 format, generated by the 10X Genomics software Cell Ranger, was converted to a comma-separated value file (.csv) and imported in R for further analysis. The information on hashtag counting derived from the CITE-seq experiment, hashtag 1 and 2, is already present in this raw counts matrix, as the two hashtags were counted alongside other genes, on a cellby-cell basis.  We used the hashtag information to assign the cells to the correct sample, namely SplitA and SplitB, on the basis of hashtag counts ( Figure 1B). We chose an arbitrary threshold (in our case, threshold was ≈2.3) and a ratio above 3 to assign cells to one split, or the other, while assigning all cells with ambiguous hashtag assignation to the "multiplet" group. A multiplet is the result from two different cells being tagged by the same bead barcode, making them indistinguishable from one another and not suitable for further analysis (specific software has been specifically designed to identify potential multiplets in scRNA-seq data, such as DoubletDecon [25] and Scrublet [26]). The presence of the antibody barcodes makes it easier to identify multiplets, thus excluding this potential source of bias from downstream analysis. Out of a total of 2879 single cells mapped, 1179 were uniquely assigned to SplitA, 1605 to SplitB and 95 tagged as multiplets ( Figure 1B,C). Quality check (QC) analysis is given in Figure 1C-E. Briefly, a similar mean amount of Unique Molecular Identifiers (UMI), corresponding to uniquely mapped reads, was detected in both SplitA and B (5300 and 5400, respectively), while a higher amount was observed in multiplets (6300). A mean of 1600 genes/cell were measured in both splits. Mitochondrial genes were similarly detectable in both splits (mtUMI). The mean mitochondrial genes proportion (mitoRatio), i.e., the fraction of mitochondrial transcript counts out of the total transcript counts, was lower than 1% in all the three groups. The mitoRatio is used as an index to identify apoptotic, stressed, or low-quality cells in the data, since these cells show higher proportion of mitochondrial transcripts compared to total. Usually, cells having a mitoRatio higher than 5% are considered as low quality samples and excluded from analysis, but appropriate thresholds can be established during exploratory data analysis.
Raw counts matrix was then filtered to exclude cells expressing less than 500 UMI, 300 genes, and showing a mitochondrial genes proportion of more than 15%. We then took advantage of Seurat function CreateSeuratObject to further refine our dataset by keeping only cells where at least 1000 genes were measured by considering all the genes detected in at least three cells. Standard Seurat LogNormalize method was applied to obtain an expression matrix as follows: data are multiplied by a scale factor (default is 10,000) and log-transformed. The Seurat object is a convenient way to store both expression data (e.g., the count matrix, log normalized expression matrix) and analysis steps (such as dimensionality reduction, or clustering results) for a single-cell dataset.

Global Gene Expression Analysis
A common first step into single-cell RNA-seq analysis is to identify genes showing a high cell-to-cell expression variation, which are usually those genes better explaining the complexity of the dataset. The Top 2000 variable genes are highlighted in Figure 2A by using default FindVariableFeatures Seurat function. Among variable genes, the ten most variable were GAL, H4C3, CHGB, S100A11, MT1X, TFPI2, CHGA, MT2A, DLK1, NDUFA4L2. In a previous report of our laboratory, phenotyping of MYCN-amplified Kelly and SK-N-BE-2-C neuroblastoma cell lines, a high expression variance of metallothionein genes, such as MT2A and MT1X, was already reported, especially in Kelly cells [14]. Methods Protoc. 2021, 4, x FOR PEER REVIEW 6 of 17 An open and frequent debate is how to select proper genes for qPCR analysis normalization (i.e., Housekeeping Genes, HK). Generally speaking, to be considered as a good HK, a gene should be highly expressed in all cells, showing a very low variance. We therefore applied these criteria to identify potential HK genes for SY5Y cells, also checking the expression of commonly used HK genes for qPCR normalization in neuroblastoma experiments involving cell lines, such as ACTB, B2M, and GAPDH ( Figure 2B-D). We applied these criteria to identify potential HK genes for SY5Y cells. As showed in Figure  2B,C, a substantial similar expression of B2M, ACTB, GAPDH, EEF1A1, RPL13A, RPL10, An open and frequent debate is how to select proper genes for qPCR analysis normalization (i.e., Housekeeping Genes, HK). Generally speaking, to be considered as a good HK, a gene should be highly expressed in all cells, showing a very low variance. We therefore applied these criteria to identify potential HK genes for SY5Y cells, also checking the expression of commonly used HK genes for qPCR normalization in neuroblastoma experiments involving cell lines, such as ACTB, B2M, and GAPDH ( Figure 2B-D). We applied these criteria to identify potential HK genes for SY5Y cells. As showed in Figure 2B,C, a substantial similar expression of B2M, ACTB, GAPDH, EEF1A1, RPL13A, RPL10, ND4, RPS18, and COX1 can be observed when analyzing the cells from different flasks separately, while 13 potential HK genes emerge when considering the entire dataset, ( Figure 2D), summarized for expression and geographic location on human genome in Figure 3A. We also investigated the presence of potentially interesting HK candidates not only considering absolute expression and variance thresholds for subsetting genes, but also considering the ratio between average LogNorm Exp on Variance, since high ratios may indicate good HK candidates. As showed in Figure 3B, while commonly used for qPCR data normalization, both ACTB and B2M showed the lowest ratios among 13 HK candidates, thus being poorly suitable as expression normalizers. We therefore extended our search including all genes showing an average LogNorm Exp/Var ratio higher than 15. In Figure 3C, 49 genes were selected. Most of the candidate HKs in Figure 3C code for ribosomal proteins (39), while some of them (8) are mitochondrial genes. Among classically used HK genes for qPCR, such as ACTB and B2M, GAPDH seems the best choice, showing a strong expression/variance ratio. Other interesting non-ribosomal, non-mitochondrial genes emerge, such as EEF1A1 (Eukaryotic Translation Elongation Factor 1 Alpha 1) and H3-3A (H3.3 Histone A). Both EEF1A1 and H3-3A have been previously described as good HK genes [27,28]. ND4, RPS18, and COX1 can be observed when analyzing the cells from different flasks separately, while 13 potential HK genes emerge when considering the entire dataset, (Figure 2D), summarized for expression and geographic location on human genome in Figure  3A. We also investigated the presence of potentially interesting HK candidates not only considering absolute expression and variance thresholds for subsetting genes, but also considering the ratio between average LogNorm Exp on Variance, since high ratios may indicate good HK candidates. As showed in Figure 3B, while commonly used for qPCR data normalization, both ACTB and B2M showed the lowest ratios among 13 HK candidates, thus being poorly suitable as expression normalizers. We therefore extended our search including all genes showing an average LogNorm Exp/Var ratio higher than 15. In Figure 3C, 49 genes were selected. Most of the candidate HKs in Figure 3C code for ribosomal proteins (39), while some of them (8) are mitochondrial genes. Among classically used HK genes for qPCR, such as ACTB and B2M, GAPDH seems the best choice, showing a strong expression/variance ratio. Other interesting non-ribosomal, non-mitochondrial genes emerge, such as EEF1A1 (Eukaryotic Translation Elongation Factor 1 Alpha 1) and H3-3A (H3.3 Histone A). Both EEF1A1 and H3-3A have been previously described as good HK genes [27,28].

Dataset Dimensionality Reduction and Cell-Cycle Regression
We further processed data by applying the data scaling process from Seurat's Scale-Data function, which applies a linear transformation so that highly expressed genes do not dominate downstream analysis. Data scaling is a standard pre-processing step prior to dimensional reduction. Since not all detected genes (13,927 in our case) are meaningful to classify cells into biologically relevant clusters based on their expression profiles, dimensionality reduction techniques are employed to reduce data complexity and for data visualization. Furthermore, computational requirements for several downstream analysis techniques (e.g., cell clustering) benefit from dimensionality reduction, because several clustering and classification algorithms require large computational resources to analyze the original dataset space. Principal Component Analysis (PCA) is a very popular technique to reduce the original space in few meaningful dimensions. PCA performs an orthogonal transformation of the original dataset to create a set of new, uncorrelated variables (i.e., the principal components). These principal components are linear combinations of variables in the original dataset, ranked in decreasing order of variance. In Figure 4A, cells are plotted along the first two principal components of the dataset. No difference between cells belonging to SplitA, SplitB, or multiplet group was detectable. We determined which PC exhibited cumulative percent greater than 90% and a percentage of variation associated with the PC as less than 5. Twelve PCs were enough to capture more than 90% of the complexity of the dataset.

Dataset Dimensionality Reduction and Cell-Cycle Regression
We further processed data by applying the data scaling process from Seurat's Scale-Data function, which applies a linear transformation so that highly expressed genes do not dominate downstream analysis. Data scaling is a standard pre-processing step prior to dimensional reduction. Since not all detected genes (13,927 in our case) are meaningful to classify cells into biologically relevant clusters based on their expression profiles, dimensionality reduction techniques are employed to reduce data complexity and for data visualization. Furthermore, computational requirements for several downstream analysis techniques (e.g., cell clustering) benefit from dimensionality reduction, because several clustering and classification algorithms require large computational resources to analyze the original dataset space. Principal Component Analysis (PCA) is a very popular technique to reduce the original space in few meaningful dimensions. PCA performs an orthogonal transformation of the original dataset to create a set of new, uncorrelated variables (i.e., the principal components). These principal components are linear combinations of variables in the original dataset, ranked in decreasing order of variance. In Figure 4A, cells are plotted along the first two principal components of the dataset. No difference between cells belonging to SplitA, SplitB, or multiplet group was detectable. We determined which PC exhibited cumulative percent greater than 90% and a percentage of variation associated with the PC as less than 5. Twelve PCs were enough to capture more than 90% of the complexity of the dataset. We expected that a cell line would show very little genomic variability with most of the differences in transcriptome that may be explained by cell cycle phase, which is a common source of variability in asynchronous cell cultures. Seurat contains a set of useful functions to assign a cell cycle score to cells on the basis of cell cycle markers defined by [18]. As showed in PCA in Figure 4B, cells tend to occupy the plot space according to cell cycle phase. To mitigate the effect of cell cycle phase and to investigate transcriptome differences among cells relying on, for example, differentiating processes, proliferation rates or aggressiveness behavior, we regressed out the difference between the G2M and S phase scores by ScaleData function to keep the source of transcriptome variability separating non-cycling and cycling cells, while regressing out differences in cell cycle phase among proliferating cells. As showed in Figure 4C, partial cycle regression resulted in a clear separation of cells belonging to G1, while keeping together cells in G2M/S phases, or actively dividing cells. To note, we also totally regressed out cell cycle scores (Supplementary Figure S1, right We expected that a cell line would show very little genomic variability with most of the differences in transcriptome that may be explained by cell cycle phase, which is a common source of variability in asynchronous cell cultures. Seurat contains a set of useful functions to assign a cell cycle score to cells on the basis of cell cycle markers defined by [18]. As showed in PCA in Figure 4B, cells tend to occupy the plot space according to cell cycle phase. To mitigate the effect of cell cycle phase and to investigate transcriptome differences among cells relying on, for example, differentiating processes, proliferation rates or aggressiveness behavior, we regressed out the difference between the G2M and S phase scores by ScaleData function to keep the source of transcriptome variability separating non-cycling and cycling cells, while regressing out differences in cell cycle phase among proliferating cells. As showed in Figure 4C, partial cycle regression resulted in a clear separation of cells belonging to G1, while keeping together cells in G2M/S phases, or actively dividing cells. To note, we also totally regressed out cell cycle scores (Supplementary Figure S1, right panel), but this procedure does not have an impact on downstream analyses. This is because our cell model was sequenced when culturing cells reached a confluence >80%, and for SY5Y cells that grow as a monolayer ( Figure 1A) this means a decrease in growth rate. However, we strongly recommend applying partial regression of the cell cycle, particularly when the focus of the analysis is to dissect differentiating processes (e.g., hematopoiesis), where we expect a quiescent, and an actively proliferating population to coexist in culture.

Differential Expression and Cluster Analysis
The Seurat FindMarkers function offers a simple way to test for differentially expressed genes for each of the identity classes in a dataset. In our case, we were interested in understanding if transcriptome differences existed between SplitA and B. As expected, no differentially expressed gene was detectable between the two flasks, as they came from a single split from the same original culture. Thus, to investigate if a source of heterogeneity among cells could be captured, we transferred the cell cycle regressed expression matrix (keeping only cells belonging to SplitA and B) to a Monocle cell data set object using convenient functions from the SeuratWrappers package, which is a collection of wrapper functions to transfer data processed in Seurat to other single-cell analysis packages. After data normalization and applying graph based Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction in monocle, we visualized again the cells in a 2D plot coloring cells by split identity. As showed in Figure 5A, cells from the two splits are distributed uniformly in the reduced space, meaning that no substantial difference came from splitting. However, point distribution was suggestive of different cluster forming groups among cells. We therefore applied the cluster_cells function implemented in Monocle3, choosing a community detection algorithm relying on Leiden clustering approach [29]. Applying a low-resolution threshold (e.g., 10 −3 ), we assigned cells to the four cluster-forming communities in Figure 5B. Then, we asked whether differences among communities may be explained by transition from different functional states. Monocle3 offers some functions to investigate transient transcriptomic changes among cell states by learning the sequence of gene expression changes each cell should go through when placed in a dynamic biological process. Once a trajectory of gene expression changes has been built, each cell can be placed at its appropriate position through the trajectory as showed in Figure 5C, where trajectories were traced in an unsupervised manner. When multiple trajectories are equally possible, monocle3 build branches, as showed clearly in cluster 2. Trajectory analysis usually performs very well when different states driven by a defined subset of genes exists, e.g., during hematopoiesis or embryogenesis processes. A similar kind of analysis can be performed by using the spathial R package [30], which takes advantage of an unsupervised learning method to navigate cells' features to identify transition states among communities. This is a slightly different approach from monocle, because spathial is able to reconstruct transcriptome states that explain possible transcriptomes, which are representative of real intermediate states between cells belonging to different clusters. Differential analysis can be performed through the top_markers function to identify genes most characterizing each cell state through the trajectory. As showed in Figure 5D, major sources of variation are detectable among cluster 4 and the other three clusters, as cluster 4 showed a higher expression of TUBA1A (Tubulin Alpha 1a) and HSP90AA1 (Heat Shock Protein 90 Alpha Family Class A Member 1), while expression of S100A6 (S100 Calcium Binding Protein A6) was barely detectable in a minimal fraction of cells. Higher PTMA (Prothymosin Alpha), HMGB2 (High Mobility Group Box 2), and H2AZ1 (H2A.Z Variant Histone 1) expression characterized cluster 1, while cluster 2 and 3 were substantially homogeneous, despite a marked difference in GAL (galanin and GMAP pre-propeptide) and DLK1 (delta like non-canonical notch ligand 1) expression, mostly characterizing cluster 3. Methods Protoc. 2021, 4, x FOR PEER REVIEW 10 of 17 Differential analysis can be performed through the top_markers function to identify genes most characterizing each cell state through the trajectory. As showed in Figure 5D, major sources of variation are detectable among cluster 4 and the other three clusters, as cluster 4 showed a higher expression of TUBA1A (Tubulin Alpha 1a) and HSP90AA1 (Heat Shock Protein 90 Alpha Family Class A Member 1), while expression of S100A6 (S100 Calcium Binding Protein A6) was barely detectable in a minimal fraction of cells. Higher PTMA (Prothymosin Alpha), HMGB2 (High Mobility Group Box 2), and H2AZ1 (H2A.Z Variant Histone 1) expression characterized cluster 1, while cluster 2 and 3 were substantially homogeneous, despite a marked difference in GAL (galanin and GMAP prepropeptide) and DLK1 (delta like non-canonical notch ligand 1) expression, mostly characterizing cluster 3.

Dissection of Cell Heterogeneity by Factorial Single-cell Latent Variable Model Method
Factorial single-cell latent variable model (f-scLVM) is a method to investigate sources of heterogeneity in scRNA-seq datasets by using pathway annotations to drive the inference of interpretable factors explaining variability among cells, facilitating the

Dissection of Cell Heterogeneity by Factorial Single-Cell Latent Variable Model Method
Factorial single-cell latent variable model (f-scLVM) is a method to investigate sources of heterogeneity in scRNA-seq datasets by using pathway annotations to drive the inference of interpretable factors explaining variability among cells, facilitating the discovery of meaningful subpopulations [21]. We applied the f-scLVM method implementation in the R package slalom to analyze our data, coupled with human gene annotations deposited in WikiPathways [22] to define annotated terms of heterogeneity. Figure 6A shows top pathways explaining the heterogeneity in the dataset. The highest relevance pathway was the WP TRANSLATION FACTORS, whose top relevant genes were several elongation factors responsible for the enzymatic delivery of aminoacyl tRNAs to the ribosome (Supplementary Figure S2), including EEF1A1. Among top 10 pathways, the most interesting factors are gene sets related to copper homeostasis ( Figure 6B) and cell cycle ( Figure 6C). In Figure 6B, several metallothioneins are listed as most varying genes in the dataset, while Figure 6C highlighted enhanced DNA Topoisomerase II Alpha expression in some cells, an enzyme involved in DNA transcription. Since we run the model on the original dataset (i.e., before cell cycle score regression), this may evidence that a small fraction of actively dividing cells was present in our dataset (as anticipated in Figure 4B). However, running the monocle pipeline on the same original dataset gave exactly the same results as the regressed dataset analysis (Supplementary Figure S3), meaning that cycling cells could explain the dataset heterogeneity only in minimal part, while major differences depended on genes, such as TUBA1A and HSP90AA1, which were found to characterize cluster 4 ( Figure 5D), and the unannotated hidden factor 05 ( Figure 6D). tRNAs to the ribosome (Supplementary Figure S2), including EEF1A1. Among top 10 pathways, the most interesting factors are gene sets related to copper homeostasis ( Figure  6B) and cell cycle ( Figure 6C). In Figure 6B, several metallothioneins are listed as most varying genes in the dataset, while Figure 6C highlighted enhanced DNA Topoisomerase II Alpha expression in some cells, an enzyme involved in DNA transcription. Since we run the model on the original dataset (i.e., before cell cycle score regression), this may evidence that a small fraction of actively dividing cells was present in our dataset (as anticipated in Figure 4B). However, running the monocle pipeline on the same original dataset gave exactly the same results as the regressed dataset analysis ( Supplementary Figure S3), meaning that cycling cells could explain the dataset heterogeneity only in minimal part, while major differences depended on genes, such as TUBA1A and HSP90AA1, which were found to characterize cluster 4 ( Figure 5D), and the unannotated hidden factor 05 ( Figure 6D).  While providing a simple way to analyze variability in scRNA-seq datasets with useful integrated plotting functions, the slalom analysis may take a lot of time to be performed, depending both on the number of pathways and genes that are given in input, and the size of the dataset. In our case, following the standard slalom pipeline as indicated in the package vignette, setting a range for the number of interactions between 1000 and 10,000, and considering 211 annotated factors and 4266 genes, it took almost 6 days to finish the analysis on R v4.0.4, running on a 64 Bit Windows 10 Pro (v20H2) system with an Intel ® Core™ i7-9600 CPU @ 3.00 GHz and 64.0 GB RAM. Our model converged after 3050 iterations. We therefore suggest to carefully evaluate and consider performing this kind of analysis on a powerful workstation, or a computational cluster. Furthermore, we strongly recommend saving the results of the analysis to a SingleCellExperiment object as explained in the package vignette, as saving the model in .rData format may incur to data loss.

Use of Single-Cell Data for Co-Expression Network Inference
Gene regulatory networks (GRNs) are representations of gene-gene relationships in a given phenotypic state [31]. Usually, GRNs summarize relationships between transcription factors and target genes, and have become a very popular way to describe disease specific signatures or drug responses [32]. Most common GRNs inference methods rely on coexpression, such as ARACNe, CorTo, and many others [23,31]. Such networks can be used to investigate network rewiring following drug treatments, or to infer the so-called Master Regulators, i.e., Transcription Factors (TFs) driving a particular phenotypic state. It was described that more than 200 gene-expression profiles were required to assemble and analyze regulatory networks [33], consistent with the requirement of~100 expression profiles for Mutual Information-based analyses [34]. Single-cell RNA-seq are an excellent source of individual samples to build context-specific GRNs. Here, we used the SY5Y single-cell transcriptomes (n = 2143 cells) to build a TF-Target network with the corto algorithm. The network was saved into a regulon object, whose information about TF-Target relationships was exported in tabular form to be loaded in Cytoscape for network visualization (Supplementary Figure S4). A focus of the E2F1 sub-network, which is a key regulator of cell cycle emerged as a source of heterogeneity in Figure 6C, is given as Supplementary Figure S5). The "regulon" object, containing all SY5Y specific TF-Target relationships, is given as supplementary file.

Discussion
We produced an original untreated SY5Y cell line CITE-seq dataset to illustrate a standard workflow using basic R functions and very popular R packages to analyze single-cell transcriptomic data. SY5Y cells are usually used as a neuronal function and differentiation model. Considering that neuroblastoma is a highly heterogeneous type of cancer, with different stages that largely vary between each other, it is important to identify both major sources of variability among cells and also the most stable genes that can serve as reliable housekeeping genes, since they are highly relevant as expression controls and for normalization purposes when doing differential expression analysis, especially in qRT PCR analysis. Previous studies with qPCR have found HPRT1 and SDHA to show low expression variability in primary neuroblastoma [35] and also, specifically for the SY5Y line, GAPDH, M-RIP, and POLR2F were found to be reliable genes to be used as normalization factors [36]. Our analysis showed that most widely used HK genes for qPCR analysis normalization, such as ACTB and B2M, may not be the best HK genes suitable in experiments involving SY5Y cells, while confirming GAPDH as an optimal HK gene. We also identified a set of 49 genes that are expressed at high levels among SY5Y cells showing a very low variance (thus, maximizing the Expression/Variance ratio). Most of these genes encode for ribosomal or mitochondrial proteins, while some, such as EEF1A1 and GAPDH, encode cytoplasmic genes. EEF1A1 was previously reported as a good HK in bone marrow derived mesenchymal stem cells [27], while suitability of GAPDH was questioned. Our 49 HK candidates selection may be useful for wet lab scientists in need of transcriptionally stable genes when planning qRT-PCR experiments involving SY5Y as cellular model. On the other hand, a major source of variation is given by some genes encoding for metallothionein proteins such as MT2A and MT1X. A high variance in metallothioneins expression was already detected in neuroblastoma cell lines [14], and this could reflect different metabolic states among cultured cells. Neuroblastoma has been demonstrated to be a highly sensitive tumor to heavy metal ion fluctuations, such as zinc, iron, and copper, and to hypoxic conditions or oxidative stresses, all representing stimuli able to induce metallothioneins transcription [37]. Furthermore, as highlighted in Figure 6B, copper homeostasis is a critical source of heterogeneity in SY5Y and neuroblastoma in general [38], thus influencing metallothioneins expression. This variability can be possibly explained by local regions of hypoxia-induced oxidative stresses in the 2D monolayer flask environment, caused for example by the formation of densely populated and hypoxic areas.
The two samples came from two separate flasks derived from the same split, and were kept in homogenous conditions until the sequencing process started; it was therefore reassuring that no significant transcriptional difference could be detected between the two SY5Y biological replicates. A similar single-cell experimental setup, comparing two different neuroblastoma cell lines, highlighted thousands of differentially expressed genes [14]. This shows that the CITE-seq + 10X Genomics procedure, and the subsequent analysis pipeline described here, possesses a negligible false discovery rate for differential expression, and that SY5Y cell line transcriptome is highly constant, making SY5Y a very strong experimental model.
Within both samples, we could however find clusters of cells representing distinct transcriptomic states, partially overlapping with different cell cycle phases (Figures 4C and 5B). The genes characterizing these clusters are involved in neuroblastoma differentiation (TUBA1A, DLK1) [39,40], survival/tumor growth/proliferation (S100A6, HSP90AA1) [41,42], or MYCN expression (PTMA) [43]. Among genes characterizing different cell communities, we also found some ribosomal proteins (RPL15, RPL35A, RPS7), and the mitochondrial gene ND3. Our data showed that RPS7 and RPL15 are highly expressed genes with very low variance throughout the SY5Y dataset, which makes these genes possible HK candidates for this cell line. However, cluster 4 ( Figure 5B) is characterized by low levels of these ribosomal proteins but higher levels of TUBA1A, HSP90AA1, and PTMA. While this difference is very little, it can be due to a little population subset characterized by a more aggressive/stressed transient behavior, which can be determined by some stresses casually occurring during 2D culture conditions, rather than a true difference among cells. Several single-cell designed methods, such as Seurat, Monocle, single-cell latent variable model (scLVM) from slalom, or spathial can help dissecting some source of heterogeneity still present even in commonly used tumor cell line models. In particular, the slalom model indicates pathways linked to copper metabolism as a major source of heterogeneity, consistently with the variability in metallothionein levels observed [14,38,44].
A typical single-cell experiment measures thousands of transcripts in thousands of cells. This can be exploited to meet sample size requirements for GRN inference by co-expression-based methods, as we performed using the corto package [23].
In our paper, we illustrated an example workflow for CITE-seq data analysis including cutting-edge methods that can be adapted in several manners to analyze different experimental settings. To this end, fully reproducible and re-usable code is given as Supplementary Material to provide a guide for CITE-seq data analysis. Our dataset on SY5Y confirmed that this cell line is transcriptionally homogeneous and suggested 49 potential candidates to be used as good HK genes for qPCR normalization, with GAPDH or EEF1A1 as best choice to this purpose. A small source of heterogeneity can be identified by in-depth data analysis, possibly caused by fluctuations in genes involved in response to hypoxic stress, metal responses, differentiation, survival, or tumor staging in neuroblastoma, identifying cell communities that adapt to the local environment (e.g., a local increase of cell density) by modulating their transcriptome to alter their metabolism/proliferation rates.

Conclusions
In conclusion, our paper provides a clear reference guide for CITE-seq data analysis, while re-evaluating the commonly used housekeeping genes and proposing a new set of 49 specific, single-cell tested candidate HK genes for experiments involving SY5Y cells. No significant difference in transcripts levels can be attributed to the splitting of a cell culture. We identified however, even within the homogeneity of the SY5Y cell line, distinct populations independent from cell cycle phases, and characterized by different stress and metabolic states. These subpopulations were found consistently in both biological replicates.
Currently, huge efforts are being undertaken by the Human Cell Atlas [45] and similar projects [46,47] in characterizing the complexity of the human Transcriptome at the single-cell level in living tissues. We also believe that a deep characterization at single-cell resolution of human cell lines will allow for a better usage of these omnipresent models in molecular/cell biology and pharmacology. Many single-cell datasets measured on cell lines exist beyond the one described here (SY5Y cells) on other cell lines from neuroblastoma [14], breast cancer [48], and lymphoblasts [49,50]. Fueled by current technological advances provided by, e.g., CITE-seq, cell line single-cell datasets can be produced in even higher numbers and for more experimental contexts, towards the generation of an integrated and comprehensive Human Cell Line Atlas.

Data Availability Statement:
The raw data associated with this dataset are stored at the Gene Expression Omnibus under the following accession: GSE171391.