Single-Cell Deconvolution of Head and Neck Squamous Cell Carcinoma

Simple Summary Tumors are not composed of a uniform ball of cells, but rather, a complex set of diverse cells. Unfortunately, most transcriptomic techniques analyze the entire tumor (bulk), and thus represent an average profile of genes expressed across heterogeneous cells. To estimate tumor composition from bulk data, many algorithms have been developed—broadly termed deconvolution. However, with the advent of single-cell RNA sequencing (scRNA-seq), which provides gene expression data for individual cells, a few deconvolution algorithms are now more nuanced. We have used our scRNA-seq data from head and neck tumors along with two cutting-edge deconvolution algorithms to analyze bulk expression data from >500 tumors. With this approach, we find that higher proportions of a class of immune cells (tumor-infiltrating regulatory T-cells) are associated with improved survival in head and neck cancer. Our findings and data establish a generalizable approach that can be applied across oncology to study tumor composition. Abstract Complexities in cell-type composition have rightfully led to skepticism and caution in the interpretation of bulk transcriptomic analyses. Recent studies have shown that deconvolution algorithms can be utilized to computationally estimate cell-type proportions from the gene expression data of bulk blood samples, but their performance when applied to tumor tissues, including those from head and neck, remains poorly characterized. Here, we use single-cell data (~6000 single cells) collected from 21 head and neck squamous cell carcinoma (HNSCC) samples to generate cell-type-specific gene expression signatures. We leverage bulk RNA-seq data from >500 HNSCC samples profiled by The Cancer Genome Atlas (TCGA), and using single-cell data as a reference, apply two newly developed deconvolution algorithms (CIBERSORTx and MuSiC) to the bulk transcriptome data to quantitatively estimate cell-type proportions for each tumor in TCGA. We show that these two algorithms produce similar estimates of constituent/major cell-type proportions and that a high T-cell fraction correlates with improved survival. By further characterizing T-cell subpopulations, we identify that regulatory T-cells (Tregs) were the major contributor to this improved survival. Lastly, we assessed gene expression, specifically in the Treg population, and found that TNFRSF4 (Tumor Necrosis Factor Receptor Superfamily Member 4) was differentially expressed in the core Treg subpopulation. Moreover, higher TNFRSF4 expression was associated with greater survival, suggesting that TNFRSF4 could play a key role in mechanisms underlying the contribution of Treg in HNSCC outcomes.


Introduction
Head and neck squamous cell carcinoma (HNSCC) arises in the upper aerodigestive mucosa of the oral cavity, oropharynx, hypopharynx, larynx, and rarely, in the nasal cavity and nasopharynx [1]. Together, HNSCC is the sixth most common cancer worldwide, representing 90% of cancers that arise in the head and neck region [2,3] and accounting for 650,000 (3.6%) of cancer cases and 330,000 (3.4%) of cancer deaths per year [2]. Unfortunately, the prognosis for HNSCC patients remains poor despite numerous advances in surgical, radiation, and systemic therapies [4]. Advancing new therapeutics for HNSCC will require insight into the cellular and molecular biology underlying HNSCC.
Knowledge of cell-type composition in tumor tissues represents an important step towards identifying cellular targets in cancer. Changes in cell composition underlie diverse physiological states of complex tissues. In malignant tumors, levels of infiltrating immune cells are associated with tumor growth, cancer progression, and patient outcomes [5,6]. However, beyond knowing the proportions, understanding how individual cell types respond may also be important for understanding the course of disease. For example, genes changes within particular cells might provide insights into novel avenues for treatment [6][7][8][9].
Intra-tumoral cell proportions can be estimated by histological techniques [10]. However, such approaches are low throughput, time-consuming methods that are not feasible to use with large sample sizes due to the requirement for specific antibodies and the limited number of cell types that can be simultaneously assessed [11]. By contrast, genomic (whole exome and whole genome sequencing; WES and WGS) and transcriptomic (RNA-seq) sequencing are high-throughput methods for analyzing bulk tumor samples which are suited for large sample sizes, such as The Cancer Genome Atlas (TCGA) dataset [12]. Although bulk sequencing approaches have identified driver mutations and abnormal expression profiles characteristic of HNSCC [13], these methods fail to capture intra-tumoral heterogeneity [1,14,15], which affects clinical outcomes and treatment response in HNSCC and other solid cancers [16][17][18][19][20]. Thus, improved therapies for HNSCC critically depend on the elucidation of the key cellular subpopulations present within these heterogeneous tumors.
Computational methods for the quantitative phenotyping of tumors from bulk RNAseq data have significant potential for efficient and low-cost profiling of a large number of existing samples yet are handicapped by the intrinsic limitations of the bulk data itself. Single-cell RNA-sequencing (scRNA-seq) technology provides a promising alternative, providing high-resolution gene expression data for individual cells within a tumor [9,21]. Indeed, scRNA-seq is increasingly being utilized across oncology, but this method is constrained by the need for freshly acquired patient samples, significant expense, and technical difficulties in tissue processing [14,15,22]. Fortunately, a relatively small subset of existing scRNA-seq data can provide detailed cell-type-specific gene expression profiles that can be inputted into deconvolution algorithms. These algorithms enable cell-type proportion estimates and even a deciphering of cell-type-specific gene expression from bulk sequencing data [8,11,12,14,15,[23][24][25]. Thus, deconvolution algorithms based on scRNAseq-derived gene expression profiles overcome the resource limitations of scRNA-seq by permitting in silico cell-type-specific analyses from bulk tissue. This approach can enable large-scale investigation of novel or poorly characterized cell states in bulk tissue profiles and enable us to test new hypotheses in the substantial existing bulk profile collections. In human cancers, cellular states of interest may include subpopulations of activated, resting, Cancers 2021, 13, 1230 3 of 20 or exhausted T-cells [26][27][28], cancer-associated fibroblasts [8], or malignant cells [29,30], including tumor-initiating cells or cancer stem cells [31].
Here, we sought to test the hypothesis that the proportions of individual cell types influence disease progression and outcome. We leveraged bulk transcriptomic data from >500 HNSCC samples profiled by TCGA. We derived the signature matrix from our previously profiled transcriptomes of~6000 single cells from 21 head and neck squamous cell carcinoma (HNSCC) samples, including 4 matched pairs of primary tumors and lymph node metastases [8]. With this scRNA-seq reference, we used two recently developed deconvolution algorithms-CIBERSORTx [32] and MuSiC [33]-which allow the use of scRNA-seq as reference to characterize cell-type compositions from bulk RNA-seq data in complex tissues, to estimate immune and non-immune cell-type proportions in HNSCC TCGA bulk RNA-seq data. By correlating proportions of each cell type with patient-matched overall survival, we identified that high proportions of T-cells are associated with improved overall survival from HNSCC. By further characterizing T-cell subpopulations, we found that regulatory T-cells (T regs ) are the major contributor to this superior survival. Lastly, we assessed the specific gene expression in the T reg subpopulation and found that TNFRSF4 (Tumor Necrosis Factor Receptor Superfamily Member 4) is differentially expressed in core T regs and is correlated with significantly better survival, raising the possibility that this gene could play a key role in the mechanisms underlying T regs in HNSCC.

Overview of Deconvolution Approach
A schematic overview of our deconvolution approach is illustrated in Figure 1. First, we derived a cell-type expression matrix from our previously profiled transcriptomes of 6000 single cells using the SMART-seq2 protocol [34] from 21 HNSCC samples [8]. This matrix established a benchmark for cell-type proportions in heterogeneous HNSCC tissue. Second, we obtained the bulk RNA-seq data from >500 HNSCC samples within TCGA and then used both CIBERSORTx [32] and MuSiC [33] to deconvolve the bulk RNA-seq data based on the derived cell-type expression matrix. profiles and enable us to test new hypotheses in the substantial existing bulk profile collections. In human cancers, cellular states of interest may include subpopulations of activated, resting, or exhausted T-cells [26][27][28], cancer-associated fibroblasts [8], or malignant cells [29,30], including tumor-initiating cells or cancer stem cells [31].
Here, we sought to test the hypothesis that the proportions of individual cell types influence disease progression and outcome. We leveraged bulk transcriptomic data from >500 HNSCC samples profiled by TCGA. We derived the signature matrix from our previously profiled transcriptomes of ~6000 single cells from 21 head and neck squamous cell carcinoma (HNSCC) samples, including 4 matched pairs of primary tumors and lymph node metastases [8]. With this scRNA-seq reference, we used two recently developed deconvolution algorithms-CIBERSORTx [32] and MuSiC [33]-which allow the use of scRNA-seq as reference to characterize cell-type compositions from bulk RNA-seq data in complex tissues, to estimate immune and non-immune cell-type proportions in HNSCC TCGA bulk RNA-seq data. By correlating proportions of each cell type with patientmatched overall survival, we identified that high proportions of T-cells are associated with improved overall survival from HNSCC. By further characterizing T-cell subpopulations, we found that regulatory T-cells (Tregs) are the major contributor to this superior survival. Lastly, we assessed the specific gene expression in the Treg subpopulation and found that TNFRSF4 (Tumor Necrosis Factor Receptor Superfamily Member 4) is differentially expressed in core Tregs and is correlated with significantly better survival, raising the possibility that this gene could play a key role in the mechanisms underlying Tregs in HNSCC.

Overview of Deconvolution Approach
A schematic overview of our deconvolution approach is illustrated in Figure 1. First, we derived a cell-type expression matrix from our previously profiled transcriptomes of ~6000 single cells using the SMART-seq2 protocol [34] from 21 HNSCC samples [8]. This matrix established a benchmark for cell-type proportions in heterogeneous HNSCC tissue. Second, we obtained the bulk RNA-seq data from >500 HNSCC samples within TCGA and then used both CIBERSORTx [32] and MuSiC [33] to deconvolve the bulk RNA-seq data based on the derived cell-type expression matrix.  A schematic overview of deconvolution analysis is illustrated. First, we derived a cell-type expression matrix from our previously profiled transcriptomes of~6000 single cells by SMART-seq2 protocol from 21 HNSCC (Head and Neck Squamous Cell Carcinomas) samples, including four matched pairs of primary tumors and lymph node metastases. This established a benchmark for cell-type proportions in heterogenous HNSCC tissue. Second, we obtained the bulk RNA-seq data from~500 HNSCC samples within TCGA. Lastly, we used either CIBERSORTx or MuSiC to derive a signature matrix and then deconvolve the bulk RNA-seq data to get cell proportions.

CIBERSORTx Analysis with scRNA-seq Reference for Nine Major Cell Types
CIBERSORTx is a computational framework to infer cell-type abundance and celltype-specific gene expression from RNA profiles of intact tissues. CIBERSORTx had been validated in HNSCC using simulated tumors reconstructed from single cells [32]. To further utilize this algorithm to deconvolute bulk RNA-seq data in HNSCC, two inputs are required. One input is the gene expression dataset representing a bulk admixture of different cell types, which we obtained from TCGA [35]. The second input is the singlecell reference that enumerates the genes defining the expression profile for each cell type of interest, for which we used our previously profiled transcriptomes of~6000 single cells [8]. To assign cell types to single cells, we generated a t-SNE (t-distributed Stochastic Neighbor Embedding) projection of scRNA-seq data revealing nine major cell clusters, which were further annotated by the expression of known marker genes for T-cells, B cells, macrophages, dendritic cells, mast cells, endothelial cells, fibroblasts, malignant cells, and myocytes ( Figure 2A). Next, we used the cell-type versus gene matrix to generate the signature matrix by CIBERSORTx (Supplementary Materials Figure S1A). With this signature matrix, we estimated the cell fractions for each sample of bulk RNA-seq in TCGA using the CIBERSORTx algorithm, which revealed a wide range of cell-type proportions from~80% malignant cells to~2% T-cells (Supplementary File S1). The relative percentages of each cell type (normalized by the corresponding mean within each cell type) are shown via heatmap ( Figure 2B).
To examine the association of tumor origin and stage with cell-type proportions, we performed hierarchical clustering of samples with nine type proportions and found no clear separation of origin or stage ( Figure 2B). We then explored the association with individual cell proportions and found that oropharynx tumors were highly correlated with immune cells' proportions, especially B cells and T cells (Supplementary Figures S2 and S3), which has been previously reported [12]. To further explore the association of tumor subtype with cell-type proportions, we first merged the subtype information [12] with our deconvolution result by TCGA ID and then we did the same hierarchical clustering analysis. We observed a strong association of the atypical subtype with three immune cell proportions: B cell, T cell, and Dendritic (Supplementary Figure S4). This association is expected because the vast majority of atypical subtype tumors are HPV (Human Papillomavirus) -positive oropharynx patients, which tend to have more immune infiltrate [36]. We also found that fibroblast and endothelial proportions were positively correlated with mesenchymal subtype and negatively correlated with basal subtype (Supplementary Figure S5), while malignant proportions were positively correlated with basal subtype and negatively correlated with mesenchymal subtype (Supplementary Figure S5), consistent with observations we made previously [8].
Next, we investigated the association of various cell-type proportions with survival in HNSCC patients. We began by classifying cell-type proportions into two groups, high and low, with an equal number of patients in each group. We calculated the survival for each group using the Kaplan-Meier log-rank test with corresponding Kaplan-Meier curves. A higher proportion of T-cells and B-cells was associated with more favorable survival with a p-value of less than 0.005 ( Figure 2C). By contrast, there was no significant difference in survival among patients with high and low proportions of fibroblasts, endothelial cells, malignant cells, macrophages, or dendritic cells. Interestingly, a higher proportion of myocytes was associated with inferior survival with a p-value less than 0.05. This is consistent with a recent study [37] demonstrating that patients of tongue squamous cell carcinoma with high relative MEF2C (myocyte enhancer factor 2C) expression had an inferior overall survival. To address potential confounders of patient survival, a multivariate Cox proportional-hazards model was performed to identify independent prognostic factors for HNSCC survival. After adjusting for tumor stage, race, smoking status, and age, we confirmed that T-cell and B-cell proportions serve as independent prognostic parameters for overall survival (T-cell HR (Hazard Ratio): 0.63, p < 0.05; B-cell HR: 0.59, p < 0.05) in HNSCC patients, as shown in Tables 1 and 2. To rule out that the effect of T cell proportion on survival is secondary to B cell, we correlated the estimated proportion of B-cell with that of T-cell (Supplementary Figure S6). The Pearson correlation coefficient is 0.42, indicating there is not a linear relationship between these two cell populations. Therefore, T cell and B cell are independent proxies.

CIBERSORTx Analysis with T-Cell Subtypes/Subpopulations
Given that a higher T-cell proportion was associated with improved survival in HN-SCC patients and that our prior scRNA-seq dataset included a relatively large number of T-cells (~1000 T-cells), we next examined T-cell subtypes by finer clustering, producing four sub-clusters annotated by marker genes [8] as conventional CD4 T-cells (CD4 conv ; CCR7, TCF7), regulatory T-cells (T regs ; FOXP3, CD25), conventional CD8 T-cells (CD8 conv ; GZMA/B/H/K, PRF1), and exhausted CD8 T-cells (CD8 exhausted ; PD1, LAG3, TIGIT, CTLA4) ( Figure 3A). With the addition of T-cell subtypes, we created and derived a new signature matrix of 12 cell types (eight major cell types and four T-cell subtypes) by CIBER-SORTx (Supplementary Figure S1B). We estimated the prevalence of these 12 cell types in the same TCGA bulk RNA-seq data as previously used ( Figure 3B and Supplementary File S2) and then associated the proportions (low or high) of T-cell subtypes with patients' survival outcomes by Kaplan-Meier curves and log-rank test ( Figure 3C). We specifically examined the four T-cell subtypes: CD4 conv and T reg under the umbrella of CD4 T-cells, and CD8 conv and CD8 exhausted under the category of CD8 T-cells. These results showed that higher proportions of CD4 T-cells were associated with favorable survival outcomes with a much lower p-value than that of CD8 T-cells (p = 0.00016 for CD4 T-cells, p = 0.04 for CD8 T-cells). Interestingly, T regs demonstrated a significantly lower p-value (p = 0.00003) than that of the other three subtypes, suggesting that T regs have a stronger association with improved survival in HNSCC than other T cell subsets. To rule out that T regs serve as an indirect or secondary contributor since T regs negatively regulate CD8 T cells, we noted that the p-value for CD8 T-cells' effect on survival was not as significant as for T regs . In addition, if the effect on survival was primarily driven by T regs negatively regulating CD8 T-cells, we might expect the CD8 and T reg effects on survival to be opposed (i.e., high T regs and low CD8 T-cells associated with improved survival). These observations directed our primary motivation for focusing on T regs . We next performed multivariate analyses with the Cox proportional-hazards model to identify the independent prognostic factors for HNSCC survival. After adjusting for tumor stage, patient race, smoking status, and age, we found that T reg proportion is an independent predictor of overall survival (Table 3, HR: 0.61, p < 0.05).

MuSiC Deconvolution Based on T-Cell Subtypes/Subpopulations
To confirm that the Treg subpopulation is associated with improved survival in an orthogonal approach, we used a separate but similar deconvolution algorithm known as

MuSiC Deconvolution Based on T-Cell Subtypes/Subpopulations
To confirm that the T reg subpopulation is associated with improved survival in an orthogonal approach, we used a separate but similar deconvolution algorithm known as multi-subject single-cell deconvolution (MuSiC) to validate our results. MuSiC incorporates cross-subject and cross-cell consistency of marker genes into the deconvolution algorithm, which allows for scRNA-seq datasets to serve as effective references for independent bulk RNA-seq datasets involving distinct patients. We first tested the deconvolution performance of MuSiC on simulated bulk-RNA-seq data reconstructed in silico from scRNA-seq that predetermines ground truth cell proportions. To have all 21 samples validated and avoid overlapping samples in each validation run, we utilized two combinations: 18 (reference) versus 3 (validation) and 14 (reference) versus 7 (validation). For example, as we have a total of 21 scRNA-seq samples, we used 18 of them to create the signature matrix and the remaining three to construct in silico bulk RNA-seq data. This approach yielded a total of seven (21/3 = 7) and three (21/7 = 3) validation runs for combinations of 14 versus 7 and 18 versus 3, respectively. We then ran each of the cases through MuSiC and compared the estimated cell proportions with the ground truth ( Figure 4A and Supplementary Figure  S7A). We found that the estimates aligned very well with the ground truth cell proportions: Patients were highly concordant with ground truth by Pearson correlation ( Figure 4B and Supplementary Figure S7B). Strong performance was also maintained when considering deconvolution results across distinct cell types (except myocytes possibly due to limited scRNA-seq representation) (Supplementary Figure S7C,D).
After validation of the MuSiC method in our previous HNSCC dataset, we deconvolved the TCGA bulk RNA-seq data with nine major cell types (Supplementary Figure S8A and File S3). To evaluate how well the cell type compositions from these two algorithms align with each other, we calculated the Pearson correlation between the estimated cell proportions from CIBERSORTx with that of MuSiC. Patients were highly concordant between these two algorithms, with a median Pearson correlation coefficient of 0.96 (Supplementary Figure S9A). However, when checking the cell types, we found correlation coefficients are in discrepancies ranging from Myocyte 0.95 to Mast 0.18 (Supplementary Figure S9B). We focused on results most consistent across algorithms (i.e., correlation coefficient > 0.8). Therefore, four cell types (myocyte, T cell, Malignant, and Fibroblast) are cross-validated by these two algorithms.
Next, we further deconvolved the TCGA bulk RNA-seq data with 12 cell types (eight major cell types and four T-cell subtypes), as described above ( Figure 4C and Supplementary File S4). Similar to the results obtained by CIBERSORTx, we found that a higher T-cell and B-cell proportion was associated with improved overall survival ( Supplementary Figure S8B). The proportions of other cell types, including fibroblasts, macrophages, dendritic cells, endothelial cells, malignant cells, myocytes, and mast cells, were not associated with a significant difference in survival. Within the T-cell subtypes, we again observed that a higher proportion of T regs was associated with improved survival ( Figure 4D). We then performed multivariable analyses again and found similar results to CIBERSORTx, namely, that T reg proportion is an independent predictor of overall survival (Table 3,     Estimated cell proportions were stratified by a half-half split, and the separation between survival curves was evaluated using a log-rank test.

CIBERSORTx Analysis of Gene Expression of Regulatory T-Cells
To complement our cell-proportion-centric analyses, we conducted gene-centric differential expression analysis, survival analysis, and identified prognostic associations of marker genes that could potentially define specific subtypes and states of T regs [38]. We first obtained the genome-wide expression values from bulk RNA-seq data as previously described and split them into halves by the T reg proportion. We used the Wilcoxon signed rank test to identify differentially expressed genes and the adjusted p-values are plotted in Figure 5A for all 33 marker genes. We then split the genes into halves by expression values and used Kaplan-Meier curves to display survival distributions for each marker gene (Supplementary Figure S10 and Figure 5C). The log-rank test was used to assess the difference between patients with high and low values of the corresponding genes and the p-values are plotted in Figure 5A. Two genes, CTLA4 and TNFRSF4, passed the threshold (p < 0.001). We further examined the expression of all 33 marker genes, specifically among T regs (rather than among all cells), using CIBERSORTx high-resolution mode. This analysis can impute genes' expression to define distinct subpopulations, although marker genes with continuously high expression may not be imputed due to lack of statistical power. We were able to impute the expression for seven genes with this approach, and two genes, TNFSF4 and RELB, were differentially expressed in core and effector T regs , respectively (Supplementary File S5). We then associated the expression of these seven genes with overall survival ( Figure 5A,D) and found that TNFRSF4 is the only one that passed the p-value threshold. We also show the differential expression of TNFRSF4 between the high and low group defined by T reg proportions in Figure 5B. Combined with the results from bulk RNA-seq, TNFRSF4 may be a marker gene of particular interest. Next, we estimated the hazard ratios (HRs) for the risk of disease progression and mortality associated with high and low expression of TNFRSF4 in all cells and T regs using the Cox proportional-hazards model. The results from all cells and T regs agreed with each other and showed that high expression of TNFRSF4 is correlated with a significantly lower hazard of death (Table 4). Taken together, these data suggest that TNFRSF4 is differentially expressed in the core T reg subset and is correlated with significantly better survival, indicating that this gene could play a key role in the mechanisms underlying the contribution of T reg in HNSCC outcomes.

Discussion
In this study, we deconvoluted bulk RNA-seq data from >500 HNSCC samples profiled by TCGA using scRNA-seq data to define cell-type proportions and determine their association with survival. Heterogeneity among HNSCC patients appears highly relevant, with the proportions of several cell types strongly associated with survival. Specifically, a higher proportion of infiltrating T regs is associated with improved outcomes, suggesting T reg fraction may be an independent prognostic factor in HNSCC outcomes.
We have implemented two distinct but similar deconvolution algorithms, CIBER-SORTx and MuSiC, to cross-validate the deconvolution results. Both methods allow the integration of sorted bulk data or scRNA-seq to derive a signature matrix. Compared with fluorescent-activated cell sorting (FACS)-purified or in vitro cell subsets, scRNA-seq does not rely on predetermined cell-type specific genes based on a priori knowledge, and therefore, enables unbiased transcriptional profiling of thousands of individual cells to guide cell-type-specific gene signatures. Moreover, it is now known that even 'sorted or purified' cells may still contain significant cellular heterogeneity [39]. The scRNA-seqderived signature matrix thus captures a more comprehensive picture of cell diversity in heterogeneous HNSCC tissue. This approach represents a major advantage over past deconvolution techniques, which primarily rely on using a sorted bulk expression profile from one tissue type to analyze bulk data usually from an entirely different tissue type. By contrast, we used scRNA-seq from HNSCC patients to deconvolute the bulk RNA-seq data from an orthogonal cohort of HNSCC samples, providing consistency in the tissue analyzed for both the reference matrix as well as the orthogonal, deconvoluted dataset, and thereby avoiding errors from changes in expression profiles under different tumor microenvironments (even for the same cell type) [32,40,41].
Since these two deconvolution methods are based on different algorithms (support vector regression for CIBERSORTx and weighted non-negative least squares regression for MuSiC), we might observe minor discrepancies in the estimated cell proportions. In terms of the cell types that have a discrepancy between these two methods, we weighed the results from CIBERSORTx higher than that from MuSiC because of its improved performance in our benchmarking validation tests. Specifically, these two algorithms were both compared to the ground truth of our single-cell HNSCC dataset. CIBERSORTx showed a strong concordance between the deconvolution results and ground truth cell proportions [32], while MuSiC also indicated a good alignment between estimated and ground truth proportion (Figure 4 and Supplementary Figure S7). However, CIBERSORTx slightly outperformed MuSIC with median values of 0.97 and 0.98 for correlation coefficiency in cell level and sample level validation tests respectively, compared to MuSiC, which showed best median values of 0.95 versus 0.97 under the same conditions. These evaluations suggest that CIBERSROTx outperforms MuSiC by a small margin, which guided us to use CIBERSORTx as the main exploratory tool and MuSiC as a secondary validation check. In addition, while MuSiC was unable to statistically separate CD8 conv , CD8 exhaust , and T reg populations in terms of their effect on survival, CIBERSORTx analyses revealed a much more significant association of T reg proportion with survival compared to these other T-cell subtypes. Thus, in toto, the combination of these two approaches emphasized a focus on T reg for additional analyses.
The role of T regs in HNSCC is somewhat controversial. Generally, T regs are thought to suppress the anti-tumor immunity of T-cells and natural killer (NK) cells in some solid tumors and generally help to establish an immunosuppressive microenvironment [42][43][44]. Thus, T reg infiltration is generally associated with a poor prognosis in many human carcinomas. For example, studies have reported that a high proportion of tumor-infiltrating T regs was significantly associated with worse outcomes in breast cancer [45], hepatocellular carcinoma [46], lung cancer [42], gastric cancer [47], and ovarian cancer [48]. However, contradicting conclusions have been drawn recently concerning the prognostic value of T regs in oncology, where it has been suggested that T reg infiltration may have a positive effect on anti-tumor response. For example, high densities of tumor-infiltrating T regs in colorectal carcinoma, malignant melanoma, and lymphoma are reported to be associated with improved outcomes [49][50][51][52]. In HNSCC, several studies have suggested that patients with high T reg infiltration have significantly better overall survival [53][54][55][56][57]; however, these analyses utilized immunohistochemistry in a limited cohort to reach this conclusion. In addition, a few clinical correlative studies that used deconvolution methods in HNSCC have uncovered a favorable association between increased levels of T reg infiltration and prognosis [11,58,59]. These studies, however, utilized older versions of CIBERSORT, and the signature matrix was based on sorted bulk data as opposed to single-cell sequencing data. Thus, there is precedent to support the findings of our study and to suggest that the effect of T reg infiltration on tumor biology may be quite complex and specific to the disease context. Importantly, our findings offer a more rigorous approach to deconvolution, allowing for greater sample size than histology-based studies while offering improved precision over existing informatics approaches to date.
A possible explanation for the paradoxical observation of T regs favorably affecting survival in HNSCC may relate to the potential translocation of microbial flora from the upper aerodigestive tract to HNSCC tissues [57], similar to theories proposed by Ladoire et al. for colorectal tumors [51]. It is possible that this microbiological hazard provokes a T-cell-mediated anti-microbial inflammatory response that involves Th17 cells and can thereby promote cancer growth. The Th17-cell-dependent pro-inflammatory and tumorenhancing response can be attenuated by T regs , providing a possible explanation for the favorable role of T regs in HNSCC prognosis. Certainly, future studies must rigorously examine the importance and relevance of T regs and the oral microbiome using a wellcontrolled animal model to further develop and test these hypotheses. However, our data, which are validated using two orthogonal deconvolution approaches, seem to broadly belie the notion of cancer-specific effects of infiltrating immune cells.
We observed a strong clinical association between high expression of TNFRSF4 and improved survival. The TNF-receptor superfamily (TNFRSF) serves various key immunoregulatory functions and includes death receptors that trigger apoptosis in cancer cells and receptors that provide co-stimulatory signals to anti-tumor T-cells [60]. Targeting the TN-FRSF is somewhat of a recent development among tumor immunotherapy approaches and shows promise for the treatment of cancer in preclinical studies when used in combination with chemotherapy or irradiation, which can induce immunogenic cell death and stimulate anti-tumor T-cell responses [61]. Indeed, various agonistic TNFRSF antibodies and recombinant forms of TNFSF ligands are under clinical evaluation [62][63][64]. With the success of immune checkpoint blockade therapies and promise of programmed death pathwaytargeted agents, TNFRSF4 could represent a potential target for future therapeutics to control tumor progression in HNSCC.
Although the results from our deconvolution are useful, some limitations are noted. First, this study relies on scRNA-seq as well as bulk RNA-seq data acquired from representative biopsy specimens rather than entire tumors themselves. Thus, inherent heterogeneity of HNSCC tumors may prevent a single biopsy from adequately representing the full biological behavior of the tumor. However, it is noteworthy that despite this limitation, we appreciated highly significant associations between certain cell-type proportions and patient outcomes. Second, the deconvolution methods used herein do not take spatial distributions of cells into consideration. Thus, subtle localization differences that are more appreciable with immunohistochemistry approaches may be missed; yet, our approach also offers a statistical power that is not feasible to obtain with histology slides. Third, our approach provides relative proportions of cells rather than absolute cell counts, therefore not capturing the total lymphocytic infiltration in any given tumor. With advances in single-cell and genomic approaches [14], such as spatial transcriptomics, we expect these limitations of our analyses to be overcome and provide yet more robust data.

Bulk RNA-Seq Data and Clinical Information of HNSCC Tumors from TCGA
Bulk RNA-sequencing data of HNSCC tumors was obtained from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/ (accessed on 1 March 2020)). A total of 545 cases were selected using the following filtering criteria: Primary site = head and neck, disease type = squamous cell neoplasms, sample type = primary tumor, experimental strategy = RNA-seq, and workflow type = HTseq (High Throughput sequencing)counts or FPKM (Fragments Per Kilobase Million). The HTseq-FPKM normalized expression data was used as the input file for CIBERSORTx according to the published protocol [32], while HTseq-Counts expression data was used as the input file for MuSiC according to the previously described protocol [33]. Clinical information for HNSCC patients, including gender, diagnosis, age, clinical stage, T stage, lymph node involvement, pathological grade, smoking, survival status, and survival duration in months, were downloaded from cBioPortal for Cancer Genomics (http://www.cbioportal.org/ (accessed on 1 March 2020)). According to the publication guidelines, datasets may be used for publication without restriction or limitation, and accordingly, no IRB (Institutional Review Boards) approval was required for use of these de-identified data.

Single-Cell RNA-seq Data of HNSCC Tumors
scRNA-seq data was obtained from our previous publication [8]. We excluded one patient (MEEI5), which is one of the samples with matched primary and LN, in an effort to have a more uniform set of single-cell samples to generate signature matrices because the eventual pathology was determined to be a spindle cell carcinoma (SCC with spindle cell features). Briefly, we profiled transcriptomes of ∼5600 single cells by SMART-seq2 method [34] from 21 HNSCC samples, including 4 matched pairs of primary tumors and lymph node metastases [8]. Expression levels were quantified as Ei,j = log2(TPMi,j/10 + 1), where TPMi,j refers to transcript-per-million for gene i in sample j, as calculated by RSEM (RNA-Seq by Expectation-Maximization) [65]. Malignant cells were identified by a set of potential epithelial markers consisting of all cytokeratins, EPCAM (Epithelial Cellular Adhesion Molecule), and SFN (Stratifin), as well as the copy number variation (CNV) analysis. t-SNE analysis of the remaining non-malignant cells identified eight major clusters. Briefly, we defined the clusters by DBSCAN (Density-Based Spatial Clustering of Applications with Noise: parameters Epsilon = 3 and MinPoints = 5) using the normalized gene-count matrix. Clusters were assigned to cell types based on strong differentiation expression of known marker genes. The T-cell cluster was further subdivided into four subtypes, which were annotated based on the differential expression of T-cell markers to represent the main patterns of variability: CCR7 and TCF7 for conventional CD4 (CD4 conv ), GZMA/B/H/K and PRF1 for conventional CD8 (CD8 conv ), PD1, LAG3, TIGIT, and CTLA4 for exhausted CD8 (CD8 exhaust ), and FOXP3 and CD25 for T regulatory cells (T regs ).

CIBERSORTx Deconvolution Analysis
We utilized the CIBERSORTx online tool implementation. We used the single-cell reference matrix file as previously described [8] to create a custom signature matrix. We applied two analysis modules to the bulk RNA-seq data, namely, cell fractions and gene expression. For cell fractions, we enumerated the proportions of distinct cell subpopulations in TCGA bulk tissue expression profiles. We imputed cell-type-specific expression profiles from bulk tissue transcriptomes using the High-Resolution mode for gene expression. This analysis provided estimates of sample-level gene expression variation among distinct cell types, which allowed exploring gene expression changes among distinct cellular subpopulations.

MuSiC Deconvolution Analysis
We inputted HTseq counts of the bulk RNA-seq data from TCGA as well as the multisubject single-cell profiles from our scRNA-seq data in the MuSiC algorithm, implemented in R. Cell types from scRNA-seq were based on prior categorizations [8]. Genes in bulk data use Ensembl gene ID (Ensembl version 84) as their identifiers, whereas genes in the single-cell profiles use gene symbols. To be consistent with the bulk data, the Ensembl gene IDs of the genes in single-cell profiles were queried by using biomaRt package (version 2.42) [39,66]. The single-cell profiles were further filtered to keep the genes with a unique Ensembl gene ID. The filtered profiles then served as reference for estimating cell-type proportions of bulk data. By following the tutorial available on Github (https://xuranw. github.io/MuSiC/articles/MuSiC.html (accessed on 1 January 2020)), we obtained the estimated cell-type proportions for each sample by using the function music_prop. The estimated proportions were normalized to sum to 1 across included cell types.

Single-Cell RNA-seq Data of HNSCC Tumors
Patients were split at the median for each estimated cell type. The prognosis of each group of patients was examined by Kaplan-Meier survival, and log-rank tests compared the survival outcomes. Kaplan-Meier plots are presented for all the cell-type proportions, and the cell types with log-rank p-values less than 0.05 were defined as a prognostic cell type. Hazard ratios (HRs) and corresponding 95% confidence intervals (CIs) for risk of disease progression and mortality associated with high and low percentage of cell types were estimated using the Cox proportional-hazards model. Multivariable Cox model was adjusted for tumor stage, race, smoking status, and age. All statistical analyses were performed using R version 3.6.

Conclusions
In summary, we have shown through two distinct deconvolution methods of bulk RNA-seq data from >500 TCGA samples that higher proportions of tumor-infiltrating regulatory T-cells are associated with improved outcomes in HNSCC. Future studies may investigate the possibility of further deconvolution of regulatory T-cell subtypes as well as subtypes of other cells, such as dendritic cells, macrophages, and natural killer cells. Additionally, a more substantial understanding of the implications of regulatory T-cells in HNSCC may reveal unique prognostic approaches as well as potential therapeutic targets for more precise and effective treatments.  Figure S8: (A) Heatmap of the relative cell fractions of the 9 major cell types for each patient estimated by MuSiC. (B) Association between cell proportions and overall survival in patients with HNSCC profiled by TCGA. Estimated cell proportions were stratified by a half-half split, and the separation between survival curves was evaluated using a log-rank test. Figure S9: (A) Histogram of Pearson correlation coefficient (r) between cell-type proportions estimated by CIBERSORTx and MuSiC for all samples in the study. (B) Bar plot of the Pearson correlation coefficient (r) between cell-type proportions estimated by CIBERSORTx and MuSiC for nine cell types. Figure S10: Association between gene expression and overall survival in patients with HNSCC profiled by TCGA. The gene expression is measured by bulk-RNA-seq and was stratified by a half-half split. The separation between survival curves was evaluated using a log-rank test. File S1: Cell proportions of the nine major cell types estimated by CIBERSORTx.