Since the start of the COVID-19 pandemic in December 2019, researchers around the world have put major efforts towards a better understanding of the immune response to its causative agent, the SARS-CoV-2. Although an impressive amount of scientific information has been generated in a very short period of time, there remain significant gaps in our understanding of SARS-CoV-2 immune control. In particular, it remains unclear what kind of adaptive immunity should be triggered by vaccination in order to achieve sterile immunity, or at least lead to an ameliorated disease course, in cases where vaccination cannot provide absolute protection from infection. We know from the available literature on other coronaviruses (mainly SARS-CoV-1 and MERS) that antibodies can neutralize the infection, although these humoral responses are short lived in many individuals, and that long-lived T cells responses are present in people with less severe disease outcomes [1
]. The emerging data on the immune response to SARS-CoV-2 demonstrate the essential contribution of the virus-specific T-cell responses, possibly in addition to the action of neutralizing antibodies, in viral control [3
]. Thus, improved tools to assess host T cell immunity in detail are urgently needed to better identify these responses and to define their role in the outcome of SARS-CoV-2 infection.
Ex-vivo immune analyses of samples from infected individuals can identify T cell responses to specific pathogens like viruses. Such analyses can help to better understand the role of host immunity in virus control and to guide successful vaccine development. However, they rely on the use of the correct recall antigens that can elicit specific responses in vitro. The urgency of the current SARS-CoV-2 pandemic has led researchers to tackle the problem of screening the 10,000 amino acids of the SARS-CoV-2 proteome for T cell responses by selecting viral sequences based on different criteria: (i) bioinformatically predicted epitopes, (ii) homology of SARS-CoV-2 sequences with epitopes defined in other coronaviruses (mainly SARS-CoV) or (iii) selecting some specific SARS-CoV-2 proteins over others [5
]. However, all these approaches have intrinsic limitations. Bioinformatic prediction tools are trained on sets of previously described epitopes, but since the available epitope repertoire for many human leukocyte antigen (HLA) alleles is limited, its prediction capacity is also limited [20
]. Inferences based on epitope sequence homology with other coronaviruses are hampered because past studies on SARS-CoV-1 and MERS only included few selected viral proteins. This is of concern, since screening only a part of the SARS-CoV-2 proteome will potentially miss an important portion of the virus-specific T cell response. Indeed, recent data indicate the existence of T cell responses against structural and non-structural proteins [5
] for SARS-CoV-2 and other viral infections [22
]. Finally, no study has considered the existence of T cell responses to epitopes encoded by open-reading frames (ORF) in alternative frames, as reported for other viral infections [23
In order to reliably measure total virus-specific T cell immunity, the recall antigens used need to be as representative as possible of the worldwide viral sequences, even for genetically more stable viruses like coronaviruses. T cell recognition of epitopes is very sensitive to mismatches and not matching the recall antigen with the autologous virus can lead to missed responses [27
]. For this reason, different test antigen design strategies, trying to cope with the diversity of circulating viral isolates in a single sequence, have been developed in the past. These strategies include central sequence designs such as Center of Tree (COT) [28
], Ancestral [33
] or Consensus sequences [29
]; which may (Ancestral, COT) or may not (Consensus) represent naturally occurring sequences of replication competent viruses. All these designs are sensitive to the underlying sequence database and may change over time as new sequence information on additional isolates becomes available. Direct comparisons of these different central sequence approaches have been performed for a highly variable pathogen (human immunodeficiency virus, HIV) and shown that the different designs yielded comparable results when synthetic peptides covering these sequences were used to measure virus-specific T cell responses [42
]. However, the additional costs in terms of peptide synthesis and cells needed for ex-vivo experiments, may not warrant inclusion of all the different variants into a single test set.
Thus, the characterization of the complete T cell responses to SARS-CoV-2 urgently needs T cell antigens that cover the whole SARS-CoV-2 proteome while covering sequence diversity, and which can be combined in different experimental set-ups and immune assays. To this end, we created a consensus sequence to cover the genetic diversity of SARS-CoV-2 (CoV-2-cons) for all ORF, including those described in alternative open reading frames. Given the computational ease for its initial generation and periodic updates, we designed a consensus sequence using more than 1700 CoV-2 full-genome sequences and designed overlapping peptide (OLP) sets as recall antigens in T cell assays. The CoV-2-cons OLP sets are presented here in different designs, balancing costs for synthesis with the sensitivity of detecting T cell responses and with the intention to provide a common test antigen that will allow data comparability across laboratories.
We here report the design of a CoV-2-cons sequence and the matched OLP sets for the comprehensive analysis of the adaptive T cell immune response against SARS-CoV-2. Three sets of OLP reported here provide enough flexibility to balance exhaustive screening for T cell responses and available resources. Ideally, the wide use of such a CoV-2-cons sequence and a specific OLP set (ideally 15 mer with 11 overlap) would ensure the comparability and reproducibility of immunological data across laboratories worldwide to accelerate SARS-CoV-2 immunological studies.
Fifteen-mer designs allow sensitive screens for both, CD4+ and CD8+ T cell responses while 18 mer allow for cheaper peptide synthesis and require less cells for comprehensive screenings. However, longer test peptides tend to yield fewer responses and imply bigger efforts for subsequent epitope mapping. For the 15 mer design, an alternative 10 amino acid overlap was proposed to reduce peptide synthesis, while maintaining the sensitivity. This approach may be valuable, but may miss epitopes restricted by HLA class I molecules known to presented longer peptides (such as HLA-B*27, -B*57 and others). Regardless of the final OLP design, the use of large OLP data sets for immune screening raises several challenges. How to pool peptides in suitable numbers may depend on the downstream analyses, whether or not subsequent epitope identification are planned, on the experimental setup and whether long incubation periods will be required. The latter may be especially important as pooling of a large number of peptides will possibly require lyophilization of the pooled peptides to eliminate dimethyl sulfoxide (DMSO) as this can be toxic for the cells during culture [11
]. Also, as we gain more insights into the distribution of virus-specific T cell responses across the full proteome, more or less reactive regions can be pooled based on expected reactivity, protein expression level, and/or degree of conservation [46
Canonical and alternative frame ORF were considered in the present CoV-2-consensus sequence design to ensure an as broad as possible screening for all potentially expressed protein sequences. Whether all these putative ORF are indeed expressed remains to be confirmed. If shown that not all these sequences are indeed expressed, the OLP set could be reduced by some 65 peptides, focusing exclusively on the canonical ORF. Consensus sequence design is highly dependent on the sequences included in the alignments used to construct them. We used publicly available sequences in the growing SARS-CoV-2 NCBI repository as a representative set of worldwide sequences. As noted, coverage of sequence diversity for in-vitro antigen test sets is critical as responses to autologous viral variants may be missed if these variant sequences are not matched [27
]. This may be most critical for highly variable pathogens, such as HCV and HIV, where it has been shown that sequence entropy was directly related to the frequency of OLP reactivity in vitro and essential to identify the potential emergence of immune escape variants [59
]. However, even genetically more stable pathogens such DNA viruses (for instance Epstein Barr Virus, EBV) have been reported to exist as a swarm of quasi-species and to lose specific T cell epitopes over time [61
]. This is also supported by recent data showing some degree of adaptation to host immunity and sequence variability for SARS-CoV-2 as it moves through the global human population [63
]. To cover these variant sites, variant OLP can be synthesized. An alternative approach to the synthesis of individual variant peptide sequences is the use of “toggled peptides”, where the sequence variation is directly incorporated into the peptide synthesis. To achieve this, peptide synthesis uses mixes of amino acids at variable positions, so that the resulting OLP resembles a mini-peptide library that can achieve an a-priori set coverage of circulating viral variants [64
]. This would readily allow to cover more sequence diversity beyond the 25% frequency cut-off that was applied in the present study.
The existence of protein fragments conserved among different coronavirus species has several implications. For the interpretation of T cell responses, it has to be taken into account that some degree of cross-reactivity can exist among human coronavirus [5
]. This implies that responses to these regions could be associated with previous infections by other human coronaviruses, some of them triggering much milder infections that can pass unnoticed, like those by coronaviruses causing a common cold. This observation will need to be taken into consideration when interpreting immune data on SARS-CoV-2. On the other hand, the existence of conserved sequences among beta- or even the whole coronavirus family suggests that T cell responses to these regions could provide broad protection and that the creation of a pan-coronavirus vaccine may be feasible. Such a vaccine could allow to prevent infection not only with SARS-CoV-2, but also with other, clinically relevant coronavirus like SARS-CoV-1 and MERS, and even with new coronaviruses jumping the species barrier to humans. However, the design of a pan-coronavirus vaccine will critically depend on the identification of epitopes shared among them. These pan-coronavirus epitopes are likely to exist in conserved sequences, but need to be experimentally validated. At the same time, the existence of SARS-CoV-2 homologous regions in the human genome, together with the existence of described epitopes in these regions raise some concern that coronaviruses could be involved in a molecular mimicry process triggering autoimmune diseases like the Guillain-Barré syndrome [66
The present study is currently limited to the design of the CoV-2 consensus sequence, without functional immune analyses of the OLP sets in samples from infected individuals. However, the principal aim here was to provide a SARS-CoV-2 T cell test reagent, including all described ORF and covering as much viral variability as possible, for its implementation in future screening efforts. In addition, the OLP sets will certainly elicit T cell responses in vitro as partial evaluation has been performed by others in studies using peptides spanning some of the regions covered by the present consensus sequence [5
] and since the current peptide designs (length, overlap) has been shown to be effective in the past [55
]. Thus, the present peptide designs will afford a high-resolution analysis of the T cell response to SARS-CoV-2, the nature of the targeted epitopes and the functionality and T cell receptor use of the T cells targeting these epitopes, thereby increasing our knowledge of factors that drive COVID-19 disease progression and which could be implemented in vaccine development.
We here present the first SARS-CoV-2 Consensus sequence for all described SARS-CoV-2 ORF, including those in alternative frames covering the SARS-CoV-2 sequence variability represented by 1700 available sequences. The description of this sequence and of the matching OLP sets will aid the further immune analyses in SARS-CoV-2 infection and ensure reproducibility between laboratories. In light of recent studies, the T cell response to SARS-CoV-2 can be crucial to control SARS-CoV-2 infection. To date, published studies are generally limited to a few viral proteins, using recall antigens that do not reflect sequence diversity nor alternative ORFs. To overcome these limitations, the description of the global landscape of T cell responses to SARS-CoV-2 urgently needs unbiased, comparable, full-proteome screens for virus-specific T cell responses. The CoV-2-cons and matched OLP sets described here will allow to integrate data globally, generating crucial information for vaccine development. We also include measures of sequence entropy to identify the most variable segments and design additional OLP sequences that cover these sites. Of note, these entropy analyses, together with sequence alignments across a wide range of coronaviruses, also allowed the identification of highly conserved regions among different coronaviruses. These regions may be targeted by T cells, which could target a wide range of coronaviruses and may be relevant targets for T cell vaccine design.