IMGT® Biocuration and Analysis of the Rhesus Monkey IG Loci

The adaptive immune system, along with the innate immune system, are the two main biological processes that protect an organism from pathogens. The adaptive immune system is characterized by the specificity and extreme diversity of its antigen receptors. These antigen receptors are the immunoglobulins (IG) or antibodies of the B cells and the T cell receptors (TR) of the T cells. The IG are proteins that have a dual role in immunity: they recognize antigens and trigger elimination mechanisms, to rid the body of foreign cells. The synthesis of the immunoglobulin heavy and light chains requires gene rearrangements at the DNA level in the IGH, IGK, and IGL loci. The rhesus monkey (Macaca mulatta) is one of the most widely used nonhuman primate species in biomedical research. In this manuscript, we provide a thorough analysis of the three IG loci of the Mmul_10 assembly of rhesus monkey, integrating IMGT previously existing data. Detailed characterization of IG genes includes their localization and position in the loci, the determination of the allele functionality, and the description of the regulatory elements of their promoters as well as the sequences of the conventional recombination signals (RS). This complete annotation of the genomic IG loci of Mmul_10 assembly and the highly detailed IG gene characterization could be used as a model, in additional rhesus monkey assemblies, for the analysis of the IG allelic polymorphism and structural variation, which have been described in rhesus monkeys.


Introduction
The immune system, known as the biological defense of the body, is made up of two parts: the innate immune system or non-specific immune system and the adaptive immune system. The adaptive immune system, also referred to as the acquired immune system, is characterized by the remarkable specificity and the extreme diversity of their antigen receptors [1]. These antigen receptors of the adaptive immune response are the immunoglobulins (IG) or antibodies of the B cells [2] and the T cell receptors (TR) of the T cells [3]. IG are proteins that have a dual role in immunity: they recognize antigens on the surface of foreign bodies such as bacteria and viruses and trigger elimination mechanisms such as cell lysis and phagocytosis, to rid the body of these cells and particles [2]. An IG comprises two identical heavy chains (IGH), associated with two identical light chains, kappa (IGK) or lambda (IGL). The synthesis of the immunoglobulin heavy and light chains requires gene rearrangements at the DNA level in the IGH, IGK, and IGL loci during the B cell differentiation [4]. The IGH locus comprises four types of genes, variable (V), diversity quently extracted from NCBI assembly [36] Mmul_10 in GenBank format. The delimitation of the locus was performed through research of the "IMGT bornes", which are coding genes (other than IG or TR) conserved among species, located upstream of the first or downstream of the last gene of an IG or TR locus (http://www.imgt.org/IMGTindex/IMGTborne.php, (accessed on 11 November 2021)). The IMGT 5 borne of the IGK locus is the paired box 8 (PAX8, Gene ID: 701906) gene and the IMGT 3 borne of the locus is the ribose 5-phosphate isomerase A (RPIA, Gene ID: 699694) gene. The IMGT 5 borne of the IGL locus is the DNA topoisomerase III (TOP3B, Gene ID: 698317) gene and the IMGT 3 borne of the locus is the radial spoke head 14 homolog (RSPH14, Gene ID: 706814) gene. Similar to the Homo sapiens locus, the IMGT 5 and IMGT 3 bornes of the Macaca mulatta IGH locus could not be identified, therefore, the sequences of the V genes and C genes of the Homo sapiens IGH locus were used to localize the V-D-J-C-CLUSTER on the rhesus monkey genome assembly. The locus orientation on a chromosome can be either forward (FWD) or reverse (REV). Therefore, the REV locus sequences were placed in the 5 to 3 locus orientation. Each locus sequence thus obtained was assigned an IMGT ® accession number (IGL: IMGT000062, IGK: IMGT000063, IGH: IMGT000064).
According to the "CLASSIFICATION" axiom of IMGT-ONTOLOGY, the nomenclature of all V genes of each IG locus, was characterized based on the human V genes by using IMGT/V-QUEST [37] and NGPhylogeny.fr [38] to define the subgroups. All the V genes are designated by a number for the subgroup, followed by a hyphen and a number for their localization from 3 to 5 in the locus [2]. Two genes are assigned to the same subgroup if their V-REGION show a percentage of identity greater than 75% at the nucleotide level. The V genes which were pseudogenes and did not match in any subgroup were assigned to a clan also characterized according to the human clans (http://www.imgt.org/IMGTindex/C lan.php, (accessed on 11 November 2021)). Duplicated genes share the same name with an additional 'D' at the end of the second occurring gene. The nomenclature of the J genes and of the IGHD genes comprises a number for the sets defined according to the human sets [2], while the number corresponding to the localization is increased from 5 to 3 within the locus. Finally, C genes are designated according to their isotype, followed by a number if there was more than one gene, a number which also increased from 5 to 3 in the locus. An allele is a polymorphic variant of a gene which is characterized by mutations at the nucleotide level, in its core sequence (V-REGION, D-REGION, J-REGION, and C-REGION for V, D, J, and C genes, respectively). Alleles are designated by the gene name followed by an asterisk and a number with two digits starting from 01. The identification of an allelic polymorphism is performed by comparison of the IMGT reference sequence with a newly annotated genomic sequence and relies on the following rule: for a given mapped gene with the same IMGT position, the same IMGT allele name is assigned if the two core nucleotide sequences have 100% identity and the new genomic sequence is qualified as "sequence from the literature" for that allele. In case of less than 100% identity in the core, a new IMGT allele name is assigned, and the new genomic sequence becomes the "IMGT reference sequence". The IMGT ® reference directories comprise the reference sequences of all gene alleles.
The functionality of the genes and alleles was defined according to the IMGT 'functionality' concept, more precisely the 'IDENTIFICATION' axiom of IMGT-ONTOLOGY, described in http://www.imgt.org/IMGTScientificChart/SequenceDescription/IMGTfu nctionality.html (accessed on 11 November 2021). An allele is considered as functional (F) if its coding region has an open reading frame without stop codons, no defect in the splicing sites, recombination signals and/or regulatory elements. An allele is considered as open reading frame (ORF) if its coding region has an open reading frame without stop codons but shows alterations in the splicing sites, recombination signals, regulatory elements, and/or changes of conserved amino acids. A gene allele is considered as pseudogene (P) if the coding region has stop codon(s) and/or frameshift mutation(s).
The main concepts of the 'DESCRIPTION' axiom of IMGT-ONTOLOGY correspond to IMGT ® standardized labels in the tools and databases used to describe the organization of the IG genes in the IGH, IGL, and IGK loci [34]. The standardized annotation of nucleotide sequences (IMGT reference sequences and sequences from literature) is performed using IMGT ® labels (http://www.imgt.org/ligmdb/label#, (accessed on 11 November 2021)) and integrated in IMGT/LIGM-DB [39]. This allows data entry of genes and alleles in IMGT/GENE-DB [40] and in the IMGT ® reference directory; and the entry of amino acid sequences in IMGT/3Dstructure-DB and IMGT/2Dstructure-DB [41]. IMGT ® reference directories are also used in the sequence analysis tools (IMGT/V-QUEST [37], IMGT/HighV-QUEST [42], and IMGT/DomainGapAlign [43]). The synthesis of the annotation of genomic data is integrated in dedicated sections of IMGT ® web resources: Locus representation, Locus description, Locus in genome assembly, Locus gene order, Locus Borne, Gene tables, Potential germline repertoire, Protein displays, Alignments of alleles, Colliers de Perles [44,45], and germline [CDR1-IMGT.CDR2-IMGT.CDR3-IMGT] lengths (http://imgt .org/IMGTrepertoire/, accessed on 11 November 2021).
The standardized annotation of nucleotide sequences (IMGT reference sequence and sequence from literature) is performed using IMGT ® labels (http://www.imgt.org/ligmd b/label#, accessed on 11 November 2021) and integrated in IMGT/LIGM-DB [39]. This allows data entry of genes and alleles in IMGT/GENE-DB [40] and in the IMGT ® reference directory, and of amino acid sequences in IMGT/3Dstructure-DB and IMGT/2Dstructure-DB [41].
After integration of data in IMGT/GENE-DB, the 5 UTR of the IG V genes and alleles were extracted. Those sequences were trimmed up to~500 bp upstream the initiation codon (atg), which include all of the core promoter elements, as reported in the literature regarding Homo sapiens [46][47][48][49][50][51][52][53]. Then, a progressive multiple sequence alignment (MSA) was performed for the V genes 5 UTR of each locus and each subgroup separately. The MSA analysis was performed using MATLAB bioinformatics toolbox [54], as previously described in several studies [55,56], along with Clustal Omega tool/EMBL-EBI [57], MAFFT version 7 [58] and NGPhylogeny.fr [59]. The results of the MSA analysis were visualized through the Jalview platform [60]. Thus, guided by the IMGT ® reference of the Homo sapiens promoter sequences and distances between elements, the motif identification and extraction, along with the calculation of the distance between elements were carried out in Macaca mulatta loci. For further investigation and element validation, the study was assisted by bioinformatics tools and biological databases concerning eukaryotic promoter elements, including PROMO [61], gene-regulation [62], TRANSFAC database [63], Sequence Manipulation Suite [64], Transcription factor Affinity Prediction (TRAP) Web Tools [65], GPMiner [66], and SoftBerry/Nsite [67].

Overview of the Locus Genomic Organization of IGH Locus
The Macaca mulatta (rhesus monkey) IGH locus is localized on chromosome 7 from position 167,900,000 to 169,868,564 in CM014342.1, Mmul_10. The orientation of the locus on the chromosome is REV and it spans 1969 kilobases (kb) ( Table 1). The locus representation in Figure 1 encompasses 2000 kb from the most 5 V gene IGHV(II)-202 (P) to the most 3 C gene IGHA (F).
The IGH locus consists of 228 IGHV genes, among these genes 208 are localized on the IMGT ® reference sequence (IMGT000064) whereas 20 unlocalized genes come from previous annotated sequences. The 228 IGHV genes of the rhesus monkey belong to eight subgroups and three clans. The 45 IGHD genes belong to seven IGHD sets. The seven IGHJ genes belong to six IGHJ sets, and the eight IGHC genes belong to eight isotypes ( Table 1). The IGHV genes span 1600 kb, the IGHD genes span 103 kb, the IGHJ genes span 6 kb, and the IGHC genes span 260 kb. The IMGT ® reference sequence (IMGT000064) has two gaps from position 48,631 to 48,730 and from position 398,207 to 449,403. Locus representation of the Macaca mulatta (rhesus monkey) IGH deduced from the genome assembly Mmul_10. Reproduced with permission from IMGT ® , the international ImMunoGeneTics information system ® , http://www.imgt.org (accessed on 11 November 2021) The diagram shows the IGH genes and positions on the locus according to the IMGT nomenclature [1]. The arrows indicate an inverse transcriptional orientation in the locus. The V-D-J-C-CLUSTER is composed of IGHV (V6-1  to V(II)-202)-IGHD(D1-1 to D7-45)-IGHJ(J1 to J6)-IGHC(IGHM-IGHD-IGHG1-IGHG2-IGHG3-IGHG4-IGHE-IGHA). The 20 unlocalized V-GENE have a provisional nomenclature and are not present

Characterization of the IGH Genes
Briefly, 288 genes and 447 alleles of the IGH locus have been annotated and integrated in the IMGT ® databases. Among the 288 genes, we have identified for the IGHV: 79 F, two ORF, 140 P, and seven genes that have alleles with different functionalities (F or ORF 'FO': IGHV3-186, IGHV3-103; F or P 'FP': IGHV3-46, IGHV3-119, IGHV4-57, IGHV4-92 and IGHV4-106) ( Table 2); for the IGHD: 44 F and one ORF; for the IGHJ: seven F, and for the IGHC eight F. Six duplicated V genes were found within the IGH locus: IGHV(II)-33D (P), IGHV3-153D (F), IGHV(II)-154D (P), IGHV(III)-155D (P), IGHV1-156D (F), and IGHV3-157D (P). IGHV3 is the most represented subgroup [68] with 78 genes and 124 alleles. It also presents the greatest number of F with 39 genes and 62 alleles, followed by the IGHV1 and IGHV4 subgroups with 11 F and 19 F genes respectively. The IGHV7 subgroup has 14 genes however most are P (12 P) whereas the IGHV2 subgroup has fewer genes (seven) than IGHV7 but only one is P. All genes having alleles with different functionalities belong to the subgroups IGHV3 (two FO, two FP) and IGHV4 (three FP). All genes of the clans are P by definition. The IGHV(II) and IGHV(III) are the most represented clans with 35 and 32 genes, respectively, while IGHV(I) has six genes ( Table 3). Each IGHD set comprises seven genes except for the IGHD1 set which is the most represented with nine genes and IGHD7 set which is the least represented with one gene. All IGHD genes are F except for one gene (IGHD5-18) which is ORF ( Table 4). The IGHJ5 set comprises two genes (IGHJ5-1 and IGHJ5-2), while the remaining IGHJ sets comprise a single gene. All IGHJ genes are F (Table 5). For the IGHC genes, four genes belong to the IGHG (IGHG1, IGHG2, IGHG3 and IGHG4) while the other four genes encode the isotypes IGHA, IGHD, IGHE and IGHM, respectively. The IGHC genes are F except for the IGHM and IGHG3 gene which have alleles with two different functionalities (F or P). In fact, both have one allele P, while the other two IGHM and six IGHG3 alleles are F (Table 6). Finally, the IMGT ® databases count, for the IGHV, 228 genes and 336 alleles; for the IGHD, 45 genes and 49 alleles; for the IGHJ, seven genes and 10 alleles and for the IGHC eight genes and 52 alleles. Further, 95 additional (26 F, 69 P) IGHV genes have been annotated from the IMGT ® reference sequence IMGT000064 (Mmul_10) compared to the IGH locus sequences in assembly Mmul_051212. However, two IGHD genes (IGHD1-1-1 and IGHD4-41) were only found in the scaffolds of the assembly Mmul_051212 with accession numbers NW_001121239 and NW_001121238 respectively. Table 5. For each IGHJ set, number of IGHJ genes per functionality and, between parentheses, number of alleles.

CDR-IMGT Distributions and IMGT Proteins Displays
The three largest subgroups, IGHV1, IGHV3, and IGHV4, have several different CDR-IMGT lengths. However, there are some CDR-IMGT lengths that are more frequently represented than others: [8.8.2] for IGHV1 with 14 genes (11 F, three in-frame P), for IGHV3 with 22 genes (17 F, five in-frame P), for IGHV4 with eight genes (eight F), for IGHV5 with three genes (two F, one in-frame P) and IGHV7 with nine genes (two F, seven in-frame P); [8.10.2] for IGHV3 with 15 genes (12 F, one ORF, two in-frame P). The CDR-IMGT lengths [8.8.2] are found in almost every subgroup except in IGHV2 and IGHV6 subgroups ( Table 7) [69]. The IGHV8 subgroup is not shown in the table because all its genes are out-of-frame P.
The protein displays of some IGHV genes are presented with examples of CDR-IMGT lengths ( Figure 2). There are cases where a gene has alleles with different CDR-IMGT lengths, the genes IGHV1-130 and IGHV4-122 being examples of that. The allele IGHV1-130*01 has [8.8.3] as CDR-IMGT lengths, whereas the allele IGHV1-130*02 has [7.8.3]. The allele IGHV4-122*01 has [9.8.2] and the CDR-IMGT lengths [9.7.2] is found on the allele IGHV4-122*02. This is due to a deletion of an amino acid (AA) in the CDR but there are also cases of insertion, for example the allele IGHV3-176*01 which has an additional position (15A) according to the IMGT unique numbering [70] or the allele IGHV3-162*01 and IGHV3-162*02 which have an insertion (A) at position 26A. The four conserved AA are highlighted: the two cysteines at positions 23 and 104 (C23 and C104), the tryptophan at position 41 (W41), and a conserved hydrophobic AA at position 89 [70]. Most of the time, the hydrophobic AA is a leucine. However, for the IGHV1 subgroup, it is methionine.

RS Sequences
The V-HEPTAMER and V-NONAMER consensus sequences of all functional IGHV genes of the rhesus monkey are 'cacagtg' and 'acacaaacc' (Figure 3). 'cacagtg' is also the consensus sequence of all the IGHV subgroups of the rhesus monkey except for the IGHV2, which is instead 'cacagag' (Supplementary Table S1). The subgroups IGHV3, IGHV4, and IGHV6 share the same nonamer with the consensus sequence of all functional IGHV genes however the IGHV1, IGHV2 and IGHV5 subgroup have respectively 'tcagaaacc', 'acaagaacc', and 'ccaaaaacc' as their consensus sequences. The IGHV7 subgroup does not have a consensus sequence for the nonamer, the two functional genes of this subgroup (IGHV7-114 and IGHV7-193) represent a total of three alleles and each one has a different nonamer (Supplementary Table S1a). The J-HEPTAMER and J-NONAMER consensus sequence of all IGHJ functional genes are 'cactgtg' or 'caatgtg' and 'ggtttttgt' respectively ( Figure 3). These J-HEPTAMER are observed on the IGHJ1, IGHJ4 and IGHJ5, whereas this J-NONAMER is observed on the IGHJ4 and IGHJ6 (Supplementary Table S1b). The 5 D-HEPTAMER and 5 D-NONAMER consensus sequence of IGHD functional genes are 'cactgtg' and 'ggtttttgt'. The former is the consensus sequence of the IGHD2 and IGHD7 sets while the latter is not the motif predominately found on any set (Supplementary Table S1c). The same pattern holds true for the 3 D-HEPTAMER and 3 D-NONAMER. The consensus sequences of IGHD functional genes are 'cacagtg' and 'tcaaaaacc'. The motif 'cacagtg' is found as consensus sequence in all sets except for the IGHD1, while the motif 'tcaaaaacc' is only found in the IGHD3 set (Supplementary Table S1).

5 UTR Analysis of the IGHV Subgroup
A highly conserved octamer motif is identified upstream of the initiation codon (ATG) in all of the IGHV subgroups. Except for the subgroup IGHV6, the octamer consensus sequence found in all other subgroups was 5 -ATGCAAAT-3 . The TATA box is located between the ATG and octamer in all of the IGHV gene promoters and its sequence is characteristic for each subgroup. Upstream of the octamer, an heptanucleotide motif and a pyrimidine-rich region were found in every subgroup. The distance between these two elements and the length of the pyrimidine-rich region are representative of each subgroup. Additional elements are found in the promoter region of some subgroups. Subgroups IGHV3 and IGHV4 have an additional pyrimidine-rich region between the octamer and the heptanucleotide. Two E-box motifs are observed between the core elements in subgroups IGHV4 and IGHV5. Finally, in IGHV1, IGHV6, and IGHV7, an additional TATA box is located between the pyrimidine-rich region and the heptanucleotide, and for IGHV3, it was found between the heptanucleotide and the octamer (Figure 4).   (IGLJ1-IGLC1,  IGLJ2-IGLC2, IGLJ2A-IGLC2A, IGLJ3-IGLC3, IGLJ4-IGLC4, IGLJ5-IGLC5, IGLJ6-IGLC6 and IGLJ7-IGLC7  The rhesus monkey IGL locus consists of 149 IGLV genes (127 genes localized on the locus + 22 unlocalized) belonging to 11 subgroups and five clans, eight IGLJ genes belonging to eight IGLJ sets and eight IGLC genes (Table 1) [72]. The IGLV genes span 1247 kb, whereas the IGLJ genes and IGLC genes span 26 kb and 28 kb, respectively. TOP3B has been identified 48 kb upstream of VPREB1 and RSPH14 has been identified 165 kb downstream of IGLC7.
The rhesus monkey IGL locus has three distinct V-CLUSTER A, B, and C based on the IGLV gene subgroup content, by comparison with the Homo sapiens IGL locus. Within the V-CLUSTER A, there are three functional non-IG genes (PRAME, ZNF280A and ZNF280B) and on the V-CLUSTER B there is an IGLL1 (P).

Characterization of the IGL Genes
Briefly, 165 genes and 247 alleles have been annotated and integrated on the IMGT ® databases. Among the 165 genes, we have identified: for the IGLV, 72 F, two ORF, 74 P, and one gene (IGLV3-16), which has two alleles with different functionalities (F or P) ( Table 2); for the IGLJ: nine F, one ORF and one gene (IGLJ7) which also has two alleles with different functionalities (F or ORF); and for the IGLC: six F and two P.
The most represented subgroup in the IGL locus is IGLV3 with 25 F genes and 35 F alleles, one gene and three alleles have been identified as ORF, 16 genes and 23 alleles have been identified as P, and one gene has a double functionality (FP) ( Table 8). The IGLV1, IGLV2, and IGLV5 are also well represented with 14 F genes, 11 F genes, and eight F genes, respectively. In contrast, the least represented subgroups are IGLV8, IGLV9, IGLV10, and IGLV11 with only one F gene. In addition to the 11 subgroups, the IGL locus also has five clans (IGLV(I), IGLV(II), IGLV(III), IGLV(IV), and IGLV(V)), all pseudogenes per definition. The most represented clan is IGLV(I) with 18 genes and 24 alleles, whereas the IGLV(V) has only one gene and allele. The J-C-CLUSTER comprises eight IGLJ-IGLC cassettes indicated by the numbers 1-7 (IGLJ1-IGLC1, IGLJ2-IGLC2, IGLJ2A-IGLC2A, IGLJ3-IGLC3, IGLJ4-IGLC4, IGLJ5-IGLC5,  IGLJ6-IGLC6, and IGLJ7-IGLC7, respectively) (Supplementary Figure S1). The IGLJ4 gene and one allele of the IGLJ7 are ORF, whereas all other IGLJ genes are F (Table 9). Six IGLC genes and their alleles are F, and the other, IGLC4 and IGLC5, are P (Table 10). Table 9. For each IGLJ set, number of IGLJ genes per functionality and, between parentheses, number of alleles are shown.
Finally, the IMGT ® databases count 149 IGLV genes and 225 alleles among which 72 genes and 108 alleles are F, two genes and five alleles are ORF, 74 genes and 110 alleles are P and one gene and two alleles have two functionalities (FP); eight genes and nine alleles for the IGLJ and eight genes and 13 alleles for the IGLC. Further, 27 additional genes were annotated from the IMGT reference sequence IMGT000062 (Mmul_10) compared to the IGL locus sequence NW_001095158 (Mmul_051212 assembly) which has 245 gaps. Among the 27 genes, 17 IGLV genes were P and eight IGLV, one IGLJ, and one IGLC were F.

CDR-IMGT Distributions and IMGT Proteins Displays
The CDR-IMGT length is well conserved within the subgroup IGLV2. All the genes within this subgroup (11 F and one in-frame P) have the same CDR-IMGT lengths [9.3.9]. The same conservation is observed for the subgroups IGLV6 (3 F) and IGLV7 (4 F and 3 in-frame P) which have respectively the CDR-IMGT lengths [  The IGLV genes have a long CDR3-IMGT which varies from 7 to 12 AA. The four conserved AA (C23, W41, hydrophobic 89 and C104) are present in all functional IGLV genes. Only one or two examples for each subgroup is shown on Figure 6. Whether it is the F, ORF, or in-frame P, all IGLV genes count 25 AA in the FR1-IMGT with a gap at position 10 according to the IMGT unique numbering [70], however there are exception like the allele IGLV3-18*01 which has an insertion of 1 AA (R) at position 20A or the alleles IGLV5-57*01 and IGLV5S4*01 which have a deletion of 1 AA at position 15. The IGLV genes of rhesus monkey count 17 AA in the FR2-IMGT (from position 39 to 55) but the allele IGLV4-17*01, which is ORF because of the non-conserved W41, has an insertion of 2 AA (SS) at positions 50A and 50B. The subgroups IGLV5, IGLV6 and the only representative of subgroup 11 (IGLV11-117) have a complete D-STRAND (75-84) of the FR3-IMGT.

RS Sequences
The V-HEPTAMER and V-NONAMER consensus sequences of all IGLV functional genes are 'cacagtg' and 'acaaaaacc' (Figure 3), however they may be distinct for particular subgroups: the consensus sequence of the V-HEPTAMER of IGLV6 subgroup is 'cacagta', and no consensus sequence is defined for IGLV8 subgroup since the two alleles of the single gene IGLV8-125 have two different V-HEPTAMER. Concerning the V-NONAMER, only IGLV5, IGLV9, and IGLV11 share the IGLV 'acaaaaacc' consensus sequence (Supplementary Table S2a). The J-HEPTAMER and J-NONAMER consensus sequences of all IGLJ functional genes are 'cacagtg' and 'ggtttttgt' (Figure 3). However, they may be distinct for particular sets: IGLJ1 and IGLJ7 have the same J-HEPTAMER 'cactgtg' while the IGLJ5 has two different nucleotides 'cacagca'. Concerning the J-NONAMER, only IGLJ2, IGLJ2A, and IGLJ5, share IGLJ nonamer consensus sequence (Supplementary Table S2b).

5 UTR Analysis of the IGLV Subgroup
In the 5 UTR of all IGLV subgroups, a TATA box was detected approximately 60 nucleotides upstream of the initiation codon (ATG) (Figure 7). Upstream of the TATA box, a highly conserved decamer element was identified in all subgroups and its consensus sequence was calculated as 5 -AGATTTGCAT-3 . A pentadecamer element was observed upstream of the decamer in all subgroups except for the IGLV3 subgroup. The position of the pentadecamer, as well as its sequence, varied among the genes of different Subgroups in IGLV promoters. However, a noticeable conservation of this element can be observed in the genes of the same subgroup, thus its position and consensus sequence are distinctive and specific for every subgroup. A CCCT element was located between the pentadecamer and the decamer for the majority of subgroups. One or two repeats of the CCCT element were also detected between the decamer and the TATA box in all IGLV subgroups. Finally, an E-box motif (5 CAnnTG 3 ) was found within the pentadecamer (for subgroups IGLV2, IGLV6, IGLV8, IGLV9, and IGLV11) and/or between the TATA box and ATG (for subgroups IGLV1, IGLV3, and IGLV9).

Overview of the Locus Genomic Organization of IGK Locus
The Macaca mulatta (rhesus monkey) IGK locus is localized on chromosome 13 from position 16,784,193 to 18,140,859 in CM014348.1, Mmul_10 and the orientation of the locus on the chromosome is FWD. The locus spans 1357 kb, from 10 kb upstream of the most 5 gene in the locus IGKV2-105 (P), to 10 kb downstream of the most 3 gene in the locus IGKC (F).
The locus representation (Figure 8) encompasses 1600 kb including the IMGT 5 borne PAX8 identified 335 kb upstream of IGKV2-105 and the IMGT 3 borne RPIA identified 94 kb downstream of IGKC. The IGK locus consists of 138 IGKV genes (110 genes localized on the locus + 28 unlocalized) belonging to seven subgroups and one clan, five IGKJ genes, and one IGKC gene ( Table 1). The IGKV genes span 1331 kb, whereas the IGKJ and IGKC genes span respectively 14 kb and 12 kb.
The subgroups IGKV1 and IGKV2 are the most represented (≥50 genes for each subgroup) [73]. The IGKV1 subgroup has the greatest number of F genes with 39 genes out of 56, followed by the IGKV2 subgroup with 21 F out of 50 (Table 12). The IGKV3 subgroup has 11 F genes out of 19. In contrast, the least represented subgroups are IGKV4, IGKV5, IGKV6, and IGKV7 with respectively four, two, four, and one genes. The two genes of the IGKV5 subgroup are F. The IGKV7 subgroup comprises only one gene: IGKV7-13 which has two alleles with different functionalities (functional or pseudogene). In addition to the seven subgroups, the IGK locus also has one clan IGKV(II) which comprises two truncated pseudogenes.   Finally, the IMGT ® databases count 138 IGKV genes and 206 alleles. All the five J genes of the IGK locus are F. For now, one allele of each IGKJ gene has been annotated ( Table 13). The only IGKC is F and currently two alleles have been described. Moreover, 43 additional (21 F, two ORF, 20 P) IGKV genes have been annotated from the IMGT ® reference sequence IMGT000063 (Mmul_10) compared to the IGK locus sequence NW_001099007 (Mmul_05121). However, four missing IGKV genes on the sequence IMGT000063 (IGKV2-13-1, IGKV2-13-2, IGKV1-25-1, and IGKV3-26-1) were found in the accession number NW_001099007. Table 13. For each IGKJ set, number of IGKJ genes per functionality and, between parentheses, number of alleles.

CDR Distributions & Proteins Displays
The CDR-IMGT length [6.3.7] is conserved within IGKV1 which is the most represented subgroup with 42 F, one ORF, and three in-frame P. This CDR-IMGT length is also observed in the subgroups IGKV3, IGKV5, and IGKV6, with 11 F, one ORF for the IGKV3, two F for IGKV5, and two F, one in-frame P for the IGKV6. The CDR2-IMGT length is the same for all genes of all subgroups, whereas the CDR1-IMGT length and CDR3-IMGT length are variable (Table 14). The CDR1-IMGT and CDR3-IMGT lengths fluctuate from 6 to 12 AA and from 4 to 8 AA, respectively. The four conserved AA (C23, W41, hydrophobic 89 and C104) are present in all functional IGKV genes. However, due to their numbers, only a few genes are shown on Figure 9. In contrast, for ORF or P alleles shown on Figure 9, some conserved AA are replaced by another AA. For example, the allele IGKV2-87*01 (ORF) has an Arginine (R) instead of the C104; the allele IGKV3-88*01 (P) has a Leucine (L) instead of the W41 and C104; the IGKV2-39*01(P) has a Phenylalanine (F) instead of the C23. Whether they are F, ORF or in-frame P, the IGKV genes count 26 AA at FR1-IMGT. However, the allele IGKV2-87*01 has an insertion of 1 AA (V) at the position 20A on the FR1-IMGT. The IGKV genes of Macaca mulatta have no gaps in FR2-IMGT (17 AA from position 39 to 55) but for the FR3-IMGT there are gaps at positions 73, 81, and 82 according to the IMGT unique numbering [70].

RS Sequences
The V-HEPTAMER and V-NONAMER consensus sequences of all IGKV functional genes are 'cacagtg' and 'acaaaaacc' (Figure 3). However, they may be distinct for particular subgroups: the consensus sequence of the V-HEPTAMER of IGKV6 subgroup is 'cacactg'. Concerning the V-NONAMER, only IGKV5 and IGKV7 subgroups share the IGKV 'acaaaaacc' consensus sequence (Supplementary Table S3a).

5 UTR Analysis of the IGKV Subgroup
In the 5 UTR of the IG kappa chain, right upstream of the initiation codon, the TATA box motif was identified first in all of the subgroups. It was located on average 53 nucleotides upstream of the initiation codon (ATG) and it is composed of five to ten A/T repeats. Upstream of the TATA box, a CCCT element (TCCT for the IGKV5 subgroup) was observed in all subgroups except for IGKV7. The most conserved core promoter element was decamer (5 -nnATTTGCAT-3 ) and it was located upstream of the previously mentioned core elements (TATA box and CCCT element). The pentadecamer motif was located within 11-90 nucleotides upstream of the decamer. The consensus sequence of the pentadecamer is 5 -TGCAnCTGTGnCCAG-3 and it was characterized by an inner E-box motif 5 -CAnnTG-3 except for the subgroups IGKV2 and IGKV4. An additional E-box was observed in subgroups IGKV3 and IGKV4 between the decamer and the TATA box ( Figure 10).

Discussion
The Macaca mulatta (rhesus monkey) is one of the most widely used primate species in biomedical research and is used extensively as a model for studying human diseases, as it is evolutionarily close to humans [74][75][76]. In this study, an in silico research of the heavy and light chain IG genes was conducted based on the IMGT biocuration pipeline by using the "representative genome" (assembly Mmul_10) of the rhesus monkey from NCBI. Moreover, after a benchmarking of all the rhesus monkey assemblies available on NCBI, the Mmul_10 assembly [24] was found to be of better quality in terms of number of gaps and correct order of clusters for each IG locus. For example, within the assembly Mmul_8, the order of clusters from 5 to 3 for the IGH locus is: V-CLUSTER -> D-CLUSTER -> J-CLUSTER -> C-CLUSTER -> V-CLUSTER instead of V-CLUSTER -> D-CLUSTER -> J-CLUSTER -> C-CLUSTER. As another example for the IGH locus, the Mmul_8 assembly has 92 gaps, the Mmul_051212 assembly has 65 gaps, and the rheMacS_1.0 assembly has six gaps, while the Mmul_10 has only two gaps. It is worth noting that the current study focuses on the analysis of the Mmul_10 assembly, including the previously available rhesus monkey data, within IMGT ® . However, a considerable IG genetic diversity in the form of allelic polymorphism and structural variation has been shown in previous genomic and germline gene inference studies [27,28], the results of which need to be taken into account for an improved overview of the IG gene repertoires of Macaca mulatta. Future work, within IMGT ® , aims at analyzing the additional assemblies (for example RheMacS assembly [27] and the ASM545330 contigs [77]) based on the same model as well as incorporating the inferred alleles validated by the community. This will lead to the description of haplotypes taking into account the localization and the relative order of the genes in each locus in order to highlight new allelic polymorphisms and/or structural variations. This will also lead to more accurate gene assignments and calculations of somatic hypermutation (SHM) in this species. Extreme caution should be taken to correctly identify this genetic diversity.
Taking advantage of the IMGT biocuration pipeline, the IG germline repertoire and the IMGT ® reference directories were established according to IMGT ® nomenclature. The annotation of sequences, genes and structural data were integrated in the IMGT ® databases, tools and web resources. As a result of this effort, 597 IG genes and 908 IG alleles have been integrated into IMGT ® . Despite the high similarity of the human and rhesus monkey genome, some differences have also been observed. Indeed, the genomic organization and characterization of IG genes highlighted that, based on the assembly Mmul_10 of the rhesus monkey and the data available on the IMGT/GENE-DB, the rhesus monkey has more V genes than human (excluding orphons). For example, the IMGT/GENE-DB contains 228 IGHV genes for rhesus monkeys versus 162 IGHV genes for humans; 149 IGLV genes for rhesus monkeys versus 79 IGLV genes for humans; 138 IGKV genes for rhesus monkeys versus 77 IGKV genes for humans. Consequently, this difference has an impact on the number of functional genes found in these species, 86 F IGHV genes for rhesus monkeys versus 57 F IGHV genes for humans; 73 F IGLV genes for rhesus monkeys versus 33 F IGLV genes for humans; 85 F IGKV genes for rhesus monkeys versus 41 F IGKV genes for humans. It would be interesting to study the genomic organization and characterization of IG genes in the other available Macaca mulatta assemblies, as well as in the genome of additional nonhuman primates to determine the proportion of genes for each species and track their evolution.
It was noted that, even though the germline CDR-IMGT lengths of the rhesus monkey vary between subgroups, and in some cases within a subgroup, certain CDR-IMGT lengths are much more frequent and are found in several subgroups. For example, the CDR-IMGT lengths [8.8.2] are frequently found in subgroups IGHV1, IGHV3, IGHV4, IGHV5, and IGHV7.
The heptamer and nonamer of the V, D and J genes recombination signals (RS) play a crucial role in the recombination process, their first three and last three nucleotides as well as the poly-A or poly-T tract of the nonamer are the specific characteristics used to identify them on genomic sequence (Figure 3). In order to determine whether a heptamer or nonamer is canonical and if it could be useful during the rearrangement process or not, the IMGT ® annotation rule for RS is as follows: if a heptamer or a nonamer is found in more than one functional gene in IMGT/GENE-DB for a given locus whatever the species, it is considered as canonical. If not, heptamers with at most one mutation and nonamers with at most two mutations compared with the corresponding consensus sequences in the locus are also considered as canonical. However, there are also exceptions where the heptamer and nonamer do not obey this rule and are found rearranged in cDNAs available in IMGT/LIGM-DB and generalist nucleotide databases (GenBank and ENA). For example, the J-HEPTAMER of the gene IGHJ2 ('ggctgtg' instead of 'cactgtg' or 'caatgtg') has been found rearranged in 159 cDNAs in IMGT/LIGM-DB [39] and the V-HEPTAMER of the allele IGLV8-125*01 ('cacggcg' instead of 'cacagtg') has also been found in cDNAs. Interestingly, the allele IGLV3-16*01, which is a pseudogene because there is no INIT-CODON (https: //www.imgt.org/ligmdb/view?id=IMGT000062, accessed on 11 November 2021), is found rearranged in the cDNA accession number KCWN01002936.1 (https://www.ncbi.nlm.nih .gov/nuccore/KCWN01002936.1, accessed on 11 November 2021).
The identification of the core regulatory elements, located in the 5 UTR of the IG, reveals their conservation among eukaryotic species and would suggest their essential role in the transcription activity of the promoter. In this IG 5 UTR analysis of the rhesus monkey, all the core regulatory elements of the promoter region, that have been experimentally described in other eukaryotic species [29,30], have also been identified. Each IG chain, heavy and light, is characterized by different elements, however, a TATA box is always present (Figures 4, 7 and 10). This conserved A/T rich region upstream of the initiation codon could indicate and validate its important role in the transcriptional promoter activity. The rhesus monkey's heavy chain promoters contain the octamer motif upstream the TATA box, whereas light chain promoters contain the identical sequence in an inverted and complementary form at the same position, with two additional conserved nucleotides in 5 end (decamer). This element (octamer/decamer) is highly conserved in all of the subgroups in both heavy and light chain promoters, and its presence seems to be sufficient and necessary for the promoter activation [78,79]. Besides, experimental data in eukaryotes have shown that the octamer/decamer interacts and binds to central regulatory sequence-specific factors (Oct family proteins) in order to achieve a high transcriptional stimulation [78,79].
Heptanucleotide and pentadecamer are two core elements, characteristic for heavy chain and light chain promoters, respectively. It is observed that their sequences, in our dataset, are highly similar to the consensus sequences which were previously identified in experimental studies of eukaryotic species [80,81]. Although they are well aligned with these consensuses, in some cases, nucleotide alterations appear in their motifs, which provide uniqueness to the subgroup they are identified in. These two elements seem to remain conserved, to ensure a high rate of transcriptional activation, acting synergistically to the octamer or decamer elements [78,79]. However, the precise mechanism underlying this synergism remains to be elucidated for the Macaca mulatta.
Our IG promoter analysis revealed additional regulatory elements in the 5 flanking sequences of immunoglobulin V genes, which could increase the rate of transcription [78]. These elements include a pyrimidine-rich region in heavy chain promoters (Figure 4), a CCCT element in light chain promoters (Figures 7 and 10), and additional E-boxes and TATA boxes in both of them (heavy and light chains). More specifically, a CCCT element occurs in one or two repeats in the decamer flanking sequence of kappa and lambda chain UTRs. It has been shown experimentally in mouse cells that this element is very important for IG expression, and even in cases where the pentadecamer is absent it acts along with the decamer to stimulate transcription [79]. Moreover, a previous study [78] justifies that in the promoters where the CCCT element is missing, an E-box is detected and could fulfill the transcriptional activity, an observation that we also make in our analysis. For example, in the IGLV3 subgroup (Figure 7), the promoter region is lacking the pentadecamer, but the presence of two repeats of the CCCT element in the 5 and 3 flanking decamer and the E-box could probably activate the transcription.
Overall, our findings support the high degree of conservation of the fundamental regulatory promoter elements upstream the initiation codon of a V gene (ATG), as well as the existence of a distinct IG promoter organization. Despite this conservation, the consensus sequences and distances of the core elements of each subgroup vary significantly, indicating an auxiliary method for subgroup characterization and gene classification within subgroups. Notably, this observation along with the results of recent polymorphism studies in the human antibody upstream sequences could be valuable for the IG variable genes annotation process [29,30]. The elements identified and revealed in our study were previously described in the literature in other eukaryotic species and showed their necessary role in IG transcription, as well as the compensatory factor between elements, as their presence or absence could increase or decrease transcriptional activity, respectively [78,79]. Therefore, based on the Evo-devo scenarios and conservation, we could hypothesize that Macaca mulatta elements could have a similar fundamental role in IG regulation as in other eukaryotic species, however further in-depth experimental research needs to be done to validate this hypothesis. There is increasing evidence that 5 UTRs have a fundamental role in gene expression and further genomic studies could be performed to clarify and even explain these IG expression differences among individuals.
The study focuses on Mmul_10 assembly of rhesus monkey in order to provide to the scientific community rich and precise details about genes and alleles in this genome. For a given gene, up to 200 different information fields may be available for its annotation (IMGT labels of description, IMGT nomenclature, IMGT numbering, functionalities, isolate, reference sequence and sequence of the literature, description of alleles, protein display, colliers de perles, FR and CDR-length, etc.). This includes detailed characterisation of the regulatory elements of their promoters and their conventional recombination signals (RS), an overview of which is presented in this manuscript.
Taking into account the IG genetic diversity in the form of allelic and structural variations previously described [27,28], the approach described in this manuscript, and the resulting characterization of IG genes in Mmul_10, will be used as reference and model for the annotation of other published assemblies. The identification of position and order of genes in the IG loci of Mmul_10 provides a starting point for the characterization of allelic polymorphisms and of structural variations.
A comprehensive understanding of both innate and adaptive immune responses is essential for vaccine design and development [82,83]. Characterization of the Macaca mulatta IG loci in genomic assemblies, as shown in this study, provides important baseline information for this model species. However, the characterization of IG loci of additional assemblies and different animals will be required to generate a more complete IGH reference directory within IMGT, work that is currently underway.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/vaccines10030394/s1, Table S1: Information regarding the IGH RS consensus sequence by subgroup or set of the rhesus monkey (Macaca mulatta). Table S2: Information regarding the IGL RS consensus sequence by subgroup or set of the rhesus monkey (Macaca mulatta). Table S3: Information regarding the IGK RS consensus sequence by subgroup or set of the rhesus monkey (Macaca mulatta). Figure S1: Zoom of the J-C-CLUSTER of the rhesus monkey (Macaca mulatta) IGL locus.