While the library size can be defined by one number (N) and the effective size can be assessed by the percentage of antibody display, the diversity of a library has a somewhat elusive definition. Sometimes the size of an antibody repertoire is used as a proxy to characterize its diversity. However, it assumes that the antibody repertoire is a random set of antigen-binding site shapes and amino acid sidechains. Furthermore, not all antibody repertoires are created equal.
For instance, if one were to generate a library of antibody variants by diversifying all the 11 positions of the LCDR1 with the 20 amino acids, the resultant repertoire would be 2011
or 2 × 1014
antibody variants. This would appear to be a diverse repertoire based on its size and it would thus presumably lead to a large number of functional antibodies. Nevertheless, most of the amino acids in LCDR1 are located in the periphery of the antigen-binding site and have a marginal role in defining the binding properties of antibodies, especially for small and mid-sized targets [40
]. Therefore, such a library would very likely be of limited use. In contrast, the same library size, e.g., 11 positions mutated to the 20 amino acids, but in the HCDR3, which is located at the center of the antigen-binding site and plays a critical role in recognizing diverse targets, would likely yield antibodies with reasonable affinity [41
]. Hence, diversity is not only the number of in-frame and/or non-toxic antibody variants, but it is also the number of functional molecules capable of recognizing as many diverse targets as possible.
The first functional antibody repertoires used as a substrate to build phage display antibody libraries were obtained from immunized animals [42
] or humans [43
]. This type of library, known as immune, has been of limited application as general-purpose platforms for human therapeutic antibody discovery. Those obtained from immunized animals are single-use libraries as the libraries are biased toward the recognition of the antigen utilized as an immunogen. More importantly, the antibodies obtained from immune libraries, being nonhuman proteins, require further engineering (humanization), which increases the cost of the antibody drug development process. In the case of immune human antibody libraries, due to ethical reasons, they have mostly been used for isolation of antibodies against infectious diseases [44
]. Nevertheless, it is worth noting that relatively recent studies of the immune repertoire of Camelid V regions have shown identical CDR conformations to those found in the human V germline gene repertoire [45
]. This similarity has led to the development of a discovery platform by ArgenX based on the immunization of camelids. ArgenX has discovered several promising therapeutic antibodies that are difficult to obtain by other means [46
], with some of them in advanced stages of clinical trials (https://www.argenx.com/en-GB/content/argenx-in-short/2/
The first human universal or general-use library was published in the early 1990s [9
]. This library was generated with the repertoire of human genes encoding the circulating antibodies. Libraries generated with this source of antibody genes have been called naïve as they are not biased toward any particular target. Although successful, naïve libraries included antibody genes toxic to E. coli
, which as discussed above, compromised the effective size of the library. Synthetic antibody libraries followed to partially mitigate this limitation [38
]. In this alternative approach, the libraries were carefully designed by making assumptions on the number of scaffolds, positions to diversify, type of amino acids to include in the design, and the proportion of each amino acid per position to diversify. These assumptions did not always hold true, particularly at the HCDR3. To avoid making assumptions on the structure of the CDRs, a combination of synthetic scaffolds with natural CDRs have been used [47
], leading to the construction of the called semisynthetic libraries.
includes four naïve and four synthetic libraries, as well as one semisynthetic library. Two of the naïve libraries were developed by Xoma [26
]; one as Fab (XFab1) and the another as scFv (XscFv2). A third naïve library was published by Kugler et al. [27
] at the Technische Universität Braunschweig, Institut für Biochemie, Biotechnologie und Bioinformatik (TU-IB). It combines kappa (HAL10) and lambda (HAL9) light chains displayed as scFvs. The fourth and most recently published naïve library [28
] was developed at Kangwon National University (KNU). It only contains kappa-type light chains, and thus we called it KNU-Fab. The synthetic libraries listed in Table 1
include two generations of libraries from MorphoSys: HuCAL PLATINUM [30
] and Ylanthia [31
]. Both are optimized versions of the initial HuCAL and HuCAL GOLD [38
] libraries. Further, Table 1
lists one synthetic library from Janssen Biotherapeutics, namely pIX V3.0 [29
]. This was the first Fab antibody library displayed as pIX instead of pIII fusions. The fourth synthetic library is PHILODiamond [32
]. This is a minimalist library built on only three scaffolds, one VH
and two alternative VL
s, one kappa and another lambda. PHILODiamond is the latest version of a series of antibody libraries of increasing size and complexity generated by Dario Neri’s laboratory over the last 20 years, e.g., ETH2 (Eidgenössische Technische Hochschule 2) [50
], ETH2Gold [51
], and PHILO-1 and PHILO-2 [52
]. The semisynthetic library was called ALTHEA (from the Greek “to heal”
) Gold Libraries [33
]. It was published by our group in collaboration with Antibody Design Labs (ADL) and The Tri-Institutional Therapeutics Discovery Institute (TDI). ALTHEA Gold Libraries™ combined one VH
scaffold with two Vκ scaffolds. The diversity of the HCDR3 came from a large pool of 200 human donors. In the following sections, we describe in detail each of the naïve, synthetic, and semisynthetic libraries listed in Table 1
4.1. Naïve Libraries
Naïve libraries do not make assumptions about the diversity of the antibody repertoire. The rationale is that the human antibody repertoire evolved to recognize any target with a reasonable specificity and affinity. Therefore, the goal when building naïve libraries is to mirror the diversity of the human antibody repertoire while avoiding biases and redundancy due to the immunological history of a few individuals and/or rare polymorphic antibody genes present in the repertoire of a given ethnic group.
Xoma used 30 ethnically diverse healthy donors and a variety of tissues to amplify the VH and Vκ and Vλ chains using RT-PCR. The tissue samples included 20 peripheral blood mononuclear cell (PBMC) samples, eight bone marrow samples, one spleen sample, and one lymph node sample. The amplification strategy encompassed all the Immunoglobulin (Ig) classes: IgM, IgG, IgA, IgE and IgD. The libraries were displayed in Fab or scFv formats to assess potential differences in the selection of antibodies depending on the display format. Both libraries yielded a similar number of unique antibodies with similar affinity (discussed in more detail below), indicating that the display format did not significantly impact the outcome of the selections.
HAL9/10 also used blood samples from diverse ethnic groups, including Caucasian, African, Indian, and Chinese donors. The libraries, one kappa (HAL10) and the other lambda (HAL9), contained the same VH repertoire obtained from 98 donors but differed in their light chain repertoires. HAL9 included all lambda subfamilies from the 98 donors, whereas HAL10 was generated with 54 donors and contained all kappa families except the IGKV7 pseudogene. The amplification strategy included a reverse primer for V regions derived from IgMs, thus favoring the amplification of naïve antibody genes, i.e., close to the germline configuration and hence with few or no somatic mutations in VH.
The KNU-Fab library was generated with a larger pool of donors than the Xoma and HAL9/10 libraries. It included 803 PBMC donors, two lymph node donors, two spleen donors, and two bone marrow donors, totaling 809 samples. Thirty-three PBMC samples were obtained from healthy Korean human donors and 770 samples were obtained from commercial samples, probably from donors of diverse ethnic backgrounds. The antibody genes were amplified using forward and reverse primers hybridizing in the V regions, and thus agnostic as to the Ig class, isotype, or whether the antibodies were close to the germline gene configuration or had a substantial number of somatic mutations.
The gene usage prior to the selections of the libraries was reported for KNU-Fab and HAL9/10 libraries, whereas for Xoma’s libraries, only the gene family usage was described. A total of 7373 unique VH
sequences and 41,804 unique Vκ sequences from the KNU-Fab library were studied to assess its diversity prior to selections with any target. Likewise, for HAL9/10, a total of 827 full length scFv sequences were analyzed from HAL9 and 466 sequences from HAL10. Figure 2
shows the frequency of the ten most prevalent IGHV (left) genes and the five most used IGKV (right) genes reported by KNU-Fab and HAL10. As a reference, we added Glanville’s [53
] study of the gene usage of Pfizer’s naïve library and antibody sequence deposited in the databases.
The ten most prevalent IGHV genes are only 20% of ≈50 potentially functional IGHV genes in the human genome [54
]. These ten genes covered 70% of KNU-Fab studied sequences and close to half (47%) of all the HAL9/10 studied sequences. The frequency of these IGHV genes follow a similar trend in all the samples, except the genes IGHV4-30, IGHV2-05, and IGHV3-11, which are overrepresented in KNU-Fab but have a low frequency in the other samples. Interestingly, genes from the IGHV4 family have been found to be selected negatively in other libraries [56
] as they have been reported to be toxic to the B-cells [57
]. In VL
, five IGKV genes out of ≈35 [58
] (14%) potentially functional IGKV genes in the human genome explained 68% of KNU-Fab sequences and 58% of HAL10 sequences. IGKV1-39 was overrepresented in KNU-Fab, comprising nearly 40% of all the sequences, whereas the IGKV4-1 gene was overrepresented in HAL10 with a frequency of close to 25%. Interestingly, the IGVK2-28 was expressed in all the samples except HAL10. Taken together, the highly skewed nature of the IGV gene usage in the naïve libraries indicate that only a few antibody genes are enough to cover the needed diversity to recognize diverse antigens in a phage display antibody library.
Regarding the HCDR3 diversity, all the four naïve libraries showed the typical Gaussian distribution of HCDR3 lengths observed in humans [59
]. Both of Xoma’s libraries have a similar HCDR3 length distribution, with the most frequent lengths between 5 and 25 and an average length of 15 amino acids, in accordance with the Immunogenetic Database (IMGT®
) definition [60
], and 13 amino acids based on Kabat’s definition [36
]. The HCDR3 of the HAL9 and HAL10 ranged from 5 to 35 amino acids, with low frequencies of long lengths (25–35 amino acids) and a median value of 14 amino acids (also IMGT’s definition) or 12 using Kabat’s definition. The KNU-Fab HCDR3 lengths ranged from 4 to 19 amino acids, with the most frequent HCDR3 loop lengths being 11 or 12 amino acids (Kabat’s definition).
In summary, to build the naïve libraries diverse pools of human donors were used. In some cases, the authors cast as wide a net as possible to amplify all the human genes, regardless the isotype or number of somatic mutations, for instance, in the Xoma and KNU-Fab libraries. In other cases, such as HAL9/10, the amplification strategy was constrained to sequences in the germline gene configuration. In all of the libraries, only a few antibody genes represented the bulk of the studied sequences, consistent with the patterns observed in the repertoire of human antibodies [53
]. The HCDR3 length distribution was also similar in all the libraries and mirrored the Gaussian distribution typical of the human antibodies.
4.2. Synthetic Libraries
Synthetic repertoires are intended to maximize the functionality of the antibody libraries by using well expressed and developable scaffolds, targeting positions for diversification that do not disrupt folding of the V regions and selecting types and frequency of amino acids that facilitate selection of diverse binders to any given target. The first synthetic libraries, HuCAL and HuCAL GOLD [38
], were developed by MorphoSys in the late 1990s. In common with its predecessor HuCAL GOLD, HuCAL PLATINUM (Table 1
) was built with seven VH
and seven VL
master scaffolds, which when combined, yielded 49 antibody sub-libraries. These scaffolds were designed with consensus sequences representing the IGV genes families of the human repertoire. The sequences of the scaffolds were optimized for high expression in E. coli
and display on the phage. The six CDRs were randomized using trinucleotide mutagenesis (TRIM) technology [61
]. This synthesis technology is based on trimers encoding the 20 amino acids instead of mixes of oligonucleotides. In this way, the quality of the synthetic genes increased significantly since TRIM realized precise combinations of amino acids at targeted positions for diversifications while avoiding stop codons and unwanted amino acids that can disrupt the antibody folding.
One of the main differences between GOLD and PLATINUM are that the latter includes newly designed HCDR3 sequences. The new strategy was based on a systematic analysis of the amino acid usage per position of HCDR3 sequences in different loop lengths. Different amino acid frequencies were then used to design the synthetic HCDR3 fragments based on the specific length dependent amino acid use frequencies, instead of a relative uniform amino acid distribution for all HCDR3 of diverse length. In addition, PLATINUM used loop lengths from 4 up to 23 amino acids, covering over 95% of the naturally occurring HCDR3 lengths. Further, potential N-glycosylation sites generated in HuCAL and GOLD by the NXT/S pattern (where N is asparagine, X is any amino acid, and S/T is serine/threonine), were removed from PLATINUM. As a result of these changes, a side-by-side comparison of HuCAL GOLD and PLATINUM [30
] indicated that the latter generated approximately 4-fold more unique and higher affinity antibodies.
] was yet another step in the MorphoSys path to enhance the performance of its antibody discovery platforms. Rather than consensus VH
master scaffolds, Ylanthia used selected VH
combinations in the germline gene configuration. The selection process that led to the final design included the human germline genes most prevalent in the human antibody repertoire and those covering the canonical structure repertoire seen in the human antibodies. The canonical structures were discovered by Cyrus Chothia and Arthur Lesk [62
] in the late 1980s. These authors found that, although the CDRs vary in sequence, five out of the six CDRs (LCDR1, LCDR2, LCDR3, HCDR1, and HCDR2) had a limited set of main-chain conformations or canonical structures. The canonical structure model has been updated in the last two decades [63
] and helped to develop 3D modeling strategies [65
]. From a structure–function perspective, the canonical structure model suggested that structural constraints are at work in antigen recognition [66
]. Recently, the application of clustering algorithms [67
] on 300 non-redundant antibody structures has further stratified the canonical structure combinations by identifying 28 combinations of CDR lengths with canonical structures, whereas previous analysis [63
] covered only 20.
To build Ylanthia [31
], this first in silico selection filter based on antibody gene usage and canonical structures generated 400 random VH
combinations. This set of scaffolds were then experimentally tested in several developability assays. The assays assessed the expression level, thermal and serum stability, as well as aggregation propensity in Fab and IgG1 formats. Additionally, the initial VH
combinations were tested for relative levels of Fab CysDisplay (a platform used by MorphoSys based on the expression of antibody-fragment linked to phage particles by a disulfide bond). After the experimental developability screening, 36 VH
combinations, including 12 VH
, 12 Vκ, and eight Vλ
scaffolds, were used to build the libraries (Table 3
In addition to this improvement with respect to the HuCAL series, Ylanthia’s CDRs were diversified based on a systematic analysis of a large set of rearranged human antibody sequences and potential developability liabilities. Moreover, a new synthesis technology known as Slonomics [68
] was used in the generation of the HCDR3 sequences. Slonomics is a fully automated protein synthesis platform developed by Sloning Biotechnology and acquired by MorphoSys. This platform is based on sets of double-stranded DNA triplets coding for the twenty amino acids. The platform enables the highly controlled synthesis of diverse combinatorial gene libraries with high fidelity, i.e., closely matching the expected frequencies of designed amino acids with those observed in the libraries. As a TRIM technology, this new synthesis method avoided stop codons and unwanted mixes of amino acids in designed positions.
Janssen Bio’s libraries [29
] were the first combinatorial synthetic Fab libraries displayed on pIX instead of the fusion partner pIII that had been used in all other libraries. The libraries were designed with three VH
and four VL
scaffolds encoded by human germlines (Table 3
and Figure 3
). These scaffolds were chosen based on their high usage in the antibody human repertoire and naïve phage display libraries (see Section 4.1
. Naïve libraries), as well as on structural considerations. Specifically, the most used canonical structures with a propensity to bind proteins and peptides were represented in the libraries [69
]. The CDRs were diversified in positions frequently found in contact with protein and peptide targets [37
]. The diversification regime mirrored the variability of amino acids and frequency observed in the human germline genes and antibodies isolated from natural sources [37
]. Two sets of libraries were built, one called pIX V2.0 with diversity focused on VH
by keeping VL
in the germline gene configuration. The other, discussed in the review, was called pIX V3.0 and had diversity in both VH
Recently, Teplyakov et al. [75
] determined the structure of the VH
combinations used in pIX V3.0, which also happened to be common to other synthetic libraries and ALTHEA Gold Libraries™ (Figure 3
). Two of the VL
IGKV1-39 and IGKV3-11)
have the shortest loop at the LCDR1 observed in human antibodies [76
]. Another (IGKV3-20) has an insertion in LCDR1, although it is still a relatively short. The fourth scaffold (IGKV4-01) is the longest LCDR1 seen in the human germline gene repertoire, with an insertion of six residues with respect to IGKV1-39 and IGKV3-11. Altering the length of the LCDR1 from a short to a long loop changes the preference to bind protein or peptide targets, respectively [37
]. Therefore, the inclusion of these IGKV genes with distinct LCDR1 lengths provided the pIX V3.0 libraries with the potential to recognize diverse types of targets. In VH
, while the HCDR1 has the same length in all the three scaffolds, and the conformations of the HCDR1 in IGHV3-23 and IGHV5-51 were found to be remarkably similar [75
], IGHV1-69 showed large structural variability. This is probably due to two glycine residues in the HCDR1 of IGHV1-69, which provides more conformational freedom. Moreover, HCDR2 has two alternative conformations, one in IGHV3-23*01 and IGHV5-51*01, and another in IGHV1-69. Taken together, the structural variability seen in the pIX V3.0 provided the libraries with distinct topographies and structural diversity to recognize diverse targets.
The thermal stability (Tm) of pIX V3.0 VH
scaffold combinations in Fab format was also assessed by Teplyakov et al. [75
]. All the VH
scaffold combinations except IGHV3-23*01:IGKV*01 had Tm values above 68 °C. Considering that the least stable domain of the human IgG1 (hIgG1) is the CH
2 with 68 °C [79
], the expectation was that Fabs derived from pIX 3.0 should be developable when converted to hIgG1, which is the therapeutic format most frequently used [12
] was built with only three scaffolds: one VH
(IGHV3-23) and two VL
s (Figure 3
). The VL
scaffolds were either kappa (IGKV3-20) or lambda (IGLV3-19*01). PHILODiamond was a new version of the ETH2Gold library [52
]. The improvements with respect to the ETH2Gold library consisted of a new HCDR3 design with lengths of four to seven residues diversified with the 20 amino acids per position. In the ETH2Gold library, the HCDR3 had four and six randomized consecutive amino acids. PHILODiamond also has the LCDR3 fully diversified (20 amino acids) in five or six positions. Furthermore, residue 52 of VH
was designed to be an asparagine (N) in order to facilitate hydrogen bonding interactions.
4.3. Semisynthetic Libraries
As discussed in the previous section, several iterations to improve the quality of synthetic libraries have been performed, with the design of HCDR3 being a constant theme and perhaps the main opportunity for improvement. The HCDR3 is a key element in defining the specificity and affinity of antibodies, but it is also by far the most diverse region of the antigen-binding site and thus difficult to design. The 3D modeling methods [80
] can predict the structure of all of the CDRs other than the HCDR3 with an accuracy of <1.0 Å [81
]. However, no method is currently available to reliably predict the HCDR3 structure, thus limiting our ability to properly design the diversity of this antigen binding-site region.
To avoid any assumption regarding the structure and diversity of the HCDR3, ALTHEA Gold Libraries™ [33
] were built with HCDR3 and Joining fragments (H3J fragments) isolated from natural sources. The natural H3J fragments were combined with synthetic scaffolds, which were designed based on human germline genes found to be dominant in the repertoire of human antibodies and numerous scFv and Fab libraries (Table 3
). One universal VH
scaffold was paired with two VL
scaffolds. As in pIX V3.0, we used one VL
scaffold to enable the recognition of proteins (IGKV3-20) and other binding peptides (IGKV4-01). Therefore, by using the proper VL
scaffold, we hypothesized that antibodies against protein or peptide targets can be selected [37
]. When used in combination, it would potentially generate antibodies that bind diverse epitopes on a given target. Also, the HFR3 of the universal VH
scaffold, being encoded by the IGHV3-23*01 germline gene, naturally binds Protein A of the bacterium Staphylococcus aureus
]. The Protein A binding site in the VH
domain is formed by discontinuous amino acid stretches distant in the primary sequence brought together by folding. Therefore, Protein A offered a means to select for well-folded scFvs in the construction process of the ALTHEA Gold Libraries™.
We used a three-step strategy (Figure 4
) to generate the libraries. In the first step, fully synthetic primary antibody libraries (PLs) were designed, cloned, and displayed as scFvs on the phage surface. Second, we performed a selection process in which the PLs were submitted to a heat shock and further selected with Protein A for in-frame and thermostable variants. We called the product of this step filtered libraries (FLs). Third, highly functional and highly diverse secondary antibody libraries (SLs) were generated by combining FLs with natural H3J fragments obtained from a large pool of 200 donors. By using this three-step construction process, the functionality of ALTHEA Gold Libraries™, assessed as Protein A binders randomly selected from the libraries, increased from ≈65% in the PLs to ≈85% in the SLs, implying a 20% improvement. In terms of the number of functional clones in a library of 1010
variants, it meant 2 × 109
additional antibody sequences to select from.