Programmable Proteins: Target Specificity, Programmability and Future Directions

Programmable proteins to detect, visualize, modulate, or eliminate proteins of selection in vitro and in vivo are essential to study the targets recognized and the biology that follows. The specificity of programmable proteins can be easily altered by designing their sequences and structures. The flexibility and modularity of these proteins are currently pivotal for synthetic biology and various medical applications. There exist numerous reviews of the concept and application of individual programmable proteins, such as programmable nucleases, single-domain antibodies, and other protein scaffolds. This review proposes an expanded conceptual framework of such programmable proteins based on their programmable principle and target specificity to biomolecules (nucleic acids, proteins, and glycans) and overviews their advantages, limitations, and future directions.


Introduction: What Are Programmable Proteins?
The term "programmable" is generally used to describe a computer or machine that can accept a set of instructions or rules to perform a range of tasks as intended. The instructions or rules are usually written in a particular programming language. In biological organisms, "programmable" refers to the artificial modification of biomolecules and intermolecular and intercellular circuits, which leads to the conversion of specificity and functions. The function of nucleic acids and proteins is programmable by changing their sequences. With the superior programmability of DNA molecules, it is possible to confer unique chemical properties on nucleic acids [1]. Aptamers are designable oligonucleotide sequences capable of recognizing various molecules with specificity and affinity that rival those of antibodies [2][3][4]. Riboswitches, ribozymes, and deoxyribozymes are functional nucleic acids potentially designable by changing their nucleotide sequences [5,6]. However, functions such as fluorogenicity, ligand-binding, and catalytic activity are still not entirely predictable without the aid of the systematic evolution of ligands by exponential enrichment (SELEX) [7].
Proteins are important biomolecules that have diverse chemical characteristics and biological roles. Protein-biomolecule interactions play pivotal roles in biology, and approaches to designing proteins that inhibit or change these interactions would have great utility. The recent deep learning approach, including AlphaFold2, indicated that the structure of proteins can be computationally predictable [8,9]. In some cases, new binding proteins can be designed using only the knowledge of the structure of the target, without requiring prior knowledge of binding [10][11][12]. This raises the possibility that diverse proteins will become artificially programmable as functional proteins in the near future. However, in the meantime, the templates used for programmable proteins are limited. As summarized in Figure 1, current programmable proteins can be categorized into two major types: nucleicacid-guided and protein-guided. They can also be classified based on the targets they interact with: nucleic acid, protein, and other molecules, including glycans.

Figure 1.
Programmable proteins are categorized into two major types based on their guide moiety: nucleic acid and protein. They can also be classified based on the target they interact with: nucleic acid (Section 2), protein (Section 3), and glycan (Section 4). IgG1 is an intact monoclonal antibody for phenobarbital. Galectin (β-galactoside-binding lectin) from Agrocybe cylindracea (ACG) and fucose-binding lectin from Aleuria aurantia (AAL) are shown as dimers. Ribbon drawings (PDB ID) are based on https://pdbj.org (accessed on 17 October 2022). The size of each protein is arbitrary.

Programmable Nucleases, Modifiers, and Nucleic Acid-Binding Proteins
Several types of programmable nucleases of natural and synthetic origin have been reported [17][18][19][20]. Two groups of such enzymes are used for genome engineering: protein- Figure 1. Programmable proteins are categorized into two major types based on their guide moiety: nucleic acid and protein. They can also be classified based on the target they interact with: nucleic acid (Section 2), protein (Section 3), and glycan (Section 4). IgG1 is an intact monoclonal antibody for phenobarbital. Galectin (β-galactoside-binding lectin) from Agrocybe cylindracea (ACG) and fucosebinding lectin from Aleuria aurantia (AAL) are shown as dimers. Ribbon drawings (PDB ID) are based on https://pdbj.org (accessed on 17 October 2022). The size of each protein is arbitrary.
Such programmable proteins are widely used in biology, synthetic biology, and medicine. Numerous applications of the CRISPR/Cas9 (clustered regularly interspaced short palindromic repeats, CRISPR-associated protein 9) system are good examples [13][14][15][16]. Accordingly, there already exist copious reviews of individual programmable proteins such as programmable nucleases and single-domain antibodies. Moreover, the field is rapidly advancing. Thus, any review articles on this topic will be quickly outdated. Instead, this review proposes to extend the idea of programmable proteins to other unique binding molecules and overviews their advantages, shortcomings, and future directions.

Programmable Nucleases, Modifiers, and Nucleic Acid-Binding Proteins
Several types of programmable nucleases of natural and synthetic origin have been reported [17][18][19][20]. Two groups of such enzymes are used for genome engineering: proteinguided nucleases that recognize the specific sequence using protein module-DNA interactions (e.g., TALEN: Section 2.1) and nucleic-acid-guided nucleases that recognize the specific sequence via an attached short complementary DNA or RNA (e.g., Cas9: Sections 2.2-2.6).

ZFNs and TALENs
The first proof-of-concept programmable nuclease was an artificial hybrid deoxyribonuclease produced by connecting the DNA-binding homeodomain (Drosophila Ultrabithorax) to a non-specific DNA cleavage domain of Fok1 [21]. Fok1, discovered in Flavobacterium okeanokoites, is a type IIS restriction endonuclease consisting of an N-terminal DNA-binding domain and a DNA cleavage domain (~200 amino acids) at the C-terminus. However, earlier approaches to create programmable nucleases such as meganucleases did not gain popularity because of their technical limitations (e.g., modification of homing enzymes and FEN1 (Flap structure-specific endonuclease-1)) [19,20,22]. Subsequently, engineered zinc-finger nucleases (ZFNs) and transcription activator-like effector nucleases (TALENs) were invented as two types of protein-guided programmable nucleases.
ZFNs use a programmable DNA binding protein recognizing~3 bp DNA. ZFNs form dimers from monomers composed of an endonuclease FokI domain fused to a zinc finger array programmed to recognize a specific target DNA sequence [23]. The DNA-binding domain of a ZFN is usually composed of 3-4 zinc finger arrays. Developing the methods used to create new ZFNs has addressed many of the technical challenges; however, it remains a limitation of ZFNs.
TALENs are constructed similarly to ZFNs, composed of a DNA-binding domain and a FokI DNA cleavage domain [24]. They are derived from naturally occurring plant bacterial pathogens of the genus Xanthomonas and contain DNA-binding proteins called TALEs [25]. Each TALE is~34 amino acids long and recognizes a single base pair of DNA, as opposed to a triplet for ZFNs, giving TALENs higher flexibility over ZFNs. TALEs are tandemly connected to form a chain capable of targeting a specific DNA sequence. However, constructing a TALEN array requires the assembly of multiple, nearly identical repeat sequences, which is technically demanding. This issue has led to the development of several elegant laboratory methods. Although nucleic-acid-guided CRISPRs are widely used programmable nucleases, protein-guided TALENs show fewer off-targets and target the mitochondrial DNA, where guide RNA of CRISPR is difficult to import [26].

Cas9 (Type II CRISPR-Cas)
ZFNs and TALENs use a strategy of linking endonuclease domains to building blockguided DNA-binding modules for inducing targeted DNA double-stranded breaks (DSBs). In contrast, Cas9 is a nucleic-acid-guided nuclease through base-pairing with target DNA, offering a system that makes it easier to design specific, efficient, and well-suited for high-throughput and multiplexed gene editing for diverse cell types and organisms. The CRISPR-Cas9 strategy originated from a naturally occurring genome editing system that bacteria and archaeon use for adaptive immunity to viruses and plasmids [27,28].
The CRISPR/Cas9 system uses a Cas9 monomeric nuclease from various bacterial species (e.g., Streptococcus pyogenes Cas9 (SpCas9), Staphylococcus aureus Cas9 (SaCas9), Streptococcus thermophilus Cas9 (St1Cas9 and St3Cas9), Campylobacter jejuni Cas9 (CjCas9), Neisseria meningitides (NmCas9)), a specificity-determining CRISPR RNA (crRNA) and an auxiliary trans-activating RNA (tracrRNA). crRNA and tracrRNA are used as a dual RNA or single-guide RNA (sgRNA), which are synthesized in vitro or in vivo [13][14][15][16]. Although SpCas9 is the most widely used, Cas9, SaCas9, CjCas9, and NmCas9 are smaller than SpCas9, allowing packaging into adeno-associated viral vectors. Cas9 contains an HNH nuclease domain that cuts the DNA strand complementary to the guide RNA (target strand), and a RuvC nuclease domain required for cutting the noncomplementary strand (non-target strand), resulting in DSBs. The most widely characterized SpCas9-RNA complex recognizes the strands, including a protospacer adjacent motif (PAM), NGG. Among the targeted 20 bps in the upstream of the PAM, 8 to 12 bps are critical for recognition. The blunt-ended cleavage site is at the third bps upstream of the PAM. DSBs were subsequently repaired by non-homologous end joining (NHEJ) or homology-directed repair (HDR) [29]. By electroporating crRNA-Cas9 protein complexes, GFP reporters can be somatically integrated into specific genes in the chick genome by HDR, showing that this approach is possible in a variety of species [30].

Cas12 (Type V CRISPR-Cas)
CRISPR-Cas12a (Cpf1) is another CRISPR/Cas system that has diversified the genome editing toolbox [34,35]. Compared with CRISPR/Cas9, the CRISPR/Cas12a system has a smaller size, requires only crRNA and no tracrRNA, uses a T-rich PAM, and creates sticky ends at the cut site. It functions as both a deoxyribonuclease and a ribonuclease that can process multiple functional crRNAs from a single transcript. Thus, CRISPR/Cas9 uses RNA guides to recognize and cleave DNA (R-D nuclease), but CRISPR/Cas12a uses RNA guides to recognize and cleave both DNA and RNA (R-D/R nuclease). The most commonly used Cas12a originates from Francisella novicida (FnCas12a), Acidaminococcus sp. (AsCas12a), and Lachnospiraceae bacterium (LbCas12a).
Cas12b (also known as C2c1) proteins are smaller than Cas9 and Cas12a. Similar to Cas9, Cas12b requires both crRNA and tracrRNA, which can be combined as sgRNA, for DNA targeting. The recent development of mesophilic Cas12b from Alicyclobacillus acidiphilus (AaCas12b) and Bacillus hisashii (BhCas12b) can be adapted for mammalian genome editing [36,37].

Cas13 (Type VI CRISPR-Cas)
Cas13 (also known as C2c2) is an RNA-guided ribonuclease (R-R nuclease) that uses a crRNA to identify its target, single-stranded RNA (cis-recognition and cleavage), and exhibits trans-cleavage ribonuclease activity (trans-cleavage/collateral effect), holding promise for RNA gene silencing comparable to RNAi or CRISPRi without changing the genome sequence [38][39][40]. There are four subtypes identified in the Cas13 family, including Cas13a, Cas13b, Cas13c, and Cas13d. All Cas13 family members are smaller than Cas9 and require a crRNA to ensure target specificity. Like Cas12a, this nucleic-acid-guided nuclease has often been used for sequence-specific detection of RNA or DNA targets for diagnostics [41].

OMEGA (TnpB, IscB)
IscB and TnpB are recently characterized RNA-guided deoxyribonucleases (R-D nuclease), likely to be the ancestral forms of Cas9 and Cas12, respectively. IscB was found in a distinct family of prokaryotic IS200/IS605 transposons [42]. TnpB was from the same transposon in an extremophilic bacterium Deinococcus [42,43]. IscB and TnpB proteins IscB and TnpB are guided by non-coding short ωRNA and right end RNA (reRNA) encoded by the transposon. Both transposon-encoded RNA-guided deoxyribonucleases cleave dsDNA in human cells and expand the genome-editing toolbox by providing a new group of small and programmable non-Cas nucleases. Feng Zhang's group proposed calling these widespread nucleases OMEGA (obligate mobile element-guided activity) [43].

Argonaute
Ago (Argonaute) proteins are the second type of nucleic-acid-guided programmable proteins [20,44]. Similar to bacterial CRISPR, Ago plays a pivotal role in genetic immune systems that protect host cells from invading nucleic acids in eukaryotes and prokaryotes. Eukaryotic Argonaute proteins (eAgos) play a role in RNA interference (RNAi) and use guide RNAs for the recognition of RNA targets (R-R nuclease). In contrast, prokaryotic Ago (pAgo) nucleases have a natural specificity for DNA guides and DNA targets (D-D nuclease), and a small group of CRISPR-associated pAgos are programmable with DNA guides or RNA guides to cleave DNA targets (D/R-D nuclease).
However, no success has been reported using Ago for programmable genome editing in mammalian cells. This is because pAgos initially characterized from thermophilic prokaryotes are most effective at high temperatures (>65 • C) but not at 37 • C (e.g., pAgos from Thermus thermophilus (TtAgo), Methanocaldococcus jannaschii (MjAgo) and Pyrococcus furiosus (PfAgo)). Nonetheless, the diverse structures and functions of pAgo proteins in various prokaryotes suggest that they will soon provide the next-generation tools for genome editing along with Cas nucleases. Some pAgo proteins are indeed demonstrated to cut DNA sequences at 37 • C in a DNA guide-dependent manner (e.g., pAgos from Clostridium perfringens (CpAgo) and Intestinibacter bartlettii (IbAgo)) [45]. If the attempt is successful, it could expand the range of CRISPR-Cas9 tools, whose application is often limited owing to the tolerance of guide-target mismatches, possible RNA secondary structures, and the PAM requirement.

Programmable Protein-Binding Proteins
In addition to aiming at nucleic acids, diverse classes of programmable proteins can detect, disrupt, or modulate protein interactions that have essential roles in biology [46][47][48][49]. Antibodies are old immunological tools with various applications. In particular, with the advent of hybridoma and recombinant DNA technologies, successful applications of monoclonal antibodies and recombinant antibodies including single-chain variable fragments (scFvs) have inspired the development of diverse types of immunological reagents and therapeutic drugs [50]. The invention of in vitro selection methods such as phage display, yeast display, mRNA display, ribosome display, directed evolution, and affinity maturation have not only enabled further antibody engineering but also facilitated the development of novel binding proteins [51][52][53][54][55]. However, they principally differ from nucleic-acid-guided programmable proteins in that their precise specificity is not readily predictable. To date, the efficient programmability depends on various display methods and directed evolution in vitro. Nonetheless, emerging computational approaches raise the possibility that they will become programmable [10][11][12]56].
The first class of such programmable binding proteins is a single-chain fragment from an unusual antibody called VHH or nanobody (Section 3.1). The second class is based on a protein scaffold that offers two structural features that were viewed as the hallmark of immunoglobulins: a variable segment that provides the structural adaptability to design novel binding sites and a constant region that offers folding stability. Such bipartite building blocks are achievable when starting from a domain architecture that already exhibits variable loop motifs. Thus, repetitive domains such as ankyrin repeats and fibronectin type III (FN3) repeats appeared captivating as a robust scaffold for a general binding protein. Accordingly, a series of protein-guided programmable proteins such as DARPins and monobodies were created (Sections 3.2 and 3.3). The third class is based on ligand-binding unrepeated proteins with high thermal and proteolytic stability. Programmable proteins dubbed affibodies and anticalins were derived from Staphylococcus aureus protein A and a family of transport proteins, lipocalins, respectively (Sections 3.4 and 3.5). This third class is typically used as a single protein, although the first and second classes are frequently applied as a recombinant fusion protein as well as a solitary protein.

Single-Domain Antibody, Nanobody, VHH
Single-domain antibodies, also called nanobodies, are small antigen-binding polypeptides having a molecular weight of~15 kDa and~2-4 nm in size. They comprise the variable domain of a heavy chain-only antibody (VHH), which was first described in the serum of camelids (camels and llamas) in the 1990s [63]. Similar single-chain antibodies are found in cartilaginous fishes (VNAR, from sharks) [64,65].
Thus, nanobodies are the smallest intact antigen-binding protein fragments derived from an active immunoglobulin [66,67]. Nanobodies are advantageous alternatives to conventional antibodies owing to their tiny size, high solubility, and high stability across a variety of applications. Furthermore, phage display, ribosome display, and mRNA display methods can be used for the efficient generation and optimization of binding molecules in vitro. The nanobodies can be genetically encoded, tagged, and expressed as recombinant intrabodies in cells or reporter fusion bodies for in vivo localization and functional studies of target proteins [68][69][70]. There are currently several nanobodies undergoing clinical trials, and one was approved by the FDA for acquired thrombotic thrombocytopenic purpura in 2019 [71].

DARPin
Designed ankyrin repeat proteins (DARPins) are genetically engineered antibody mimetic proteins typically exhibiting highly specific and high-affinity target protein binding [72,73]. Most natural ankyrin repeat (AR) proteins contain 4-6 ARs stacked onto each other. DARPins contain 2-3 internal ARs sandwiched between the N-and C-terminal capping modules. Each internal AR module consists of 27 defined framework residues and 6 potential protein-binding residues that form a β-turn followed by two antiparallel helices and a loop connecting to the β-turn of the next AR [74]. DARPins are small in size (14-18 kDa, depending on the number of internal ARs), thermostable, resistant to proteases and chemical denaturants, and can be expressed in bacteria. DARPins have a concave binding interface, adding value by expanding the synthetic ligand landscape because their binding interface differs from the nanobodies' convex interface [75].

Monobody, Adnectin ® , FingR
Monobodies were initially designed based on the tenth FN3 domain, which has an immunoglobulin β-sandwich fold with seven strands connected by six loops but no disulfide bonds [76,77]. The original tenth FN3 consists of 94 amino acids, has a molecular mass of~10 kDa, and contains the adhesive RGD sequence that binds integrins. Three of the six flexible loops on one side of the FN3 are surface-exposed and have been shown to be a synthetic interface for binding ligands of interest. Subsequently, their commercial equivalence, Adnectin ® , and several monobody-like scaffolds have been developed [77,78], reiterating the robustness of the FN3 approach for creating programmable proteins. Various display techniques have selected ligand-binding proteins with binding affinities in the nanomolar to picomolar range. Like nanobody-based intrabodies, monobody-derived proteins can also be used intracellularly (e.g., FingR (=Monobody) fused to GFP that colocalized to endogenous PSD95 at neuronal synapses [79]).
Note: Adnectin ® is a registered trademark of Adnexus, a Bristol-Myers Squibb R&D Company.

Affibody
Affibodies are designed based on the Z-domain of the immunoglobulin-binding region of Staphylococcus aureus protein A [80,81], which adopts a three-helix scaffold and contains no disulfide bonds. The programmable ligand-binding surface is composed of 13 amino acid residues scattered between two of the helices. Their compactness (58 amino acids,~7 kDa in size) allows them to be simply expressed in bacteria or produced by chemical synthesis.

Anticalin ®
Anticalin ® was designed based on lipocalins, a large group of secreted proteins that typically transport or store small compounds, including vitamins, steroids, odorants, and various metabolites [82,83]. The anticalin scaffold adopts an eight-stranded antiparallel βbarrel, which is open to the solvent at one end and contains 160-180 amino acids (~20 kDa in size). Anticalins are not glycosylated and possess no disulfide bonds. An anticalin library for screening ligand-binding proteins contains 16-24 randomized amino acids in each loop [82]. Ligand-specific anticalins have been engineered via phage and bacterial surface displays and are expressed in either bacteria or yeasts [82].
Note: Anticalin ® is a registered trademark of Pieris Pharmaceuticals.

Glycan-Binding Proteins (GBPs)
Glycans-a general term describing carbohydrates, including oligosaccharides and polysaccharides-are the third class of essential biological macromolecules, following nucleic acids and proteins [84]. They exist as free sugars but are more commonly found as complex glycoconjugates, including glycoproteins, proteoglycans, and glycolipids. The broad importance of glycans has driven the interest in developing new glycan-binding molecules [85][86][87].
Glycan-binding proteins (GBPs) include antibodies, lectins, pseudoenzymes, and carbohydrate-binding modules. Antibodies against glycans are also seen in nature, and many monoclonal antibodies are remarkably useful reagents for recognizing the specific structure of glycans [88]. However, glycans are poorly immunogenic in general. On the other hand, lectins are non-immunoglobulin proteins containing at least one non-catalytic domain that often displays a specific glycan binding. For example, numerous plant lectins that recognize distinct glycans have been used as tools for detecting glycans [88][89][90]. GBPs also include carbohydrate-binding modules that are similar to lectins but are small binding domains typically found in carbohydrate-active enzymes such as glycosidases and glycosyltransferases [91]. Some carbohydrate-active enzymes have evolved to pseudoenzymes that have lost their enzymatic activity but retain their glycan-binding feature, providing a potential GBP scaffold [92].
Numerous pioneering works in lectin engineering have changed the binding specificity of eukaryotic and prokaryotic lectins. For example, using random mutagenesis and ribosome display techniques, a novel sialic acid-binding protein was created from a galactose-binding R-type lectin (galectin) from earthworms [93]. A sugar-binding spectrum of two different mushroom (Agrocybe cylindracea and Aleuria aurantia) lectins was also changed by mutagenesis [94,95]. These earlier studies demonstrated that new GBPs could be generated [96,97], as in the case of naturally occurring legume lectins [98], supporting the idea that lectins are potentially programmable. Moreover, other common protein scaffolds, including the protein-guided programmable proteins described in Section 3, can be employed to acquire a novel glycan-binding specificity. Lastly, a plethora of new computational and data science approaches can be used to expand the toolboxes to distinguish further complex glycans [90,92,99], along with the state-of-the-art progress in glycoconjugates research [84].

Perspectives
Programmable proteins to bind, label, inhibit, or remove biomolecules of choice in vitro and in vivo are critical for understanding the fundamental roles of the biomolecules recognized. The specificity of programmable proteins can be flexibly designed by changing their sequences in vitro. In particular, with the superb synthesizability of DNA, the programmability of nucleic-acid-guided proteins for recognizing specific nucleic acid sequences is practically finalized, although many significant issues such as off-target and inefficiency remain unsolved. In contrast, the programmability of protein-guided programmable proteins is still in development, except for building-block-guided modular nucleases. Nonetheless, the emerging deep learning approach that includes AlphaFold2 demonstrates that many protein structures can be accurately predictable in silico [8]. Thus, in limited cases, binding proteins were designed using only information on the structure of the target [10][11][12], raising the possibility that diverse proteins will be designable as useful to recognizing specific sequences and structures in proteins or glycans. Developing such computational algorithms and massive, open-source databases will be the key to creating a cutting-edge armamentarium in synthetic biology.
This review focused on programmable proteins that recognize three important biopolymers: nucleic acid, protein, and glycan. However, there are different types of programmablity. For example, a unique group of programmable proteins are enzymes involved in the biosynthesis of bioactive natural products, such as polyketides, non-ribosomal peptides (NRPs), and ribosomally synthesized and post-translationally modified peptides (RiPPs), widespread throughout bacteria, fungi, and plants [100][101][102]. These biosynthetic enzymes produce diverse natural products using a modular synthetic scheme that resembles an assembly pipeline. The modularity of these megaenzymes is useful from a synthetic biology viewpoint. The substitution or rearrangement of modules can lead to new natural products, suggesting that the biosynthesis of bioactive small molecules is programmable [101][102][103]. Taken together, the flexibility and modularity of programmable proteins are crucial for industrial and medical applications such as diagnostic reagents and therapeutic drugs.