AI-Aided Design of Novel Targeted Covalent Inhibitors against SARS-CoV-2

The drug repurposing of known approved drugs (e.g., lopinavir/ritonavir) has failed to treat SARS-CoV-2-infected patients. Therefore, it is important to generate new chemical entities against this virus. As a critical enzyme in the lifecycle of the coronavirus, the 3C-like main protease (3CLpro or Mpro) is the most attractive target for antiviral drug design. Based on a recently solved structure (PDB ID: 6LU7), we developed a novel advanced deep Q-learning network with a fragment-based drug design (ADQN–FBDD) for generating potential lead compounds targeting SARS-CoV-2 3CLpro. We obtained a series of derivatives from the lead compounds based on our structure-based optimization policy (SBOP). All of the 47 lead compounds obtained directly with our AI model and related derivatives based on the SBOP are accessible in our molecular library. These compounds can be used as potential candidates by researchers to develop drugs against SARS-CoV-2.


Introduction
The emerging coronavirus SARS-CoV-2 has caused an outbreak of coronavirus disease 2019 (COVID-19) worldwide [1]. By the end of May 2022, the Johns Hopkins Coronavirus map tracker had reported more than 527 million SARS-CoV-2 infections and over six million deaths [2]. The number of infections and deaths is still increasing. To deal with the threat of SARS-CoV-2, it is necessary to develop new inhibitors or drugs. Unfortunately, since the outbreak of severe acute respiratory syndrome (SARS) 18 years ago, there has been no approved treatment for SARS-associated coronavirus (SARS-CoV) [3], which is similar to SARS-CoV-2. SARS-CoV-2-injected patients have also failed to respond to repurposed drugs, such as lopinavir and ritonavir [4]. Structure-based antiviral drug design with a novel artificial intelligence algorithm may be an effective approach to developing SARS-CoV-2-targeted inhibitors or drugs. Owing to the efforts of many researchers, several pieces of important information about the viral genome and protein structures are currently available. We know that non-structural protein 5 (Nsp5), a cysteine protease, is one of the main proteases (M pro ) of SARS-CoV-2, also known as "3C-like protease" (3CL pro ). Moreover, we know that the 3D structure of 3CL pro is very similar to that of SARS-CoV with a sequence identity of >96% and 3D structure superposition RMSD Cα of 0.44 Å, as shown in Supplementary Figures S1 and S2.
Overall, 3CL pro has been reported as an attractive target for developing anti-coronavirus drugs for the following reasons: (1) this protease is highly conserved in both sequences and Biomolecules 2022, 12, 746 2 of 18 3D structures [5], (2) 3CL pro is a crucial enzyme for the replication of related viruses (including SARS and SARS-CoV-2), and (3) it only exists in viruses and not in humans. Developing specific antiviral drugs that target the 3CL pro of viruses has shown significant success; for example, both lopinavir and ritonavir (approved HIV drugs) can completely occupy the substrate-binding site of 3CL pro , thereby disrupting the replication of human immunodeficiency virus (HIV). However, due to the large difference between HIV and SARS-CoV-2 3CL pro , lopinavir and ritonavir are ineffective for inhibiting SARS-CoV-2 [4]. On the other hand, the substrate-binding site of 3CL pro is almost the same between SARS-CoV-2 and SARS, as shown in Supplementary Figure S3. Therefore, the developed potential inhibitors and drug design experience for SARS-3CL pro may also apply to SARS-CoV-2. For example, the recently solved structure of SARS-CoV-2 3CL pro (PDB ID: 6LU7) demonstrated that the developed inhibitor N3 [6], which is a covalent inhibitor derived from non-covalent inhibitors against SARS, can also bind to SARS-CoV-2 3CL pro with a similar binding conformation (Supplementary Figure S4).
The available information from previous research can aid the design of novel targeted covalent inhibitors (TCIs) [7] against SARS-CoV-2. A successful TCI against 3CL pro must first be able to fit in the binding site of 3CL pro with an appropriate pose that keeps its reactive groups close enough to Cys145, which then undergoes a chemical step (nucleophilic attack by Cys145), leading to the formation of a stable covalent bond, as presented in the scheme below: In theory, TCIs usually have a longer target residence time compared with that of non-covalent inhibitors, given the following: (1) for inhibition, k 1 must be larger than k 2 , and non-covalent binding is determined by the equilibrium constant k 1 /k 2 ; (2) TCIs undergo a chemical reaction with the target, where usually k 3 >> k 4 ; and (3) for TCIs, the binding process is controlled by k 1 k 3 /(k 2 k 4 ), which is larger than k 1 /k 2 for non-covalent inhibitors. In some extreme cases, k 4 = 0; thus, these inhibitors covalently bind until the target is no longer detectable [7,8].
Considering that the inhibitors of SARS 3CL pro may exert bioactivity against SARS-CoV-2, we have created a molecular library that contains all of the reported SARS-3CL pro inhibitors (284 molecules) . We will also add new validated molecular structures to this library as our research progresses. To date, there are no clinically approved vaccines or drugs specifically targeting SARS-CoV-2. Therefore, to discover novel candidate drugs targeting SARS-CoV-2, we combine artificial intelligence (AI) with fragment-based drug design (FBDD) to accelerate the generation of potential lead compounds and design TCIs.
AI, especially deep learning, has been used to predict molecular properties [30][31][32][33] and design novel molecules [34]. Table 1 lists some deep-learning-based tools and their features. Among these tools, the tool involving recurrent neural networks relies on a large number of training molecules with known structures to learn the context information as the basis for molecule generation. Over time, various advanced generative tools were proposed, including GENTRL, ORGAN, and ORGANIC. These tools were expected to increase success rates by introducing posterior distributions and adversarial attacks as regularizers. However, using the generative models still does not guarantee that their outputs are valid molecules for drug discovery; to overcome such problems, Molecule Deep Q-Networks (MolDQNs) with reinforcement learning (RL) were proposed. MolDQN can iteratively work at the atomic level to generate molecules fitting all predefined constraints. Hence, all outputs are chemistry-reasonable molecules. Considering the advantages of MolDQN, we combined the framework with more fragment-based rules to optimize the generated molecules in this study. In contrast to earlier deep-learning molecular design by adding a single atom one at a time [30,34], our approach explores new molecules by adding a meaningful molecular fragment one by one, which is computationally more efficient and chemically more reasonable. For our AI model, we initially prepared the molecular fragment library by using the collected SARS-CoV 3CL pro inhibitors (284 molecules) targeting SARS-CoV-2 3CL pro , as shown in Figure 1. Then we split this set of molecules into fragments with a molecule weight of no more than 200 Da. Both of the collected inhibitors and the fragments can be found at [40]. Then we applied an advanced deep Q-learning network with fragment-based drug design (ADQN-FBDD) to generate potential lead compounds. If researchers have sufficient experience and internal lead compounds or biased fragments, they can manually add the lead compounds and biased fragments to the corresponding files of the ADQN-FBDD. Using the same fragments directly from existing bioactivate molecules, our ADQN-FBDD agent could easily access the potential chemical space for the 3CL pro of SARS-CoV-2.
After the ADQN-FBDD automatically generated novel compounds targeting 3CL pro , we obtained a covalent lead compound library with 4922 unique valid structures. A total of 47 of these compounds were selected with high scores with our AI model's reward function. These molecules were further evaluated through docking studies. Among the 47 lead compounds, compound #46, with a high covalent docking score, attracted our attention, showing a small difference between the non-covalent and covalent docking poses. After carefully examining the interaction mode of lead #46 with 3CL pro , we believe that there is still much space for optimization. Subsequently, we designed a series of derivatives from compound #46 based on our chemical biology knowledge and the structure-based optimization policy. All of the generated molecular structures are published in our code library, https://github.com/tbwxmu/2019-nCov [40]. We would encourage researchers interested in finding a potential treatment for COVID-19 to synthesize and evaluate some of these molecules. than 200 Da. Both of the collected inhibitors and the fragments can be found at [40]. Then we applied an advanced deep Q-learning network with fragment-based drug design (ADQN-FBDD) to generate potential lead compounds. If researchers have sufficient experience and internal lead compounds or biased fragments, they can manually add the lead compounds and biased fragments to the corresponding files of the ADQN-FBDD. Using the same fragments directly from existing bioactivate molecules, our ADQN-FBDD agent could easily access the potential chemical space for the 3CL pro of SARS-CoV-2.

Results
Integrating the double-dueling deep-Q-learning network with fixed q-targets and prioritized experience replay allowed our agent to be stable and efficient when learning from the chemical environment. With a combination of the state-of-the-art AI algorithm and FBDD, as shown in Figure 2, the approach was flexible and efficient in accessing the focused chemical space for SARS-CoV-2 3CL pro . Based on the configurations targeting SARS-CoV-2 3CL pro , the ADQN-FBDD generated a lead library containing 4922 unique molecular structures (Supplementary Materials dataset in [40]). To narrow our focus to a smaller set of molecules for analysis, we defined the filter rules (QED > 0.1 and DRL-reward ≥0.6); detailed information on the rules can be found in the Methods section. A total of 47 unique molecules (Supplementary Table S1) were selected for non-covalent docking and covalent docking evaluation. These 47 virtual leads exhibited an appropriate 3D complexity with typical characteristics of peptidomimetics and protein-protein interaction (PPI) inhibitors. They were mainly ranked according to the covalent docking scores, considering that covalent docking scores also include the scores of non-covalent docking [41].
To analyze the common features of these generated molecules, we used the Canvas Similarity and Clustering tool. As shown in Supplementary Table S3, the 47 leads were clustered into five clusters according to the clustering metric (shape). Among the five clusters, all of the molecules in Cluster #1 contained R-shaped pyrrolidin-2-one structures and may be considered optimal segments of the S2 subsite or S1 subsite. In addition, they contained some substituents with hydrophobic cyclic groups, such as aromatic rings, alicyclic rings, and aliphatic hydrocarbons. The molecules in Cluster #2 included 2-pyrrolidone in their structures, including R-shaped, S-shaped, and S'-shaped. Furthermore, their side chains linked to covalent targets contained large substituents (e.g., aromatic rings) and small substituents (e.g., aliphatic hydrocarbons or aliphatic amines). Cluster #3 comprised several straight chain-like molecules with substituted or non-substituted pyrrolidin-2-one, which may also be regarded as optimal segments of the S2 subsite or S1 subsite. In addition, their α-oxo aldehyde group can form a covalent bond with Cys145. Cluster #4 mainly consisted of 2-pyrrolidone, heterocyclic rings, and heteroaromatic rings, whose polarities were considerably higher than those of the molecules from other clusters, which may reduce the affinities of the protein active pockets S1 and S2. Their covalent targets included α, β-unsaturated ketones. Each molecule in Cluster #5 had four extension directions with different covalent targets, matching four sub-pockets of active protein-binding sites. This could contribute to a high binding affinity; hence, molecules in Cluster #5 were the best candidates for further analysis. We also considered the RMSD difference between the covalent and non-covalent binding poses based on all heavy atoms. Finally, we selected lead molecule #46 (Figure 3), with a good covalent docking score and a small RMSD value (docking affinity: −8.722 kcal/mol, KabschRmsd 1.71 Å) in Supplementary Table S1. The FBDD approach was further optimized to obtain a series of derivatives. some of these molecules.

Results
Integrating the double-dueling deep-Q-learning network with fixed q-targets and prioritized experience replay allowed our agent to be stable and efficient when learning from the chemical environment. With a combination of the state-of-the-art AI algorithm and FBDD, as shown in Figure 2, the approach was flexible and efficient in accessing the focused chemical space for SARS-CoV-2 3CL pro . Based on the configurations targeting SARS-CoV-2 3CL pro , the ADQN-FBDD generated a lead library containing 4922 unique molecular structures (Supplementary Materials dataset in [40]). To narrow our focus to a smaller set of molecules for analysis, we defined the filter rules (QED > 0.1 and DRL-reward 0.6); detailed information on the rules can be found in the Methods section. A total of 47 unique molecules (Supplementary Table S1) were selected for non-covalent docking and covalent docking evaluation. These 47 virtual leads exhibited an appropriate 3D complexity with typical characteristics of peptidomimetics and protein-protein interaction (PPI) inhibitors. They were mainly ranked according to the covalent docking scores, considering that covalent docking scores also include the scores of non-covalent docking [41]. The ADQN-FBDD consists of a reinforced agent, a prioritized experience replay algorithm, and a chemical environment to perform chemical structure generation. The agent selects an action (insertion, deletion, or none) for the intermediate molecular fragment at each step to generate a new molecule that can maximize the cumulative rewards. The prioritized experience replay algorithm allows the agent to repeat the molecule generation based on the updated maximization of rewards. The chemical environment assesses the agent's actions according to the predefined chemical rules and provides rewards. (B) Example of fragment-based actions. (C) The solid lines represent taken actions, including the addition or deletion of different fragments during an episode. The dashed lines represent actions that the RL agent was considered but did not take. An exploratory action is represented by the red dashed line, which was taken even though another sibling action, the one leading to S*, was ranked higher. The exploratory action did not result in any learning; however, other actions did, resulting in updates as demonstrated by the curved arrows where estimated values moved up the tree from later nodes to earlier nodes.   Molecule #46 was ranked first based on the covalent docking score, and its interaction model with the binding site was carefully examined, as shown in Figure 3. Although compound #46 had the best covalent docking score, it had an aldehyde group. Interestingly, upon a detailed examination of #46, it was found that its interactions were similar to those of the inhibitor 6XHO (PF-00835231) designed by Pfizer, thus demonstrating the potential of our proposed method. Moreover, 6XHO was reported to have modest levels of irreversible inhibition, which allowed co-crystallization in complex with SARS-CoV-2 3CL [42]. Based on PF-00835231, Pfizer further developed Lufotrelvir (PF-07304814). The phosphate group of PF-07304814 contributed to improved solubility and was cleaved in vitro to release the active antiviral PF-00835231. Similar to remdesivir, PF-07304814 was administered via intravenous infusion. Moreover, Pfizer developed the orally active 3C-like protease inhibitor Nirmatrelvir (PF-07321332), crystallized as 7VH8. We superimposed lead #46 with 6XHO and 7VH8 and compared their 3D pose and interactions, as shown in Figure 4. Most of their binding sites were aligned with their SARS-CoV-2 3CL pro ligands. Considering that there is still much space for compound #46 to fill in the S1 subsite and that α-ketoamides may be suitable to fit into the oxyanion hole ( Figure 5A) of 3CL pro [3], we replaced the aldehyde with formamide and also replaced the 1,4 Michael acceptors with α-ketoamides. Therefore, we optimized compound #46 to obtain compound 46-14-1 ( Figure 5A,B).
interactions between compound #46 and protease under covalent docking. The triazole ring forms an H-bond with His163. The α, β-unsaturated aldehyde forms a covalent bond with Cys145, i.e., the key residue in the catalytic center of the protease, resulting in covalent inhibition. The aldehyde carbonyl group forms an H-bond with His41. The hydroxyl group of chromone at position 7 forms an H-bond with Thr190.
Molecule #46 was ranked first based on the covalent docking score, and its interaction model with the binding site was carefully examined, as shown in Figure 3. Although compound #46 had the best covalent docking score, it had an aldehyde group. Interestingly, upon a detailed examination of #46, it was found that its interactions were similar to those of the inhibitor 6XHO (PF-00835231) designed by Pfizer, thus demonstrating the potential of our proposed method. Moreover, 6XHO was reported to have modest levels of irreversible inhibition, which allowed co-crystallization in complex with SARS CoV-2 3CL [42]. Based on PF-00835231, Pfizer further developed Lufotrelvir (PF-07304814). The phosphate group of PF-07304814 contributed to improved solubility and was cleaved in vitro to release the active antiviral PF-00835231. Similar to remdesivir, PF-07304814 was administered via intravenous infusion. Moreover, Pfizer developed the orally active 3C-like protease inhibitor Nirmatrelvir (PF-07321332), crystallized as 7VH8. We superimposed lead #46 with 6XHO and 7VH8 and compared their 3D pose and interactions, as shown in Figure 4. Most of their binding sites were aligned with their SARS CoV-2 3CL pro ligands. Considering that there is still much space for compound #46 to fill in the S1' subsite and that α-ketoamides may be suitable to fit into the oxyanion hole ( Figure 5A) of 3CL pro [3], we replaced the aldehyde with formamide and also replaced the 1,4 Michael acceptors with α-ketoamides. Therefore, we optimized compound #46 to obtain compound 46-14-1 (Figure 5A,B). compounds, sulfonic groups were introduced to replace the ketone carbonyl groups on the main chain, and sulfonamides 46-14-2 were obtained. The covalent docking model of compound 46-14-2 with SARS-CoV-2 3CL pro is shown in Figure 5C,D.    To make compound 46-14-2 fit the active pocket with a higher affinity, we added a carbon atom to the sulfonic acid group of the original molecule, and it extended the carbon chain to increase the molecule's flexibility and obtained another optimized compound 46-14-3. The mode of covalent docking with SARS-CoV-2 3CL pro is shown in Figure 6. Due to the introduction of carbon atoms and the enhancement of molecular flexibility, the β-lactam ring can be inserted deeper into the S2 pocket, and other fragments of the compound can better adapt to the S1, S1 , and S3 subsites. The α-carbonyl carbon on the α-ketoamide of compound 46-14-3 forms a covalent bond with the key residue Cys145 on the protease; however, the primary non-bonding interaction is still an H-bond (indicated by a yellow dash). The triazole ring mainly forms H-bonds with the Phe140 and Glu166 residues in the S1 pocket. The α-ketoamide covalent binding fragment mainly forms H-bonds with the key amino acid residues Cys145, Gly143, and Ser144 in the S1 subsite, forming an oxyanion hole in the red circle shown in Figure 5. The β-lactam side chain mainly forms H-bonds with residues Tyr54 and AsS187 in the S2 pocket. The chromone scaffold mainly forms H-bonds with the key residues Thr190 and Gln192 in the S3 pocket. In addition, the oxygen on the sulfonyl group of the main chain forms an H-bond with Glu166 in the S1 pocket.
scaffold α-ketoamide forms H-bonds with Leu141 and Gly143 as a hydrogen acceptor or a hydrogen donor. The hydrogen on the nitrogen atom of the triazole ring forms an Hbond with Glu166, and Glu166 forms an H-bond with the carbonyl group on the main chain. The oxygen of the β-lactam ring forms an H-bond with His41. To enhance the polarity of the compounds, sulfonic groups were introduced to replace the ketone carbonyl groups on the main chain, and sulfonamides 46-14-2 were obtained. The covalent docking model of compound 46-14-2 with SARS-CoV-2 3CL pro is shown in Figure 5C,D.
To make compound 46-14-2 fit the active pocket with a higher affinity, we added a carbon atom to the sulfonic acid group of the original molecule, and it extended the carbon chain to increase the molecule's flexibility and obtained another optimized compound 46-14-3. The mode of covalent docking with SARS-CoV-2 3CL pro is shown in Figure 6. Due to the introduction of carbon atoms and the enhancement of molecular flexibility, the βlactam ring can be inserted deeper into the S2 pocket, and other fragments of the compound can better adapt to the S1, S1', and S3 subsites. The α-carbonyl carbon on the αketoamide of compound 46-14-3 forms a covalent bond with the key residue Cys145 on the protease; however, the primary non-bonding interaction is still an H-bond (indicated by a yellow dash). The triazole ring mainly forms H-bonds with the Phe140 and Glu166 residues in the S1 pocket. The α-ketoamide covalent binding fragment mainly forms Hbonds with the key amino acid residues Cys145, Gly143, and Ser144 in the S1' subsite, forming an oxyanion hole in the red circle shown in Figure 5. The β-lactam side chain mainly forms H-bonds with residues Tyr54 and AsS187 in the S2 pocket. The chromone scaffold mainly forms H-bonds with the key residues Thr190 and Gln192 in the S3 pocket. In addition, the oxygen on the sulfonyl group of the main chain forms an H-bond with Glu166 in the S1 pocket.  The optimized compounds (46-14-1, 46-14-2, and 46-14-3) are shown in Figure 7 that may be further evaluated by molecular dynamics simulation to determine the binding free energy and by quantum chemical calculation to determine the reaction energy barrier. Furthermore, 46-14-1, 46-14-2, and 46-14-3 were chosen as candidates for chemical synthesis and anti-SARS-CoV-2 activity testing, which is ongoing.
sent the H-bonds, and the oxyanion hole is in the red circle.
The optimized compounds (46-14-1, 46-14-2, and 46-14-3) are shown in Figure 7 that may be further evaluated by molecular dynamics simulation to determine the binding free energy and by quantum chemical calculation to determine the reaction energy barrier. Furthermore, 46-14-1, 46-14-2, and 46-14-3 were chosen as candidates for chemical synthesis and anti-SARS-CoV-2 activity testing, which is ongoing.

Discussion
Computational approaches are particularly important for emerging diseases, given the need to provide timely solutions. In this study, our robust and efficient computational method and pipeline for designing compounds can provide useful drug candidates for treating SARS-CoV-2 infections. More information about our AI model-generated leads and derivatives can be found at [40]. These candidates or their variants have a high probability of producing valid leads of anti-COVID-19 drugs. Nevertheless, the computational design requires experimental validation. Although we are currently conducting experimental validations, we hope that other researchers could use these candidates to accelerate the development of anti-COVID-19 drugs, given the urgency of treating the disease.
FBDD has become a key technology in the pharmaceutical industry for early stage drug discovery and development in the last two decades. Molecular fragments are always small in size compared with intact molecules. Several fragments are especially favorable to some protein binding pockets. For example, the α-ketoamides can form a covalent bond with CYS145 of 3CL pro in SARS-CoV-2 and fit into the oxyanion hole, as Figure 6B presented. These fragments with high-quality interactions with protein targets can more

Discussion
Computational approaches are particularly important for emerging diseases, given the need to provide timely solutions. In this study, our robust and efficient computational method and pipeline for designing compounds can provide useful drug candidates for treating SARS-CoV-2 infections. More information about our AI model-generated leads and derivatives can be found at [40]. These candidates or their variants have a high probability of producing valid leads of anti-COVID-19 drugs. Nevertheless, the computational design requires experimental validation. Although we are currently conducting experimental validations, we hope that other researchers could use these candidates to accelerate the development of anti-COVID-19 drugs, given the urgency of treating the disease.
FBDD has become a key technology in the pharmaceutical industry for early stage drug discovery and development in the last two decades. Molecular fragments are always small in size compared with intact molecules. Several fragments are especially favorable to some protein binding pockets. For example, the α-ketoamides can form a covalent bond with CYS145 of 3CL pro in SARS-CoV-2 and fit into the oxyanion hole, as Figure 6B presented. These fragments with high-quality interactions with protein targets can more likely grow up or integrate into lead compounds. During this growing or integrating process, ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties could be easier to control than working directly on the intact molecule. Based on the above advantages of FBDD, we believe more and more machine-learning-or deep-learning-based molecular generative models will use FBDD to design molecules in the future.
Compared with other deep RL methods, the ADQN-FBDD has several notable features: (1) It directly modifies and generates molecular structures without format conversion problems; other tools, such as in silico Medicine's GENTRL (https://insilico.com), may generate invalid SMILES. (2) Most generative models require pretraining on a specific dataset and produce molecules with high similarities to a given training set. For example, the best molecule from Insilico Medicine for the target DDR1 is highly similar to the kinase inhibitor Iclusig (ponatinib) [43]. The ADQN-FBDD does not require any pre-training and is capable of generating new molecules. (3) The process of generating molecules is highly efficient, as the ADQN-FBBD is a molecular-fragment-based model with knowledge of chemical reactions, whereas other models are all atom-based, with no rules for chemical reactions [30,38,44]. (4) The ADQN-FBDD is highly flexible and user-friendly for medicinal chemists, who can easily introduce their drug discovery experience into the reward function to guide novel molecule generation. Our ADQN-FBDD and related pipeline may be used not only for designing anti-COVID-19 drugs but also for other structure-based drug discoveries, especially for emerging infectious diseases that require timely treatment.

Markov Decision Process (MDP) for Molecule Generation
Intuitively, the problem of chemical structure graph generation can be formulated as learning a reinforced agent, which performs discrete actions of chemical-reaction-based fragment addition or removal in a chemistry-aware MDP. Formally, the components of the MDP include M = {S, A, P, R, γ}, where each term can be defined as follows: S = {S t } denotes the state space containing all possible intermediate and final generated molecular graphs. Each s t is a tuple of (s, t); s represents a valid molecular structure, and t is the time step. For the initial state, s 0 , the structure may be represented as a specific core, as demonstrated in Figure 2C, or randomly chosen from the prepared fragment library at time t = 0. We limited the maximum number of time steps, T, in the fragment-based MDP, which defines the set of terminal states as {s t |t=T}, containing all the states, with the number of steps reaching the maximum allowed value T. Ac = {A t } denotes a set of actions that describe the modification made on the current molecular structure at each time step, t. Each action can be classified into three categories: fragment addition, fragment deletion, and no modification. P = p (s t+1 |s t . . . s 0 ) = p (s t+1 |s t , a t ) is the basic assumption in the MDP. The state transition probability specifies the next possible state given the current state and action at time step, t. Here, we defined the state transition to be deterministic. For example, for S 0 to S 1 (Figure 2C), by adding a 1H-1,2,3-triazol-4-yl fragment on S 0 , the next state S 1 would be the new structure consisting of added 1H-1,2,3-triazol-4-yl with a probability of 1.
R is the reward function that specifies the reward after reaching state S t, and γ (0, 1] is the discount factor; typically, γ = 0.9 in our study. In our framework, the state always had a valid and complete chemical structure at each step, as shown in Figure 2C. A reward was given not just at the terminal states but also at each step. Both intermediate rewards and final rewards were used to guide the behavior of the RL agent. Therefore, there was no delayed or sparse reward issue as experienced with many other reinforced frameworks [36,45]. Additionally, to ensure that the last state was rewarded the most, we used γ T-t to discount the value of the rewards at state s t . Our reward function can directly integrate the experience of medicinal chemists. For example, given a core of interest, medicinal chemists may add biased fragments of interest. Their input can be used to design a reward function that gives high reward signal values to biased fragments so that the ADQN-FBDD may have a better chance of generating the desired structures.

Chemical Environment Design
In our RL framework, the chemical environment receives action, a t , from the agent and emits scalar reward, r t , and state s t to the agent, as shown in Figure 2A. Notably, the definition of the environment state differed from the general approach such that the environment state consisted of only the environment's private representation invisible to the agent. We defined the state of the chemical environment, s t , as the generated intermediate molecular structure at time step, t, as is fully observable by the RL agent. Basically, the environment's state, s t , was equivalent to the agent's state, s t+1 . For the task of molecule generation, the environment can incorporate the rules of chemistry. In our study, chemistry rules were not only about chemical valency but also about adding and removing the fragments of known inhibitors derived from chemical reactions. The detailed information of 45 defined chemical reaction rules is presented in Supplementary Table S2.

Agent Design
As shown in Figure 2A, the basic model of our ADQN-FBDD was an advanced Q-network. The goal of molecule generation was equally to fit a Q function Q(s t , a t ) to make the agent choose the action a t at state s t to maximize the expected γ-discounted cumulative rewards with policy π. Mathematically, given the agent's policy π, the value of the state-action pair Q π (s t , a t ) and the value of state V π (s t ) can each be defined as follows: where E a t ∼π(s t ) is the expectation within policy π on state s t with a t , and R(s n , a n ) denotes the reward at step n. The value function Q π (s t , a t ) measures the value of taking action a t on state s t . V π (s t ) is the value of being at state s t . V π (s t ) can be seen as a part of Q π (s t , a t ). The remaining part of Q π (s t , a t ) can be defined as the so-called advantage function A π (s t , a t ) [43] as follows: Intuitively, the advantage value shows how advantageous selecting the action is relative to the others at the same given state. Equation (2) can be rewritten according to Equation (4): Obviously, E a t ∼π(s t ) [ A π (s t , a t )] = 0. To avoid the issue of identifiability, we subtracted the mean value from the prediction, and the Q-function of the dueling DQN can be defined as follows: Note that Θ, α, and β come from the dueling Q-network, as shown in Figure 8. Moreover, |Ac| is the size of the action space and a t Ac. To make our RL agent more robust for more stable learning and to handle the problem of the overestimation of Q-values, the double Q-network [46] and fixed Q-targets [47] were also incorporated: where TD is the temporal difference; and Q π tar is another dueling DQN network, as the target network and its parameters (Θ , α , β ) were fixed and copied from the dueling DQN Q π every m step (m = 20). To update the parameters (Θ, α, β) from the dueling DQN as shown in Figure 7, we trained our RL agent by minimizing the loss function: where E is the expectation. As the disadvantage of the L2 loss is the tendency to be dominated by outliers, we used the Huber loss as the loss function f l : where E is the expectation. As the disadvantage of the L2 loss is the tendency to be dominated by outliers, we used the Huber loss as the loss function : ( ) = { | | − 0.5 | | ≥ 1 0.5 * 2 | | < 1 (8) Figure 8. Architecture of the applied deep Q-learning network (ADQN). TD is the temporal difference error.

Prioritized Experience Replay
Prioritized experience replays [48] is a technique to enable the RL agent to remember and reuse experiences from the past and to replay important transitions more frequently. Prioritized experience replay is highly useful for replaying some less frequent experiences. Here, we used the "Prioritized Replay Buffer" code from Open AI Gym (version 0.15.4) [49]. Finally, our RL agent was trained in the double-dueling deep Q-learning network with fixed q-targets and prioritized experience replay.

Fragment Library Design
The fragment-based approach to drug discovery (FBDD) has been established as an efficient tool in searching for new drugs [50]. The idea of FBDD is that proper optimization of each unique interaction in the binding site and subsequent incorporation into a single molecular entity should produce a compound with a binding affinity that is the sum of the individual interactions. However, the widely used fragment libraries consider only the diversity of fragments, such as the ZINC fragment database. Therefore, there is a very low probability of achieving the desired bioactivity for a given protein.
To combine FBDD with our RL framework, we first collected and built a SARS-CoV-2 3CL pro inhibitor dataset containing 284 reported molecules. We adopted the improved BRICS algorithm [46] to split these molecules to obtain the fragment library target on

Prioritized Experience Replay
Prioritized experience replays [48] is a technique to enable the RL agent to remember and reuse experiences from the past and to replay important transitions more frequently. Prioritized experience replay is highly useful for replaying some less frequent experiences. Here, we used the "Prioritized Replay Buffer" code from Open AI Gym (version 0.15.4) [49]. Finally, our RL agent was trained in the double-dueling deep Q-learning network with fixed q-targets and prioritized experience replay.

Fragment Library Design
The fragment-based approach to drug discovery (FBDD) has been established as an efficient tool in searching for new drugs [50]. The idea of FBDD is that proper optimization of each unique interaction in the binding site and subsequent incorporation into a single molecular entity should produce a compound with a binding affinity that is the sum of the individual interactions. However, the widely used fragment libraries consider only the diversity of fragments, such as the ZINC fragment database. Therefore, there is a very low probability of achieving the desired bioactivity for a given protein.
To combine FBDD with our RL framework, we first collected and built a SARS-CoV-2 3CL pro inhibitor dataset containing 284 reported molecules. We adopted the improved BRICS algorithm [46] to split these molecules to obtain the fragment library target on SARS-CoV-2 3CL pro , as demonstrated in the flowchart in Figure 1 (yellow box). An elaborate filtering cascade is accompanied by manual inspection, and the rules can be changed based on the needs of different studies. Our fragment library contained 316 fragments with molecular weights of <200 Da and minimum and maximum numbers of non-hydrogen atoms of 1 and 25, respectively. The fragments that come directly from existing inhibitors based on the chemical retrosynthetic rules are always true substructures and may have high bioactivity in targeting 3CL pro . The quality of the designed fragment library can directly affect the properties of the chemical environment of the ADQN-FBDD.

Core Selection
Previous studies have identified various scaffolds or core structures with advantageous characteristics in terms of the activity for a particular target [51][52][53]. Core structure selection is the starting point in scaffold-based drug discovery. However, choosing or designing a proper initial scaffold is not a simple task, and medicinal chemists may require considerable experience. Nevertheless, there are several reported core structures targeting SARS M pro [54]. We chose 4-aminopent-2-enal and 3-amino-2-oxobutanal as the starting cores ( Figure 9) because both cores have been validated to generate covalent bonds with the Cys145 of SARS or SARS-CoV-2 3CL pro .
have high bioactivity in targeting 3CL pro . The quality of the designed fragment library can directly affect the properties of the chemical environment of the ADQN-FBDD.

Core Selection
Previous studies have identified various scaffolds or core structures with advantageous characteristics in terms of the activity for a particular target [51][52][53]. Core structure selection is the starting point in scaffold-based drug discovery. However, choosing or designing a proper initial scaffold is not a simple task, and medicinal chemists may require considerable experience. Nevertheless, there are several reported core structures targeting SARS M pro [54]. We chose 4-aminopent-2-enal and 3-amino-2-oxobutanal as the starting cores ( Figure 9) because both cores have been validated to generate covalent bonds with the Cys145 of SARS or SARS-CoV-2 3CL pro .

Reward Design
Most reported RL methods use the complete structural information of a positive drug or inhibitor as the template [38,44]. They design a reward function for the RL agent to learn to regenerate the template structure or generate highly similar structures to the template. This approach may be useful in testing the performance of RL methods but may not be suitable in real-world drug design because the complete structural information of the novel molecule is unknown. A more practical approach is to learn the structural features of existing drugs or inhibitors in the focused chemical space for a specific protein target. Instead of focusing only on the diversity of molecular structures, we explored the possibility of generating novel molecules based on the existing knowledge. Here, we designed a deep reinforcement learning reward (DRL-reward) function consisting of the final property score with the specific fragments (CSF) score and pharmacophore score as follows: where represents the weight for the quantitative estimate of drug-likeness (QED), and its default value is 0.1; and represents QED. QED values can range from 0 (all

Reward Design
Most reported RL methods use the complete structural information of a positive drug or inhibitor as the template [38,44]. They design a reward function for the RL agent to learn to regenerate the template structure or generate highly similar structures to the template. This approach may be useful in testing the performance of RL methods but may not be suitable in real-world drug design because the complete structural information of the novel molecule is unknown. A more practical approach is to learn the structural features of existing drugs or inhibitors in the focused chemical space for a specific protein target. Instead of focusing only on the diversity of molecular structures, we explored the possibility of generating novel molecules based on the existing knowledge. Here, we designed a deep reinforcement learning reward (DRL-reward) function consisting of the final property score with the specific fragments (CSF) score and pharmacophore score as follows: R(s) = w pro · f pro (s)+w con · f con (s)+w pha · f pha (s) (9) f pro (s) = 1 i f QED(S) > 0.1 0 else where w pro represents the weight for the quantitative estimate of drug-likeness (QED), and its default value is 0.1; and f pro represents QED. QED values can range from 0 (all properties are unfavorable) to 1 (all properties are favorable), calculated based on eight molecular properties [55]. The score function f cs f of CSF is as follows: The binding site of SARS-CoV-2 3CL pro ( Figure 10) is commonly divided into the catalytic activity center (His41 and Cys145; S1 ) and several subsites, defined as S1 (His163, Glu166, Phe140, Leu141, and Asn142), S1 (His41, Cys145, Gly143, and Ser144), S2 (Tyr54, Asp187, His41, Arg188, His164, and Met49), and S3 (Thr190, Gln192, Glu166, Met165, Leu167, and Gln189). Each subsite may have its favorable binding fragment. When generated structures (including intermediates) contain these favorable fragments, f cs f is equivalent to giving an additional reward to our RL agent. Moreover, w con controls the contribution of the biased fragment to the reward signal, and the default value is 0.6; n match represents the number of biased fragments matched in one generated structure; N total is the number of biased fragments defined based on our knowledge from related work; and f pha represents the score function of the pharmacophore, which mainly depends on the ligand-protein interaction mode from the crystal structure (PDB ID: 6LU7): alytic activity center (His41 and Cys145; S1') and several subsites, defined as S1 (His163, Glu166, Phe140, Leu141, and Asn142), S1' (His41, Cys145, Gly143, and Ser144), S2 (Tyr54, Asp187, His41, Arg188, His164, and Met49), and S3 (Thr190, Gln192, Glu166, Met165, Leu167, and Gln189). Each subsite may have its favorable binding fragment. When generated structures (including intermediates) contain these favorable fragments, is equivalent to giving an additional reward to our RL agent. Moreover, controls the contribution of the biased fragment to the reward signal, and the default value is 0.6; ℎ represents the number of biased fragments matched in one generated structure; is the number of biased fragments defined based on our knowledge from related work; and ℎ represents the score function of the pharmacophore, which mainly depends on the ligand-protein interaction mode from the crystal structure (PDB ID: 6LU7): The pharmacophore plot is shown in Supplementary Figure S5. Moreover, ℎ controls the contribution of the pharmacophore score to the reward, and the default value is 0.4. Figure 10. Binding site of SARS-CoV-2 3CL pro . (A) Surfaces of subsites that complement the substrate-binding pocket that are labeled as S1 (green), S1' (red), S2 (magenta), and S3 (cyan). (B) Key residues of the binding site that are presented as green, red, magenta, and cyan lines (S1, S1', S2, and S3). Images of the binding site were generated by using PyMol (http://www.pymol.org/). Figure 10. Binding site of SARS-CoV-2 3CL pro . (A) Surfaces of subsites that complement the substratebinding pocket that are labeled as S1 (green), S1 (red), S2 (magenta), and S3 (cyan). (B) Key residues of the binding site that are presented as green, red, magenta, and cyan lines (S1, S1 , S2, and S3). Images of the binding site were generated by using PyMol (http://www.pymol.org/).

Molecule Generation and Selection
The pharmacophore plot is shown in Supplementary Figure S5. Moreover, w pha controls the contribution of the pharmacophore score to the reward, and the default value is 0.4.

Molecule Generation and Selection
As discussed in Section 4.7, addressing the reward design, our reward function considered the molecular descriptor threshold (QED > 0.1), the defined pharmacophore mode, and biased fragments. A total of 4922 unique valid structures were automatically generated, and all matched the defined rules, using the ADQN-FBDD without any pre-training, as required by other methods [30,36,45,56,57]. Next, all molecules with a high deep-reinforcementlearning score (DRL score: R(S) >= 0.6) were selected (47 molecules), as they were more drug-like, contained more favorable fragments of SARS-CoV-2 3CL pro , and covered more pharmacophore modes compared with other candidates obtained from multiple learning steps. These 47 unique molecules were prepared to generate at least one conformation with local energy minimization, using the OPLS-2005 force field of the "ligand prepare" module of Schrödinger 2015 software. The 47 unique molecules generated a total of 163 3D conformations before docking into the substrate-binding site of SARS-CoV-2 3CL pro . Considering the balance between precision and calculation time, the stand-precision (SP) Glide [58] was initially used to predict the possible non-covalent binding poses in this binding site and the binding site grid centered on the original ligand N3 [59] with 20 Å buffer dimensions. Following non-covalent docking, we also determined the covalent docking poses and scores for the 47 molecules. We reordered the docking results mainly based on the covalent docking score and the RMSD difference between the covalent and non-covalent poses (Supplementary Table S1).