Next Article in Journal
Eupatorium lindleyanum DC Ameliorates Carbon Tetrachloride-Induced Hepatic Inflammation and Fibrotic Response in Mice
Previous Article in Journal
SIRT1 Modulates the Photodynamic Anticancer Activity of 5,10,15-Triethoxycarbonyl P(V) Corrole in Hepatocellular Carcinoma
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Objective Drug Molecule Optimization Based on Tanimoto Crowding Distance and Acceptance Probability

School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
*
Author to whom correspondence should be addressed.
Pharmaceuticals 2025, 18(8), 1227; https://doi.org/10.3390/ph18081227
Submission received: 23 June 2025 / Revised: 27 July 2025 / Accepted: 29 July 2025 / Published: 20 August 2025
(This article belongs to the Section Medicinal Chemistry)

Abstract

Background: Traditional molecular optimization methods struggle with high data dependency and significant computational demands. Additionally, conventional genetic algorithms often produce solutions with high similarity, leading to potential local optima and reduced molecular diversity, thereby limiting the exploration of chemical space. Methods: In order to address the above issues, this paper proposes an improved genetic algorithm for multi-objective drug molecular optimization (MoGA-TA). It uses the Tanimoto similarity-based crowding distance calculation and a dynamic acceptance probability population update strategy. The study employs a decoupled crossover and mutation strategy within chemical space for molecular optimization. The proposed crowding distance calculation method better captures molecular structural differences, enhancing search space exploration, maintaining population diversity, and preventing premature convergence. The dynamic acceptance probability strategy balances exploration and exploitation during evolution. Optimization continues until a predefined stopping condition is met. To assess MoGA-TA’s effectiveness, the algorithm is evaluated using metrics like success rate, dominating hypervolume, geometric mean, and internal similarity. Results: Experimental results show that compared to the comparative method, MoGA-TA performs better in drug molecule optimization and significantly improves the efficiency and success rate. Conclusions: The method described in this paper has been proven to be an effective and reliable method for multi-objective molecular optimization tasks.

1. Introduction

Optimizing drug molecules is a complex and crucial process focused on enhancing their physicochemical properties, biological activity, and selectivity [1,2]. This process constitutes a crucial element in drug discovery and necessitates ongoing optimization via the Design–Synthesis–Test–Analysis (DSTA) cycle. The DSTA cycle is an iterative strategy commonly used in drug optimization, which refers to the continuous improvement of the structure and properties of a candidate molecule through four steps: molecular design, synthesis, experimental testing, and analysis of results. The traditional drug discovery process is faced with long lead times, high costs, and high risks [3], especially in the huge chemical space—estimated to be about 10 60 molecules—which makes finding molecules with desirable properties in such a large search space an extremely challenging task [4]. It is crucial to develop efficient lead compound optimization strategies to expedite the identification and development of novel drug candidates.
Computer-aided drug design (CADD) has significantly advanced molecular optimization methodologies, encompassing both generative model-based and evolutionary algorithm-based strategies. Nonetheless, generative model-based methods continue to encounter optimization efficiency issues owing to the absence of prior knowledge in drug molecular design and the constraints imposed by limited training datasets. Specifically, as deep generative models (DGMs) have flourished in the field of molecular design [5,6], models such as variational auto-encoders (VAEs) [7,8], recurrent neural networks (RNNs) [9], and generative adversarial networks (GANs) [10], with different molecular representations (e.g., SMILES strings or molecular graphs), have been widely used in molecular generation, attribute prediction, and optimization tasks [11,12]. These models help medicinal chemists efficiently design molecules and generate candidate molecules that match known good molecular properties by mapping discrete molecular representations (e.g., SMILES, molecular graphs) in chemical space to the continuous latent space and combining them with optimization models [13,14,15]. Despite the significant progress made by DGMs in the field of drug molecule optimization, the limitations of the available training data, the insufficient global search capability, and the lack of molecular data with multiple desirable attributes at the same time remain the major bottlenecks limiting the further development of these models [16,17,18]. Evolutionary algorithm-based methods typically require balancing multiple optimization objectives to achieve an optimal solution. Current approaches often employ an aggregation strategy to merge multiple objectives into a single objective, complicating the effective weighting of these objectives. Furthermore, most existing multi-objective optimization techniques are limited to optimizing two or three objectives, underscoring the importance of developing methods suitable for a greater number of objectives. Despite significant advancements in drug molecule optimization, addressing the scarcity of high-quality data and improving the balance of objectives in multi-objective optimization remain critical challenges in this domain.
Evolutionary algorithms (EAs) play a crucial role in drug molecule optimization, especially in multi-objective drug molecule design, where they show excellent performance [19,20]. These algorithms are adept at managing multiple optimization objectives concurrently, utilizing evaluation techniques such as non-dominated sorting and crowding distance for molecule selection, thereby efficiently identifying the optimal solution set along the Pareto frontier. The primary benefits of EAs include their robust global search capabilities and thorough exploration of intricate chemical landscapes, which facilitate exceptional performance in addressing molecular design challenges. EAs are recognized for their minimal reliance on extensive prior knowledge or large-scale training datasets, demonstrating significant flexibility in drug molecular design. Recent research, exemplified by studies like MolFinder and EvoMol [21,22], indicates that in certain scenarios, the efficacy of EAs not only matches but sometimes surpasses that of Deep Generative Models (DGMs) in specific tasks [23,24]. These studies emphasize the great potential of EAs in the field of drug molecule optimization, and in particular, they exhibit significant advantages in dealing with multi-objective optimization problems. Although EAs have been widely used in the development of ab initio drug discovery models, these models mainly focus on designing new molecules from scratch and are more focused on 2–3 objectives [25]. Therefore, there is still room for improving the optimization efficiency in generating structurally similar molecules with target properties based on existing lead compounds.
In the field of multi-objective evolutionary algorithms (MOEA), the non-dominated sorting genetic algorithm II (NSGA-II) has attracted much attention due to its high efficiency and excellent ability to maintain population diversity [26,27]. The algorithm selects individuals by non-dominated sorting and crowding calculations, thus maintaining population diversity and guiding the evolution of the population towards the Pareto front. In addition, the Tanimoto coefficient has an important role in the optimization process of drug molecules. Based on the principle of set theory, this coefficient measures the similarity of two sets by quantifying the ratio of their intersection to their concatenation [28]. Tanimoto coefficients are widely used in molecular similarity metrics and provide a powerful tool for tasks such as molecular clustering, classification, and retrieval [29]. Through these approaches, EAs are not only able to identify new molecules with desirable properties but also significantly enhance the efficiency and innovativeness of the drug discovery process, extend the exploration of chemical space, and enable fast and efficient searches.
We present an algorithm named MoGA-TA, which calculates Tanimoto similarity-based congestion distance and incorporates a dynamic population updating strategy to adjust the acceptance probability for molecular optimization. This algorithm integrates the multi-objective optimization capabilities of NSGA-II with the advantages of Tanimoto coefficient similarity measures. It is designed to optimize multiple objectives concurrently, including enhancing efficacy, reducing toxicity, increasing solubility, and improving other performance metrics. Through an iterative search process, it generates a set of non-dominated solutions that balance these objectives [30]. The Tanimoto crowding-based mechanism accurately captures structural differences between molecules, preserving diverse structures and guiding population evolution. The acceptance probability-based population update strategy enables broader exploration of chemical space during early evolution, balancing exploration and exploitation while preventing premature convergence to local optima. In later stages, this strategy effectively retains superior individuals, allowing the population to gradually converge towards the global optimum.

2. Results and Discussion

2.1. Benchmark Evaluation of MOGA-TA

To evaluate the performance of MoGA-TA, we compared MoGA-TA with NSGA-II and GB-EPI on six multi-objective molecular optimization tasks, of which the first five were from the GuacaMol benchmarking platform, and the sixth was aimed at optimizing the biological activity and drug-like properties [31,32,33].

2.1.1. Test Tasks for Molecular Optimization

In this study, we use datasets from the ChEMBL database. The six benchmark tasks cover different molecular properties and optimization objectives. The six tasks are specified as follows:
  • Task 1 (Fexofenadine): Tanimoto similarity (AP), TPSA, logP.
  • Task 2 (Pioglitazone): Tanimoto similarity (ECFP4), molecular weight, number of rotatable bonds.
  • Task 3 (Osimertinib): Tanimoto similarity (FCFP4), Tanimoto similarity (FCFP6), polar surface area (TPSA), logP.
  • Task 4 (Ranolazine): Tanimoto similarity (AP), polar surface area (TPSA), logP, number of fluorine atoms.
  • Task 5 (Cobimetinib): Tanimoto similarity (FCFP4), Tanimoto similarity (ECFP6), number of rotatable bonds, number of aromatic rings, CNS [34].
  • Task 6 (DAP kinases): DAPk1, DRP1, ZIPk, QED, logP.
The objectives in the six optimization tasks include similarity between the molecule and the target drug, specific properties of the molecule, and physicochemical properties. As shown in Table 1, among the objectives of these benchmarking tasks, the similarity score is achieved by calculating the Tanimoto similarity of the fingerprint between the target molecule and the generated candidate molecule. This computational process was accomplished using models embedded in the RDKit software package (version 2022.09). The molecular fingerprints used therein include ECFP, FCFP, and atom pair fingerprint AP. For the obtained similarity scores and target scores, such as attributes, they are mapped to the [0, 1] interval after modification by the corresponding modifier functions in Table 1. Among them, the scoring functions can be calculated based on the SMILES string of the molecule or the molecule map, and the scores of TPSA and logP are calculated by the RDKit software package [35].

2.1.2. Evaluation Metrics

In order to comprehensively evaluate each optimization task, we used four evaluation metrics: success rate, super volume, geometric mean, and internal similarity. Specifically, the success rate (SR) is the percentage of generated molecules that satisfy all the target conditions. The geometric mean measures the comprehensive performance of the generated molecules on multiple target attributes. We use an extended similarity index to calculate and track the internal similarity of evolving populations. In the optimization task, the conditions to be satisfied by the molecules are those that meet the thresholds given by the modification function. For example, for the optimization task Fexofenadine, molecules that satisfy the Tanimoto (AP) condition are those with similarity less than 0.8; molecules that satisfy the condition for the target TPSA are those with TPSA scores in the range of [80, 100]; and molecules that satisfy the condition for the target logP are those with logP values between [2, 6]. The hypervolume metric is a common measure of convergence and diversity of algorithms in multi-objective optimization, which calculates the hypervolume of the objective space dominated by the Pareto solution, and is mathematically defined as shown in Equation (1):
H V = δ i = 1 | S | v i
where H V represents the hypervolume [36]. δ is the Lebesgue measure used to evaluate the hypervolume. | S | represents the number of solutions in the non-dominated solution set (Pareto solutions). v i represents the hypervolume between the reference point and the i-th solution in the solution set, which is a hyperrectangle formed by the reference point (origin) and the i-th solution in the non-dominated solution set.

2.2. Comparisons on Test Tasks

In our experiments, we compared MoGA-TA with NSGA-II and GB-EPI [32,37]. To enhance the precision and impartiality of the experimental outcomes, identical benchmarking and experimental datasets were utilized for MoGA-TA, NSGA-II, and GB-EPI. Furthermore, the experimental configuration of the NSGA-II methodology was adopted, which entailed conducting 20 trials per optimization task, with a population size of 100 and 150 iterations. All experiments were conducted on a computing platform equipped with an RTX 3090 graphics card and an Intel(R) Xeon(R) Gold 6226R processor. The Python version used was 3.7.
First, Table 2 demonstrates the average values of the three methods on each evaluation metric [32]. In the six test tasks, the proposed MoGA-TA method significantly outperforms GB-EPI across all evaluation metrics and demonstrates superior overall performance compared to NSGA-II. Figure 1 illustrates the success rates of the three methods in the six optimization tasks. It shows that in tasks involving three optimization objectives, MoGA-TA achieves a success rate of 73%, which is 17% higher than GB-EPI and approximately 9% higher than NSGA-II. For tasks with four optimization objectives, such as Osimertinib optimization, MoGA-TA’s success rate exceeds NSGA-II by 19%. In the tasks involving five optimization objectives, the success rate of MoGA-TA for the optimization of multi-kinase inhibitors is still higher than the baseline method and 9% higher than NSGA-II. In addition, MoGA-TA has also achieved certain improvements in the hypervolume index, reflecting the improvement in the diversity, superiority, convergence, and algorithm performance of the generated molecules. In short, MoGA-TA has achieved better performance in all average values and has achieved significant improvements in the hypervolume and success rate indicators. It proves the effectiveness and feasibility of MoGA-TA in molecular optimization.
Second, in order to visualize the distribution of molecules generated by the two methods in the chemical space, Figure 2 shows the molecular spatial distribution of two representative benchmarks: Fexofenadine and Pioglitazone. As can be seen from Figure 2a, in the Fexofenadine optimization task, the molecules generated by the MoGA-TA method are more uniformly distributed and cover a wider range of the chemical space, suggesting that the MoGA-TA method has a better versatility in exploring the chemical space. In contrast, the molecular distribution of NSGA-II is more concentrated, and the number of molecules is dense in some areas, while the molecular distribution is sparse or very small in some areas. This suggests that this chemical region was not explored effectively or was insufficiently explored by NSGA-II. GB-EPI generated molecules with higher TPSA scores with a higher percentage of molecules, but fewer molecules within the TPSA score interval [80, 100], in addition to some molecules with lower TPSA scores. From the figure, it can be seen that the overall similarity of molecules generated by NSGA-II and GB-EPI is higher than that of molecules generated by MoGA-TA, whereas the similarity distribution of molecules generated by the MoGA-TA method is more uniform, which is more suitable for the generation of molecules with higher diversity and different similarity levels. Figure 2b shows that in optimizing the task Pioglitazone, the distribution of molecules generated by the NSGA-II method is more concentrated and fails to explore the chemical space effectively, while the GB-EPI method generates molecules with a relatively high proportion of molecules with a high number of rotatable bonds. However, too many rotatable bonds may lead to the instability of the molecular structure. Comparatively, most of the molecules generated by the MoGA-TA method have 1 or 2 rotatable bonds, and the generated molecules have a wider coverage and more dispersed distribution in the chemical space, which enables the exploration of more effective molecules. This suggests that the MoGA-TA method is able to explore different chemical spaces while maintaining the stability of the molecular structure. In addition, the molecules generated by the MoGA-TA method are more homogeneous in similarity distribution compared to GB-EPI, which helps to maintain the diversity of the generated molecules. In summary, the MoGA-TA method is able to discover molecules with large structural differences and exhibits a more uniform distribution in similarity scores, a property that is particularly prominent in multi-objective optimization tasks.
The molecular optimization method MoGA-TA, as presented in this paper, demonstrates superior performance as an optimization model. It excels particularly in tasks related to molecular structure optimization and multi-objective molecular property optimization. The exceptional performance of MoGA-TA can be attributed to its congested distance calculation method and dynamic acceptance probability strategy. These strategies enhance the ability to discern structural differences between molecules and effectively manage the balance between exploration and exploitation. Through dynamic adjustment of the acceptance probability, the algorithm introduces necessary randomness, preventing premature convergence to local optima and fostering the generation of diverse molecular structures.
Finally, this paper uses indicators such as success rate and maximum mean to show that the method can effectively improve the efficiency of generating molecules while meeting the optimization requirements when performing multi-objective optimization tasks. By visualizing the distribution of generated molecules in chemical space, it is verified that this method can explore a wider range of chemical space during the molecular optimization process. The experimental results show that in the optimization task of this paper, MoGA-TA can achieve relatively good results in the optimization of a given task.

2.3. Ablation Experiment

The ablation study focuses on evaluating two key aspects of population updating strategies: dynamically adjusted acceptance probabilities based on the number of iterations and congestion distance computation based on Tanimoto similarity. Initially, we compared the MoGA-T algorithm without acceptance probabilities with NSGA-II. The data in Table 2 and Table 3 show that the algorithm’s hypervolume and success rates are improved with the Tanimoto similarity-based congestion distance computation. Notably, an 11% increase in success rate was observed when optimizing the Osimertinib task, which may be attributed to the method’s ability to better highlight structural differences between molecules, proving its effectiveness. Subsequently, MoGA-T was compared with MoGA-TA. Table 2 and Table 3 show that there is a significant improvement in all optimization tasks after implementing the probabilistic acceptance strategy. This improvement is attributed to the fact that acceptance probabilities allow underperforming individuals to remain in the population, thus maintaining diversity, preventing premature convergence to local optima, and providing breeding opportunities for less adaptable individuals.

3. Materials and Methods

3.1. Related Work

Evolutionary algorithms play an important role in multi-objective drug molecule optimization, especially when dealing with multiple conflicting objectives and complex chemical spaces. In order to better introduce the method proposed in this paper, this section will first review the basic concepts and applications of drug molecule optimization and genetic algorithms. Through the discussion of existing methods, we will further elucidate the current status of practical applications of multi-objective optimization in drug molecule design, as well as the development trends and challenges in this field.

3.1.1. Non-Dominated Sorting Genetic Algorithm II (NSGA-II)

The non-dominated sorting genetic algorithm II (NSGA-II) is an evolutionary algorithm for solving multi-objective optimization problems proposed by Deb et al. in 2002 [28]. The algorithm is able to efficiently solve optimization problems with multiple conflicting objectives through the mechanism of non-dominated sorting and congestion comparison. The core idea of NSGA-II is to simulate the process of natural selection through the operations of selection, crossover, and mutation, and thus to generate progressively more optimal solutions in the population. In the working principle of NSGA-II, the individuals in the population are firstly sorted into non-dominated orderings and categorized into multiple frontier tiers based on dominance relationships. Subsequently, the crowding distance is used to measure the distribution of individuals in each level to maintain the diversity of population distribution and avoid all solutions from clustering in a certain area. The pseudo-code is shown in Algorithm 1 [28,33]. In the field of drug molecule optimization, NSGA-II has been widely used in drug design, molecule optimization, and property prediction [38]. For example, methods such as graph-based elite patch illumination algorithm (GB-EPI) [35] and graph-based molecular Pareto optimization [33] have shown good performance in drug molecule optimization.

3.1.2. Multi-Objective Molecular Optimization

The multi-objective molecular optimization is of great significance in the field of drug design and molecular engineering, involving the optimization of multiple, often conflicting, objectives, such as the biological activity, toxicity, solubility, stability, and pharmacokinetic properties of drug molecules [1,2,39]. Due to the intrinsic competition between these objectives, single-objective optimization methods make it difficult to take into account the needs of multiple objectives, and therefore, multi-objective optimization methods have become an inevitable choice [20,21]. The Pareto-based methodology is pivotal in multi-objective molecular optimization. It delineates the balance among objectives within the solution set through the “non-domination” principle. Utilizing Pareto optimal solution sets, it seeks to generate a collection of solutions that attain a relative equilibrium across objectives. These sets form Pareto frontiers, symbolizing optimal trade-offs between various objectives. The primary objective is to identify solution sets proximate to the Pareto frontier, leveraging optimization algorithms to achieve equilibrium among multiple objectives.
Algorithm 1 The framework of the algorithm for NSGA-II.
Require: 
P (initial population), N (number of generations)
Ensure: 
P f i n a l (pareto front)
1:
f r o n t s f r o n t s s o r t i n g ( P )
2:
c r o w d i n g _ d i s t a n c e c r o w d i n g d i s t a n c e ( f r o n t s )
3:
for  i = 0 to N do
4:
     P i m u t a t i o n ( P i ) + c r o s s o v e r ( P i )
5:
     f r o n t s f r o n t s s o r t i n g ( P i + P i )
6:
     c r o w d i n g _ d i s t a n c e c r o w d i n g d i s t a n c e ( f r o n t s )
7:
     P i + 1 S e l e c t s a t i s f y c o n d i t i o n s ( )
8:
end for
Currently, the commonly used evolutionary algorithms for multi-objective molecular optimization include the non-dominated sorting genetic algorithm II (NSGA-II), the conformational space annealing algorithm (MolFinder), etc. [23,28]. These algorithms ensure diversity and comprehensive exploration within the target space, thereby preventing the risk of converging to suboptimal solutions. This capability facilitates efficient navigation through chemical space. Given the escalating demands for precision and efficiency in drug design, enhancing the optimization strategy is crucial. By refining evolutionary algorithms and exploring superior methodologies, the efficacy and performance of multi-objective molecular optimization can be significantly improved. These advanced strategies enable more effective exploration of chemical space, effectively manage the trade-offs between various objectives, and augment both the diversity and quality of the solution set, ultimately aligning with the requirements of drug design for molecular generation and optimization [31].
In conclusion, multi-objective molecular optimization holds significant value in drug discovery and material design. As optimization algorithms advance and computational power improves, this method will increasingly influence molecular design and optimization, thereby accelerating the discovery of new drugs and enhancing the efficiency of molecular synthesis.

3.1.3. Crossover and Mutation Operations

The genetic algorithm (GA) represents a traditional optimization technique grounded in principles of natural selection and genetic processes. In GA, crossover and mutation operations are pivotal for generating new individuals, preserving population diversity, and enhancing the quality of offspring. This study employed a decoupled crossover and mutation strategy to achieve more efficient molecular optimization. The crossover strategy significantly expands the exploration of the chemical space by decomposing the parent molecule into core structures and side-chain components, which are then randomly recombined. During the mutation process, different offspring molecules are obtained by adding, replacing, or deleting the side chains of the molecules [33]. The above crossover mutation strategy not only significantly improves the algorithm’s ability to explore the solution space but also enhances its adaptability and the quality of the generated molecules, thereby more effectively discovering high-quality solutions during the optimization process [19,32,37].

3.2. Method

This section introduces a multi-objective optimization approach utilizing Tanimoto crowding distance and acceptance probability (MoGA-TA) to generate and evaluate compounds meeting multi-objective optimization criteria. The framework integrates essential operations, including Tanimoto-based crowding degree, crossover, and mutation, to sustain population diversity and convergence while enhancing the success rate of molecular optimization in multi-objective tasks. Additionally, a dynamic acceptance probability mechanism, adjusted according to the number of generations, is implemented to maintain a balance between diversity and convergence throughout the evolutionary process. Subsequently, the multi-objective molecular optimization problem is outlined, followed by a detailed description of the methodological framework’s primary components and processes.

3.2.1. Problem Definition

The optimization of lead compounds within the drug discovery and development process constitutes a multi-objective optimization challenge focused on enhancing various critical molecular properties. This process can be structured as an optimization task aimed at identifying the optimal molecular configuration, with each molecular characteristic serving as an objective for optimization. Starting from an initial lead compound, the objective is to navigate the chemical landscape to discover novel molecular structures that enhance multiple target properties. These properties may encompass elevated biological activity, improved selectivity, superior pharmacokinetic characteristics, and diminished toxicity. Mathematically, this multi-objective optimization problem can be defined as in Equation (2):
m a x x Ω ( f 1 ( x ) , f 2 ( x ) , , f n ( x ) )
where x denotes a molecule, Ω denotes the chemical space consisting of all molecules, and f 1 ( x ) , f 2 ( x ) , , f n ( x ) denotes the n objectives to be optimized, each representing a particular molecular property. The solution to this problem is usually a collection of multiple molecular structures, known as a Pareto front. Each molecular structure provides a different trade-off between multiple objectives. The molecular properties to be considered for a typical situation include the following:
  • Chemical rationality assessment: including synthetic accessibility (SA score), drug similarity (QED), LogP (lipid solubility), etc.
  • Total polar surface area (TPSA): used to predict the ability of a molecule to cross a cell membrane [39].
  • Molecular structure characteristics: including the number of aromatic rings (Number of Aromatic Rings), the number of rotatable bonds (Number of Rotatable Bonds), etc.
  • Bioactivity: refers to the ability of a molecule to interact with a biological target (such as an enzyme, receptor, or ion channel).
  • Similarity: The similarity score with the target molecule, usually calculated using Tanimoto similarity (based on different fingerprint types, such as FCFP4, ECFP6, etc.) [40]. Tanimoto similarity was calculated using Equation (3) below:
S i m ( x , x 0 ) = 2 × | f p ( x ) f p ( x 0 ) | | f p ( x ) | + | f p ( x 0 ) |     | f p ( x ) f p ( x 0 ) |
where f p ( x ) and f p ( x 0 ) represent the Morgan fingerprints of molecules x and x 0 , respectively [41], and | · | denotes the size of the set, i.e., the number of elements in the set. This formula reflects the degree of overlap between the structural features of two molecules and is a value between 0 and 1. The closer the value is to 1, the higher the similarity. In this study, we used the RDKit toolkit to generate Morgan fingerprints and calculate Tanimoto similarity [35].

3.2.2. Framework of MOGA-TA

Algorithm 2 delineates the primary framework of the proposed multi-objective optimization algorithm, MoGA-TA. The algorithm consists of the following steps: first, a population of size N is randomly selected from the specified dataset, and its fitness is calculated. Subsequently, offspring are generated through crossover and mutation operations, and their fitness is calculated. Next, the parent and child molecules are merged, Pareto sorting is used to obtain the Pareto front, and the crowding distance of the molecules in the Pareto front is calculated based on the Tanimoto distance. Finally, for frontiers requiring selection operations, they are sorted in descending order by crowding distances, and molecules entering the next generation are determined based on acceptance probability. This evolutionary process continues until the termination condition is met. The selection procedure integrates crowding distance sorting and acceptance probability to ensure the next generation of molecules is selected based on fitness and diversity metrics, thereby preserving population diversity. Consequently, the algorithm achieves a balance between exploration and exploitation, enhancing the likelihood of discovering a globally optimal solution. Evaluation metrics, including success rate, dominant hypervolume, geometric mean, and internal similarity, are employed to assess the overall optimization quality and population diversity. The fundamental process of the algorithm resembles that of the classical NSGA-II method, with enhancements in crowding distance calculations and dynamic acceptance probabilities based on Tanimoto similarity. Further details on other key components of MoGA-TA will be provided subsequently.
Algorithm 2 The framework of the algorithm for MoGA-TA
Require: 
P (initial population), N (number of generations)
Ensure: 
Optimized molecules
1:
for  i = 0 to N do
2:
     P i m u t a t i o n ( P i ) + c r o s s o v e r ( P i )
3:
     f r o n t s f r o n t s s o r t i n g ( P i + P i )
4:
     c r o w d i n g _ d i s c r o w d i n g _ d i s t a n c e ( f r o n t s )
5:
     a c c e p t _ p r o b a c c e p t a n c e _ p r o b a b i l i t y ( i )
6:
     P i + 1 [ ]
7:
    for each front in f r o n t s  do
8:
        if satisfy_splitting_condition(front) then
9:
            P i + 1 s o r t _ s e l e c t i o n ( f r o n t , a c c e p t _ p r o b )
10:
        else
11:
            P i + 1 P i + 1 + f r o n t
12:
        end if
13:
    end for
14:
end for

3.2.3. Crowding Distance Based on Tanimoto Similarity

In NSGA-II (non-dominated sorting genetic algorithm II), the crowding distance is used to distinguish individuals within the Pareto front. In traditional NSGA-II, only the objective function value is usually considered when calculating the crowding distance. The solutions in the same non-dominated front are sorted in each objective dimension, and the normalized difference between adjacent individuals in this dimension is used to estimate the local density (sparseness), and finally, the difference in all dimensions is linearly accumulated as the crowding degree. However, this calculation method does not take into account the chemical structure information of the molecule and ignores the diversity of molecular structure. However, this computational method does not take into account the chemical structure information of the molecule, ignores the diversity of molecular structures, and focuses only on the objective function value. In drug design, the chemical structure of a molecule is pivotal, particularly when the objective function is indifferent to structural variations, potentially resulting in structural homogenization. The Tanimoto similarity-based crowding distance effectively reflects molecular structural similarity regardless of the objective function value. This method enhances the identification of structural differences among molecules, thereby fostering the generation of diverse molecular structures. In drug discovery, diverse molecular libraries significantly boost the likelihood of identifying effective drugs. In molecular optimization tasks, molecular structure (represented by SMILES) and chemical properties (such as similarity and functional group distribution) are paramount. Tanimoto similarity directly quantifies structural differences, aiding in the retention of structurally diverse molecules within the population. In particular, during the initial molecular screening phase, maintaining diversity in the molecular library increases the probability of discovering novel and effective compounds. During the optimization process, this approach is able to generate richer candidate molecules while optimizing by directly considering structural differences. In this case, the distance between pairs of molecules is calculated using the Tanimoto distance [33], which is defined as shown in Equation (4):
d i j = d ( s i , s j ) = 1 | s i s j | | s i s j |
where | s i s j | denotes the number of common sites in the fingerprints of molecules s i and s j , and | s i s j | denotes the number of all sites present in both molecular fingerprints (the fingerprint used in this paper is ECFP4). Based on the distances between pairs of molecules, we calculated the average distance between each molecule and all other molecules as a measure of crowding. By calculating the crowding distance in conjunction with Tanimoto similarity, the diversity and structural features of molecules can be evaluated more efficiently in multi-objective optimization.

3.2.4. Population Update

To ensure the evolution of the population towards higher-quality candidate molecules, MOGA-TA employs a selection mechanism that integrates Tanimoto similarity-based crowding distances, Pareto ranking, and acceptance probabilities. After merging the parent and child generations, the merged molecules are first non-dominated sorted, and the crowding distance is calculated. Then, the contemporary acceptance probability is calculated according to Equation (5). Molecules are selected based on dominance rank and crowding, and included in the next iteration’s parent population using acceptance probability. During crossover and mutation, parent individuals are selected via random sampling based on fitness values. For mutation, individuals with higher fitness values have a higher likelihood of selection. For crossover, a comprehensive evaluation based on fitness values ensures that higher-fitness individuals are more frequently chosen. This selection process considers individual fitness while maintaining diversity through randomness, thereby mitigating the risk of premature convergence. The acceptance probability strategy effectively balances exploration and exploitation, preventing premature convergence to local optima by introducing randomness in the selection process. Early in the algorithm, a higher degree of exploration is permitted, allowing for diverse individuals. As evolution progresses, the focus shifts to selecting superior individuals, gradually converging towards the optimal solution. Each molecule has a probability of being selected for the next generation, though elimination is also possible. The acceptance probability formula is provided (5):
p a = e ( 1 / g ) β
where g denotes the number of current iteration generations, p a denotes the probability that the current molecule is accepted, and β controls the rate of decay, which is 0.45 in this paper. A smaller value of β implies a slower increase in the acceptance probability, while a larger value of β approaches 1 more quickly. Finally, this population renewal strategy maintains the diversity of the population during generation and selection while utilizing the good individuals that are already available. This balance helps to achieve better convergence and diversity.

4. Conclusions

This study transforms the optimization problem of lead drug molecules into a multi-objective optimization challenge. A novel optimization method, MoGA-TA, is introduced as a population updating strategy that utilizes crowding distance and dynamic acceptance probability adjustment based on Tanimoto similarity. The crowding calculation strategy in MoGA-TA helps capture the structural differences between molecules, thereby generating diverse molecular structures that are critical to improving the probability of drug discovery. The method improves selection accuracy by standardizing crowding treatment, thereby achieving fair comparison at different scales. In addition, robust management of invalid fingerprints and efficient computation on large datasets enhance the adaptability and computational efficiency of the algorithm. Together, these features promote a balance between exploration and exploitation, providing an effective strategy for solving complex molecular optimization problems. Dynamic acceptance probability adjustment in the population updating strategy ensures a balance between exploration and exploitation in the multi-objective optimization framework. By regulating the probability of individuals with poor fitness entering the next generation, the method helps maintain diversity while achieving convergence and effectively explores a wider solution space.
This study demonstrates the effectiveness of MoGA-TA in solving the multi-objective drug molecule optimization challenge. The method successfully explored a wider range of chemical fields, generated diverse molecules with different similarities, and optimized multiple objectives. Nevertheless, when faced with more optimization targets or more complex optimization problems, how to generate candidate molecules with better diversity and more ideal properties still needs further improvement. Therefore, there are still many research directions in the future, such as an in-depth study of the evolutionary process to generate more diverse, more widely distributed, and more ideal candidate molecules in the chemical space. In addition, improved cross-mutation technology is expected to bring more suitable drug molecule optimization strategies.

Author Contributions

Methodology, Y.W.; writing—first draft preparation, analysis of results, Y.W.; writing review and editing, C.D. and X.L.; visualization, Y.W.; supervision, guided writing, C.D. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundations of China (no. 12271326, no. 62102304, no. 61806120, no. 61502290, no. 61672334, no. 61673251), China Postdoctoral Science Foundation (no. 2015M582606), Industrial Research Project of Science and Technology in Shaanxi Province (no. 2015GY016, no. 2017JQ6063), Fundamental Research Fund for the Central Universities (no. GK202003071), and the Natural Science Basic Research Plan in Shaanxi Province of China (no. 2022JM-354).

Data Availability Statement

Data is contained in the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hsu, H.-H.; Hsu, Y.-C.; Chang, L.-J.; Yang, J.-M. An integrated approach with new strategies for QSAR models and lead optimization. BMC Genom. 2017, 18, 104. [Google Scholar] [CrossRef]
  2. Zhavoronkov, A. Artificial intelligence for drug discovery, biomarker development, and generation of novel chemistry. Mol. Pharm. 2018, 15, 4311–4313. [Google Scholar] [CrossRef]
  3. DiMasi, J.A.; Grabowski, H.G.; Hansen, R.W. Innovation in the pharmaceutical industry: New estimates of R&D costs. J. Health Econ. 2016, 47, 20–33. [Google Scholar]
  4. Polishchuk, P.G.; Madzhidov, T.I.; Varnek, A. Estimation of the size of drug-like chemical space based on GDB-17 data. J. Comput.-Aided Mol. Des. 2013, 27, 675–679. [Google Scholar] [CrossRef]
  5. Grantham, K.; Mukaidaisi, M.; Ooi, H.K.; Ghaemi, M.S.; Tchagang, A.; Li, Y. Deep evolutionary learning for molecular design. IEEE Comput. Intell. Mag. 2022, 17, 14–28. [Google Scholar] [CrossRef]
  6. Liu, X.; Ye, K.; van Vlijmen, H.W.T.; Emmerich, M.T.M.; IJzerman, A.P.; van Westen, G.J.P. DrugEx v2: De novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. J. Cheminform. 2021, 13, 85. [Google Scholar] [CrossRef] [PubMed]
  7. Blaschke, T.; Olivecrona, M.; Engkvist, O.; Bajorath, J. Application of generative autoencoder in de novo molecular design. Mol. Inform. 2018, 37, 1700123. [Google Scholar] [CrossRef] [PubMed]
  8. Gómez-Bombarelli, R.; Wei, J.N.; Duvenaud, D.; Hernández-Lobato, J.M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. [Google Scholar] [CrossRef]
  9. Kotsias, P.-C.; Arús-Pous, J.; Chen, H.; Engkvist, O.; Tyrchan, C. Direct steering of de novo molecular generation using descriptor conditional recurrent neural networks (cRNNs). J. Cheminform. 2019, 16, 64. [Google Scholar]
  10. Prykhodko, O.; Johansson, S.V.; Kotsias, P.-C.; Arús-Pous, J.; Bjerrum, E.J.; Engkvist, O.; Chen, H. A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminform. 2019, 11, 74. [Google Scholar] [CrossRef]
  11. Jin, W.; Yang, K.; Barzilay, R.; Jaakkola, T. Learning multimodal graph-to-graph translation for molecular optimization. arXiv 2018, arXiv:1812.01070. [Google Scholar]
  12. Fu, T.; Xiao, C.; Li, X.; Glass, L.M.; Sun, J. Mimosa: Multi-constraint molecule sampling for molecule optimization. Proc. AAAI Conf. Artif. Intell. 2021, 35, 125–133. [Google Scholar] [CrossRef]
  13. Lee, M.; Min, K. MCVAE: Multi-objective inverse design via molecular graph conditional variational autoencoder. J. Chem. Inf. Model. 2022, 62, 2943–2950. [Google Scholar] [CrossRef] [PubMed]
  14. Chen, Z.; Min, M.R.; Parthasarathy, S.; Ning, X. A deep generative model for molecule optimization via one fragment modification. Nat. Mach. Intell. 2021, 3, 1040–1049. [Google Scholar] [CrossRef] [PubMed]
  15. Winter, R.; Montanari, F.; Steffen, A.; Briem, H.; Noé, F.; Clevert, D.A. Efficient multi-objective molecular optimization in a continuous latent space. Chem. Sci. 2019, 10, 8016–8024. [Google Scholar] [CrossRef] [PubMed]
  16. Gao, W.; Fu, T.; Sun, J.; Coley, C. Sample efficiency matters: A benchmark for practical molecular optimization. Adv. Neural Inf. Process. Syst. 2022, 35, 21342–21357. [Google Scholar]
  17. Xie, Y.; Shi, C.; Zhou, H.; Yang, Y.; Zhang, W.; Yu, Y.; Li, L. Mars: Markov molecular sampling for multi-objective drug discovery. arXiv 2021, arXiv:2103.10432. [Google Scholar]
  18. Hoffman, S.C.; Chentham, V.; Wadhawan, K.; Chen, P.-Y.; Das, P. Optimizing molecules using efficient queries from property evaluations. Nat. Mach. Intell. 2022, 4, 21–31. [Google Scholar] [CrossRef]
  19. Jensen, J.H. A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chem. Sci. 2019, 10, 3567–3572. [Google Scholar] [CrossRef]
  20. Nigam, A.K.; Friederich, P.; Krenn, M.; Aspuru-Guzik, A. Augmenting genetic algorithms with deep neural networks for exploring the chemical space. arXiv 2019, arXiv:1909.11655. [Google Scholar]
  21. Kwon, Y.; Lee, J. MolFinder: An evolutionary algorithm for the global optimization of molecular properties and the extensive exploration of chemical space using SMILES. J. Cheminform. 2021, 13, 24. [Google Scholar] [CrossRef] [PubMed]
  22. Leguy, J.; Cauchy, T.; Glavatskikh, M.; Du Mota, B. EvoMol: A flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. J. Cheminform. 2020, 12, 55. [Google Scholar] [CrossRef] [PubMed]
  23. Thomas, M.; O’Boyle, N.M.; Bender, A.; De Graaf, C. MolScore: A scoring, evaluation and benchmarking framework for generative models in de novo drug design. J. Cheminform. 2024, 16, 64. [Google Scholar] [CrossRef] [PubMed]
  24. Barshatski, G.; Radinsky, K. Unpaired generative molecule-to-molecule translation for lead optimization. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 2554–2564. [Google Scholar]
  25. Tian, Y.; Cheng, R.; Jin, Y.; Zhang, X. PlatEMO: A MATLAB platform for evolutionary multi-objective optimization [educational forum]. IEEE Comput. Intell. Mag. 2017, 12, 73–87. [Google Scholar] [CrossRef]
  26. van der Horst, E.; Marqués-Gallego, P.; Mulder-Krieger, T. Multi-objective evolutionary design of adenosine receptor ligands. J. Chem. Inf. Model. 2012, 52, 1713–1721. [Google Scholar] [CrossRef]
  27. Ekins, S.; Honeycutt, J.D.; Metz, J.T. Evolving molecules using multi-objective optimization: Applying to ADME/Tox. Drug Discov. Today 2010, 15, 451–460. [Google Scholar] [CrossRef]
  28. Deb, K.; Pratap, A.; Agarwal, S.; MeyarivanTAM, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
  29. López-Pérez, K.; Avellaneda-Tamayo, J.F.; Chen, L.; López-López, E. Molecular similarity: Theory, applications, and perspectives. Artif. Intell. Chem. 2024, 2, 100077. [Google Scholar] [CrossRef]
  30. Bajusz, D.; Rácz, A.; Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminform. 2015, 7, 20. [Google Scholar] [CrossRef]
  31. Slowik, A.; Kwasnicka, H. Evolutionary algorithms and their applications to engineering problems. Neural Comput. Appl. 2020, 32, 12363–12379. [Google Scholar] [CrossRef]
  32. Brown, N.; Fiscato, M.; Segler, M.H.S.; Vaucher, A.C. GuacaMol: Benchmarking models for de novo molecular design. J. Chem. Inf. Model. 2019, 59, 1096–1108. [Google Scholar] [CrossRef]
  33. Verhellen, J. Graph-based molecular Pareto optimisation. Chem. Sci. 2022, 13, 7526–7535. [Google Scholar] [CrossRef] [PubMed]
  34. Wager, T.T.; Hou, X.; Verhoest, P.R.; Villalobos, A.; Will, Y. Central Nervous System Multiparameter Optimization Desirability: Application in Drug Discovery. ACS Chem. Neurosci. 2016, 7, 767–775. [Google Scholar] [CrossRef] [PubMed]
  35. Verhellen, J.; Van den Abeele, J. Illuminating elite patches of chemical space. Chem. Sci. 2020, 11, 11485–11491. [Google Scholar] [CrossRef] [PubMed]
  36. Guerreiro, A.P.; Fonseca, C.M.; Paquete, L. The hypervolume indicator: Problems and algorithms. arXiv 2020, arXiv:2005.00515. [Google Scholar]
  37. Lipkus, A.H. A proof of the triangle inequality for the Tanimoto distance. J. Math. Chem. 1999, 26, 263–265. [Google Scholar] [CrossRef]
  38. Xia, X.; Liu, Y.; Zheng, C.; Zhang, X.; Wu, Q.; Gao, X.; Zeng, X. Evolutionary Multiobjective Molecule Optimization in an Implicit Chemical Space. J. Chem. Inf. Model. 2024, 64, 5161–5174. [Google Scholar] [CrossRef]
  39. Ertl, P.; Rohde, B.; Selzer, P. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 2000, 43, 3714–3717. [Google Scholar] [CrossRef]
  40. Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [Google Scholar] [CrossRef]
  41. Bento, A.P.; Hersey, A.; Félix, E.; Landrum, G.; Gaulton, A. An open source chemical structure curation pipeline using RDKit. J. Cheminform. 2020, 12, 51. [Google Scholar] [CrossRef]
Figure 1. SR values for all comparison models across the six tasks.
Figure 1. SR values for all comparison models across the six tasks.
Pharmaceuticals 18 01227 g001
Figure 2. The distribution of molecules generated by MoGA-TA, GB-EPI, and NSGA-II in the chemical space when optimizing task 1 and task 2. The coordinate axes represent the corresponding attribute values. ((a) Task 1; (b) Task 2).
Figure 2. The distribution of molecules generated by MoGA-TA, GB-EPI, and NSGA-II in the chemical space when optimizing task 1 and task 2. The coordinate axes represent the corresponding attribute values. ((a) Task 1; (b) Task 2).
Pharmaceuticals 18 01227 g002
Table 1. Multi-objective optimization benchmarks and corresponding scoring and modification functions.
Table 1. Multi-objective optimization benchmarks and corresponding scoring and modification functions.
Benchmark NameScoring FunctionsModifier
FexofenadineTanimoto (AP)Thresholded (0.8)
TPSAMaxGaussian (90, 10)
logPMinGaussian (4, 2)
PioglitazoneTanimoto(ECFP4)Gaussian (0, 0.1)
Molecular weightGaussian (356, 10)
Number of rotatable bondsGaussian (2, 0.5)
OsimertinibTanimoto(FCFP4)Thresholded (0.8)
Tanimoto(ECFP6)MinGaussian (0.85, 2)
TPSAMaxGaussian (95, 20)
logPMinGaussian (1, 2)
RanolazineTanimoto (AP)Thresholded (0.7)
TPSAMaxGaussian (95, 20)
logPMaxGaussian (7, 1)
Number of fluorine countGaussian (1, 1)
CobimetinibTanimoto(FCFP4)Thresholded (0.7)
Tanimoto(ECFP6)MinGaussian (0.75, 0.1)
Number of rotatable bondsMinGaussian (3, 1)
Number of aromatic ringsMaxGaussian (3, 1)
CNS (0.5)
DAP kinasesDAPk1Thresholded (0.8)
DRP1Thresholded (0.8)
ZIPkThresholded (0.8)
QEDGaussian (0.8, 0.1)
logPMaxGaussian (3, 1)
Table 2. Supervolume, success rate, maximum geometric mean, and internal similarity results for methods NSGA-II, GB-EPI, and MOGA-TA in six multi-optimization tasks.
Table 2. Supervolume, success rate, maximum geometric mean, and internal similarity results for methods NSGA-II, GB-EPI, and MOGA-TA in six multi-optimization tasks.
AlgorithmTaskHVSuccess RateGeometric MeanInternal Similarity
GB-EPIFexofenadine0.670.330.870.50
Pioglitazone0.980.550.990.50
Osimertinib0.540.280.850.50
Ranolazine0.460.310.810.50
Cobimetinib0.770.600.930.50
DAP kinases0.040.170.500.51
NSGA-IIFexofenadine0.780.420.920.52
Pioglitazone1.000.641.000.51
Osimertinib0.660.330.890.52
Ranolazine0.680.360.870.51
Cobimetinib0.940.700.940.51
DAP kinases0.040.230.480.51
MoGA-TAFexofenadine0.850.510.940.51
Pioglitazone1.000.731.000.50
Osimertinib0.700.520.900.52
Ranolazine0.750.420.890.51
Cobimetinib0.960.730.940.50
DAP kinases0.060.320.510.50
Table 3. Supervolume, success rate, maximum geometric mean, and internal similarity results for MoGA-TA without acceptance probability case.
Table 3. Supervolume, success rate, maximum geometric mean, and internal similarity results for MoGA-TA without acceptance probability case.
AlgorithmTaskHVSuccess RateGeometric MeanInternal Similarity
MoGA-TFexofenadine0.810.460.930.51
Pioglitazone1.000.671.000.50
Osimertinib0.680.440.890.51
Ranolazine0.710.350.880.51
Cobimetinib0.940.710.950.51
DAP kinases0.050.260.490.50
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Dai, C.; Lei, X. Multi-Objective Drug Molecule Optimization Based on Tanimoto Crowding Distance and Acceptance Probability. Pharmaceuticals 2025, 18, 1227. https://doi.org/10.3390/ph18081227

AMA Style

Wang Y, Dai C, Lei X. Multi-Objective Drug Molecule Optimization Based on Tanimoto Crowding Distance and Acceptance Probability. Pharmaceuticals. 2025; 18(8):1227. https://doi.org/10.3390/ph18081227

Chicago/Turabian Style

Wang, Yuxin, Cai Dai, and Xiujuan Lei. 2025. "Multi-Objective Drug Molecule Optimization Based on Tanimoto Crowding Distance and Acceptance Probability" Pharmaceuticals 18, no. 8: 1227. https://doi.org/10.3390/ph18081227

APA Style

Wang, Y., Dai, C., & Lei, X. (2025). Multi-Objective Drug Molecule Optimization Based on Tanimoto Crowding Distance and Acceptance Probability. Pharmaceuticals, 18(8), 1227. https://doi.org/10.3390/ph18081227

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop