Mechanistic Analysis of Chemically Diverse Bromodomain-4 Inhibitors Using Balanced QSAR Analysis and Supported by X-ray Resolved Crystal Structures

Bromodomain-4 (BRD-4) is a key enzyme in post-translational modifications, transcriptional activation, and many other cellular processes. Its inhibitors find their therapeutic usage in cancer, acute heart failure, and inflammation to name a few. In the present study, a dataset of 980 molecules with a significant diversity of structural scaffolds and composition was selected to develop a balanced QSAR model possessing high predictive capability and mechanistic interpretation. The model was built as per the OECD (Organisation for Economic Co-operation and Development) guidelines and fulfills the endorsed threshold values for different validation parameters (R2tr = 0.76, Q2LMO = 0.76, and R2ex = 0.76). The present QSAR analysis identified that anti-BRD-4 activity is associated with structural characters such as the presence of saturated carbocyclic rings, the occurrence of carbon atoms near the center of mass of a molecule, and a specific combination of planer or aromatic nitrogen with ring carbon, donor, and acceptor atoms. The outcomes of the present analysis are also supported by X-ray-resolved crystal structures of compounds with BRD-4. Thus, the QSAR model effectively captured salient as well as unreported hidden pharmacophoric features. Therefore, the present study successfully identified valuable novel pharmacophoric features, which could be beneficial for the future optimization of lead/hit compounds for anti-BRD-4 activity.


Introduction
Cancer and heart failure are major causes of mortality [1], health complications, and social and economic problems for millions of people around the globe. Researchers have identified different chemotherapeutic methods to minimise heart failure as well as the onset, growth, and survival of cancer cells [1]. However, different serious health issues initiated or echoed by different anti-cancer and cardiac drugs are of great concerns. Therefore, the quest for a harmless and effective anti-cancer and cardiac drug is an important goal for the research and development laboratories of pharmaceutical companies and academic institutions. For this, researchers generally prefer to inhibit any irregularity occurring during a vital cellular process. A good number of recent studies have confirmed that reversible lysine acetylation (RAL) is a dynamic process responsible for protein post-translational modifications, transcriptional activation, and other cellular processes [2][3][4][5][6][7][8][9][10][11]. Therefore, any anomaly with RAL could lead to the initiation of malignancy or its survival [7,12]. RAL is regulated by three types of epigenetic regulatory proteins [12,13]: (1) histone acetyltransferases (HATs) acetylate lysine, (2) Histone deacetylases (HDACs), and (3) bromodomain [7,12]. RAL is regulated by three types of epigenetic regulatory proteins [12,13]: (1) histone acetyltransferases (HATs) acetylate lysine, (2) Histone deacetylases (HDACs), and (3) bromodomain (BRD) family of proteins. HATs are responsible for acetylation of lysine residues on histone tails and thereby behave as "writers", whereas the reverse is true for HDACs and sirtuins, which work as "erasers" that are accountable for the elimination of the acetyl group from acetylated lysine (KAc) [14]. The Bromodomain and Extra-terminal (BET) family selectively recognises and links with acetylated lysine residues in histones H3 and H4 [15][16][17]; thus, they function as "readers". The BET proteins, viz., BRD2, BRD3, and BRD-4, and bromodomain testis-specific protein (BRDT) are widely recognised as druggable target proteins for regulating cellular epigenetics [15]. Therefore, intruding interactions between BET proteins and acetylated lysine have attracted many researchers to develop better therapeutics for various human diseases including cancer, acute heart failure, and inflammation [2][3][4][5][6][7][8][9][10][11].
BRD-4, also called mitotic chromosomal-associated protein (MCAP), Fshrg4, or Hunk1, is ubiquitously expressed and plays a crucial role in a number of DNA-centered processes [15]. It is generally localised in the nucleus and regulates transcription by RNA polymerase II through a positive transcriptional elongation factor complex [15]. Structurally, it comprises two highly conserved N-terminal bromodomains (BD1 and BD2), an ET domain, and a C-terminal domain (CTD) [7]. Furthermore, BRD-4 contains a set of four helices: αZ, αA, αB, and αC. αZ and αA helices are connected through the ZA loop, whereas the BC loop connects the αB and αC helices [11,18]. Together, the four helices and the two loops create an active acetyl-lysine binding pocket (see Figure 1) [11,18]. The active site also consists of a hydrophobic WPF shelf (Trp81, Pro82, and Phe83), ZA loop, Tyr97, Asn140, and Met149 [11,18]. The majority of BRD-4 inhibitors compete with histone H4 to imitate the interactions with Tyr97 and Asn140 [3]. The WPF shelf is believed to play an important role in deciding the selectivity for BET bromodomains [6]. Recent studies indicate that the inhibition of BRD-4 is a good strategy, and a good number of BRD-4 inhibitors are in clinical or pre-clinical trials (see Figure 2) [2,19,20]. Recent studies indicate that the inhibition of BRD-4 is a good strategy, and a good number of BRD-4 inhibitors are in clinical or pre-clinical trials (see Figure 2) [2,19,20].
However, the quest for a safer and effective BRD-4 inhibitor with an optimum ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) profile with a retention of potency is still in progress. For this, it is essential to know the prominent and concealed pharmacophoric features associated with BRD-4 inhibitors. To achieve this goal, a good number of researchers have reported SAR (Structure Activity Relationships) and QSAR (Quantitative SAR) analyses of BRD-4 inhibitors. Tahir et al. [21] developed a CoMSIA (3D-QSAR) model with an R 2 tr (coefficient of determination) = 0.982 and R 2 cv (or Q 2 loo ) (cross-validated coefficient of determination for leave-one-out) = 0.500 for a dataset of 60 quinolinone and quinazolinone derivatives as BRD-4 inhibitors. Tong et al. [22] reported four 3D-QSAR models possessing R 2 tr = 0.912 to 0.963 and R 2 cv = 0.574 to 0.759 for the BRD-4 inhibitory activity of 4,5-dihydro- [1,2,4]triazolo [4,3-f]pteridine derivatives. Obadawo and co-workers [23] performed QSAR modelling (R 2 tr = 0.93 and R 2 cv = 0.70) for 40 different substituted 4-Phenylisoquinolinones as potent BET bromodomain (BRD-4-BD1) inhibitors. Speck-Planche and Scotti [6] performed multi-target QSAR for bromodomain inhibitors using linear discriminant analysis and artificial neural networks. Their binary classification (active/inactive), which is based on a fragment-based topological approach, and analysis led to the identification of a good number of pharmacophoric features. However, the fragment-based topological approach involved the use of SMILES of molecules and thereby lacks the inclusion of 3D information. Thus, even though these studies are successful in identifying easily visible pharmacophoric features, they are based on small datasets with limited variations in structures, binary classification, lack thorough validation and general applicability, and provide partial mechanistic interpretations. However, the quest for a safer and effective BRD-4 inhibitor with an optimum AD-MET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) profile with a retention of potency is still in progress. For this, it is essential to know the prominent and concealed pharmacophoric features associated with BRD-4 inhibitors. To achieve this goal, a good number of researchers have reported SAR (Structure Activity Relationships) and QSAR (Quantitative SAR) analyses of BRD-4 inhibitors. Tahir et al. [21] developed a CoMSIA (3D-QSAR) model with an R 2 tr (coefficient of determination) = 0.982 and R 2 cv (or Q 2 loo) (cross-validated coefficient of determination for leave-one-out) = 0.500 for a dataset of 60 quinolinone and quinazolinone derivatives as BRD-4 inhibitors. Tong et al. [22] reported four 3D-QSAR models possessing R 2 tr = 0.912 to 0.963 and R 2 cv = 0.574 to 0.759 for the BRD-4 inhibitory activity of 4,5-dihydro- [1,2,4]triazolo [4,3-f]pteridine derivatives. Obadawo and co-workers [23] performed QSAR modelling (R 2 tr = 0.93 and R 2 cv = 0.70) for 40 different substituted 4-Phenylisoquinolinones as potent BET bromodomain (BRD-4-BD1) inhibitors. Speck-Planche and Scotti [6] performed multi-target QSAR for bromodomain inhibitors using linear discriminant analysis and artificial neural networks. Their binary classification (active/inactive), which is based on a fragment-based topological approach, and analysis led to the identification of a good number of pharmacophoric features. However, the fragment-based topological approach involved the use of SMILES of molecules and thereby lacks the inclusion of 3D information. Thus, even though these studies are successful in identifying easily visible pharmacophoric features, they are based on small datasets with limited variations in structures, binary classification, lack thorough validation and general applicability, and provide partial mechanistic interpretations.
A literature survey reveals that BRD-4 inhibitors possess structural isomerism (positional, chain, etc.), variations in central scaffolds, and their chemical space is very broad [6,7]; therefore, many concealed or hidden correlations of pharmacophoric features cannot be identified by visual inspection [24]. In such a situation, there is a need to accomplish thorough QSAR analysis using a larger dataset of BRD-4 inhibitors. In the present work, we have performed QSAR analysis of 980 structurally diverse BRD-4 inhibitors. The de- A literature survey reveals that BRD-4 inhibitors possess structural isomerism (positional, chain, etc.), variations in central scaffolds, and their chemical space is very broad [6,7]; therefore, many concealed or hidden correlations of pharmacophoric features cannot be identified by visual inspection [24]. In such a situation, there is a need to accomplish thorough QSAR analysis using a larger dataset of BRD-4 inhibitors. In the present work, we have performed QSAR analysis of 980 structurally diverse BRD-4 inhibitors. The developed QSAR model possesses a balance of excellent predictive ability with in-depth mechanistic interpretations, which are reinforced by reported X-ray-resolved structures of BRD-4 inhibitors with the target enzyme.

Results
The present QSAR analysis is based on a dataset covering a broad chemical space and data range owing to the inclusion of structurally diverse compounds with experimentally measured IC 50 in the range of 1 nM to 15 µM. Consequently, this helped us in developing an appropriately validated genetic algorithm multi-linear regression (GA-MLR) model for gathering or extending exhaustive information about the pharmacophoric traits that govern the desired bio-activity (Descriptive QSAR) and also possessing acceptable external predictive capability (Predictive QSAR) [25][26][27]  A multitude of statistical validation parameters and analysis of associated graphs has been recommended by different researchers to confirm the statistical robustness and external prediction ability of a QSAR model [28][29][30][31][32][33][34][35][36][37][38][39]. The same approach has been followed in the present work. A high value of R 2 tr , R 2 adj. , R 2 cv (Q 2 loo), R 2 ex , Q 2 -F n , CCC ex , etc., and a small value of LOF (lack-of-fit), RMSE tr , MAE tr , R 2 Yscr (R 2 for Y-scrambling), etc., along with different graphs (Figure 3a-d) related to model-A support the external predictive ability, statistical robustness, and point outs the lack of chancy correlation for model-A [28][29][30][31][32][33][34][35][36][37][38]. Moreover, the Williams plot [40][41][42][43][44] point outs that the majority of molecules (929 molecules) are within the applicability domain; thus, the model is statistically acceptable (see Figure 3b). The outliers with high leverage have been labeled in Figure 3b. Therefore, it fulfills all the Organisation for Economic Co-operation and Development (OECD) endorsed guidelines for generating a thriving QSAR model. that govern the desired bio-activity (Descriptive QSAR) and also possessing acceptab external predictive capability (Predictive QSAR) [25][26][27]. The seven variable-based G MLR QSAR model (see model-A), along with selected internal and external validati parameters (see supplementary material for additional parameters), is as follows. A multitude of statistical validation parameters and analysis of associated graphs h been recommended by different researchers to confirm the statistical robustness and e ternal prediction ability of a QSAR model [28][29][30][31][32][33][34][35][36][37][38][39]. The same approach has been follow in the present work. A high value of R 2 tr, R 2 adj., R 2 cv (Q 2 loo), R 2 ex, Q 2 -F n , CCCex, etc., and small value of LOF (lack-of-fit), RMSEtr, MAEtr, R 2 Yscr (R 2 for Y-scrambling), etc., along w different graphs (Figure 3a-d) related to model-A support the external predictive abili statistical robustness, and point outs the lack of chancy correlation for model-A [28-3 Moreover, the Williams plot [40][41][42][43][44] point outs that the majority of molecules (929 mo cules) are within the applicability domain; thus, the model is statistically acceptable (s Figure 3b). The outliers with high leverage have been labeled in Figure 3b. Therefore fulfills all the Organisation for Economic Co-operation and Development (OECD) e dorsed guidelines for generating a thriving QSAR model.  The descriptions of seven molecular descriptors constituting model-A have been tabulated in Table 1. Interestingly, five molecular descriptors, viz., com_C_4A, fsp3CringC2B, flipoacc3B, fsulfonSaroC8B, and Saturated_Carbo_Rings, comprise the presence of different types of carbon atoms, which indicates the importance of carbon atoms in deciding BRD-4 inhibitory activity. The same is true for nitrogen, which is a part of three molecular descriptors, viz., flipoacc3B, fplaNN4B, and fsp3OaroN6B. Since, in general, the presence of carbon increases lipophilicity whereas nitrogen is attributed to significantly influence the pharmacological and hydrophilic profile, therefore, a balance of an appropriate number of carbons for lipophilicity and nitrogen is necessary to obtain adequate BRD-4 inhibitory activity. Of the seven descriptors in model-A, six have positive coefficients and only one has a negative coefficient. The effects of descriptors and their role in deciding the BRD-4 inhibitory profile have been discussed in more detail with relevant examples in the Discussion section.

Mechanistic Interpretation of QSAR Model
An appropriately validated relationship between prominent structural features or molecular descriptors of the molecules with the bioactivity enlarges knowledge about mechanism of action of molecules, reasons for their specificity, and pharmacophoric atoms/groups accountable for the desired bioactivity [20,26,39]. In the present analysis, although we have equated the IC 50 values of different molecules in a relationship with a specific molecular descriptor (or feature), a synergistic or reverse effect of other molecular descriptors or unknown factors having a superseding influence in deciding the overall IC 50 value of a molecule cannot be ignored. That is, a single molecular descriptor or feature neither decides nor completely explains the experimental IC 50 value for such a large and structurally diverse set of molecules. In other words, the effective use of a validated QSAR model depends on the synergetic consideration of constituent molecular descriptors. The newly developed QSAR model-A comprises seven descriptors.
The molecular descriptor fsp3CringC2B represents the frequency of occurrence of ring carbon atoms exactly at two bonds from sp 3 -hybridised carbon atoms. If the same ring carbon atom was also present at less than two bonds from any other sp 3 -hybridised carbon atoms, then it was excluded while calculating fsp3CringC2B. Its positive coefficient in model-A and also a correlation of 0.30 with pIC 50 indicate that increasing such a combination of ring and sp 3 -hybridised carbon atoms could lead to better inhibitory activities for BRD-4. For example, a comparison of molecule 736 with 737 indicates the significant influence of ring carbon atoms (shown using green dots in Figure 4a,b) at exactly two bonds from sp 3 -hybridised carbon atoms. This is further supported by their reported X-ray resolved structures with BRD-4. Molecule 736 (pdb: 5z1s [47]) has an additional water-mediated interaction with receptors with a distance of 3.37 Å (see Figure 4c) due to the -OCH 3 group present in the benzoxazinone ring. The same -OCH 3 group is responsible for increasing the value of fsp3CringC2B for 736, but it is absent in 737 (pdb: 5z1r [47]). The difference in IC 50 for the following pairs of molecules further support the influence of fsp3CringC2B on the activity profile: 255 with 499, 725 with 716, 231 with 240, and 89 with 105, to mention a few.
From this discussion, it appears that ring carbon atoms alone are important. However, replacing fsp3CringC2B by number of ring carbon atoms as a descriptor in model-A reduced the statistical performance from R 2 tr = 0.76 to 0.73. In addition, the number of ring carbon atoms has a correlation of 0.27 with pIC 50 . Therefore, fsp3CringC2B is a better descriptor than the number of ring carbon atoms. combination of ring and sp 3 -hybridised carbon atoms could lead to better inhibitory activities for BRD-4. For example, a comparison of molecule 736 with 737 indicates the significant influence of ring carbon atoms (shown using green dots in Figure 4a,b) at exactly two bonds from sp 3 -hybridised carbon atoms. This is further supported by their reported X-ray resolved structures with BRD-4. Molecule 736 (pdb: 5z1s [47]) has an additional water-mediated interaction with receptors with a distance of 3.37 Å (see Figure 4c) due to the -OCH3 group present in the benzoxazinone ring. The same -OCH3 group is responsible for increasing the value of fsp3CringC2B for 736, but it is absent in 737 (pdb: 5z1r [47]). The difference in IC50 for the following pairs of molecules further support the influence of fsp3CringC2B on the activity profile: 255 with 499, 725 with 716, 231 with 240, and 89 with 105, to mention a few. From this discussion, it appears that ring carbon atoms alone are important. However, replacing fsp3CringC2B by number of ring carbon atoms as a descriptor in model-A reduced the statistical performance from R 2 tr = 0.76 to 0.73. In addition, the number of ring carbon atoms has a correlation of 0.27 with pIC50. Therefore, fsp3CringC2B is a better descriptor than the number of ring carbon atoms.  com_C_4A, which stands for total number of carbon atoms within 4 Å from centre of mass (com) of molecule, has a positive coefficient in model-A. Therefore, increasing the value of com_C_4A leads to better inhibitory activities. This observation is supported by the fact that it has a correlation of 0.404 with pIC 50 , and molecules with IC 50 < 10 nM (29 molecules) possess a high value of com_C_4A. In addition, a simple comparison of the following pairs of the molecules strengthens this observation: 620 with 621, 720 with 710, 724 with 717, 526 with 518, and 691 with 692, and 595 with 596. At first glance, it looks as if com_C_4A is pointing out the importance of the number of carbon atoms. However, nC (number of carbon atoms) has a correlation of 0.29 with pIC 50 and substituting com_C_4A by nC led to a decrease in statistical performance of model-A from R 2 tr = 0.76 to 0.69. Therefore, com_C_4A is a better choice as a variable for model-A.
As the presence of carbon is generally associated with the increased lipophilicity of a molecule, therefore, com_C_4A indicates that the lipophilic part must be concentrated near the com of the molecule for better activities. This in turn provides a crucial hint about the active site of BRD-4. It appears that a significant portion of the active site of BRD-4 is reasonably lipophilic in nature. This is supported by the fact that the active site of BRD-4 consists of a hydrophobic WPF shelf (see Figure 1) [11,18]. Thus, the findings of the present QSAR analysis are supported by the reported X-ray-resolved structure of BRD-4 enzyme. In addition, the pharmacophore model, depicted in Figure 5, generated using most active molecule 297, again points out the presence of a lipophilic region near the com of the molecule. com_C_4A, which stands for total number of carbon atoms within 4 Å from centre mass (com) of molecule, has a positive coefficient in model-A. Therefore, increasing value of com_C_4A leads to better inhibitory activities. This observation is supported the fact that it has a correlation of 0.404 with pIC50, and molecules with IC50 < 10 nM molecules) possess a high value of com_C_4A. In addition, a simple comparison of following pairs of the molecules strengthens this observation: 620 with 621, 720 with 7 724 with 717, 526 with 518, and 691 with 692, and 595 with 596. At first glance, it looks if com_C_4A is pointing out the importance of the number of carbon atoms. However, (number of carbon atoms) has a correlation of 0.29 with pIC50 and substituting com_C_ by nC led to a decrease in statistical performance of model-A from R 2 tr = 0.76 to 0.69. The fore, com_C_4A is a better choice as a variable for model-A.
As the presence of carbon is generally associated with the increased lipophilicity o molecule, therefore, com_C_4A indicates that the lipophilic part must be concentra near the com of the molecule for better activities. This in turn provides a crucial hint abo the active site of BRD-4. It appears that a significant portion of the active site of BRD-4 reasonably lipophilic in nature. This is supported by the fact that the active site of BRD consists of a hydrophobic WPF shelf (see Figure 1) [11,18]. Thus, the findings of the p sent QSAR analysis are supported by the reported X-ray-resolved structure of BRD-4 zyme. In addition, the pharmacophore model, depicted in Figure 5, generated using m active molecule 297, again points out the presence of a lipophilic region near the com the molecule. Another descriptor that also point outs the importance of lipophilicity of a molec is flipoacc3B, which represents the frequency of occurrence of H-bond acceptor atoms actly at three bonds from lipophilic atoms. However, an acceptor atom was exclud while calculating flipoacc3B if it is also present within two or less bonds from same or a other lipophilic atom. Evidently, the lipophilic part of a molecule close to H-bond accep moiety (O or N atoms) plays a crucial role in deciding the inhibitory effect for BRD-4. T is once again visible in the pharmacophore model depicted in Figure 6. A simple comp ison of the following pairs of molecules supports this observation: 812 with 823, 7 with and 4 with 10.
An easily interpretable and influential molecular descriptor is Saturated_Carbo_Rin which corresponds to total number of saturated carbocyclic rings. It has positive coe cient in model-A; therefore, increasing such rings is beneficial. A comparison of IC50 411 with 384 (see Figure 6), 137 with 127, 73 with 67, 131 with 124, 60 with 61, 570 w 573, and 572, 230 with 247 is in favour of this observation. Another descriptor that also point outs the importance of lipophilicity of a molecule is flipoacc3B, which represents the frequency of occurrence of H-bond acceptor atoms exactly at three bonds from lipophilic atoms. However, an acceptor atom was excluded while calculating flipoacc3B if it is also present within two or less bonds from same or any other lipophilic atom. Evidently, the lipophilic part of a molecule close to H-bond acceptor moiety (O or N atoms) plays a crucial role in deciding the inhibitory effect for BRD-4. This is once again visible in the pharmacophore model depicted in Figure 6. A simple comparison of the following pairs of molecules supports this observation: 812 with 823, 7 with 8, and 4 with 10.
An easily interpretable and influential molecular descriptor is Saturated_Carbo_Rings, which corresponds to total number of saturated carbocyclic rings. It has positive coefficient in model-A; therefore, increasing such rings is beneficial. A comparison of IC 50 for 411 with 384 (see Figure 6), 137 with 127, 73 with 67, 131 with 124, 60 with 61, 570 with 573, and 572, 230 with 247 is in favour of this observation.
The importance of Saturated_Carbo_Rings in model-A indicates that the lipophilicity and flexibility of a molecule are the actual factors governing an activity profile. It is noteworthy that clogP, which represents molecular lipophilicity, has a correlation of 0.193 with pIC 50 , whereas Saturated_Carbo_Rings has 0.240. Thus, Saturated_Carbo_Rings is a better choice, as it pinpoints the specific feature or part of the molecule (saturated carbocyclic rings), which is correlated with the activity due to its lipophilic nature, whereas clogP is a molecular property. A plausible reason could be the crucial role played by hydrophobic zones such as saturated carbocyclic rings at the periphery or outer part of molecule as a proxy for BRD4 selectivity through their interaction with the WPF shelf [6]. Therefore, saturated carbocyclic rings should be retained in future optimizations for better activity profiles. Thus, the present work is successful in identifying the significance of saturated carbocyclic rings as a novel unreported pharmacophore feature associated with BRD-4 inhibitory activity. The importance of Saturated_Carbo_Rings in model-A indicates that the lipophilic and flexibility of a molecule are the actual factors governing an activity profile. It is no worthy that clogP, which represents molecular lipophilicity, has a correlation of 0.193 w pIC50, whereas Saturated_Carbo_Rings has 0.240. Thus, Saturated_Carbo_Rings is a bet choice, as it pinpoints the specific feature or part of the molecule (saturated carbocyc rings), which is correlated with the activity due to its lipophilic nature, whereas clogP i molecular property. A plausible reason could be the crucial role played by hydropho zones such as saturated carbocyclic rings at the periphery or outer part of molecule a proxy for BRD4 selectivity through their interaction with the WPF shelf [6]. Therefo saturated carbocyclic rings should be retained in future optimizations for better activ profiles. Thus, the present work is successful in identifying the significance of saturat carbocyclic rings as a novel unreported pharmacophore feature associated with BRD inhibitory activity.
The molecular descriptor fsulfonSaroC8B (frequency of occurrence of aromatic c bon atoms exactly at eight bonds from sulphur atoms of Sulfone (-SO2-) group) has a p itive coefficient in model-A. Consequently, increasing the value of fsulfonSaroC8B, vours binding with BRD-4. It is to be noted that if the same aromatic carbon atom is a present at ≤7 bonds from a sulphur atom of same or different Sulfone group through a path, then it was excluded while calculating fsulfonSaroC8B. Obviously, this descrip signifies the importance of the Sulfone group (a highly polar group) and its correlati with aromatic rings (a lipophilic moiety) in deciding the binding with BRD-4. This clearly reflected in the difference in the activity of the following pairs of molecules: 7 with 718, 723 with 724, 714 with 715, 707 with 716, 936 with 941, and 942 with 943. Rece studies indicate that Sulfone moiety is present in a cleft near the ZA-loop and establish an H-bond with the -CONH-(amide) of backbone [19].
fsp3OaroN6B stands for the frequency of occurrence of aromatic nitrogen atoms actly at six bonds from sp 3 -hybridised oxygen atom. If the same aromatic nitrogen ato is also present at ≤5 bonds from the same or any other sp 3 -hybridised oxygen ato through any path, then it was excluded while calculating fsp3OaroN6B. For example, with 609, 81 with 620, and 614 with 615, to mention a few. In our previous study [20], identified a similar descriptor notringO_acc_6B (total number of all non-ring Oxygen oms present within a distance of six bonds from H-bond acceptor atoms) as an importa The molecular descriptor fsulfonSaroC8B (frequency of occurrence of aromatic carbon atoms exactly at eight bonds from sulphur atoms of Sulfone (-SO 2 -) group) has a positive coefficient in model-A. Consequently, increasing the value of fsulfonSaroC8B, favours binding with BRD-4. It is to be noted that if the same aromatic carbon atom is also present at ≤7 bonds from a sulphur atom of same or different Sulfone group through any path, then it was excluded while calculating fsulfonSaroC8B. Obviously, this descriptor signifies the importance of the Sulfone group (a highly polar group) and its correlation with aromatic rings (a lipophilic moiety) in deciding the binding with BRD-4. This is clearly reflected in the difference in the activity of the following pairs of molecules: 715 with 718, 723 with 724, 714 with 715, 707 with 716, 936 with 941, and 942 with 943. Recent studies indicate that Sulfone moiety is present in a cleft near the ZA-loop and establishes an H-bond with the -CONH-(amide) of backbone [19].
fsp3OaroN6B stands for the frequency of occurrence of aromatic nitrogen atoms exactly at six bonds from sp 3 -hybridised oxygen atom. If the same aromatic nitrogen atom is also present at ≤5 bonds from the same or any other sp 3 -hybridised oxygen atom through any path, then it was excluded while calculating fsp3OaroN6B. For example, 79 with 609, 81 with 620, and 614 with 615, to mention a few. In our previous study [20], we identified a similar descriptor notringO_acc_6B (total number of all non-ring Oxygen atoms present within a distance of six bonds from H-bond acceptor atoms) as an important pharmacophoric feature that governs the binding affinity (Ki) of a molecule for BRD-4. Thus, a consensus between the previous and the present study indicates that a molecule must have an H-bond acceptor (preferably aromatic nitrogen) at a distance of six bonds from a sp 3 -hybridised oxygen atom (non-ring oxygen favoured). This observation is supported by the difference in the activity of molecule S1, S2, and S3 [48] (see Figure 7). To add further, the sp 3 -hybridised oxygen atom is present as a linker between two aromatic rings or as an -OR (alkoxy) group in a good number of molecules [20]. must have an H-bond acceptor (preferably aromatic nitrogen) at a distance of six bonds from a sp 3 -hybridised oxygen atom (non-ring oxygen favoured). This observation is supported by the difference in the activity of molecule S1, S2, and S3 [48] (see Figure 7). To add further, the sp 3 -hybridised oxygen atom is present as a linker between two aromatic rings or as an -OR (alkoxy) group in a good number of molecules [20]. The only descriptor with a negative coefficient in model-A is fplaNN4B, which corresponds to frequency of occurrence of nitrogen atoms exactly at 4 bonds from planer nitrogen atoms. If the same nitrogen atom is also present at ≤ 3 bonds from the same or any other planer nitrogen atom through any path, then it was excluded while calculating fplaNN4B. The following pairs of molecules have a significant difference in their activities, which could be attributed to the presence of fplaNN4B as a structural feature: 945 with 954, 936 with 944, 411 with 385, 170 with 171, 722 with 714, 563 with 577, 737 with 734, 239 with 240. Therefore, such a combination of nitrogen atoms should be avoided to have better inhibition of BRD-4.

Materials and Methods
The present work follows the standard procedure recommended by OECD and different researchers to perform QSAR analysis [49][50][51]. All the software were used with default settings to obtain a QSAR model possessing a balance of predictive ability and mechanistic interpretation; however, some settings were changed, which have been reported at appropriate places. The different steps are as follows.
Step-1: Collection of data and curation: The present work commenced with the collection of a large dataset of 2026 experimentally tested BRD-4 inhibitors from a free and publicly available database BindingDB (https://www.bindingdb.org/bind/index.jsp, accessed on 16 March 2022). A QSAR analysis is significantly influenced by the quality of data, its composition, and its appropriate curation before further processing [50,[52][53][54]. Therefore, in the next step, data curation was performed [55], which involved the removal of duplicate entries, organometallic compounds, salts, molecules with ambiguous IC50 values, etc. This reduced the dataset to 980 molecules only. The reduced dataset still consists of molecules with experimental IC50 (nM) in the range 1 nM to 15 µM and the presence of diverse scaffolds such as heterocyclic rings, positional isomers, stereoisomers, etc., enhancing the chemical space and consequently widening the applicability of the newly de-  The only descriptor with a negative coefficient in model-A is fplaNN4B, which corresponds to frequency of occurrence of nitrogen atoms exactly at 4 bonds from planer nitrogen atoms. If the same nitrogen atom is also present at ≤3 bonds from the same or any other planer nitrogen atom through any path, then it was excluded while calculating fplaNN4B. The following pairs of molecules have a significant difference in their activities, which could be attributed to the presence of fplaNN4B as a structural feature: 945 with 954, 936 with 944, 411 with 385, 170 with 171, 722 with 714, 563 with 577, 737 with 734, 239 with 240. Therefore, such a combination of nitrogen atoms should be avoided to have better inhibition of BRD-4.

Materials and Methods
The present work follows the standard procedure recommended by OECD and different researchers to perform QSAR analysis [49][50][51]. All the software were used with default settings to obtain a QSAR model possessing a balance of predictive ability and mechanistic interpretation; however, some settings were changed, which have been reported at appropriate places. The different steps are as follows.
Step-1: Collection of data and curation: The present work commenced with the collection of a large dataset of 2026 experimentally tested BRD-4 inhibitors from a free and publicly available database BindingDB (https://www.bindingdb.org/bind/index.jsp, accessed on 16 March 2022). A QSAR analysis is significantly influenced by the quality of data, its composition, and its appropriate curation before further processing [50,[52][53][54]. Therefore, in the next step, data curation was performed [55], which involved the removal of duplicate entries, organometallic compounds, salts, molecules with ambiguous IC 50 values, etc. This reduced the dataset to 980 molecules only. The reduced dataset still consists of molecules with experimental IC 50 Figure 8 to depict the structural diversity of the current dataset.
In Table 2, five most and least active molecules have been included as examples only along with their SMILES notation: IC 50 (nM) and pIC 50 (M).
Step-2: In the next step, SMILES notations were used to develop the optimised 3D structures (semi-empirical PM3 method) of the molecules, accomplished using OpenBabel 2.4 [56] and MOPAC 2012 (openmopac.net) using default settings.
Step-3: A QSAR model achieves a balance of mechanistic interpretation and predictive ability if a good number of diverse molecular descriptors are calculated, followed by adequate pruning to reduce the chances of overfitting from noisy redundant descriptors [57]. The next step involved the calculation of myriad number of 1D-to 3D-molecular descriptors for all molecules. For this, PyDescriptor [45] and DataWarrior [46] were used, which generated more than 40,000 molecular descriptors for a single molecule. Obviously, the descriptor pool contained a good number of redundant molecular descriptors; therefore, highly correlated (|R| > 0.95) and nearly constant (>98%) variables were removed using QSARINS 2.2.4 [58]. This considerably decreased the size of set of molecular descriptor pool from 30,000 to 4326, which still contained a variety of molecular descriptors. In Table 2, five most and least active molecules have been included as examples only along with their SMILES notation: IC50 (nM) and pIC50 (M). Step-2: In the next step, SMILES notations were used to develop the optimised 3D structures (semi-empirical PM3 method) of the molecules, accomplished using OpenBabel 2.4 [56] and MOPAC 2012 (openmopac.net) using default settings.
Step-3: A QSAR model achieves a balance of mechanistic interpretation and predictive ability if a good number of diverse molecular descriptors are calculated, followed by adequate pruning to reduce the chances of overfitting from noisy redundant descriptors  For developing a useful QSAR model and its appropriate validation, it is essential to divide the dataset into training and external (also called as prediction or test set) sets [50,[52][53][54]59]. Consequently, in the present work, the dataset was randomly divided into training (80% = 785 molecules) and external (20% = 195 molecules) sets to minimise any bias. The only purpose of the training set was to choose suitable number of variables (molecular descriptors), whereas the external set was employed only for validation purpose, i.e., external validation of the model (Predictive QSAR). Genetic Algorithm (GA) and multi-linear regression (MLR) available in QSARINS 2.2.4 were used for model building. For this, Q 2 LOO was used as a fitness function, and the number of generations was set to 10,000. A decisive step in QSAR modelling is to select the optimum number of molecular descriptors for model building to avoid over-fitting and to obtain acceptable interpretability. Consequently, the heuristic search involved building multiple models from univariate to multivariate with the successive addition of molecular descriptors until there was an increase in the value of Q 2 LOO , which is called the breaking point [39,60]. A 2D graph between the number of molecular descriptors involved in the models and Q 2 LOO values has been depicted in Figure 9. The number of variables matching with the breaking point was considered optimal for model building as there was no improvement in the statistical performance of model upon the inclusion of additional molecular descriptors. The analysis led to the matching of breaking points with seven variables. Therefore, QSAR models with more than seven descriptors were excluded.

Splitting the Data Set into Training and External Sets and Subjective Feature Selection (SFS)
For developing a useful QSAR model and its appropriate validation, it is essential to divide the dataset into training and external (also called as prediction or test set) sets [50,[52][53][54]59]. Consequently, in the present work, the dataset was randomly divided into training (80% = 785 molecules) and external (20% = 195 molecules) sets to minimise any bias. The only purpose of the training set was to choose suitable number of variables (molecular descriptors), whereas the external set was employed only for validation purpose, i.e., external validation of the model (Predictive QSAR). Genetic Algorithm (GA) and multi-linear regression (MLR) available in QSARINS 2.2.4 were used for model building. For this, Q 2 LOO was used as a fitness function, and the number of generations was set to 10,000. A decisive step in QSAR modelling is to select the optimum number of molecular descriptors for model building to avoid over-fitting and to obtain acceptable interpretability. Consequently, the heuristic search involved building multiple models from univariate to multivariate with the successive addition of molecular descriptors until there was an increase in the value of Q 2 LOO, which is called the breaking point [39,60]. A 2D graph between the number of molecular descriptors involved in the models and Q 2 LOO values has been depicted in Figure 9. The number of variables matching with the breaking point was considered optimal for model building as there was no improvement in the statistical performance of model upon the inclusion of additional molecular descriptors. The analysis led to the matching of breaking points with seven variables. Therefore, QSAR models with more than seven descriptors were excluded.

Building Regression Model and Its Validation
Appropriate validations involving cross/inter validation, external validation, Y-randomization analysis, and applicability domain (Williams plot) are necessary to estimate the reliability and general applicability of a QSAR model [25,31,33,50,61]. A properly validated QSAR model finds its usage for QSAR-based virtual screening, lead/hit optimization, decision making, etc. The following validation parameters and their recommended Figure 9. Graph between number of descriptors against leave-one-out coefficient of determination Q 2 LOO to identify the optimum number of descriptors.

Building Regression Model and Its Validation
Appropriate validations involving cross/inter validation, external validation, Y-randomization analysis, and applicability domain (Williams plot) are necessary to estimate the reliability and general applicability of a QSAR model [25,31,33,50,61]. A properly validated QSAR model finds its usage for QSAR-based virtual screening, lead/hit optimization, decision making, etc. The following validation parameters and their recommended threshold values are usually used to assess a model [39,60]: R 2 tr (coefficient of determination) ≥ 0.6; Q 2 loo (cross-validated coefficient of determination for leave-one-out) ≥ 0.5; Q 2 LMO (cross-validated coefficient of determination for leave-many-out) ≥ 0.6, R 2 > Q 2 ; R 2 ex (external coefficient of determination) ≥ 0.6; CCC (Concordance Correlation Coefficient) ≥ 0.80; Q 2 -F n ≥ 0.60; high values of external validation parameters R 2 ex , Q 2 F1 , Q 2 F2 , and Q 2 F3 , with low values of R 2 Yscr (coefficient of determination for Y-randomization); RMSE (Root mean square error); MAE (Mean absolute error); RMSE tr < RMSE cv . The formulae for calculating these statistical parameters are available in Supplementary Materials. In the present analysis, a Williams plot was used to assess the applicability domain of the newly developed QSAR model.

Pharmacophore Model
For pharmacophore modelling, the 3D-optimised structure of the most active molecule, 207, was selected. The model was generated using LIQUID [62,63], a free and easy to use PyMOL plugin, using default settings, except that the contour region was set to 3 for H-bond donor/acceptor and hydrophobic regions.

Other Experimental Details
The reported X-ray resolved structures (pdb 5z1r and 5z1s) were downloaded from Protein Data Bank (www.rcsb.org accessed on 13 April 2022). PyMOL version 2.4 has been used for the depiction of molecular interactions between the compounds and the protein.

Conclusions
In the present study, a seven-descriptor-based and rigorously validated GA-MLR QSAR model with R 2 tr = 0.79, Q 2 LMO = 0.79, and R 2 ex = 0.78 was derived to identify the significant pharmacophoric features that influence BRD-4 inhibitory activity. As mentioned earlier, it is essential to perceive salient and visually unrecognizable pharmacophoric features linked with BRD-4 inhibitory activity for different chemical scaffolds. The analysis indicates that the presence of ring carbon and nitrogen atoms, occurrence of carbon atoms near the center of mass of a molecule, specific combination of planer nitrogen with ring carbon, donor and acceptor atoms, etc., are prominent features to be retained in future optimizations. On the other hand, a combination of nitrogen atoms with planer nitrogen atoms exactly at four bonds should be avoided for better BRD-4 inhibitory activity. The reported crystal structures of BRD-4 inhibitors strengthen these observations. The present study efficaciously captured and reported novel pharmacophoric features and has a good balance of predictive ability and mechanistic interpretations.

Conflicts of Interest:
The authors declare no conflict of interest.

SMILES
Simplified molecular-input line-entry system GA Genetic algorithm MLR Multiple linear regression QSAR Quantitative structure−activity relationship WHO World Health Organization OLS Ordinary least square QSARINS QSAR Insubria OECD Organisation for Economic Co-operation and Development