Development of Prediction Capabilities for High-Throughput Screening of Physiochemical Properties by Biomimetic Chromatography

Damian Tuz; Damian Smuga; Tomasz Pawiński

doi:10.3390/molecules30234528

,

and

¹

Laboratory of Physicochemical Analysis, Department of Medicinal Chemistry, Celon Pharma S.A., Marymoncka 15, 05-152 Kazuń Nowy, Poland

²

Department of Drug Chemistry, Pharmaceutical and Biomedical Analysis, Faculty of Pharmacy, Medical University of Warsaw, Banacha 1, 02-097 Warsaw, Poland

^*

Author to whom correspondence should be addressed.

Molecules2025, 30(23), 4528;https://doi.org/10.3390/molecules30234528

Version Notes

Order Reprints

Review Reports

Abstract

The ever-increasing costs of in vitro and in vivo testing are compelling scientists to increasingly rely on computational models for predictive characterisation at early stages of drug discovery and development. The complexity of this stage requires high-throughput screening methods that can rapidly generate comprehensive information about new chemical compounds. This review explores innovative approaches assessing pharmacokinetic and pharmacodynamic properties of new chemical entities, with a focus on integrating machine learning as a transformative analytical tool. Machine learning algorithms are highlighted for their capability to train sufficient predictors combining biomimetic chromatography data (a high-throughput alternative for several physicochemical assays) with molecular features and/or molecular fingerprints obtained in silico and in vivo data of known compounds to allow efficient prediction of in vivo data for new chemical entities. By synthesising recent methodological advancements and giving useful practical approaches, the review provides insights into computational strategies that can significantly accelerate compound library screening and drug development processes.

Keywords:

high-throughput; screening; biomimetic chromatography; machine learning

1. Introduction

A significant focus in drug discovery involves establishing pharmacokinetic and pharmacodynamic properties through computational models. Pharmacokinetics describes the disposition of a drug in the body, encompassing the processes of Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET), often summarised as “what the body does to the drug”. Meanwhile, pharmacodynamics pertains to the drug’s mechanism of action and its biochemical and physiological effects on the body, which can be summarised as “what the drug does to the body”.

The relationship between a molecule’s structure and its biological activity is governed by its fundamental physicochemical properties. Among these, lipophilicity—the affinity of a molecule for a lipid-like environment—is of paramount importance in Medicinal Chemistry. This property influences a compound’s entire ADMET profile, affecting its absorption, its distribution across biological membranes into various compartments in the body, its tendency to bind to plasma proteins, and even its potential for toxicity (Table 1).

Table 1. Connection between lipophilicity and ADMET [].

To quantify lipophilicity, the logarithm of the n-octanol–water partition coefficient (LogP) is used, as it is the most common parameter for predicting physicochemical properties [,]. This value is determined using a reference system of two immiscible liquids, n-octanol and water, where n-octanol acts as a surrogate for biological lipid membranes. A compound’s LogP value measures its distribution equilibrium between these two phases, with higher values indicating greater lipid solubility (lipophilicity) and lower values signifying greater water solubility (hydrophilicity). Since LogP refers to neutral species, it is easy to compare values among neutral compounds, but it may be misleading for ionisable compounds. Because of this, there is another distribution coefficient, logD, which accounts for pH [].

The shake-flask method is widely regarded as the gold standard for experimentally determining LogP/logD. This method involves partitioning a compound between immiscible solvents, octanol and water, and measuring the equilibrium concentrations in each phase. While highly accurate, the shake-flask method is time-consuming, requires high-purity compounds, and cannot be used for unstable compounds or those with extreme lipophilicity, due to analytical detection limits [,,].

To overcome these limitations, chromatographic techniques were developed. Reversed-phase (RP)-HPLC is one of the most common alternatives. This method relies on calibration plots based on compounds with a known Chromatographic Hydrophobicity Index (CHI) []. The CHI value itself estimates the percentage of organic solvent (e.g., acetonitrile) needed to elute the compound. This CHI value (0–100) can be mapped onto the traditional octanol–water logD scale using a linear equation to produce ChromlogD. ChromlogD is now widely used as a high-throughput alternative to the shake-flask method for assessing lipophilicity.

While ChromlogD addresses the throughput limitation, n-octanol and C18 stationary phases do not perfectly replicate the complexity of biological systems. To experimentally assess critical physicochemical properties (lipophilicity, permeability, protein binding) in a more biologically relevant manner, biomimetic chromatography (BC) has emerged as a reliable high-throughput technique, which has been proven to be superior in lipophilicity assessment within certain groups of chemical compounds [,,,].

Predictive models in pharmacokinetics link experimentally obtained data (A), such as retention factors from BC, with in silico derived molecular descriptors (B) or chemical fingerprints (C). The aim is to reliably forecast data that are usually resource-intensive and challenging to acquire (D), like in vivo efficacy [], plasma protein binding (PPB) [,], or blood–brain barrier permeability (log BB) [,,,]. This approach is known as the Quantitative Structure–Retention Relationship (QSRR) [] (Figure 1).

Figure 1. Flowchart of a prediction model trained with compounds that have publicly accessible in vivo data used to forecast the relevant in vivo data for unknown compounds. A—experimentally derived data, e.g., retention time; B—molecular descriptors generated in silico; C—molecular fingerprints generated in silico; D—in vivo data.

The translation of raw chromatographic and in silico data into predictions of complex biological phenomena is greatly improved by using ML. This review will therefore examine these methods as follows:

Section 2 will explore the quantitative relationship between retention behaviour in these BC systems and their corresponding “gold standard” biological assays.
Section 3 will introduce modern ML algorithms as tools to decode these complex, non-linear relationships and their use in QSSR.
Section 4 will present selected applications from the recent literature, followed by our conclusions on the current state of the art and upcoming challenges.

2. Physiochemistry in Pharmacokinetics: From Gold Standards to Biomimetic Alternatives

The main physicochemical parameters are lipophilicity, permeability, and protein binding. However, they should not be viewed as independent variables. Collectively, these parameters offer essential information for understanding and predicting the behaviour of drug candidates in biological systems. While these parameters can be measured individually in vitro, their real value lies in forecasting outcomes in complex in vivo studies or cell-based assays (“golden standards”), which are resource-intensive and low-throughput.

This chapter examines how BC offers high-throughput alternatives for predicting these in vivo parameters. BC is an analytical method within (ultra) high-performance liquid chromatography ((U)HPLC). It may use artificial stationary phases designed to mimic the molecular interactions between pharmaceutical compounds and their biological targets, including proteins, cellular membranes, and enzymatic systems []. Retention times from BC can be used to model parameters such as lipophilicity, protein binding affinity, and membrane permeability characteristics, which can then be utilised to model more complex parameters like human oral absorption (%HOA) or log BB.

Implementing BC provides benefits in drug development, as it offers a cost-effective alternative to traditional in vivo studies and aligns with High-Throughput Screening (HTS) methods (Table 2).

Table 2. Comparative table for BC techniques.

2.1. Plasma Protein Binding (PPB) and Volume of Distribution (V_D)

According to the free drug hypothesis, only the unbound fraction of a drug is biologically active, as it can diffuse across cell membranes to interact with its target site []. Intensive PPB can lower the concentration of unbound drug, reduce its V_D, decrease Cl, and necessitate higher doses to attain the desired effect, which presents a major risk of toxicity. Although, in some cases, it can be beneficial by prolonging the active effect of a drug through slow release from plasma proteins [,,]. The affinity of a drug for common plasma proteins, such as α1–acid glycoprotein (AGP) and human serum albumin (HSA), provides valuable insight into its pharmacokinetic behaviour, influencing its V_D, half-life (t_1/2) and clearance (Cl) []. The V_D, t_1/2, and Cl are also dependent on membrane permeability, which is heavily influenced by lipophilicity.

The gold standard for measuring PPB is Equilibrium Dialysis (ED). This method physically separates a protein-containing solution from a buffer using a semipermeable membrane, allowing the free drug to reach equilibrium across it []. It is accurate but slow and not suitable for HTS. Regarding V_D, t_1/2, and Cl, these parameters are not obtained through a single assay. Their gold standard determination is an in vivo pharmacokinetic (PK) study. In such studies, drug concentration in plasma is measured over time, and V_D, t_1/2, and Cl are calculated from this curve using pharmacokinetic models.

Using BC with AGP and HSA columns has become a reliable HTS method for studying PPB []. In this technique, the retention time of a drug on protein-coated columns is measured to determine retention factors, log k_w(HSA) and log k_w(IAM), which are correlated with the drug’s binding affinity to plasma proteins. Several studies have demonstrated a strong correlation between these retention factors and in vivo data obtained in PPB studies [,,,,,,,,]. It is important to note that in some of this work, researchers combine multiple BC columns for HTS of various aspects of drug disposition in the Central Nervous System (CNS) (log BB, fraction unbound in brain, unbound brain volume of distribution) [].

Technical Details: AGP and HSA are both protein-based columns within Affinity chromatography, initially designed for chiral separation by exploiting the stereospecific binding pockets of the immobilised proteins. However, these columns have found additional applications in ADMET profiling and can be employed as tools for drug distribution and drug–drug interactions. AGP contains α1-acid glycoprotein, a major plasma protein that binds basic and neutral drugs. HSA contains immobilised human serum albumin, a key plasma protein that binds numerous drugs in the bloodstream []. Daicel Corporation offers a wide range of protein-based chiral selectors. HSA and AGP are available under the trade names CHIRALPAK HSA and CHIRALPAK AGP. They also supply columns with immobilised Cellobiohydrolase (CBH) and serum albumins from various animal species.

An alternative BC approach to predict PPB utilises Micellar Liquid Chromatography (MLC) [,,]. MLC was also utilised for VD, t_1/2, and Cl [,,]. This technique operates in reversed-phase liquid chromatography (RPLC) mode with a mobile phase containing surfactant at a concentration above its critical micellar concentration (CMC). The most common stationary phases are C8, C18, and cyanopropyl, while the typical surfactants include the anionic sodium dodecyl sulphate (SDS), cationic cetyltrimethylammonium bromide (CTAB), and non-ionic Brij-35 [].

Technical Details: The retention and separation process in MLC relies on a double equilibrium. This determines how the analyte distributes itself among three different microenvironments: (i) between the bulk aqueous mobile phase and the surfactant-coated stationary phase; and (ii) between the bulk aqueous mobile phase and the micellar aggregates in the mobile phase. Analytes that bind strongly to the micelles are slowed down compared to those in the aqueous phase []. Due to this mechanism, the retention factor log k_w(MLC) is proportional to the compound’s partitioning into lipids on the surface of a stationary phase and micelles, providing results that directly measure membrane affinity.

2.2. Oral Bioavailability (F), Human Oral Absorption (%HOA), Membrane Permeability

Oral bioavailability (F) is a crucial in vivo parameter that reflects the proportion of an administered dose reaching systemic circulation unchanged. It is a composite parameter influenced by %HOA (which includes aqueous solubility, intestinal permeability, and chemical degradation in the gastrointestinal (GI) tract) and first-pass metabolism [,]. It is important to distinguish between different types of permeability: (i) membrane permeability affecting drug absorption, which can be further defined as intestinal permeability and epithelium permeability; and (ii) BBB permeability, influencing drug distribution specifically through the BBB, which is essential for drugs targeting the CNS. A compound’s ability to cross the BBB is a critical and distinct parameter, essential for designing CNS-active drugs or, conversely, for excluding drugs from the brain to prevent neurotoxicity.

The gold standard for oral bioavailability is an in vivo PK study comparing the area under the curve (AUC) of plasma concentration following an oral dose to that of an intravenous (IV) dose. Aqueous solubility and chemical degradation are relatively easier to assess than membrane permeability. For membrane permeability, the gold standard is the Caco-2 and Madin-Darby canine kidney (MDCK) cell-based in vitro assay. They are slightly different; the Caco-2 assay uses human colon carcinoma cells and requires long differentiation, whereas the MCDK assay provides a faster alternative and is frequently used in transfected MDR1-MDCK variants to investigate P-glycoprotein efflux transport specifically [,,]. An alternative, well-established approach to measuring passive membrane permeability is the Parallel Artificial Membrane Permeability Assay (PAMPA) []. The membrane on which PAMPA methods are based is artificial, can only measure passive diffusion, and can be “customised”, depending on the membrane the assay mimics [,,]. The gold standard for BBB permeability is derived from in vivo studies in animals. It is expressed as the logarithmic ratio of the drug’s concentration in the brain to its concentration in plasma, resulting in the log BB value. A less commonly used descriptor of BBB permeability is logPS, which represents the permeability surface area product. The difference between them is that log BB measures concentrations at equilibrium, and log PS measures the initial permeability rate. This process is low-throughput and resource-intensive [].

The primary BC alternative for predicting membrane permeability is Immobilised Artificial Membrane (IAM) chromatography [,,,,]. The retention factor log k_w(IAM) is proportional to the compound’s partitioning into the phospholipid phase, providing a direct measure of membrane affinity. This method offers a high-throughput prediction of passive diffusion. IAM permeability has a medium correlation with Caco-2, but only after the inclusion of the molecular mass variable and the exclusion of compounds that undergo active transport []. However, in a different dataset, a weak correlation between IAM permeability and Caco-2 has been reported []. A medium correlation can also be observed with MDCK, after the inclusion of a parameter representing electrostatic interactions []. Both correlations are only valid for drug compounds that undergo passive transport. IAM cannot predict the effect of active transporters, just like PAMPA, a key limitation compared to cell-based assays []. Because oral bioavailability (F) is a composite property, log k_w(IAM) can potentially be part of a prediction model (e.g., predicting %HOA, due to the heavy influence of lipophilicity on solubility and permeability) [,]. IAM chromatography has also been successfully applied in studies on epithelial permeability [] and for predicting log BB [,,,]. The phospholipid stationary phase serves as a model for the membranes of brain endothelial cells. Various IAM column variants share similar capabilities in predicting BBB permeability []. Measurement of epithelium, membrane (intestine), and BBB permeability can be performed using a single type of column, commercially available IAM.PC.DD2. The differences lie in (i) the pH of the buffer—usually pH 5.5–7.4 for skin permeability, pH 6.5–7.4 for intestinal permeability, and pH 7.4 for BBB permeability; and (ii) different molecular descriptors.

Technical Details: IAM columns are phospholipid-based columns. The first commercially available column was IAM.PC (phosphatidylcholine). IAM.PC.DD2 is the latest version []. A switch from “Type A” silica to “Type B” silica after 2018 caused significant differences in retention for acidic and basic compounds. Valko et al. [] emphasises using new CHI(IAM) values for calibration of new columns, for better in vivo correlations. IAM columns can be further specialised by using other phospholipids as head groups; IAM.PE (phosphatidylethanolamine) shows differences in abundance in vivo [] and IAM.SPH (sphingomyelin) can give unique insights into drug–neuron activity due to its rich presence in animal nerve tissue compared to phosphatidylcholine [].

MLC can expand possible information by predicting epithelium permeability [,,]. MLC can also be utilised as a tool for predicting passive membrane permeability, and what follows %HOA [,,,]. MLC permeability shows a moderate to strong correlation with the PAMPA [,] assay and a medium to good correlation with Caco-2, but only for selected compounds that permeate passively [,]. MLC is also employed to forecast BB permeability [,,]. Most of the presented studies utilise C18 as the stationary phase and Brij-15 as the surfactant. Similarly, like with IAM, the key factor that distinguishes different permeability measurements is pH and molecular descriptors.

While the methods mentioned earlier (IAM, MLC) are HTS tools for modelling general membrane partitioning and passive permeability, Cell Membrane Chromatography (CMC) represents a different, more biologically complex approach that may be employed at a later stage of drug discovery. CMC is designed to examine specific drug–receptor interactions using biologically active membranes. This method excels in studying drug–membrane interactions [,,].

Technical Details: The core of the CMC stationary phase consists of adsorbed (on activated silica gel) cell membranes, which were historically sourced from tissue cells (e.g., rabbit red and white cells, rabbit cardiomyocytes and rat vascular endothelial cells [,,,]), and now from high-expression recombinant cell lines with specific receptors (e.g., Vascular Endothelial Growth Factor Receptor 2 (VEGFR-2), Fibroblast Growth Factor Receptor-1 (FGRF1) [,]). This approach preserves the biological structure and activity of receptors, allowing for accurate simulation of in vivo interactions [,]. The adsorption of high-expression cell lines significantly enhances the sensitivity and accuracy of the method. The mechanism of retention is based on the specific recognition between the analyte and the membrane receptor. Ligands, such as drugs, selectively interact with membrane receptors adsorbed on silica gel, achieving chromatographic separation. A key parameter that can be measured is the equilibrium dissociation constant (K_D), which reflects the affinity strength between a drug and its receptor. Methods such as frontal analysis and zonal elution are used within the CMC framework to calculate these K_D values. Although CMC is widely used, due to a lack of commercial availability, its usage in HTS is heavily limited. Column life is relatively short due to membrane receptors falling off the silica gel, thereby losing stability and reproducibility. Moreover, the amount of attached membrane receptors in CMC should be controlled for the accuracy improvement [].

2.3. Toxicity (DIPL and hERG)

Toxicity is often linked to extreme values of physicochemical properties. One specific form, DIPL, is a condition where drugs bind excessively to phospholipids, leading to their accumulation within cell lysosomes [,]. Another specific form of toxicity is inhibition of hERG cardiac potassium channel, which is an important antitarget in early drug discovery [].

For DIPL and hERG, the gold standard involves an in vitro cell-based assay (e.g., using fluorescent dyes) or in vivo histopathology. All these methods are extremely slow and resource-intensive.

IAM retention factor log k_w(IAM) data may be used to predict toxicity. While moderate phospholipid affinity is required for permeability, an extreme affinity is a well-established indicator of DIPL risk [,,,]. Other toxicities, like hERG channel binding, have been correlated with high general lipophilicity (ChromlogD). However, more precise approaches are shown in the literature that leverage both retention factors of k_w(IAM) and k_w(AGP) [].

As toxicity relates usually to specific drug–receptor interactions, CMC could potentially excel in this field with a wide variety of papers related to the screening of complex samples and identification of harmful components [,,,,,,,,,,].

2.4. General Technical Considerations

Excluding CMC and MLC, which have their own unique requirements for mobile phases, mobile phases in BC, especially for IAM, AGP, and HSA, are designed to resemble physiological conditions closely, and they are typically composed of a buffered solution and organic modifiers. Due to the narrow range of optimal pH for these columns (5.0–7.4), the most popular buffers are phosphate and acetate [,,]. While it is possible to conduct experiments using purely aqueous mobile phases, this approach leads to significantly prolonged retention times for analytes. Such time-consuming analyses are fundamentally incompatible with the HTS ideology that dominates modern drug discovery, which demands rapid and efficient methods. The introduction of fast gradient methods by Valko et al. was a pivotal development that gave a chance to replace slow isocratic runs and enabled the rapid analysis required for HTS [,]. For IAM chromatography, acetonitrile is used almost exclusively as an organic modifier. However, technical optimisation for IAM columns shows that predictive capabilities are similar using methanol []. Literature indicates the possibility of using mass spectrometry detection to increase the throughput, enabling the “pooling” of multiple compounds in a single ejection. Russo et al. clearly indicate that this methodology is faster, more environmentally friendly and results from MS-friendly chromatographic conditions, having good correlation with results from standard phosphate buffers [].

3. Machine Learning (ML): Translating Chromatographic Data into Predictions

In the previous chapters, we established BC as a high-throughput method for generating experimental data (e.g., log k_w(IAM), log k_w(HSA)) that mimics specific biological interactions. However, the raw retention factor is not, by itself, a prediction. ML, especially supervised learning, provides a set of computational tools that can develop a model to connect this experimental data (A) with in silico descriptors (B/C) to predict the complex in vivo parameters (D) we ultimately focus on, such as log BB or PPB.

ML can be broadly divided into supervised learning (where labelled data means the output is known) and unsupervised learning (where unlabelled data means there is no output). These methods are often combined to build a final model or make inferences (Figure 2). Even simple regression can be implemented manually, but ML frameworks benefit from scalability and efficiency through the ability to chain different algorithms in pipelines (excluding deployment in this context) []. Moreover, in the presented applications, no component automatically acquires new data, preprocesses it, builds new iterations of predictions, and immediately deploys them into production, simply because of the limited amount of available data.

Figure 2. Simple graphical representation of difference between supervised learning (using labelled data) and unsupervised learning (using unlabelled data).

While complex ML techniques, such as deep neural networks, can capture intricate patterns, they often require substantial computational resources, extensive knowledge, and large datasets, and suffer from reduced interpretability. A reasonable approach, and one commonly seen in the literature, is to begin with simple, interpretable models like linear regression and progress to advanced techniques only when necessary [].

The development and training of ML models follow a standard workflow:

Data acquisition—represents curation of comprehensive, representative datasets. The quality and integrity of training data directly influence model performance and can significantly compromise the model’s predictive capabilities and generalisation ability.
Data preprocessing—transforms data, including feature scaling, normalisation and handling missing values. Many ML algorithms exhibit sensitivity to feature scale disparities, where features differing by orders of magnitude can disproportionately influence model training. Standard techniques include standardisation, min-max scaling, and log transforming for heavily skewed distributions.
Data partitioning—involves division of the dataset into training (typically 60–80%), validation (10–20%) and test (10–20%) sets. Data partitioning can be performed randomly, which is most suitable for large, diverse, and evenly distributed datasets, or through rational splitting, such as scaffold splitting that divides by groups of molecules (with similar chemical scaffolds), thereby ensuring better model generalisation and reducing overfitting []. The exact proportion may vary based on dataset size and specific application requirements [].
Model training—an optimisation process where a loss function (e.g., Mean Squared Error (MSE), Cross-Entropy) quantifies the error between the model’s prediction and the actual values. This function acts as a performance indicator that models seek to minimise iteratively throughout the training process. The choice of loss function influences model behaviour, particularly in handling outliers.
Model evaluation—assessment of the final model’s performance on the unseen test set using evaluation metrics (e.g., R², Q²). Evaluation ensures an unbiased assessment of the model’s ability to generalise to new, unseen data points. If the data partitioning step is omitted, this step should include validation methods (e.g., Leave-one-out Cross-Validation, LOOCV). In specific frameworks, such as Quantitative Structure–Activity Relationship (QSAR)/QSRR, and in general scientific papers, statistical tests should also be evaluated (e.g., Fisher test, t-test) to ensure model significance.

The selection of both loss functions and evaluation metrics is dependent on the specific type of ML task being addressed. While the selection of loss functions depends on the model’s requirements, evaluation metrics can be chosen purely based on their interpretability and relevance to business or research objectives.

Although retention factors derived from BC provide valuable information, they often require additional variables to accurately model complex biological systems, such as log BB. However, when combining chromatographic retention data with in silico computed parameters characterising chemical structures, the QSRR term should be utilised.

3.1. Molecular Representations

Before a quantitative model can be built, a chemical structure must be converted into a computer-readable format, known as a molecular representation. This representation can capture the molecule’s identity at different levels, such as one-dimensional (1D) string (e.g., SMILES), 2D graphs (which define atomic connectivity), or 3D conformations (which define atomic coordinates). From these foundational representations, numerical features are generated for the ML model []. The numerical features derived from molecular representation generally fall into two broad categories.

The first, molecular descriptors are numerical values that capture diverse, interpretable properties (e.g., logP, pKa, Polar Surface Area (PSA)) that can be broadly categorised into topological, geometrical, electrostatic, quantum, physicochemical and pharmacophoric types. Molecular descriptors can greatly complement experimental data obtained from BC and help with predictive modelling and property calculations. However, despite this applicability, model performance heavily relies on the quality of those molecular descriptors [,] (Table 3).

Table 3. Summary of differences between different molecular descriptors [].

Good practice in working with molecular descriptors involves []

Cleaning—handling missing values and descriptors with no variance.
Normalising/standardisation—some ML algorithms like Support Vector Machine (SVM) and Artificial Neural Network (ANN) require features to be on the same scale.
Feature selection/dimensionality reduction—removing low-variance descriptors, eliminating highly correlated descriptors, usually by unsupervised learning.
Handling categorical variables—presence/absence of functional group requires encoding to binary (0, 1).
Handling outliers—deciding if outliers should be included in the model.

The second, molecular fingerprints, encodes structural patterns such as binary vectors (bits) or bitstrings. Unlike molecular descriptors, molecular fingerprints are used without normalisation or standardisation. They encode narrow structural patterns in the form of bits that are not interpretable and can be categorised into substructural, topological, crystallographic, and hybrid types. They are best suited for similar searching and in silico screenings.

There are several tools and libraries, both commercial and open-source, for description calculation, like RDKit, Open Babel, Scikit-learn, or the KNIME 5.3 analytics platform [,,,]. These open-source cheminformatics tools are essential for calculating molecular descriptors and fingerprints used in the QSRR modelling. Scikit-learn is a comprehensive Python 3.14.0 library that provides tools for feature selection, data scaling, and model validation. KNIME is a perfect solution for researchers who do not want to learn a programming language.

3.2. Unsupervised Learning

Unsupervised learning aims to discover patterns, relationships, and exceptions within variables, through algorithms based on clustering (structure discovery and/or anomaly detection) and dimensionality reduction, which are often employed synergistically to extract maximum information. However, while anomaly detection algorithms are a great tool, for the datasets depicted in this work, easier statistical outlier detection is more appropriate and robust.

Clustering methodologies, particularly k-means clustering and hierarchical clustering, have demonstrated significant utility in cheminformatics [,]. In the topic of ML, dimensions are described by variables. Dimensionality reduction techniques aim to decrease the number of variables (dimensions). A Principal Component Analysis (PCA), first introduced by Pearson and later developed by Hotelling, identifies orthogonal components that maximise variance in multivariate data [,]. The PCA in QSRR can be used to reduce the number of molecular features generated in silico, thereby preprocessing data before applying supervised learning techniques. However, it is essential to note that PCA does not select but creates new principal components (variables) that retain information from potentially correlated variables. An identification and interpretation of outliers can indicate experimental artefacts or reveal compounds with unique binding mechanisms. Two prominent approaches for anomaly detection include Isolation Forest and Density Spatial Clustering of Applications with Noise (DBSCAN) [,]. Isolation Forest identifies anomalies through recursive partitioning of the feature space, while DBSCAN designates points in low-density regions as outliers based on spatial density distributions.

Evaluation metrics and loss functions serve different purposes depending on the specific unsupervised learning task. For instance, in clustering, the focus is often on measuring both the cohesion within clusters and the separation between clusters. In contrast, dimensionality reduction metrics typically focus on information preservation and the quality of reconstruction (Table 4).

Table 4. Description of the most popular loss function and evaluation metric for different unsupervised ML algorithms.

Pastewska et al. [,] used hierarchical cluster analysis (CA) and PCA to explore relationships among different lipophilicity measures (both in silico and experimental, including IAM retention). They then applied the Sum of Ranking Differences (SRD), an evaluation metric used to systematically compare and rank those lipophilicity measures, determining which methods most reliably capture lipophilicity. Jeličić et al. [] used hierarchical CA and PCA to evaluate similarities between calculated and experimentally observed values (including PPB measure from HSA and AGP, and lipophilicity from IAM) and their mutual correlation.

3.3. Supervised Learning

In supervised learning, the model learns a mapping function from labelled data. This is the most common approach for building predictive QSRR and pharmacokinetic models [].

3.3.1. Regression Models

Regression models are used to predict continuous outputs (log BB, log k_w(IAM)). Regression models can be univariate, featuring only one variable, or multivariate, which depends on multiple variables. Linear regression assumes a proportional relationship between variables and the output. Their primary advantage is simplicity and high interpretability, because the resulting equation clearly shows which features are most important.

Linear regression assumes additive and proportional effects between any number of variables (X) and output (Y). When a non-linear relationship exists between variables and output, non-linear regression can capture curvilinear, saturating, or sigmoidal relationships commonly seen in biological systems. In linear regression, which predicts continuous outcomes, loss functions assess the difference between predicted values (ŷ) and actual values (y). Linear regression models are most commonly used in QSRR because of their simplicity and ease of interpretation (Table 5 and Table 6).

Linear regression is the most prevalent model in this field for these exact reasons. For example, De Vrieze et al. [] created a multivariate linear regression (PLS model, LOOCV) to predict log BB using log k_IAM and log k_MLC. In this work, they initially obtained models that properly fit the training data but poorly fit the validation set, indicating overfitting (average difference between R(PLS) and R(LOOCV) = 0.2407). After manually reducing the variables from 15 to 7, the difference dropped to 0.0928, indicating that models after dimensionality reduction enable more robust predictions, a common observation among prediction models. Later, De Vrieze et al. [] dedicated themselves to evaluating the prediction power of log BB from IAM.PC.DD2, IAM-sphingo, and IAM-cholesterol, with a similar performing model, proved that each of these columns predicts log BB well. Janicka et al. [] developed a regression model using Multiple Linear Regression (MLR) with identified key descriptors: log K_w(IAM) and molecular weight (MW). Statistical validation (R² = 0.934, cross-validated Q² = 0.934) confirmed the model’s robustness, with PCA ensuring descriptor independence. In more broad study, Vallianatou et al. [] incorporated unsupervised learning, PCA and hierarchical CA, to obtain a clear overview of the dataset, based on which they applied two (MLR and Partial Least Squares (PLS)) models to create an HTS approach for early drug evaluation of CNS drugs, modelling not only log BB, but also unbound fraction in brain and unbound brain volume of distribution.

Due to the high complexity of biological prediction, a linear relationship can be insufficient. More complex non-linear algorithms like Support Vector Regression (SVR), Random Forest Regression (RFR), and ANN can capture curvilinear, saturating, or sigmoidal relationships [].

A clear example of using non-linear regression is the work of Tsopelas et al. [] focused on modelling %HOA, which captures a sigmoidal relationship with retention factors obtained from IAM chromatography and several molecular descriptors. The study of Ciura et al. [] compared MLR with non-linear ANNs, using both multilayer perceptron (MLP) and radial basis function (RBF) architectures. The models were constructed using 261 experimental CHI_IAM values combined with four molecular descriptors: PSA, MV, HDo, and distribution coefficient at pH 7.4 (logD 7.4) to predict CHI_IAM. While the linear MLR model showed moderate predictive ability (R² = 0.550, Q² = 0.513), the best MLP network (ANN4) demonstrated superior performance with higher R² values for both training (0.746) and external validation (EV) (0.677) sets. Global sensitivity analysis identified PSA and MV as the most influential descriptors, emphasising the importance of polar interactions and molecular size in drug–membrane interactions.

Table 5. Summary of differences between linear and non-linear regression.

Aspect	Linear Regression	Non-Linear Regression
Interpretability	Directly interpretable coefficients	Parameters are often context-dependent
Flexibility	Limited to linear trends	Captures saturating, sigmoidal or exponential relationships
Models	Ordinary Least Squares (OLS), PLS, MLR, SVR []	Polynomial Regression, SVR [], RFR [], Extreme Gradient Boosting (XGBoost) []

Table 6. Set of three loss functions, validation methods, evaluation metrics, and statistical tests usually used in supervised learning (QSRR).

Loss Functions	Validation Method	Evaluation Metrics	Statistical Test
MSE—calculates the average of the squared differences between ( $\hat{y}$ ) and (y). Heavily penalises significant errors and is sensitive to outliers. Assuming errors follow a Gaussian distribution.	LOOCV trains on n-1 samples, tests on 1, and repeats n times. It is ideal for small datasets. Offers an unbiased estimate but comes with a high computational cost.	Sum of Squared Errors (SSE)—measures the total squared error between predictions and actual values.	F-test—overall model significance. Checks if the prediction is not due to chance alone. Standard threshold: p < 0.05.
Mean Absolute Error (MAE)—calculates the average absolute differences between ( $\hat{y}$ ) and (y). It is robust for outliers. Suitable for data containing outliers or errors that follow the Laplace distribution.	k-fold Cross-Validation (k-Fold CV). Splits data into k equal parts. Each fold is used only once as a test set.	R² or R (e.g., “ML Model”)—measures the proportion of variance in the dependent variable that is explained by the independent variable. It ranges from 0 to 1, where R² = 1 means that the model explains all the variance, and R² = 0 implies that the model explains none of the variance.	t-test—individual variable significance. Provides p-value for each variable. Standard threshold: p < 0.05.
Huber Loss—combines MSE for minor errors (smooth and differentiable) and MAE for significant errors (robust to outliers) [].	EV (train–test split). Splits data into training set and independent set. The test set should never be used to train models. Gold standard for a prediction model. Provides R²_ext.	Q² or R (e.g., “ML Model” with “Validation method”)—used to evaluate the predictive performance of a model, particularly in cross-validation or EV scenarios. It measures how well the model predicts new, unseen data. It ranges from −∞ to 1, where Q² = 1 means strong predictive power, and Q² < 0 implies that the model performs poorly on unseen data [,].	Y-randomisation—permutation test. Checks robustness by verifying that R² and Q² remain similar after a random change to the output value. If R² and Q² drop in values, it is good because a relationship exists between the variables and the output [].

3.3.2. Classification Models

Classification models are supervised ML algorithms designed to predict discrete class labels (e.g., 1/0, active/inactive, toxic/non-toxic) based on input features. Unlike regression, which predicts continuous outcomes, classification assigns data points to predefined categories, making it helpful for decision-making in drug discovery [] (Table 7 and Table 8). The type of classification model depends on the number of labels that need to be populated. In the case of binary labels, there are two popular models: logistic regression, which estimates class probabilities using the logistic function, and SVM, which finds a hyperplane that maximises the margin between classes []. In the case of nonbinary classification, such as classification into different drug families, there are two other classic models: multinomial logistic regression, which extends logistic regression to multiple classes, and RFR. In some cases, complex classification (binary or categorical labelling) may not represent the most accurate or informative approach. Instead, probabilistic outputs that quantify the likelihood of class membership often provide a superior strategy for capturing prediction confidence and variability []. Two probabilistic classification models are Naïve Bayes, which applies Bayes’ theorem with feature independence assumptions, and ANN, which uses SoftMax activation for probability distribution. Loss functions in classification models quantify the accuracy of a model’s predictions by measuring the discrepancy between the predicted probabilities and the actual labels. Those types of ML algorithms are not usually used to correlate with chromatographic data, but rather as a classifier combining different kinds of information; for example, in decision-making for drug discovery. Clever usage of this approach has been presented in the work of Tsopelas et al. [], where, based on a trained non-linear regression model, they developed a simple classification approach.

Table 7. Summary of popular classification models.

Table 8. Description of three popular loss functions and five evaluation metrics for classification models.

4. Discussion and Future Perspectives

The application of biomimetic principles to pharmaceutical sciences has evolved significantly since the early works of Valko et al. [,]. Over the last two decades, the field has advanced from simple isocratic measurements to automated HTS platforms. However, a critical review of the recent literature reveals that while instrumental capabilities have expanded, the field often tends toward incremental improvements rather than transformative innovation.

4.1. Throughput vs. Mechanistic Understanding

A recurring challenge in BC is the trade-off between analytical speed and the depth of mechanistic understanding.

Bunally et al. [] marked a significant shift by introducing a 96/384-well plate format, integrating multiple parameters (ChromlogD, HSA binding, membrane interaction) into a single automated workflow. This addressed the speed bottleneck but potentially simplified the biological interpretation.
Russo et al. [] demonstrated the viability of 2D-LC systems combining HSA and IAM columns. Their work on a visual clustering approach for permeability characterisation offered an alternative to traditional statistical modelling.
Vallianatou et al. [] proposed complex HTS approach for early-stage CNS drug candidates.
Conversely, Iwakuma et al. [] dived into the detailed mechanism of drug–membrane interactions in chromatographic separation on IAM stationary phase. Investigating acetonitrile concentrations and salt effects.
Alternative approaches like those by Ciura et al. [] using micellar electrokinetic chromatography (MEKC) raises fundamental questions about whether complex biomimetic surfaces are even necessary, as high correlations (R² = 0.904) were achieved with simplified surfactant systems.

4.2. Analytical Bottleneck

Despite the development of fast gradient methods, a significant number of studies still rely on long gradients or isocratic methods to preserve column life or peak resolution. Furthermore, the field exhibits an over-reliance on ultraviolet (UV) detection, as shown by Russo et al. []. In response, they proposed a mass spectrometry-based approach, demonstrating that this methodology is faster, more environmentally friendly, and yields results that correlate well with those from standard phosphate buffers. However, it is another trade-off, because switching from phosphate-buffered saline (PBS) buffers to ammonium acetate in different datasets may result in reduced biomimicry [].

4.3. From Regression to Black Boxes

It is important to remember that the performance of various ML models should be assessed using the same datasets. The number and distribution of samples have a significant influence on model performance. For instance, a small number of samples with similar structures may produce good predictions during cross-validation (high R², high Q²). However, when tested against external datasets, its effectiveness often diminishes (R²_ext) due to differences in structure across the extensive chemical space. This is a typical sign of a model with a limited applicability domain.

The field clearly progresses from linear regression to complex pipelines. Use of unsupervised learning is much more common in more recent works, clearly benefiting researchers through graphical and statistical relationships between variables [,]. Ciura et al. [] advanced this further by deploying an ANN, which showed superior performance to simple regression. While innovative, the use of increasingly complex ML algorithms threatens to obscure mechanistic understanding through the development of black-box models. There is a risk of prioritising predictive efficiency over understanding the underlying biological issues. However, the answer can be in the middle, like “gray box” ML algorithms [].

This problem may be exacerbated by the rise and rapid improvement of large language models (LLMs). While they offer immense coding and analytical assistance, they may pose a trap for inexperienced analysts by luring them into black-box models that predict well but are incomprehensible to humans.

4.4. Will in Silico Replace Experimental?

With the increasing availability of curated chemical databases, data-intensive approaches like deep convolutional neural networks pose a risk of making experimental BC obsolete in the future. However, the current literature remains contradictory regarding the predictability of physicochemical parameters through descriptor-based models alone. As shown by the comparative studies of Orzel et al. [] and Iwakuma et al. [], experimental validation remains necessary to capture the dynamic nuances of biological environments.

5. Conclusions

This review highlights that BC has successfully matured from a niche analytical technique into a robust partner for ML in early drug discovery. By combining the high-throughput generation of biologically relevant data (IAM, HSA, AGP, MLC) with advanced QSRR modelling, researchers can now estimate in vivo parameters, such as log BB, %HOA, and specific toxicity, with increasing accuracy.

Author Contributions

D.T. wrote the main manuscript text. D.S. and T.P. supervised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Polish Ministry of Science and Higher Education under the 8th edition of the Implementation Doctorate Program, registration number: DWD/8/0390/2024.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors gratefully acknowledge Celon Pharma S.A. for covering the Article Processing Charge.

Conflicts of Interest

D.T. and D.S. are employees of Celon Pharma S.A. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

(U)HPLC	(Ultra) High-Performance Liquid Chromatography
ADMET	Absorption, Distribution, Metabolism, Excretion, Toxicity
AGP	α-1-acid Glycoprotein
ANN	Artificial Neural Network
AUC	Area Under the Curve
BBB	Blood–Brain Barrier
BC	Biomimetic Chromatography
CA	Cluster Analysis
CHI	Chromatographic Hydrophobicity Index
Cl	Clearance
CMC	Cell Membrane Chromatography
CMK	Critical Micellar Concentration
CNS	Central Nervous System
CTAB	Cationic Cetyltrimethylammonium Bromide
CV	Cross-Validation
DIPL	Drug-Induced PhosphoLipidosis
ED	Equilibrium Dialysis
EV	External Validation
FGFR1	Fibroblast Growth Factor Receptor-1
GI	Gastrointestinal
HAc	Hydrogen bond Acceptors
HDo	Hydrogen bond Donors
hERG	human ether-a-go-go-related gene
HOA	Human Oral Absorption
HSA	Human-Serum Albumin
HTS	High-Throughput Screening
IAM	Immobilised Artificial Membrane
LLM	Large Language Model
LOC	Local Outlier Factor
LOOCV	Leave-One-Out Cross-Validation
MAE	Mean Absolute Error
MDCK	Madin-Darby canine kidney
MEKC	Micellar Electrokinetic Chromatography
ML	Machine Learning
MLC	Micellar Liquid Chromatography
MLP	Multilayer Perceptron
MLR	Multiple Linear Regression
MSE	Mean Squared Error
OLS	Ordinary Least Squares
PAMPA	Parallel Artificial Membrane Permeability Assay
PCA	Principal Component Analysis
PLS	Partial Least Square
PPB	Plasma Protein Binding
PSA	Polar Surface Area
QSAR	Quantitative Structure–Activity Relationship
QSRR	Quantitative Structure–Retention Relationship
RBF	Radial Basis Function
RFR	Random Forest Regression
SDS	Sodium Dodecyl Sulphate
SRD	Sum of Ranking Differences
SVM	Support Vector Machine
SVR	Support Vector Regression
t_1/2	Half-life
UV	Ultraviolet
Vd	Volume of distribution
VEGFR2	Vascular Endothelial Growth Factor Receptor 2
XGBoost	Extreme Gradient Boosting

References

Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 2012, 64, 4–17. [Google Scholar] [CrossRef]
Leo, A.; Hansch, C.; Elkins, D. Partition coefficients and their uses. Chem. Rev. 1971, 71, 525–616. [Google Scholar] [CrossRef]
Corwin, H.; Toshio, F. p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure. J. Am. Chem. Soc. 1964, 86, 1616–1626. [Google Scholar] [CrossRef]
Testa, B.; van de Waterbeemd, H.; Folkers, G.; Guy, R. Pharmacokinetic Optimization in Drug Research; Wiley: Hoboken, NJ, USA, 2001. [Google Scholar]
Avdeef, A. Physicochemical Profiling (Solubility, Permeability and Charge State). Curr. Top. Med. Chem. 2001, 1, 277–351. [Google Scholar] [CrossRef]
Sangster, J.M. Octanol-Water Partition Coefficients: Fundamentals and Physical Chemistry. Eur. J. Med. Chem. 1997, 11, 842. [Google Scholar] [CrossRef]
Valkó, K.; Bevan, C.; Reynolds, D. Chromatographic Hydrophobicity Index by Fast-Gradient RP-HPLC: A High-Throughput Alternative to log P/log D. Anal. Chem. 1997, 69, 2022–2029. [Google Scholar] [CrossRef]
Jeličić, M.-L.; Klarić, D.A.; Kovačić, J.; Verbanac, D.; Mornar, A. Accessing Lipophilicity and Biomimetic Chromatography Profile of Biologically Active Ingredients of Botanicals Used in the Treatment of Inflammatory Bowel Disease. Pharmaceuticals 2022, 15, 965. [Google Scholar] [CrossRef]
Pastewska, M.; Żołnowska, B.; Kovačević, S.; Kapica, H.; Gromelski, M.; Stoliński, F.; Sławiński, J.; Sawicki, W.; Ciura, K. Modeling of Anticancer Sulfonamide Derivatives Lipophilicity by Chemometric and Quantitative Structure-Retention Relationships Approaches. Molecules 2022, 27, 3965. [Google Scholar] [CrossRef]
Pastewska, M.; Bednarczyk-Cwynar, B.; Kovačević, S.; Buławska, N.; Ulenberg, S.; Georgiev, P.; Kapica, H.; Kawczak, P.; Bączek, T.; Sawicki, W.; et al. Multivariate assessment of anticancer oleanane triterpenoids lipophilicity. J. Chromatogr. A 2021, 1656, 462552. [Google Scholar] [CrossRef]
Valko, K.; Du, C.M.; Bevan, C.D.; Reynolds, D.P.; Abraham, M.H. Rapid-Gradient HPLC Method for Measuring Drug Interactions with Immobilized Artificial Membrane: Comparison with Other Lipophilicity Measures. J. Pharm. Sci. 2000, 89, 1085–1096. [Google Scholar] [CrossRef]
Bunally, S.; Young, R.J. The role and impact of high throughput biomimetic measurements in drug discovery. ADMET DMPK 2018, 6, 74–84. [Google Scholar] [CrossRef]
Chrysanthakopoulos, M.; Vallianatou, T.; Giaginis, C.; Tsantili-Kakoulidou, A. Investigation of the retention behavior of structurally diverse drugs on alpha1 acid glycoprotein column: Insight on the molecular factors involved and correlation with protein binding data. Eur. J. Pharm. Sci. 2014, 60, 24–31. [Google Scholar] [CrossRef]
Chrysanthakopoulos, M.; Giaginis, C.; Tsantili-Kakoulidou, A. Retention of structurally diverse drugs in human serum albumin chromatography and its potential to simulate plasma protein binding. J. Chromatogr. A 2010, 1217, 5761–5768. [Google Scholar] [CrossRef]
De Vrieze, M.; Lynen, F.; Chen, K.; Szucs, R.; Sandra, P. Predicting drug penetration across the blood–brain barrier: Comparison of micellar liquid chromatography and immobilized artificial membrane liquid chromatography. Anal. Bioanal. Chem. 2013, 405, 6029–6041. [Google Scholar] [CrossRef] [PubMed]
De Vrieze, M.; Verzele, D.; Szucs, R.; Sandra, P.; Lynen, F. Evaluation of sphingomyelin, cholester, and phosphatidylcholine-based immobilized artificial membrane liquid chromatography to predict drug penetration across the blood-brain barrier. Anal. Bioanal. Chem. 2014, 406, 6179–6188. [Google Scholar] [CrossRef] [PubMed]
Janicka, M.; Sztanke, M.; Sztanke, K. Modeling the Blood-Brain Barrier Permeability of Potential Heterocyclic Drugs via Biomimetic IAM Chromatography Technique Combined with QSAR Methodology. Molecules 2024, 29, 287. [Google Scholar] [CrossRef]
Vallianatou, T.; Tsopelas, F.; Tsantili-Kakoulidou, A. Prediction Models for Brain Distribution of Drugs Based on Biomimetic Chromatographic Data. Molecules 2022, 27, 3668. [Google Scholar] [CrossRef]
Kaliszan, R. QSRR: Quantitative Structure-(Chromatographic) Retention Relationships. Chem. Rev. 2007, 107, 3212–3246. [Google Scholar] [CrossRef]
Tsopelas, F.; Stergiopoulos, C.; Danias, P.; Tsantili-Kakoulidou, A. Biomimetic separations in chemistry and life sciences. Microchim. Acta 2025, 192, 133. [Google Scholar] [CrossRef]
Smith, D.A.; Di, L.; Kerns, E.H. The effect of plasma protein binding on in vivo efficacy: Misconceptions in drug discovery. Nat. Rev. Drug Discov. 2010, 9, 929–939. [Google Scholar] [CrossRef]
Rolan, P. Plasma protein binding displacement interactions—Why are they still regarded as clinically important? Br. J. Clin. Pharmacol. 1994, 37, 125–128. [Google Scholar] [CrossRef] [PubMed]
Rowley, M.; Kulagowski, J.J.; Watt, A.P.; Rathbone, D.; Stevensons, G.I.; Carling, R.W.; Baker, R.; Marshall, G.R.; Kemp, J.A.; Foster, A.C.; et al. Effect of Plasma Protein Binding on in Vivo Activity and Brain Penetration of Glycine/NMDA Receptor Antagonists. J. Med. Chem. 1997, 40, 4053–4068. [Google Scholar] [CrossRef] [PubMed]
Ito, K.; Iwatsubo, T.; Kanamitsu, S.; Nakajima, Y.; Sugiyama, Y. Quantitative prediction of in vivo drug clearance and drug interactions from in vitro data on metabolism, together with binding and transport. Annu. Rev. Pharmacol. Toxicol. 1998, 38, 461–499. [Google Scholar] [CrossRef] [PubMed]
Szliszka, E.; Czuba, Z.P.; Domino, M.; Mazur, B.; Zydowicz, G.; Krol, W. Ethanolic Extract of Propolis (EEP) Enhances the Apoptosis- Inducing Potential of TRAIL in Cancer Cells. Molecules 2009, 14, 738–754. [Google Scholar] [CrossRef]
Eriksson, M.A.L.; Gabrielsson, J.; Nilsson, L.B. Studies of drug binding to plasma proteins using a variant of equilibrium dialysis. J. Pharm. Biomed. Anal. 2005, 38, 381–389. [Google Scholar] [CrossRef]
Valko, K.; Nunhuck, S.; Bevan, C.; Abraham, M.H.; Reynolds, D.P. Fast Gradient HPLC Method to Determine Compounds Binding to Human Serum Albumin. Relationships with Octanol/Water and Immobilized Artificial Membrane Lipophilicity. J. Pharm. Sci. 2003, 92, 2236–2248. [Google Scholar] [CrossRef]
Grumetto, L.; Barbato, F.; Russo, G. Scrutinizing the interactions between bisphenol analogues and plasma proteins: Insights from biomimetic liquid chromatography, molecular docking simulations and in silico predictions. Envion. Toxicol. Pharmacol. 2019, 68, 148–154. [Google Scholar] [CrossRef]
Katopodi, A.; Tsotsou, E.; Iliou, T.; Deligiannidou, G.; Pontiki, E.; Kontogiorgis, C.; Tsopelas, F.; Detsi, A. Synthesis, Bioactivity, Pharmacokinetic and Biomimetic Properties of Multi-Substituted Coumarin Derivatives. Molecules 2021, 26, 5999. [Google Scholar] [CrossRef]
Studziński, M.; Kozyra, P.; Pitucha, M.; Senczyna, B.; Matysiak, J. Retention Behavior of Anticancer Thiosemicarbazides in Biomimetic Chromatographic Systems and In Silico Calculations. Molecules 2023, 28, 7107. [Google Scholar] [CrossRef]
Nisterenko, W.; Kułaga, D.; Woziński, M.; Singh, Y.R.; Judzińska, B.; Jagiello, K.; Greber, K.E.; Sawicki, W.; Ciura, K. Evaluation of Physicochemical Properties of Ipsapirone Derivatives Based on Chromatographic and Chemometric Approaches. Molecules 2024, 29, 1862. [Google Scholar] [CrossRef]
Brusač, E.; Jeličić, M.-L.; Klarić, D.A.; Nigović, B.; Turk, N.; Klarić, I.; Mornar, A. Pharmacokinetic Profiling and Simultaneous Determination of Thiopurine Immunosuppressants and Folic Acid by Chromatographic Methods. Molecules 2019, 24, 3469. [Google Scholar] [CrossRef] [PubMed]
Trainor, G.L. The importance of plasma protein binding in drug discovery. Expert. Opin. Drug Discov. 2007, 2, 51–64. [Google Scholar] [CrossRef]
Martínez-Pla, J.J.; Sagrado, S.; Villanueva-Camañas, R.M.; Medina-Hernández, M.J. Retention–property relationships of anticonvulsant drugs by biopartitioning micellar chromatography. J. Chromatogr. B Biomed. Sci. Appl. 2001, 757, 89–99. [Google Scholar] [CrossRef] [PubMed]
Wu, L.; Chen, Y.; Wang, S.; Chen, C.; Ye, L. Quantitative retention–activity relationship models for quinolones using biopartitioning micellar chromatography. Biomed. Chromatogr. 2008, 22, 106–114. [Google Scholar] [CrossRef]
Tsopelas, F.; Danias, P.; Pappa, A.; Tsantili-Kakoulidou, A. Biopartitioning micellar chromatography under different conditions: Insight into the retention mechanism and the potential to model biological processes. J. Chromatogr. A 2020, 1621, 461027. [Google Scholar] [CrossRef]
Quiñones-Torrelo, C.; Sagrado, S.; Villanueva-Camañas, R.M.; Medina-Hernández, M.J. Development of Predictive Retention−Activity Relationship Models of Tricyclic Antidepressants by Micellar Liquid Chromatography. J. Med. Chem. 1999, 42, 3154–3162. [Google Scholar] [CrossRef]
Ruiz-Ángel, M.J.; Carda-Broch, S.; Torres-Lapasió, J.R.; García-Álvarez-Coque, M.C. Retention mechanisms in micellar liquid chromatography. J. Chromatogr. A 2009, 1216, 1798–1814. [Google Scholar] [CrossRef]
Kalyankar, T.M.; Kulkarni, P.D.; Wadher, S.J.; Pekamwar, S.S. Applications of Micellar Liquid Chromatography in Bioanalysis: A Review. J. Appl. Pharm. Sci. 2014, 4, 128–134. [Google Scholar] [CrossRef]
Stielow, M.; Witczyńska, A.; Kubryń, N.; Fijałkowski, Ł.; Nowaczyk, J.; Nowaczyk, A. The Bioavailability of Drugs—The Current State of Knowledge. Molecules 2023, 28, 8038. [Google Scholar] [CrossRef]
Wu, K.; Kwon, S.H.; Zhou, X.; Fuller, C.; Wang, X.; Vadgama, J.; Wu, Y. Overcoming Challenges in Small-Molecule Drug Bioavailability: A Review of Key Factors and Approaches. Int. J. Mol. Sci. 2024, 25, 13121. [Google Scholar] [CrossRef]
Artursson, P.; Karlsson, J. Correlation between oral drug absorption in humans and apparent drug permeability coefficients in human intestinal epithelial (Caco-2) cells. Biochem. Biophys. Res. Commun. 1991, 175, 880–885. [Google Scholar] [CrossRef]
Jorgensen, C.; Linville, R.M.; Galea, I.; Lambden, E.; Vögele, M.; Chen, C.; Troendle, E.P.; Ruggiu, F.; Ulmschneider, M.B.; Schiøtt, B.; et al. Permeability Benchmarking: Guidelines for Comparing in Silico, in Vitro, and in Vivo Measurements. J. Chem. Inf. Model. 2025, 65, 1067–1084. [Google Scholar] [CrossRef] [PubMed]
Irvine, J.D.; Takahashi, L.; Lockhart, K.; Cheong, J.; Tolan, J.; Selick, H.E.; Grove, J.R. MDCK (Madin-Darby Canine Kidney) Cells: A Tool for Membrane Permeability Screening. J. Pharm. Sci. 1999, 88, 28–33. [Google Scholar] [CrossRef] [PubMed]
Kansy, M.; Senner, F.; Gubernator, K. Physicochemical High Throughput Screening: Parallel Artificial Membrane Permeation Assay in the Description of Passive Absorption Processes. J. Med. Chem. 1998, 41, 1007–1010. [Google Scholar] [CrossRef] [PubMed]
Ottaviani, G.; Martel, S.; Carrupt, P.-A. Parallel Artificial Membrane Permeability Assay: A New Membrane for the Fast Prediction of Passive Human Skin Permeability. J. Med. Chem. 2006, 49, 3948–3954. [Google Scholar] [CrossRef]
Di, L.; Kerns, E.H.; Fan, K.; McConnell, O.J.; Carter, G.T. High throughput artificial membrane permeability assay for blood–brain barrier. Eur. J. Med. Chem. 2003, 38, 223–232. [Google Scholar] [CrossRef]
Avdeef, A. Absorption and Drug Development; Wiley: Hoboken, NJ, USA, 2003. [Google Scholar]
Carpenter, T.S.; Kirshner, D.A.; Lau, E.Y.; Wond, S.E.; Nilmeier, J.P.; Lightstone, F.C. A Method to Predict Blood-Brain Barrier Permeability of Drug-Like Compounds Using Molecular Dynamics Simulations. Biophys. J. 2014, 107, 630–641. [Google Scholar] [CrossRef]
Russo, G.; Grumetto, L.; Baert, M.; Lynen, F. Comprehensive two-dimensional liquid chromatography as a biomimetic screening platform for pharmacokinetic profiling of compound libraries in early drug development. Anal. Chim. Acta 2021, 1142, 157–168. [Google Scholar] [CrossRef]
Morimoto, J.; Miyamoto, K.; Ichikawa, Y.; Uchiyama, M.; Makishima, M.; Hashimoto, Y.; Ishikawa, M. Improvement in aqueous solubility of achiral symmetric cyclofenil by modification to a chiral asymmetric analog. Sci. Rep. 2021, 11, 12697. [Google Scholar] [CrossRef]
Sobańska, A.W.; Brzezińska, E. Immobilized Keratin HPLC Stationary Phase—A Forgotten Model of Transdermal Absorption: To What Molecu-lar and Biological Properties Is It Relevant? Pharmaceutics 2023, 15, 1172. [Google Scholar] [CrossRef]
Orzel, D.; Ravald, H.; Dillon, A.; Rantala, J.; Wiedmer, S.K.; Russo, G. Immobilised artificial membrane liquid chromatography vs liposome electrokinetic capillary chromatography: Suitability in drug/bio membrane partitioning studies and effectiveness in the assessment of the passage of drugs through the respiratory mucosa. J. Chromatogr. A 2024, 1734, 465286. [Google Scholar] [CrossRef] [PubMed]
Neri, I.; MacCallum, J.; Di Lorenzo, R.; Russo, G.; Lynen, F.; Grumetto, L. Into the toxicity potential of an array of parabens by biomimetic liquid chromatography, cell viability assessments and in silico predictions. Sci. Total Environ. 2024, 917, 170461. [Google Scholar] [CrossRef] [PubMed]
Chan, E.C.Y.; Tan, W.L.; Ho, P.C.; Fang, L.J. Modeling Caco-2 permeability of drugs using immobilized artificial membrane chromatography and physicochemical descriptors. J. Chromatogr. A 2005, 1072, 159–168. [Google Scholar] [CrossRef] [PubMed]
Tsopelas, F.; Vallianatou, T.; Tsantili-Kakoulidou, A. The potential of immobilized artificial membrane chromatography to predict human oral absorption. Eur. J. Pharm. Sci. 2016, 81, 82–93. [Google Scholar] [CrossRef]
Pidgeon, C.; Venkataram, U.V. Immobilized artificial membrane chromatography: Supports composed of membrane lipids. Anal. Biochem. 1989, 176, 36–47. [Google Scholar] [CrossRef]
Valko, K.; Rava, S.; Bunally, S.; Anderson, S. Revisiting the application of immobilized artificial membrane (IAM) chromatography to estimate in vivo distribution properties of drug discovery compounds based on the model of marketed drugs. ADMET DMPK 2020, 8, 78–97. [Google Scholar] [CrossRef]
Patel, D.; Witt, S.N. Ethanolamine and Phosphatidylethanolamine: Partners in Health and Disease. Oxidative Med. Cell. Longev. 2017, 2017, 4829180. [Google Scholar] [CrossRef]
Siakotos, A.N.; Rouser, G.; Fleischer, S. Isolation of highly purified human and bovine brain endothelial cells and nuclei and their phospholipid composition. Lipids 1969, 4, 234–239. [Google Scholar] [CrossRef]
Martínez-Pla, J.J.; Martín-Biosca, Y.; Sagrado, S.; Villanueva-Camañas, R.M.; Medina-Hernández, M.J. Evaluation of the pH effect of formulations on the skin permeability of drugs by biopartitioning micellar chromatography. J. Chromatogr. A 2004, 1047, 255–262. [Google Scholar] [CrossRef]
Martínez-Pla, J.J.; Martín-Biosca, Y.; Sagrado, S.; Villanueva-Camañas, R.M.; Medina-Hernández, M.J. Biopartitioning micellar chromatography to predict skin permeability. Biomed. Chromatogr. 2003, 17, 530–537. [Google Scholar] [CrossRef]
Waters, L.J.; Shahzad, Y.; Stephenson, J. Modelling skin permeability with micellar liquid chromatography. Eur. J. Pharm. Sci. 2013, 50, 335–340. [Google Scholar] [CrossRef]
Molero-Monfort, M.; Escuder-Gilabert, L.; Villanueva-Camañas, R.M.; Sagrado, S.; Medina-Hernández, M.J. Biopartitioning micellar chromatography: An in vitro technique for predicting human drug absorption. J. Chromatogr. B Biomed. Sci. Appl. 2001, 753, 225–236. [Google Scholar] [CrossRef]
Molero-Monfort, M.; Martín-Biosca, Y.; Sagrado, S.; Villanueva-Camañas, R.M.; Medina-Hernández, M.J. Micellar liquid chromatography for prediction of drug transport. J. Chromatogr. A 2000, 870, 1–11. [Google Scholar] [CrossRef] [PubMed]
Čudina, O.; Marković, B.; Karljiković-Rajić, K.; Vladimirov, S. Biopartitioning Micellar Chromatography-Partition Coefficient Micelle/Water as a Potential Descriptor for Hydrophobicity in Prediction of Oral Drug Absorption. Anal. Lett. 2012, 45, 677–688. [Google Scholar] [CrossRef]
De Vrieze, M.; Janssens, P.; Szucs, R.; Van der Eycken, J.; Lynen, F. In vitro prediction of human intestinal absorption and blood–brain barrier partitioning: Development of a lipid analog for micellar liquid chromatography. Anal. Bioanal. Chem. 2015, 407, 7453–7466. [Google Scholar] [CrossRef]
Russo, G.; Grumetto, L.; Szucs, R.; Barbato, F.; Lynen, F. Determination of in Vitro and in Silico Indexes for the Modeling of Blood–Brain Barrier Partitioning of Drugs via Micellar and Immobilized Artificial Membrane Liquid Chromatography. J. Med. Chem. 2017, 60, 3739–3754. [Google Scholar] [CrossRef]
Ma, W.; Yang, L.; Lv, Y.; Fu, J.; Zhang, Y.; He, L. Determine equilibrium dissociation constant of drug-membrane receptor affinity using the cell membrane chromatography relative standard method. J. Chromatogr. A 2017, 1503, 12–20. [Google Scholar] [CrossRef]
Ma, W.; Zhang, Y.; Li, J.; Liu, R.; Che, D.; He, L. Analysis of Drug Interactions with Dopamine Receptor by Frontal Analysis and Cell Membrane Chromatography. Chromatographia 2015, 78, 649–654. [Google Scholar] [CrossRef]
Ma, W.; Zhang, D.; Li, J.; Che, D.; Liu, R.; Zhang, J.; Zhang, Y. Interactions between histamine H1 receptor and its antagonists by using cell membrane chromatography method. J. Pharm. Pharmacol. 2015, 67, 1567–1574. [Google Scholar] [CrossRef]
He, L.; Yang, G.; Geng, X. Enzymatic activity and chromatographic characteristics of the cell membrane immobilized on silica surface. Chin. Sci. Bull. 1999, 44, 826–831. [Google Scholar] [CrossRef]
He, L.; Wang, S.; Geng, X. Coating and fusing cell membranes onto a silica surface and their chromatographic characteristics. Chromatographia 2001, 54, 71–76. [Google Scholar] [CrossRef]
Li, C.; He, L. Establishment of the model of white blood cell membrane chromatography and screening of antagonizing TLR4 receptor component from Atractylodes macrocephala Koidz. Sci. China Life Sci. 2006, 49, 11. [Google Scholar] [CrossRef]
Yang, X.; Zhang, Y.; Zhang, X.; Chang, R.; Li, X. Development of a Stationary Phase of Vascular Smooth Muscle Cell Membrane Chromatography and Its Chromatographic Affinity Characteristics. Chromatographia 2011, 73, 1065–1071. [Google Scholar] [CrossRef]
Zhou, Y.; Luo, W.; Zheng, L.; Li, M.; Zhang, Y. Construction of recombinant FGFR1 containing full-length gene and its potential application. Plasmid 2010, 64, 60–67. [Google Scholar] [CrossRef]
Li, M.; Wang, S.; Zhang, Y.; He, L. An online coupled cell membrane chromatography with LC/MS method for screening compounds from Aconitum carmichaeli Debx. acting on VEGFR-2. J. Pharm. Biomed. Anal. 2010, 53, 1063–1069. [Google Scholar] [CrossRef] [PubMed]
Slon-Usakiewicz, J.J.; Ng, W.; Dai, J.-R.; Pasternak, A.; Redden, P.R. Frontal affinity chromatography with MS detection (FAC-MS) in drug discovery. Drug Discov. Today 2005, 10, 409–416. [Google Scholar] [CrossRef] [PubMed]
Hou, X.; Wang, S.; Zhang, T.; Ma, J.; Zhang, J.; Zhang, Y.; Lu, W.; He, H.; He, L. Recent advances in cell membrane chromatography for traditional Chinese medicines analysis. J. Pharm. Biomed. Anal. 2014, 101, 141–150. [Google Scholar] [CrossRef] [PubMed]
Ma, W.; Wang, C.; Liu, R.; Wang, N.; Lv, Y.; Dai, B.; He, L. Advances in cell membrane chromatography. J. Chromatogr. A 2021, 1639, 461916. [Google Scholar] [CrossRef]
Reasor, M.J.; Kacew, S. Drug-Induced Phospholipidosis: Are There Functional Consequences? Exp. Biol. Med. 2001, 226, 825–830. [Google Scholar] [CrossRef]
Halliwell, W.H. Cationic Amphiphilic Drug-Induced Phospholipidosis. Toxicol. Pathol. 1997, 25, 53–60. [Google Scholar] [CrossRef]
Garrido, A.; Lepailleur, A.; Mignani, S.M.; Dallemagne, P.; Rochais, C. hERG toxicity assessment: Useful guidelines for drug design. Eur. J. Med. Chem. 2020, 195, 112290. [Google Scholar] [CrossRef] [PubMed]
Iwakuma, Y.; Okamoto, H.; Hamaguchi, R.; Kuroda, Y. The Limited Contribution of the Analyte Partition to the Water-Rich Layer in Immobilized Artificial Membrane Chromatography with an Acetonitrile-Rich Binary Mobile Phase. Chromatographia 2019, 82, 1311–1320. [Google Scholar] [CrossRef]
Fedorowicz, J.; Bazar, D.; Brankiewicz, W.; Kapica, H.; Ciura, K.; Zalewska-Piątek, B.; Piątek, R.; Cal, K.; Mojsiewicz-Pieńkowska, K.; Sączewski, J. Development of Safirinium dyes for new applications: Fluorescent staining of bacteria, human kidney cells, and the horny layer of the epidermis. Sci. Rep. 2022, 12, 15098. [Google Scholar] [CrossRef]
Iwakuma, Y.; Okamoto, H.; Hamaguchi, R.; Kuroda, Y. Immobilized Artificial Membrane Chromatography Using Acetonitrile-Rich Mobile Phase for Comparison of Retention Properties Between Phospholipidosis-Inducing and Non-inducing Basic Drugs. Chromatographia 2023, 86, 43–54. [Google Scholar] [CrossRef]
Stergiopoulos, C.; Tsopelas, F.; Valko, K. Prediction of hERG inhibition of drug discovery compounds using biomimetic HPLC measurements. ADMET DMPK 2021, 9, 191–207. [Google Scholar] [CrossRef]
Ma, W.; Zhu, M.; Zhang, D.; Yang, L.; Yang, T.; Li, X.; Zhang, Y. Berberine inhibits the proliferation and migration of breast cancer ZR-75-30 cells by targeting Ephrin-B2. Phytomedicine 2017, 25, 45–51. [Google Scholar] [CrossRef]
Jia, D.; Chen, X.; Cao, Y.; Wu, X.; Ding, X.; Zhang, H.; Zhang, C.; Chai, Y.; Zhu, Z. On-line comprehensive two-dimensional HepG2 cell membrane chromatographic analysis system for charactering anti-hepatoma components from rat serum after oral administration of Radix scutellariae: A strategy for rapid screening active compounds in vivo. J. Pharm. Biomed. Anal. 2016, 118, 27–33. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, N.; Ma, J.; Zhu, Y.; Wang, M.; Wang, X.; Zhang, P. A Platelet/CMC coupled with offline UPLC-QTOF-MS/MS for screening antiplatelet activity components from aqueous extract of Danshen. J. Pharm. Biomed. Anal. 2016, 117, 178–183. [Google Scholar] [CrossRef]
Wu, X.; Chen, X.; Jia, D.; Cao, Y.; Gao, S.; Guo, S.; Zerbe, P.; Chai, Y.; Diao, Y.; Zhang, Y. Characterization of anti-leukemia components from Indigo naturalis using comprehensive two-dimensional K562/cell membrane chromatography and in silico target identification. Sci. Rep. 2016, 6, 25491. [Google Scholar] [CrossRef]
Wei, F.; Hu, Q.; Huang, J.; Han, S.; Wang, S. Screening active compounds from Corydalis yanhusuo by combining high expression VEGF receptor HEK293 cell membrane chromatography with HPLC-ESI-IT-TOF-MSn method. J. Pharm. Biomed. Anal. 2017, 136, 134–139. [Google Scholar] [CrossRef]
Lin, Y.; Wang, C.; Hou, Y.; He, H.; Huang, H.; Yang, L.; Sun, M. The human mast cell line-1 cell membrane chromatography coupled with HPLC-ESI-MS/MS method for screening potentical anaphylactic components from chuanxinlian injection. Biomed. Chromatogr. 2017, 31, e4015. [Google Scholar] [CrossRef] [PubMed]
Lv, Y.; Fu, J.; Shi, X.; Yang, Z.; Han, S. Screening allergic components of Yejuhua injection using LAD2 cell membrane chromatography model online with high performance liquid chromatography-ion trap-time of flight-mass spectrum system. J. Chromatogr. B 2017, 1055–1056, 119–124. [Google Scholar] [CrossRef] [PubMed]
Lin, Y.; Lv, Y.; Fu, J.; Jia, Q.; Han, S. A high expression Mas-related G protein coupled receptor X2 cell membrane chromatography coupled with liquid chromatography and mass spectrometry method for screening potential anaphylactoid components in kudiezi injection. J. Pharm. Biomed. Anal. 2018, 159, 483–489. [Google Scholar] [CrossRef] [PubMed]
Jia, Q.; Sun, W.; Zhang, L.; Fu, J.; Lv, Y.; Lin, Y.; Han, S. Screening the anti-allergic components in Saposhnikoviae Radix using high-expression Mas-related G protein-coupled receptor X2 cell membrane chromatography online coupled with liquid chromatography and mass spectrometry. J. Sep. Sci. 2019, 42, 2351–2359. [Google Scholar] [CrossRef]
Xie, Y.; Wei, D.; Hu, T.; Hou, Y.; Lin, Y.; He, H.; Wang, C. Anti-pseudo-allergic capacity of alkaloids screened from Uncaria rhynchophylla. New J. Chem. 2020, 44, 38–45. [Google Scholar] [CrossRef]
Taillardat-Bertschinger, A.; Galland, A.; Carrupt, P.-A.; Testa, B. Immobilized artificial membrane liquid chromatography: Proposed guidelines for technical optimization of retention measurements. J. Chromatogr. A 2002, 953, 39–53. [Google Scholar] [CrossRef]
Russo, G.; Grumetto, L.; Szucs, R.; Barbato, F.; Lynen, F. Screening therapeutics according to their uptake across the blood-brain barrier: A high throughput method based on immobilized artificial membrane liquid chromatography-diode-array-detection coupled to electrospray-time-of-flight mass spectrometry. Eur. J. Pharm. Biopharm. 2018, 127, 72–84. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Breiman, L. Statistical Modeling: The Two Cultures. Stat. Sci. 2001, 16, 199–215. [Google Scholar] [CrossRef]
Bemis, G.W.; Murcko, M.A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887–2893. [Google Scholar] [CrossRef]
Liu, L.; Ozsu, M.T. Encyclopedia of Database Systems, 1st ed.; Springer: New York, NY, USA, 2016. [Google Scholar]
Wang, S.; Zhang, R.; Li, X.; Cai, F.; Ma, X.; Tang, Y.; Xu, C.; Wang, L.; Ren, P.; Liu, L.; et al. Recent advances in molecular representation methods and their applications in scaffold hopping. npj Drug Discov. 2025, 2, 14. [Google Scholar] [CrossRef]
Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley: Hoboken, NJ, USA, 2000. [Google Scholar]
Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
Guyon, I.M.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
RDKit: Open-Source Cheminformatics. Available online: https://www.rdkit.org (accessed on 19 October 2025).
O’Boyle, N.M.; Banck, M.; James, C.A.; Morley, C.; Vandermeersch, T.; Hutchison, G.R. Open Babel: An open chemical toolbox. J. Cheminform 2011, 3, 33. [Google Scholar] [CrossRef] [PubMed]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Preisach, C.; Burkhardt, H.; Schmidt-Thieme, L.; Decker, R. (Eds.) Data Analysis, Machine Learning and Applications: Proceedings of the 31st Annual Conference of the Gesellschaft Für Klassifikation EV, Albert-Ludwigs-Universität Freiburg, March 7–9, 2007; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
MacQueen, J.B. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Statistical Laboratory of the University of California: Berkeley, CA, USA, 1967; Volume 5.1, pp. 281–297. [Google Scholar]
Ward, J.H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
Pearson, K.L., III. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; IEEE: New York, NY, USA, 2008; pp. 413–422. [Google Scholar]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD’96, Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Héberger, K. Sum of ranking differences compares methods or models fairly. TrAC Trends Anal. Chem. 2010, 29, 101–109. [Google Scholar] [CrossRef]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Breunig, M.M.; Kriegel, H.-P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. ACM SIGMOD Rec. 2000, 29, 93–104. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Bates, D.M.; Watts, D.G. Nonlinear Regression Analysis and Its Applications; Wiley: Hoboken, NJ, USA, 1988. [Google Scholar]
Ciura, K.; Kovačević, S.; Pastewska, M.; Kapica, H.; Kornela, M.; Sawicki, W. Prediction of the chromatographic hydrophobicity index with immobilized artificial membrane chromatography using simple molecular descriptors and artificial neural networks. J. Chromatogr. A 2021, 1660, 462666. [Google Scholar] [CrossRef] [PubMed]
Lindley, D.V.; Smith, A.F.M. Bayes Estimates for the Linear Model. J. R. Stat. Soc. Ser. B Methodol. 1972, 34, 1–41. [Google Scholar] [CrossRef]
Vapnik, V.; Golowich, S.E.; Smola, A. Support vector method for function approximation, regression estimation and signal processing. In Proceedings of the 10th International Conference on Neural Information Processing Systems, Denver, CO, USA, 3–5 December 1996; MIT Press: Cambridge, MA, USA, 1996; pp. 281–287. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Stone, M. Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. Ser. B Stat. Methodol. 1974, 36, 111–133. [Google Scholar] [CrossRef]
Wold, S. Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models. Technometrics 1978, 20, 397. [Google Scholar] [CrossRef]
Rücker, C.; Rücker, G.; Meringer, M. y-Randomization and Its Variants in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47, 2345–2357. [Google Scholar] [CrossRef]
Cox, D.R. The Regression Analysis of Binary Sequences. J. R. Stat. Soc. Ser. B Stat. Methodol. 1958, 20, 215–232. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees, 1st ed.; Routledge: Abingdon, UK, 2017. [Google Scholar]
Podgorelec, V.; Zorman, M. Decision Tree Learning. In Encyclopedia of Complexity and Systems Science; Springer: Berlin/Heidelberg, Germany, 2015; pp. 1–28. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Vapnik, V.N. The Nature of Statistical Learning Theory, 2nd ed.; Springer: New York, NY, USA, 2000. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2999–3007. [Google Scholar]
Ciura, K.; Ulenberg, S.; Kapica, H.; Kawczak, P.; Belka, M.; Bączek, T. Drug affinity to human serum albumin prediction by retention of cetyltrimethylammonium bromide pseudostationary phase in micellar electrokinetic chromatography and chemically advanced template search descriptors. J. Pharm. Biomed. Anal. 2020, 188, 113423. [Google Scholar] [CrossRef]
Ciura, K. Modeling of small molecule’s affinity to phospholipids using IAM-HPLC and QSRR approach enhanced by similarity-based machine algorithms. J. Chromatogr. A 2024, 1714, 464549. [Google Scholar] [CrossRef]

Figure 1. Flowchart of a prediction model trained with compounds that have publicly accessible in vivo data used to forecast the relevant in vivo data for unknown compounds. A—experimentally derived data, e.g., retention time; B—molecular descriptors generated in silico; C—molecular fingerprints generated in silico; D—in vivo data.

Figure 2. Simple graphical representation of difference between supervised learning (using labelled data) and unsupervised learning (using unlabelled data).

Table 1. Connection between lipophilicity and ADMET [].

Drug	Parameters Influenced by Lipophilicity
Absorption	Solubility Membrane Permeability
Distribution	Blood–Brain Barrier (BBB) permeability Volume of distribution (V_D)
Metabolism and Excretion	Susceptibility to oxidative metabolism Half-life (t_1/2) Clearance (Cl)
Toxicity	Drug-induced phospholipidosis (DIPL) human ether-a-go-go-related gene (hERG) toxicity

Table 2. Comparative table for BC techniques.

Method	Stationary Phase (Characteristics)	Application Area	Advantages	Disadvantage
IAM	Phospholipids (e.g., phosphatidylcholine) are covalently bonded to silica	Membrane permeability, BBB permeability, lipophilicity (phospholipid affinity), phospholipidosis risk, human oral absorption, protein binding	Commercially available (Regis), robust, and HTS-compatible, this is a suitable model for passive diffusion	Does not model active transport. “Type B” silica switch caused retention shifts.
Affinity (HSA/AGP)	Immobilised plasma proteins on a support	Plasma protein binding, chiral separations	Commercially available (Daicel), HTS-compatible, directly measures binding to key plasma proteins	Specific protein binding. Does not measure membrane permeability or lipophilicity.
CMC	Immobilised, intact cell membranes or whole cells (e.g., HEK 293)	Drug–membrane interactions, specificity testing	Provides the most biologically relevant model	Not commercially available, complex to prepare, lower stability and robustness, HTS-incompatible.
MLC	Standard RPLC phase (e.g., C18) with a micelle-containing mobile phase (e.g., SDS, CTAB)	Membrane permeability, BBB permeability, lipophilicity, human oral absorption, protein binding	Uses standard columns, cost-effective, versatile (surfactant choice alters properties)	Complex separation mechanism (dual equilibrium). Micelles may not perfectly mimic biological membranes.

Table 3. Summary of differences between different molecular descriptors [].

Type of Descriptor	Definition	Examples	Application
Topological	Derived from molecular graphs and encodes information about the connectivity and branching of atoms in a molecule	Degree of branching, Molecular connectivity indices, Wiener index	Solubility Boiling point Biological activity
Geometrical	Encode information about the 3D shape and size of the molecule	Molecular surface area (MSA), molecular volume (MV), principal moments of inertia	Molecular interactions
Electrostatic	Quantify the distribution of electric charge within a molecule	Partial atomic charges, dipole moment, Eeectrostatic potential maps	Hydrogen bonding Ionic interactions.
Quantum	Derived from quantum mechanical calculations	HOMO-LUMO gap, ionisation potential, electron affinity	Chemical reactivity Stability Spectroscopic properties
Physicochemical	Represent physical and chemical properties of molecules	logP, pKa, PSA	ADME properties
Pharmacophoric	Represent the spatial arrangement of features in a molecule that are essential for biological activity	Hydrogen bond donors (HDo), hydrogen bond acceptors (HAc), aromatic rings	ADME properties

Table 4. Description of the most popular loss function and evaluation metric for different unsupervised ML algorithms.

Method Category	Clustering	Dimensionality Reduction	Anomaly Detection
Loss function	Silhouette Coefficient—evaluates cluster cohesion and separation by measuring how similar points are to their own cluster compared to other clusters. Ranges from −1 to 1, where higher values indicate better-defined clusters [].	Kullback–Leibler Divergence—measures the difference between probability distributions of original and reduced-dimensional data. Widely used in variational autoencoders. Lower values indicate better preservation of data structure [].	Isolation Score—Measures how easily a point can be isolated from the rest of the data through random partitioning. Lower values indicate a higher likelihood of being an outlier [].
Evaluation metric	SRD—evaluate how different clustering algorithms rank or group similar objects []. Inertia measures the sum of the squared distances between each data point and its closest centroid, commonly used in k-means. Lower values indicate better-defined clusters [].	Reconstruction error—Quantifies the difference between the original data and its reconstruction after dimensionality reduction, significant in autoencoders. Lower values indicate better preservation of information [].	Local Outlier Factor (LOF) Score—Compares the local density of a point with the densities of its neighbours. Higher values indicate a more substantial likelihood of being an outlier [].

Table 7. Summary of popular classification models.

Model	Logistic Regression []	Decision Trees [,]	SVM []	ANN [,,]
Strengths	Interpretable, efficient with small data	Handles non-linear data, interpretable	Effective in high-dimensional spaces	Captures complex patterns, scalable
Limitations	Limited to linear decision	Prone to overfitting	Computationally intensive with large data	Require large datasets, poor interpretability
Use Case	Binary toxicity	Rule-based ADMET screening	Drug–target interaction prediction	Multi-task toxicity

Table 8. Description of three popular loss functions and five evaluation metrics for classification models.

Loss Functions for Classification Models	Evaluation Metrics for Classification Models
Cross-entropy loss (Log Loss)—measures the difference between predicted class probabilities and true labels [].	Accuracy—ratio of total correct predictions (both positive and negative) out of all predictions. Best for balanced sets.
Hinge loss—used for margin maximisation in SVM. Penalises predictions that are on the wrong side of the decision boundary [].	Precision—ratio of correctly predicted positive instances out of all the cases predicted as positive. Measure how reliable an optimistic prediction is.
Focal loss—addresses class imbalances by focusing on complex classifiable examples. Gives small weight to easy examples [].	Specificity—precision, but for pessimistic predictions.
	Recall (sensitivity)—ratio of actual positive instances that the model correctly identifies.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Development of Prediction Capabilities for High-Throughput Screening of Physiochemical Properties by Biomimetic Chromatography

Abstract

1. Introduction

2. Physiochemistry in Pharmacokinetics: From Gold Standards to Biomimetic Alternatives

2.1. Plasma Protein Binding (PPB) and Volume of Distribution (V_D)