AI and Big Data in Chemistry

A special issue of Chemistry (ISSN 2624-8549).

Deadline for manuscript submissions: 31 December 2026 | Viewed by 4902

Editors


E-Mail Website
Guest Editor
School of Chemistry and Chemical Engineering, Chongqing University, Chongqing 401331, China
Interests: physical chemistry and chemical physics; theoretical chemistry; reaction dynamics; reaction kinetics; potential energy surface; artificial intelligence in chemistry; reaction mechanisms; quantum chemistry; computational chemistry; reaction network; gas-phase chemistry; gas-liquid scattering; atmospheric chemistry
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200241, China
Interests: AI-driven chemical reaction dynamics; multi-scale simulation for bio-molecules; theoretical study of metalloproteins; computer aided drug design; quantum chemical calculation for macromolecules

E-Mail Website
Guest Editor
N.D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Moscow, Russia
Interests: molecular complexity and transformations; metal complexes and nanoparticles; development of new generation of highly active nanosized and molecular catalysts; mechanistic studies of chemical reactions by experimental and theoretical methods; AI in chemitry
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The transformative integration of Artificial Intelligence (AI) and Big Data into chemistry marks a paradigm shift, accelerating discovery, enhancing analysis, and expanding application in ways once deemed unimaginable. This Special Issue seeks to examine how machine learning, deep learning, and data-driven approaches are fundamentally reshaping chemical research—whether through rapid molecular design, accurate reaction prediction, optimized synthesis of drugs and functional materials, or reliable and efficient computations for spectroscopy, kinetics, and dynamics. By leveraging vast and intricate datasets, researchers can now uncover latent patterns, model complex processes, and achieve scientific breakthroughs at an unprecedented pace.

However, alongside these remarkable opportunities come significant challenges. Issues such as data quality and inherent biases, the often opaque “black-box” character of AI models, the crucial need for experimental validation, and the evolving demands on chemical education must be thoughtfully and openly addressed.

It is in this context of both promise and scrutiny that we present this Special Issue. We aim not only to highlight cutting-edge research at the intersection of AI, data science, and chemistry but also to foster meaningful interdisciplinary dialogue. By confronting key challenges and showcasing innovative solutions, this collection aspires to contribute actively to the responsible and impactful advancement of smart, sustainable chemistry for the future.

I warmly invite your valuable contributions.

Prof. Dr. Jun Li
Prof. Dr. Tong Zhu
Prof. Dr. Valentine P. Ananikov
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 250 words) can be sent to the Editorial Office for assessment.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-anonymized peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Chemistry is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • chemoinformatics & data mining
  • AI for chemical theories and computations
  • AI agent for chemical research and education
  • machine learning-driven molecular design
  • reaction prediction & retrosynthesis
  • materials genomics & high-performance computing
  • AI-accelerated catalyst discovery
  • high-throughput experimentation & self-driving laboratories
  • integration of quantum chemistry & AI
  • intelligent analysis of spectroscopic data
  • multiscale modeling & digital twins
  • opportunities and challenges of large chemical models
  • explainable AI (XAI) in chemistry
  • data standardization & sharing ethics in chemistry
  • educating the next generation of chemists

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

16 pages, 2400 KB  
Article
Molecular Dynamics Study on the Mechanism of Coal High-Temperature Pyrolysis Based on Machine Learning Potential
by Menghao Ren, Rongheng Gou, Hanyu Chen, Tian-Min Wu, Shansong Gao, Dao Li, Haisheng Li, Qing Zheng and Yanjun Zhang
Chemistry 2026, 8(6), 75; https://doi.org/10.3390/chemistry8060075 - 1 Jun 2026
Viewed by 230
Abstract
Understanding the atomic-scale mechanisms of coal pyrolysis is essential for efficient coal utilization and carbon-neutral energy strategies, yet conventional computational approaches often struggle to balance between the high accuracy of quantum-chemical calculations and the efficiency of reactive force fields. To overcome this limitation, [...] Read more.
Understanding the atomic-scale mechanisms of coal pyrolysis is essential for efficient coal utilization and carbon-neutral energy strategies, yet conventional computational approaches often struggle to balance between the high accuracy of quantum-chemical calculations and the efficiency of reactive force fields. To overcome this limitation, we proposed a multiscale computational framework integrating high-throughput density functional theory (DFT) calculations, ReaxFF-based configuration sampling, YARP reaction enumeration, and DPA3-based machine learning potentials (MLPs). Two coal-specific MLPs, DPA3-coal and DPA3-coal@dftb, were constructed and systematically benchmarked on both small molecular systems and larger C20–30 coal fragments extracted from MD simulations. DPA3-coal@dftb model demonstrated significantly improved accuracy over ReaxFF in predicting energies and atomic forces while maintaining good transferability. To balance computational efficiency and accuracy in large-scale simulations, the DPA3-coal model was employed to perform accelerated reactive molecular dynamics simulations of a Solomon-type bituminous coal molecule from 1600 to 2600 K. The simulations revealed temperature-dependent evolution of coke, tar, and gas products, including secondary condensation and deep-cracking processes at elevated temperatures. Higher-level DFT calculations further confirmed the thermodynamic consistency of key reaction pathways involving radical formation, H-transfer, recombination, and CO generation, indicating that coal-specific MLPs provide an effective atomistic tool for investigating mechanistic trends in coal pyrolysis. Full article
(This article belongs to the Special Issue AI and Big Data in Chemistry)
Show Figures

Figure 1

16 pages, 3681 KB  
Article
Application of Machine Learning Models for Predicting pIC50 Values of Plasticizers Against Cytochrome P450 Aromatase
by Itumeleng Lucky Mongadi, Nomasonto Rapulenyane, Walter Bonke Mahlangu and Jean-Nazaire Oyourou
Chemistry 2026, 8(5), 68; https://doi.org/10.3390/chemistry8050068 - 20 May 2026
Viewed by 677
Abstract
This study investigated the application of six machine learning regression algorithms such as Random Forest, CatBoost, K-Nearest Neighbours, XGBoost, LightGBM, and Gradient Boosting, paired with Molecular ACCess System (MACCS) key fingerprints for the quantitative prediction of aromatase (CYP19A1) inhibitory potency, expressed as pIC [...] Read more.
This study investigated the application of six machine learning regression algorithms such as Random Forest, CatBoost, K-Nearest Neighbours, XGBoost, LightGBM, and Gradient Boosting, paired with Molecular ACCess System (MACCS) key fingerprints for the quantitative prediction of aromatase (CYP19A1) inhibitory potency, expressed as pIC50. A dataset of 187 compounds was assembled from the ChEMBL database (version 33, Target ID: CHEMBL1978) following by systematic data curation workflow encompassing duplicate removal, pIC50 transformation, and activity-based filtering. Model performance was rigorously evaluated using an 80/20 stratified train/test split, 5-fold cross-validation, and Y-randomisation testing to ensure unbiased assessment of predictive generalisation. Feature selection via CatBoost permutation importance on the held-out test set identified the top 20 predictive MACCS keys from an initial 166-bit space, substantially reducing dimensionality and improving generalisation across all models. Among the algorithms evaluated, CatBoost trained on the top 20 features achieved the strongest test-set performance (R2 = 0.693, RMSE = 0.794, MAE = 0.659) with the most stable cross-validation R2 (0.062 ± 0.304), outperforming all other algorithms. Y-randomisation testing returned an empirical p-value of <0.01, confirming that model performance reflects genuine structure–activity relationships rather than statistical chance. Permutation importance and SHAP analysis identified nitrogen-containing heterocyclic fragments (MACCS_41, MACCS_145) and halide-bearing substructures (MACCS_109) as the primary structural determinants of aromatase inhibitory potency, consistent with established CYP19A1 pharmacophoric requirements. Application of the model to ten representative plasticizers demonstrated that the refined applicability domain (h* = 0.423) accommodated eight of the ten compounds, enabling reliable potency predictions across phthalate esters and bisphenol analogues. These findings establish a transparent and reproducible QSAR framework for first-tier endocrine disruption risk screening of plasticizers and highlight the importance of permutation-based feature selection and applicability domain assessment in QSAR model development. Full article
(This article belongs to the Special Issue AI and Big Data in Chemistry)
Show Figures

Figure 1

15 pages, 1462 KB  
Article
Mechanistic Insights into Iron–Sulfur Clusters for Direct Coal Liquefaction: A Combined First-Principles and Machine Learning Study
by Jing Xie, Caoran Li, Shansong Gao, Zhening Chen, Rongheng Gou, Lei Gong, Xiangfeng Yu and Dao Li
Chemistry 2026, 8(5), 66; https://doi.org/10.3390/chemistry8050066 - 18 May 2026
Viewed by 362
Abstract
Direct Coal Liquefaction (DCL) is a promising route for converting abundant coal resources into liquid fuels, yet its efficiency remains strongly dependent on catalyst performance. In this work, we present an integrated computational framework combining density functional theory (DFT) calculations with machine learning [...] Read more.
Direct Coal Liquefaction (DCL) is a promising route for converting abundant coal resources into liquid fuels, yet its efficiency remains strongly dependent on catalyst performance. In this work, we present an integrated computational framework combining density functional theory (DFT) calculations with machine learning (ML) to investigate iron–sulfur (FeS) cluster catalysts for DCL. DFT calculations were employed to examine hydrogen-donor dissociation and coal-derived radical hydrogenation on representative FeS clusters. The results indicate that the most favorable catalytic pathways arise from the cooperation between metallic Fe sites (Fe_2) and interfacial Fe sites adjacent to sulfur (Fe_1), while sulfur atoms mainly play an indirect structural and electronic modulation role. Based on these mechanistic insights, a database containing thermodynamic and kinetic data for 636 reactions across 50 FeS cluster models was constructed. This dataset was then used to train three ML classifiers, among which the Random Forest model showed the best performance, reaching accuracies of 80% for H-donor cleavage and 93% for radical hydrogenation on the held-out test sets. SHapley Additive exPlanations (SHAP) analysis further showed that descriptors associated with Fe active-site identity were among the most influential variables in both tasks. Overall, this work provides a mechanistically informed and interpretable computational framework for understanding FeS-catalyzed DCL chemistry and for the preliminary screening of catalyst motifs within the chemical space covered by the present FeS cluster library. Full article
(This article belongs to the Special Issue AI and Big Data in Chemistry)
Show Figures

Figure 1

27 pages, 1917 KB  
Article
Machine Learning and Approximated Estimation Approaches for Process Design in Drug Synthesis
by Andrea Repetto, Gianguido Ramis and Ilenia Rossetti
Chemistry 2026, 8(3), 32; https://doi.org/10.3390/chemistry8030032 - 3 Mar 2026
Viewed by 1102
Abstract
The continuous-flow technologies in organic synthesis for the production of active pharmaceutical ingredients (APIs) are nowadays more and more applied. In-silico process design is a powerful tool able to support organic synthesis in the field of scale-up and process development. Process design feasibility [...] Read more.
The continuous-flow technologies in organic synthesis for the production of active pharmaceutical ingredients (APIs) are nowadays more and more applied. In-silico process design is a powerful tool able to support organic synthesis in the field of scale-up and process development. Process design feasibility and reliability depend on the availability of a well-defined chemical reaction kinetic scheme, information which is usually derived from experimental datasets collected on purpose. The latter approach is time-consuming and demanding in terms of resources. Different possibilities are here proposed to valorize widely available experimental data from explorative works with different approaches, depending on the nature, richness, and structure of the datasets. The kinetic parameters (i.e., reaction order, kinetic constant, and activation energy) of some interesting organic reactions have been approximately estimated by applying different computational methodologies, thanks to built-in experimental databases. The numerical algebra approach dealing with linear and non-linear regression analysis for the kinetic parameters has been initially considered and related to the database information for oseltamivir synthesis. The Bayesian statistic was applied to the ibuprofen case through the application of the Markov Chain Monte Carlo (MCMC) method for reaction order estimation. At last, a Machine Learning (ML) approach has been applied to the Rolipram and Pregabalin case study. The in-house developed T-ReX experimental kinetic constant database was exploited, with application of the k-Nearest neighbor algorithm for classification and regular expression pattern recognition. Advantages and limitations of the three approaches are discussed. Full article
(This article belongs to the Special Issue AI and Big Data in Chemistry)
Show Figures

Graphical abstract

12 pages, 2260 KB  
Article
PDCG: A Diffusion Model Guided by Pre-Training for Molecular Conformation Generation
by Yanchen Liu, Yameng Zheng, Amina Tariq, Xiaofei Nan, Lingbo Qu and Jinshuai Song
Chemistry 2026, 8(2), 29; https://doi.org/10.3390/chemistry8020029 - 18 Feb 2026
Viewed by 1305
Abstract
Background: While machine learning has advanced molecular conformation generation, existing models often suffer from limited generalization and inaccuracies, especially for complex molecular structures. These limitations hinder their reliability in downstream applications. Methods: We proposed a molecular conformation model combined with a molecular graph [...] Read more.
Background: While machine learning has advanced molecular conformation generation, existing models often suffer from limited generalization and inaccuracies, especially for complex molecular structures. These limitations hinder their reliability in downstream applications. Methods: We proposed a molecular conformation model combined with a molecular graph pre-training module and a diffusion model (PDCG). Feature embeddings are obtained from a pre-trained model and concatenated with the molecular graph information. Fusion features are used for generating conformations in the model. The model was trained and evaluated on the GEOM-QM9 and GEOM-Drugs datasets. Results: PDCG significantly outperforms existing baselines, which shows markedly superior results. Furthermore, in downstream molecular property prediction tasks, conformations generated by PDCG yield results comparable to those derived from DFT-optimized geometries. Conclusions: Our work provides a robust and generalizable model for accurate conformation generation. PDCG offers a reliable tool for downstream computational tasks, such as the virtual screening of functional materials and drug-like molecules. Full article
(This article belongs to the Special Issue AI and Big Data in Chemistry)
Show Figures

Graphical abstract

Review

Jump to: Research

50 pages, 529 KB  
Review
Machine Learning and Deep Learning Application in Cholinesterase Research Area
by Nikola Maraković
Chemistry 2026, 8(5), 67; https://doi.org/10.3390/chemistry8050067 - 19 May 2026
Viewed by 544
Abstract
As key therapeutic targets for symptomatic treatment of Alzheimer’s disease (AD) according to the cholinergic hypothesis, acetylcholinesterase (AChE; EC 3.1.1.7) and butyrylcholinesterase (BChE; EC 3.1.1.8) have been the subject of numerous studies over decades, leading to large collections of different ligands with corresponding [...] Read more.
As key therapeutic targets for symptomatic treatment of Alzheimer’s disease (AD) according to the cholinergic hypothesis, acetylcholinesterase (AChE; EC 3.1.1.7) and butyrylcholinesterase (BChE; EC 3.1.1.8) have been the subject of numerous studies over decades, leading to large collections of different ligands with corresponding AChE and BChE activity. This vast amount of data provides an ideal basis for the implementation of different machine learning (ML) and deep learning (DL) tools in different steps of the drug discovery process. Mainly applied to identify potential strong inhibitors of AChE and to a lesser extent BChE, many quantitative structure–activity relationship (QSAR) models and other predictive tools have been constructed utilizing different ML algorithms and DL techniques with various success depending on the input data and specific context. Here, we provide an extensive overview of such cases reported in the literature in recent years. Full article
(This article belongs to the Special Issue AI and Big Data in Chemistry)
Show Figures

Graphical abstract

Back to TopTop