A Multi-View Fusion Data-Augmented Method for Predicting BODIPY Dye Spectra
Abstract
1. Introduction
- A multi-view feature fusion strategy combining molecular fingerprints and descriptors was adopted to extract features from different views, and a pre-trained regression model was employed to obtain the molecular energy gap as an electronic structural feature, which was subsequently incorporated into the feature engineering module;
- Data augmentation strategies are applied exclusively to the training set during model training, effectively expanding the dataset size and enhancing the model’s generalization capability;
- The effectiveness of the proposed strategies was validated through experiments, providing feasible insights into the field of molecular property prediction.
2. Methods
2.1. Problem Definition
- S is the SMILES sequence of a BODIPY molecule;
- λ is the corresponding spectral property, either the absorption peak wavelength () or the emission peak wavelength ().
2.2. Framework of the Method
2.3. Molecular Feature Engineering Module
2.3.1. Acquisition and Complementarity of Molecular Descriptors and Fingerprints
2.3.2. Incorporation of HOMO/LUMO
2.3.3. Multi-View Data Fusion
2.4. Data Augmentation Module
Algorithm 1 Data Augmentation Module |
|
2.5. Spectral Prediction Module
Algorithm 2 Spectral Prediction Module |
|
2.5.1. Multilayer Perceptron
2.5.2. Convolutional Neural Network
2.5.3. Random Forests
2.5.4. Gradient Boosting Regression Tree
2.5.5. Extreme Gradient Boosting
3. Experimental Analysis
3.1. Experimental Settings
3.2. Model Evaluation
3.3. Ablation Study
3.4. Discussion
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Medintz, I.L.; Uyeda, H.T.; Goldman, E.R.; Mattoussi, H. Quantum Dot Bioconjugates for Imaging, Labelling and Sensing. Nat. Mater. 2005, 4, 435–446. [Google Scholar] [CrossRef] [PubMed]
- Wolfbeis, O.S. An Overview of Nanoparticles Commonly Used in Fluorescent Bioimaging. Chem. Soc. Rev. 2015, 44, 4743–4768. [Google Scholar] [CrossRef] [PubMed]
- Hong, G.; Antaris, A.L.; Dai, H. Near-Infrared Fluorophores for Biomedical Imaging. Nat. Biomed. Eng. 2017, 1, 0010. [Google Scholar] [CrossRef]
- Uno, S.; Kamiya, M.; Yoshihara, T.; Sugawara, K.; Okabe, K.; Tarhan, M.C.; Fujita, H.; Takakura, H.; Urano, Y. A Spontaneously Blinking Fluorophore Based on Intramolecular Spirocyclization for Live-Cell Super-Resolution Imaging. Nat. Chem. 2014, 6, 681–689. [Google Scholar] [CrossRef]
- Mei, J.; Leung, N.L.C.; Kwok, R.T.K.; Lam, J.W.Y.; Tang, B.Z. Aggregation-Induced Emission: Together We Shine, United We Soar! Chem. Rev. 2015, 115, 11718–11940. [Google Scholar] [CrossRef]
- Hong, Y.; Lam, J.W.Y.; Tang, B.Z. Aggregation-Induced Emission. Chem. Soc. Rev. 2011, 40, 5361–5388. [Google Scholar] [CrossRef]
- Yadav, I.S.; Misra, R. Design, Synthesis and Functionalization of BODIPY Dyes: Applications in Dye-Sensitized Solar Cells (DSSCs) and Photodynamic Therapy (PDT). J. Mater. Chem. C 2023, 11, 8688–8723. [Google Scholar] [CrossRef]
- Loudet, A.; Burgess, K. BODIPY Dyes and Their Derivatives: Syntheses and Spectroscopic Properties. Chem. Rev. 2007, 107, 4891–4932. [Google Scholar] [CrossRef] [PubMed]
- Boens, N.; Leen, V.; Dehaen, W. Fluorescent Indicators Based on BODIPY. Chem. Soc. Rev. 2012, 41, 1130–1172. [Google Scholar] [CrossRef]
- Ni, Y.; Wu, J. Far-Red and Near Infrared BODIPY Dyes: Synthesis and Applications for Fluorescent pH Probes and Bio-Imaging. Org. Biomol. Chem. 2014, 12, 3774–3791. [Google Scholar] [CrossRef]
- Kamkaew, A.; Lim, S.H.; Lee, H.B.; Kiew, L.V.; Chung, L.Y.; Burgess, K. BODIPY Dyes in Photodynamic Therapy. Chem. Soc. Rev. 2013, 42, 77–88. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Wang, N.; Ji, X.; Tao, W.; Ji, S. BODIPY-Based Fluorescent Probes for Biothiols. Chem. Eur. J. 2020, 26, 4172–4192. [Google Scholar] [CrossRef]
- Dreuw, A.; Head-Gordon, M. Single-Reference Ab Initio Methods for the Calculation of Excited States of Large Molecules. Chem. Rev. 2005, 105, 4009–4037. [Google Scholar] [CrossRef]
- Tom, G.; Schmid, S.P.; Baird, S.G.; Raccuglia, P.; Aspuru-Guzik, A. Self-Driving Laboratories for Chemistry and Materials Science. Chem. Rev. 2024, 124, 9633–9732. [Google Scholar] [CrossRef] [PubMed]
- Sanchez-Lengeling, B.; Aspuru-Guzik, A. Inverse Molecular Design Using Machine Learning: Generative Models for Matter Engineering. Science 2018, 361, 360–365. [Google Scholar] [CrossRef]
- Elton, D.C.; Boukouvalas, Z.; Fuge, M.D.; Chung, P.W. Deep Learning for Molecular Design—A Review of the State of the Art. Mol. Syst. Des. Eng. 2019, 4, 828–849. [Google Scholar] [CrossRef]
- Noé, F.; Tkatchenko, A.; Müller, K.R.; Clementi, C. Machine Learning for Molecular Simulation. Annu. Rev. Phys. Chem. 2020, 71, 361–390. [Google Scholar] [CrossRef]
- Mayr, A.; Klambauer, G.; Unterthiner, T.; Hochreiter, S. Large-Scale Comparison of Machine Learning Methods for Drug Target Prediction on ChEMBL. Chem. Sci. 2018, 9, 5441–5451. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, J.; Cao, Z.; Zhu, L.; Tang, J. Molecular Contrastive Learning of Representations via Graph Neural Networks. Nat. Mach. Intell. 2022, 4, 279–287. [Google Scholar] [CrossRef]
- Bajusz, D.; Rácz, A.; Héberger, K. Chemical Data Formats, Fingerprints, and Other Molecular Descriptions for Database Analysis and Searching. In Comprehensive Medicinal Chemistry III; Elsevier: Amsterdam, The Netherlands, 2017; pp. 329–378. [Google Scholar]
- Zhang, Y.; Fan, M.; Xu, Z.; Zhang, Y.; Tang, B.Z. Machine-Learning Screening of Luminogens with Aggregation-Induced Emission Characteristics for Fluorescence Imaging. J. Nanobiotechnol. 2023, 21, 107. [Google Scholar] [CrossRef] [PubMed]
- Liyaqat, T.; Ahmad, T.; Saxena, C. Advancements in Molecular Property Prediction: A Survey of Single and Multimodal Approaches. arXiv 2024, arXiv:2408.09461. [Google Scholar] [CrossRef]
- Huang, R.; Xia, M.; Sakamuru, S.; Zhao, J.; Shahane, S.A.; Attene-Ramos, M.; Simeonov, A.; Austin, C.P. Modelling the Tox21 10K Chemical Profiles for In Vivo Toxicity Prediction and Mechanism Characterization. Nat. Commun. 2016, 7, 10425. [Google Scholar] [CrossRef] [PubMed]
- Deng, J.; Yang, Z.; Wang, H.; Ojima, I.; Samaras, D.; Wang, F. A systematic study of key elements underlying molecular property prediction. Nat. Commun. 2023, 14, 6395. [Google Scholar] [CrossRef]
- Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. [Google Scholar] [CrossRef] [PubMed]
- Niazi, S.K.; Mariam, Z. Recent Advances in Machine-Learning-Based Chemoinformatics: A Comprehensive Review. Int. J. Mol. Sci. 2023, 24, 11488. [Google Scholar] [CrossRef]
- Sandfort, F.; Strieth-Kalthoff, F.; Kühnemund, M.; Beecks, C.; Glorius, F. A Structure-Based Platform for Predicting Chemical Reactivity. Chem 2020, 6, 1379–1390. [Google Scholar] [CrossRef]
- Zeng, Y.; Qu, J.; Wu, G.; Zhang, L.; Zhang, Z.; Zhu, L.; Tang, B.Z. Two Key Descriptors for Designing Second Near-Infrared Dyes and Experimental Validation. J. Am. Chem. Soc. 2024, 146, 9888–9896. [Google Scholar] [CrossRef] [PubMed]
- Ramakrishnan, R.; Dral, P.O.; Rupp, M.; von Lilienfeld, O.A. Quantum Chemistry Structures and Properties of 134 Kilo Molecules. Sci. Data 2014, 1, 140022. [Google Scholar] [CrossRef]
- Faber, F.A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoenholz, S.S.; Dahl, G.E.; Vinyals, O.; Kearnes, S.; Riley, P.F.; von Lilienfeld, O.A. Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error. J. Chem. Theory Comput. 2017, 13, 5255–5264. [Google Scholar] [CrossRef]
- Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; Leskovec, J. Strategies for Pre-Training Graph Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Xia, J.; Zhu, Y.; Du, Y.; Li, S.Z. A Systematic Survey of Chemical Pre-Trained Models. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 19–25 August 2023; pp. 6787–6795. [Google Scholar]
- Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv 2020, arXiv:2010.09885. [Google Scholar]
- Honda, S.; Shi, S.; Ueda, H.R. SMILES Transformer: Pre-Trained Molecular Fingerprint for Low Data Drug Discovery. arXiv 2019, arXiv:1911.04738. [Google Scholar] [CrossRef]
- Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Yang, S.; Xiao, W.; Zhang, M.; Guo, S.; Zhao, J.; Shen, F. Image Data Augmentation for Deep Learning: A Survey. arXiv 2022, arXiv:2204.08610. [Google Scholar]
- Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6382–6388. [Google Scholar]
- Wang, S.; Guo, Y.; Wang, Y.; Sun, H.; Huang, J.; Tang, J. SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Niagara Falls, NY, USA, 7–10 September 2019; pp. 429–436. [Google Scholar]
- Eraqi, B.A.; Khizbullin, D.; Nagaraja, S.S.; Gao, L.; Aspuru-Guzik, A. Molecular Property Prediction in the Ultra-Low Data Regime. Commun. Chem. 2025, 8, 201. [Google Scholar] [CrossRef] [PubMed]
- Li, C.; Feng, J.; Liu, S.; Hu, X.; Wang, J.; Tang, J. A Novel Molecular Representation Learning for Molecular Property Prediction with a Multiple SMILES-Based Augmentation. Comput. Intell. Neurosci. 2022, 2022, 8464452. [Google Scholar] [CrossRef] [PubMed]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, QC, Canada, 20–25 August 1995; pp. 1137–1145. [Google Scholar]
- Wu, Z.; Ramsundar, B.; Feinberg, E.N.; Gomes, J.; Geniesse, C.; Pappu, A.S.; Leswing, K.; Pande, V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chem. Sci. 2018, 9, 513–530. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful Are Graph Neural Networks? In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Liu, S.; Demirel, M.F.; Liang, Y. N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules. Adv. Neural Inf. Process. Syst. 2019, 32, 8464–8476. [Google Scholar]
- Jiang, X.; Tan, L.; Zou, Q. DGCL: Dual-Graph Neural Networks Contrastive Learning for Molecular Property Prediction. Briefings Bioinform. 2024, 25, bbae474. [Google Scholar] [CrossRef] [PubMed]
- Gong, X.; Liu, M.; Liu, Q.; Li, X.; Chen, G.; Tang, J. MDFCL: Multimodal Data Fusion-Based Graph Contrastive Learning Framework for Molecular Property Prediction. Pattern Recognit. 2025, 163, 111463. [Google Scholar] [CrossRef]
Augmented Variants | Training Samples | Test Samples |
---|---|---|
variant = 0 1 | 1141 | 127 |
variant = 2 | 3423 | 127 |
variant = 3 | 4564 | 127 |
variant = 4 | 5705 | 127 |
variant = 5 | 6846 | 127 |
Model | Absorption Peak | Emission Peak | ||||
---|---|---|---|---|---|---|
MAE | RMSE | R2 | MAE | RMSE | R2 | |
MLP | 15.18 ± 2.63 | 23.28 ± 4.02 | 0.898 ± 0.032 | 19.34 ± 1.70 | 29.42 ± 3.01 | 0.837 ± 0.034 |
Ours-MLP | 13.11 ± 2.02 | 22.11 ± 3.91 | 0.908 ± 0.030 | 18.09 ± 1.73 | 28.95 ± 2.62 | 0.842 ± 0.030 |
CNN | 16.47 ± 4.35 | 24.57 ± 5.23 | 0.884 ± 0.049 | 20.54 ± 3.28 | 29.82 ± 2.27 | 0.833 ± 0.025 |
Ours-CNN | 14.56 ± 2.31 | 23.46 ± 3.96 | 0.896 ± 0.031 | 18.64 ± 1.46 | 29.48 ± 3.10 | 0.836 ± 0.035 |
RF | 17.42 ± 1.74 | 26.27 ± 2.41 | 0.872 ± 0.024 | 21.21 ± 1.86 | 30.50 ± 3.18 | 0.824 ± 0.038 |
Ours-RF | 16.23 ± 1.89 | 26.43 ± 3.07 | 0.870 ± 0.031 | 19.84 ± 1.68 | 30.44 ± 3.08 | 0.825 ± 0.037 |
GBRT | 17.96 ± 1.59 | 26.17 ± 2.53 | 0.873 ± 0.025 | 21.76 ± 1.57 | 30.53 ± 2.57 | 0.824 ± 0.031 |
Ours-GBRT | 17.44 ± 1.84 | 25.70 ± 2.86 | 0.877 ± 0.027 | 21.40 ± 1.48 | 30.34 ± 2.48 | 0.827 ± 0.029 |
XGBoost | 16.93 ± 2.41 | 26.11 ± 3.20 | 0.874 ± 0.030 | 20.53 ± 1.76 | 29.82 ± 2.53 | 0.830 ± 0.028 |
Ours-XGBoost | 16.34 ± 2.05 | 25.81 ± 3.56 | 0.876 ± 0.031 | 19.77 ± 1.45 | 29.56 ± 2.64 | 0.836 ± 0.031 |
Model | RMSE | ||
---|---|---|---|
ESOL | FreeSolv | Lipo | |
GCN | 1.43 ± 0.05 | 2.87 ± 0.14 | 0.85 ± 0.08 |
GIN | 1.45 ± 0.02 | 2.76 ± 0.18 | 0.85 ± 0.07 |
N-Gram | 1.10 ± 0.03 | 2.51 ± 0.19 | 0.88 ± 0.12 |
MolCLR | 1.11 ± 0.01 | 2.20 ± 0.20 | 0.65 ± 0.08 |
DGCL | 1.01 ± 0.02 | 1.91 ± 0.25 | |
MDFCL | 1.05 ± 0.05 | 2.44 ± 0.23 | 0.68 ± 0.03 |
Ours-MLP | 0.68 ± 0.03 |
Model | w/o Energy Gap | w/o Augment | Baseline Model | Full Model | ||||
---|---|---|---|---|---|---|---|---|
Abs | Emi | Abs | Emi | Abs | Emi | Abs | Emi | |
MLP | 13.86 ± 2.32 | 18.24 ± 1.73 | 14.71 ± 2.43 | 19.32 ± 1.40 | 15.18 ± 2.63 | 19.34 ± 1.70 | 13.11 ± 2.02 | 18.09 ± 1.73 |
CNN | 14.89 ± 1.74 | 19.20 ± 1.32 | 15.04 ± 2.00 | 20.11 ± 1.73 | 16.47 ± 4.35 | 20.54 ± 3.28 | 14.56 ± 2.31 | 18.64 ± 1.46 |
RF | 16.03 ± 1.84 | 19.87 ± 1.71 | 17.51 ± 1.79 | 21.30 ± 1.78 | 17.42 ± 1.74 | 21.21 ± 1.86 | 16.23 ± 1.89 | 19.84 ± 1.68 |
GBRT | 17.82 ± 1.93 | 21.30 ± 1.78 | 17.81 ± 1.79 | 21.60 ± 1.61 | 17.96 ± 1.59 | 21.76 ± 1.57 | 17.44 ± 1.84 | 21.40 ± 1.48 |
XGBoost | 16.31 ± 2.06 | 19.54 ± 1.26 | 15.91 ± 2.09 | 19.31 ± 1.63 | 16.34 ± 2.05 | 19.77 ± 1.45 | 15.82 ± 1.85 | 19.26 ± 1.50 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, X.; Li, X.; Zhao, Q. A Multi-View Fusion Data-Augmented Method for Predicting BODIPY Dye Spectra. Mathematics 2025, 13, 2947. https://doi.org/10.3390/math13182947
Yang X, Li X, Zhao Q. A Multi-View Fusion Data-Augmented Method for Predicting BODIPY Dye Spectra. Mathematics. 2025; 13(18):2947. https://doi.org/10.3390/math13182947
Chicago/Turabian StyleYang, Xinwen, Xuan Li, and Qin Zhao. 2025. "A Multi-View Fusion Data-Augmented Method for Predicting BODIPY Dye Spectra" Mathematics 13, no. 18: 2947. https://doi.org/10.3390/math13182947
APA StyleYang, X., Li, X., & Zhao, Q. (2025). A Multi-View Fusion Data-Augmented Method for Predicting BODIPY Dye Spectra. Mathematics, 13(18), 2947. https://doi.org/10.3390/math13182947