Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions
Abstract
1. Introduction
- Symmetry-Aware Multimodal Framework: We develop a tailored methodology that integrates text, image, and tabular modalities, using symmetry-resolved crystallographic data to enhance predictions of properties governed by spatial invariants.
- Comprehensive Multimodal Dataset: Utilizing the Alexandria database, we construct a dataset of 10,000 materials with aligned textual, tabular, and image-based representations, enabling systematic evaluation of modality interactions.
- Enhanced Predictive Performance: Through hybrid fusion and modality-specific encoders, the framework achieves superior accuracy for symmetry-dependent properties, validated by scaled error metrics (MAE Scaled, RMSE Scaled).
2. Related Work
2.1. Unimodal Models
2.2. Multimodal Models
3. Methodology
3.1. Alexandria Database
3.2. Dataset Creation and Multimodal Representation
3.2.1. Generating the Image Modality for Multimodal Learning
3.2.2. Standardizing Structural Data for Tabular Representation
3.2.3. Generating the Text Modality for Multimodal Learning
3.2.4. Target Feature Selection
- Gap (eV): The band gap, measured in electron volts, serves as a key indicator of a material’s electronic behavior, distinguishing conductors, insulators, and semiconductors. These values were directly retrieved from band_gap_ind.
- Eform/atom (eV/atom): The formation energy per atom, in electron volts, reflects the thermodynamic stability of the material and was sourced directly from e_form.
- Ehull/atom (eV/atom): Energy above the convex hull per atom, measured in electron volts, indicates the material’s stability relative to potential phase separation. These values were obtained from e_above_hull.
- Etot/atom (eV/atom): The total energy per atom, in electron volts, representing the cumulative stability and binding energy of the atomic configuration, was calculated by dividing energy_total by the number of atomic sites (nsites).
- Mag/vol (/Å3): The magnetic moment per unit volume, measured in micro-Bohr magnetons per cubic angstrom, provides a normalized measure of material magnetism. This value was computed by dividing total_mag by volume.
- Vol/atom (Å3/atom): The atomic volume per atom, in cubic angstroms, offers insights into atomic packing density and was derived by dividing volume by nsites.
- DOS/atom (states/(eV atom)): The density of electronic states per atom at the Fermi level, a key measure of electronic and conductive properties, was computed by dividing dos_ef by nsites.
3.3. Multimodal Training Pipeline
3.3.1. Tabular Modality
3.3.2. Text Modality
3.3.3. Image Modality
3.3.4. Fusion Model
3.3.5. Training
4. Results and Analysis
4.1. Error Metrics
4.2. MAE and RMSE Results
4.3. MAE Scaled and RMSE Scaled Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A

References
- Xie, T.; Grossman, J.C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120, 145301. [Google Scholar] [CrossRef] [PubMed]
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Networks Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
- Lin, Y.; Yan, K.; Luo, Y.; Liu, Y.; Qian, X.; Ji, S. Efficient Approximations of Complete Interatomic Potentials for Crystal Property Prediction. In Proceedings of the 40th International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; Proceedings of Machine Learning Research; TMLR: New York, NY, USA, 2023; Volume 202, pp. 21260–21287. [Google Scholar] [CrossRef]
- Ramos, P.; Santos, N.; Rebelo, R. Performance of state space and ARIMA models for consumer retail sales forecasting. Robot. Comput.-Integr. Manuf. 2015, 34, 151–163. [Google Scholar] [CrossRef]
- Oliveira, J.M.; Ramos, P. Assessing the Performance of Hierarchical Forecasting Methods on the Retail Sector. Entropy 2019, 21, 436. [Google Scholar] [CrossRef] [PubMed]
- Ramos, P.; Oliveira, J.M.; Kourentzes, N.; Fildes, R. Forecasting Seasonal Sales with Many Drivers: Shrinkage or Dimensionality Reduction? Appl. Syst. Innov. 2023, 6, 3. [Google Scholar] [CrossRef]
- Petropoulos, F.; Apiletti, D.; Assimakopoulos, V.; Babai, M.Z.; Barrow, D.K.; Ben Taieb, S.; Bergmeir, C.; Bessa, R.J.; Bijak, J.; Boylan, J.E.; et al. Forecasting: Theory and practice. Int. J. Forecast. 2022, 38, 705–871. [Google Scholar] [CrossRef]
- Ramos, P.; Oliveira, J.M. Robust Sales Forecasting Using Deep Learning with Static and Dynamic Covariates. Appl. Syst. Innov. 2023, 6, 85. [Google Scholar] [CrossRef]
- Teixeira, M.; Oliveira, J.M.; Ramos, P. Enhancing Hierarchical Sales Forecasting with Promotional Data: A Comparative Study Using ARIMA and Deep Neural Networks. Mach. Learn. Knowl. Extr. 2024, 6, 2659–2687. [Google Scholar] [CrossRef]
- Ramos, P.; Oliveira, J.M. A procedure for identification of appropriate state space and ARIMA models based on time-series cross-validation. Algorithms 2016, 9, 76. [Google Scholar] [CrossRef]
- Merchant, A.; Batzner, S.; Schoenholz, S.S.; Aykol, M.; Cheon, G.; Cubuk, E.D. Scaling deep learning for materials discovery. Nature 2023, 624, 80–85. [Google Scholar] [CrossRef]
- Wang, A.Y.T.; Kauwe, S.K.; Murdock, R.J.; Sparks, T.D. Compositionally restricted attention-based network for materials property predictions. Npj Comput. Mater. 2021, 7, 77. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar] [CrossRef]
- Xu, P.; Zhu, X.; Clifton, D.A. Multimodal Learning with Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef] [PubMed]
- Moro, V.; Loh, C.; Dangovski, R.; Ghorashi, A.; Ma, A.; Chen, Z.; Kim, S.; Lu, P.Y.; Christensen, T.; Soljačić, M. Multimodal Learning for Materials. arXiv 2024, arXiv:2312.00111. [Google Scholar] [CrossRef]
- Schmidt, J.; Cerqueira, T.F.; Romero, A.H.; Loew, A.; Jäger, F.; Wang, H.C.; Botti, S.; Marques, M.A. Improving machine-learning models in materials science through large datasets. Mater. Today Phys. 2024, 48, 101560. [Google Scholar] [CrossRef]
- Škrlj, B. From Unimodal to Multimodal Machine Learning: An Overview; SpringerBriefs in Computer Science; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
- Oliveira, J.M.; Ramos, P. Investigating the Accuracy of Autoregressive Recurrent Networks Using Hierarchical Aggregation Structure-Based Data Partitioning. Big Data Cogn. Comput. 2023, 7, 100. [Google Scholar] [CrossRef]
- Oliveira, J.M.; Ramos, P. Cross-Learning-Based Sales Forecasting Using Deep Learning via Partial Pooling from Multi-level Data. In Proceedings of the Engineering Applications of Neural Networks, León, Spain, 14–17 June 2023; Iliadis, L., Maglogiannis, I., Alonso, S., Jayne, C., Pimenidis, E., Eds.; Springer: Cham, Switzerland, 2023; pp. 279–290. [Google Scholar] [CrossRef]
- Oliveira, J.M.; Ramos, P. Evaluating the Effectiveness of Time Series Transformers for Demand Forecasting in Retail. Mathematics 2024, 12, 2728. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- MatBERT GitHub. MatBERT: A Pretrained BERT Model on Materials Science Literature. 2021. Available online: https://github.com/lbnlp/MatBERT (accessed on 15 October 2024).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar] [CrossRef]
- Caetano, R.; Oliveira, J.M.; Ramos, P. Transformer-Based Models for Probabilistic Time Series Forecasting with Explanatory Variables. Mathematics 2025, 13, 814. [Google Scholar] [CrossRef]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; Proceedings of Machine Learning Research, PMLR; TMLR: New York, NY, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Pyzer-Knapp, E.O.; Manica, M.; Staar, P.; Morin, L.; Ruch, P.; Laino, T.; Smith, J.R.; Curioni, A. Foundation models for materials discovery—Current state and future directions. Npj Comput. Mater. 2025, 11, 61. [Google Scholar] [CrossRef]
- Moro, V.; Loh, C.; Dangovski, R.; Ghorashi, A.; Ma, A.; Chen, Z.; Kim, S.; Lu, P.Y.; Christensen, T.; Soljačić, M. Multimodal foundation models for material property prediction and discovery. Newton 2025, 1, 100016. [Google Scholar] [CrossRef]
- Muroga, S.; Miki, Y.; Hata, K. A Comprehensive and Versatile Multimodal Deep-Learning Approach for Predicting Diverse Properties of Advanced Materials. Adv. Sci. 2023, 10, 2302508. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Gong, S.; Böger, T.; Newnham, J.A.; Vivona, D.; Sokseiha, M.; Gordiz, K.; Aggarwal, A.; Zhu, T.; Zeier, W.G.; et al. Multimodal Machine Learning for Materials Science: Discovery of Novel Li-Ion Solid Electrolytes. Chem. Mater. 2024, 36, 11541–11550. [Google Scholar] [CrossRef]
- Ozawa, K.; Suzuki, T.; Tonogai, S.; Itakura, T. Graph-text contrastive learning of inorganic crystal structure toward a foundation model of inorganic materials. Sci. Technol. Adv. Mater. Methods 2024, 4, 2406219. [Google Scholar] [CrossRef]
- Ock, J.; Montoya, J.; Schweigert, D.; Hung, L.; Suram, S.K.; Ye, W. UniMat: Unifying Materials Embeddings through Multi-modal Learning. arXiv 2024, arXiv:2411.08664. [Google Scholar] [CrossRef]
- Jain, A.; Ong, S.P.; Hautier, G.; Chen, W.; Richards, W.D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. [Google Scholar] [CrossRef]
- Das, K.; Goyal, P.; Lee, S.C.; Bhattacharjee, S.; Ganguly, N. CrysMMNet: Multimodal representation for crystal property prediction. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, JMLR.org, UAI ’23, Pittsburgh, PA, USA, 31 July–4 August 2023. [Google Scholar]
- Zhao, F.; Zhang, C.; Geng, B. Deep Multimodal Data Fusion. ACM Comput. Surv. 2024, 56, 216. [Google Scholar] [CrossRef]
- Shi, Y.; Ong, H.R.; Yang, S.; Fan, Y. Deep Multimodal Fusion of Visual and Auditory Features for Robust Material Recognition. Int. J. Comput. Commun. Control 2024, 19, 1–17. [Google Scholar] [CrossRef]
- Barroso-Luque, L.; Shuaibi, M.; Fu, X.; Wood, B.M.; Dzamba, M.; Gao, M.; Rizvi, A.; Zitnick, C.L.; Ulissi, Z.W. Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models. arXiv 2024, arXiv:2410.12771. [Google Scholar] [CrossRef]
- Takeda, S.; Priyadarsini, I.; Kishimoto, A.; Shinohara, H.; Hamada, L.; Masataka, H.; Fuchiwaki, J.; Nakano, D. Multi-modal Foundation Model for Material Design. In Proceedings of the AI for Accelerated Materials Design—NeurIPS 2023 Workshop, New Orleans, LA, USA, 15 December 2023. [Google Scholar]
- Takeda, S.; Kishimoto, A.; Hamada, L.; Nakano, D.; Smith, J.R. Foundation Model for Material Science. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 15376–15383. [Google Scholar] [CrossRef]
- Horton, M.; Shen, J.X.; Burns, J.; Cohen, O.; Chabbey, F.; Ganose, A.M.; Guha, R.; Huck, P.; Li, H.H.; McDermott, M.; et al. Crystal Toolkit: A Web App Framework to Improve Usability and Accessibility of Materials Science Research Algorithms. arXiv 2023, arXiv:2302.06147. [Google Scholar] [CrossRef]
- Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting Deep Learning Models for Tabular Data. arXiv 2023, arXiv:2106.11959. [Google Scholar] [CrossRef]
- Gorishniy, Y.; Rubachev, I.; Babenko, A. On Embeddings for Numerical Features in Tabular Deep Learning. arXiv 2023, arXiv:2203.05556. [Google Scholar] [CrossRef]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
- Yu, W.; Si, C.; Zhou, P.; Luo, M.; Zhou, Y.; Feng, J.; Yan, S.; Wang, X. MetaFormer Baselines for Vision. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 896–912. [Google Scholar] [CrossRef]








![]() | ![]() | ![]() CaLaIrBr | ![]() | ![]() |
![]() GaNiHgC | ![]() | ![]() | ![]() | ![]() MgFeRe |
![]() | ![]() PtSeCl | ![]() | ![]() | ![]() |
| Hyperparameter | Range | Parameter Type |
|---|---|---|
| learning rate | Continuous (log) | |
| optimizer | {AdamW, SGD} | Discrete |
| maximum epochs | Discrete | |
| batch size | Discrete |
| Parameter | Value |
|---|---|
| weight decay | 0.001 |
| learning rate decay strategy | layerwise_decay |
| learning rate decay | 0.9 |
| learning rate scheduler | cosine |
| warmup steps | 0.1 |
| validation patience | 10 |
| validation check interval | 0.5 |
| loss function | MSE |
| precision | 16-mixed |
| feature pooling mode | concat |
| Modalities | Gap | Eform/Atom | Ehull/Atom | Etot/Atom | Mag/Vol | Vol/Atom | DOS/Atom |
|---|---|---|---|---|---|---|---|
| Tabular | |||||||
| Images | |||||||
| Text | |||||||
| Tabular + Images | |||||||
| Tabular + Text | |||||||
| Images + Text | |||||||
| Tabular + Images + Text |
| Modalities | Gap | Eform/Atom | Ehull/Atom | Etot/Atom | Mag/Vol | Vol/Atom | DOS/Atom |
|---|---|---|---|---|---|---|---|
| Tabular | |||||||
| Images | |||||||
| Text | |||||||
| Tabular + Images | |||||||
| Tabular + Text | |||||||
| Images + Text | |||||||
| Tabular + Images + Text |
| Modalities | Gap | Eform/Atom | Ehull/Atom | Etot/Atom | Mag/Vol | Vol/Atom | DOS/Atom |
|---|---|---|---|---|---|---|---|
| Tabular | |||||||
| Images | |||||||
| Text | |||||||
| Tabular + Images | |||||||
| Tabular + Text | |||||||
| Images + Text | |||||||
| Tabular + Images + Text |
| Modalities | Gap | Eform/Atom | Ehull/Atom | Etot/Atom | Mag/Vol | Vol/Atom | DOS/Atom |
|---|---|---|---|---|---|---|---|
| Tabular | |||||||
| Images | |||||||
| Text | |||||||
| Tabular + Images | |||||||
| Tabular + Text | |||||||
| Images + Text | |||||||
| Tabular + Images + Text |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Costa, V.; Oliveira, J.M.; Ramos, P. Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions. Computation 2025, 13, 282. https://doi.org/10.3390/computation13120282
Costa V, Oliveira JM, Ramos P. Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions. Computation. 2025; 13(12):282. https://doi.org/10.3390/computation13120282
Chicago/Turabian StyleCosta, Vítor, José Manuel Oliveira, and Patrícia Ramos. 2025. "Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions" Computation 13, no. 12: 282. https://doi.org/10.3390/computation13120282
APA StyleCosta, V., Oliveira, J. M., & Ramos, P. (2025). Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions. Computation, 13(12), 282. https://doi.org/10.3390/computation13120282
















