Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions

Costa, Vítor; Oliveira, José Manuel; Ramos, Patrícia

doi:10.3390/computation13120282

Open AccessArticle

Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions

by

Vítor Costa

^1,†

,

José Manuel Oliveira

^2,3,†

and

Patrícia Ramos

^2,4,*,†

¹

ISCAP, Polytechnic of Porto, Rua Jaime Lopes Amorim s/n, 4465-004 São Mamede de Infesta, Portugal

²

Institute for Systems and Computer Engineering, Technology and Science, Campus da FEUP, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal

³

Faculty of Economics, University of Porto, Rua Dr. Roberto Frias, 4200-464 Porto, Portugal

⁴

CEOS.PP, ISCAP, Polytechnic of Porto, Rua Jaime Lopes Amorim s/n, 4465-004 São Mamede de Infesta, Portugal

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computation 2025, 13(12), 282; https://doi.org/10.3390/computation13120282

Submission received: 17 September 2025 / Revised: 21 October 2025 / Accepted: 17 November 2025 / Published: 1 December 2025

Download

Browse Figures

Versions Notes

Abstract

Advancements in deep learning have revolutionized materials discovery by enabling predictive modeling of complex material properties. However, single-modal approaches often fail to capture the intricate interplay of compositional, structural, and morphological characteristics. This study introduces a novel multimodal deep learning framework for enhanced material property prediction, integrating textual (chemical compositions), tabular (structural descriptors), and image-based (2D crystal structure visualizations) modalities. Utilizing the Alexandriadatabase, we construct a comprehensive multimodal dataset of 10,000 materials with symmetry-resolved crystallographic data. Specialized neural architectures, such as FT-Transformer for tabular data, Hugging Face Electra-based model for text, and TIMM-based MetaFormer for images, generate modality-specific embeddings, fused through a hybrid strategy into a unified latent space. The framework predicts seven critical material properties, including electronic (band gap, density of states), thermodynamic (formation energy, energy above hull, total energy), magnetic (magnetic moment per volume), and volumetric (volume per atom) features, many governed by crystallographic symmetry. Experimental results demonstrated that multimodal fusion significantly outperforms unimodal baselines. Notably, the bimodal integration of image and text data showed significant gains, reducing the Mean Absolute Error for band gap by approximately 22.7% and for volume per atom by 22.4% compared to the average unimodal models. This combination also achieved a 28.4% reduction in Root Mean Squared Error for formation energy. The full trimodal model (tabular + images + text) yielded competitive, and in several cases the lowest, error metrics, particularly for band gap, magnetic moment per volume and density of states per atom, confirming the value of integrating all three modalities. This scalable, modular framework advances materials informatics, offering a powerful tool for data-driven materials discovery and design.

Keywords:

multimodal deep learning; foundation models; property prediction; materials science

1. Introduction

Materials science seeks to unravel the intricate relationships between composition, structure, and properties to design materials with tailored functionalities for applications ranging from energy storage to advanced electronics [1]. The complexity of these relationships, often governed by crystallographic symmetry, poses significant challenges for predictive modeling [2]. Symmetry principles dictate critical material behaviors, such as electronic band structures, mechanical anisotropy, and phase stability, by imposing physical invariants on atomic arrangements and interactions [3]. Traditional materials discovery methods, reliant on experimental synthesis or computationally intensive simulations, are resource-heavy and struggle to scale with the growing volume of material data. Machine learning has emerged as a transformative tool [4,5,6,7,8,9,10], leveraging vast datasets to predict material properties efficiently [11]. However, single-modality models, whether based on chemical compositions, structural descriptors, or imaging, often fail to capture the multifaceted interplay of factors influencing symmetry-dependent properties [12].

Multimodal deep learning offers a promising solution by integrating diverse data types to form holistic representations of materials [13,14,15]. This study introduces a novel multimodal framework that fuses textual (chemical compositions), tabular (structural descriptors), and image-based (2D visualizations of 3D crystal structures) modalities to enhance the prediction of material properties governed by crystallographic symmetry. By leveraging symmetry-resolved crystallographic data from the Alexandria database [16], the framework explicitly encodes spatial invariants through image-based structural representations and tabular descriptors, such as lattice parameters, atomic coordinates and symmetry information. These modalities are processed using specialized architectures: an Electra-based transformer for text, an FT-Transformer for tabular data, and a TIMM-based MetaFormer for images. A hybrid fusion strategy, utilizing linear transformations, concatenation, and a multi-layer perceptron with layer normalization, integrates these representations into a shared latent space, enabling the model to capture complementary insights into symmetry-driven phenomena.

The significance of this approach lies in its ability to exploit symmetry invariants to improve predictive accuracy for properties like band gap, formation energy, and magnetic moment, which are inherently tied to crystallographic constraints. For instance, image-based modalities encode visual patterns of atomic arrangements, reflecting symmetry elements that influence electronic and magnetic properties. Textual data provides precise compositional information, while tabular data quantifies structural metrics, such as unit cell volume and lattice parameters, that embed symmetry-related features. By systematically evaluating combinations of these modalities, this study demonstrates that multimodal fusion, particularly the integration of image and text data, and the integration of image, text and tabular data, outperforms unimodal baselines, achieving lower error metrics (MAE, RMSE) across diverse properties. This addresses a critical gap in materials informatics, where the role of modality interactions in capturing symmetry-dependent behaviors remains underexplored [17].

The novelty of this work stems from its integration of crystallographic symmetry into multimodal learning. Unlike prior studies that treat modalities independently or focus on generic data fusion [2,18], this framework explicitly leverages symmetry-aware representations to model physical constraints. It employs advanced techniques, including graph convolutional networks for structural embeddings and a hybrid fusion architecture with concatenation and multi-layer perceptron processing for modality fusion, to ensure that symmetry-driven relationships are preserved in the latent space. The resulting framework is scalable, modular, and adaptable to large-scale datasets, offering a robust tool for data-driven materials discovery.

This study’s key contributions are threefold:

Symmetry-Aware Multimodal Framework: We develop a tailored methodology that integrates text, image, and tabular modalities, using symmetry-resolved crystallographic data to enhance predictions of properties governed by spatial invariants.
Comprehensive Multimodal Dataset: Utilizing the Alexandria database, we construct a dataset of 10,000 materials with aligned textual, tabular, and image-based representations, enabling systematic evaluation of modality interactions.
Enhanced Predictive Performance: Through hybrid fusion and modality-specific encoders, the framework achieves superior accuracy for symmetry-dependent properties, validated by scaled error metrics (MAE Scaled, RMSE Scaled).

Although the present study focuses on inorganic crystalline materials, the proposed multimodal deep learning framework is inherently generalizable and readily extensible to other complex systems characterized by heterogeneous data sources. The core challenge addressed, learning robust representations from the joint integration of textual, tabular, and image-based modalities, appears in numerous computational science domains where composition, structure, and visual or spatial features jointly determine emergent properties. Examples include the predictive modeling of polymers and amorphous materials, molecular crystals and pharmaceutical compounds, nanomaterials with explicit morphological descriptors, and even non-materials problems involving multimodal inputs.

The paper is organized as follows: Section 2 reviews related work, comparing unimodal and multimodal approaches in materials informatics to contextualize our contributions. Section 3 describes the methodology, detailing the construction of the multimodal dataset from the Alexandria database and the proposed training pipeline. Section 4 presents experimental results, evaluating the performance of various modality combinations for predicting symmetry-dependent material properties. Finally, Section 5 discusses the implications of our findings, highlighting opportunities for future research in symmetry-driven materials discovery.

2. Related Work

The evolution from unimodal to multimodal deep learning models in materials science reflects a shift toward capturing the multifaceted nature of material properties. Unimodal models, leveraging single data types like composition or structure, have laid the groundwork for predictive modeling, while multimodal approaches integrate diverse data sources to enhance accuracy and generalizability. This section reviews key contributions in both paradigms, distilling insights into model architectures, representation strategies, and fusion techniques that inform the development of integrated, data-driven methodologies for materials property prediction.

2.1. Unimodal Models

Single-modality machine learning models in materials science have historically provided the foundation for predictive modeling in the field. These models are trained on a single type of data, such as chemical composition, crystal structure, or imaging, and have driven important scientific advances by enabling accurate, scalable, and interpretable predictions of material properties. Their utility lies in their simplicity, computational efficiency, and the direct link they establish between input representations and output targets. By focusing on a single modality, these models can capitalize on large volumes of modality-specific data, refine tailored architectures, and explore theoretical principles deeply rooted in the physics and chemistry of materials. Over the last decade, each class of unimodal models has evolved to solve increasingly complex tasks, enabling novel material discovery pipelines and expanding the understanding of structure–property relationships.

Composition-based models utilize chemical formulae as their sole input to predict properties such as thermal conductivity, formation enthalpy, band gap, hardness, and elasticity. These models are especially valuable in high-throughput computational screening, where full structural data is not always available, and experimental measurements are limited [17]. Early models often relied on handcrafted descriptors or statistical features extracted from periodic table properties, such as electronegativity, ionic radii, or oxidation states. While useful, such features required expert knowledge and were limited in their ability to model complex interactions. The introduction of deep learning, and more specifically transformer-based models [19,20,21], has radically improved the capacity to learn rich representations directly from raw compositional data. One of the most impactful advances was the adaptation of NLP architectures like BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. [22], to the chemical domain. BERT’s success in language tasks stems from its ability to learn context-aware representations by jointly processing sequences in both directions. This property was leveraged in MatBERT [23], a domain-specific variant that treats chemical formulas as sequences, enabling the model to infer latent relationships between elements and capture compositional motifs associated with specific material behaviors. In parallel, CrabNet has emerged as a composition-only model that bridges the gap between traditional feature engineering and modern representation learning [12]. Inspired by self-attention mechanisms [24,25], CrabNet treats compositions as sequences of atoms, where each element is encoded individually and allowed to exchange contextual information through attention layers. This approach captures the subtle dependencies among elements and their roles in compound stability and functionality, even in complex or imbalanced compositions. Its featurization pipeline avoids handcrafted descriptors, using elemental embeddings and stoichiometric coefficients to preserve both atomic identity and proportion. Notably, CrabNet has demonstrated competitive performance on benchmark datasets, proving particularly robust when applied to materials without known structures, making it a powerful tool in exploratory discovery and screening. Despite their strengths, composition-based models inherently lack spatial awareness. They are unable to account for atomic arrangements, coordination environments, or long-range interactions, all of which are crucial for understanding phenomena such as mechanical stress distribution, phonon propagation, or electronic band dispersion. In such cases, structure-aware models are necessary.

To overcome the limitations of purely compositional approaches, structure-based models have been developed to incorporate spatial information. Graph Neural Networks (GNNs) are especially well-suited to this task, as they naturally represent crystalline materials as graphs: atoms are treated as nodes, and bonds or interatomic distances are treated as edges [2,18]. This representation allows GNNs to encode local atomic environments, symmetry, and connectivity, enabling the model to learn how structure influences properties such as thermal stability, dielectric constants, or magnetism. The Crystal Graph Convolutional Neural Network (CGCNN) was one of the earliest models to operationalize this idea in materials science [1]. CGCNN constructs a crystal graph based on atom positions and bonding topologies, and then applies graph convolutions to propagate information through the lattice. This allows the model to capture both local coordination and global symmetry patterns. CGCNN’s architecture has proven effective in capturing crystal-specific descriptors and learning rich structural features without the need for handcrafted inputs. Building on this foundation, PotNet introduced a more physically grounded framework by integrating interatomic potentials directly into the graph representation [3]. Instead of constructing edges based on predefined bond distances, PotNet forms a fully connected graph, where edge weights are computed from calculated pairwise potentials. This representation captures long-range forces and subtle interactions that are often missed in traditional models, making it particularly powerful for predicting properties like phonon dispersion, mechanical stress response, or conductivity under strain. The message-passing scheme employed in PotNet dynamically updates node features based on both energetic and spatial cues, effectively learning embeddings that reflect the underlying physics of the material. Overall, GNNs have established themselves as the state-of-the-art in structure-aware learning. Their modularity and interpretability, along with their compatibility with diverse crystallographic formats (e.g., Crystallographic Information File (CIF) files), make them indispensable in both predictive modeling and material representation learning. However, they depend heavily on the availability and quality of structural data, limiting their applicability in early-stage discovery where such data is often unavailable.

Image-based models introduce a third class of unimodal approaches, focusing on visual data to capture features that are difficult to encode symbolically. These models are particularly valuable when analyzing materials characterized through high-resolution microscopy or simulated visualizations. Examples include scanning electron microscopy (SEM), transmission electron microscopy (TEM), and visual outputs from crystal structure simulation tools. Images allow for direct observation of microstructures, defect patterns, grain boundaries, and symmetry-breaking phenomena, key aspects that influence mechanical performance, conductivity, and other macroscopic properties. Deep learning models for image analysis in materials science initially adopted Convolutional Neural Networks (CNNs), which excel at extracting hierarchical features through local receptive fields [26]. CNNs have been employed for phase classification, defect detection, and segmentation of microstructural domains. Early layers capture edges and textures, while deeper layers learn compound features such as lattice arrangements, void distributions, and anisotropy. This makes CNNs particularly adept at distinguishing between similar-looking phases or detecting minute structural anomalies. More recently, Vision Transformers (ViTs) [27] have offered a new approach by modeling images as sequences of patches and applying self-attention mechanisms across the entire field of view [24]. Unlike CNNs, ViTs are not constrained by locality and can model long-range dependencies more effectively. This enables the model to recognize global symmetry patterns, alignments across different crystal orientations, or recurring spatial motifs. ViTs have demonstrated superior performance in image classification and segmentation tasks, especially when pretrained on large-scale datasets and fine-tuned to materials-specific tasks. One key resource for implementing image-based models in materials science is TIMM (Torchvision Image Models), a versatile PyTorch v2.4.1 library offering an extensive collection of pre-trained models and utilities for training and fine-tuning on custom image datasets. TIMM includes over 600 pre-trained models, such as ResNet [28], EfficientNet [29], Vision Transformer (ViT) [27], and Swin Transformer [30], trained on large-scale datasets like ImageNet [31]. These models are particularly effective for transfer learning, where pre-trained weights are adapted to domain-specific materials science tasks. TIMM allows flexible customization, including architecture modifications, layer freezing, and fine-tuning, making it well-suited for high-resolution 3D images of crystal structures. While image-based models offer high expressivity, they are highly dependent on the resolution, quality, and annotation of the input images. Furthermore, images may not capture compositional or electronic nuances that are critical for certain applications, requiring their fusion with other data modalities for a more complete understanding.

2.2. Multimodal Models

While unimodal models have advanced the field by extracting meaningful patterns from specific data types, they are inherently limited by their reliance on isolated representations of materials. Materials science is intrinsically multimodal, with properties emerging from the complex interplay of composition, structure, morphology, and experimental conditions [32]. Multimodal learning addresses this by integrating heterogeneous data sources within a single predictive framework, leveraging complementary strengths to enhance predictive performance and generalizability [15,33]. For instance, compositional models capture elemental interactions, structural models provide geometric context, and image-based models offer morphological insights, enabling a holistic representation of materials.

A crucial step in multimodal modeling is the fusion of representations [34]. Early fusion combines raw inputs before processing, allowing joint learning but requiring strict alignment. Late fusion merges outputs after modality-specific processing, offering flexibility but limiting interaction. Hybrid fusion, combining intermediate features, balances shared learning and modality-specific nuances, making it ideal for materials science’s complex chemical-physical interplay [35].

Recent studies demonstrate that integrating graph-based structure representations with text-derived composition embeddings [36] or vision-based image features [37] improves prediction accuracy and interpretability. However, unresolved challenges hinder the widespread adoption of multimodal learning in materials science, directly informing this study’s contributions. First, data scarcity remains a significant barrier. Comprehensive multimodal datasets combining text, images, and structured data are scarce in materials science, unlike in general machine learning, where databases like Materials Project [38] and Alexandria [16] provide structured data but lack fully integrated multimodal representations. This limits model training, particularly for symmetry-dependent properties requiring aligned crystallographic data. Second, alignment issues complicate integration. Ensuring that text, image, and tabular data correspond to the same material entity with semantic and structural consistency is challenging, especially with disparate data sources [39]. Misalignment introduces noise, degrading performance. Third, hybrid fusion strategies involve trade-offs. While combining intermediate features enables shared learning, it risks losing modality-specific nuances, and designing efficient fusion architectures remains an open problem [40,41]. Additionally, current multimodal models often rely on synthetic data due to limited experimental datasets, and the lack of standardized benchmarks hinders fair comparisons and reproducibility [11,42].

These limitations underscore the need for novel methodologies, such as symmetry-aware frameworks, and interdisciplinary collaboration to curate high-quality, domain-specific datasets [43,44]. This study addresses these challenges through a symmetry-aware multimodal framework. To tackle data scarcity, we curate a dataset of 10,000 materials from the Alexandria database, integrating aligned text, image, and tabular representations with symmetry-resolved crystallographic data. Alignment is ensured through standardized preprocessing and metadata linking, leveraging crystallographic formats like CIF files. A tailored hybrid fusion architecture, using linear transformations, concatenation, and a multi-layer perceptron, balances inter-modal interactions and modality-specific features to preserve symmetry-driven relationships. By addressing dependence on synthetic data and the lack of benchmarks, our framework and dataset provide a reproducible foundation for multimodal learning, fostering interdisciplinary efforts to advance materials discovery.

3. Methodology

At the outset of this work, several key data sources were considered to construct a dataset suitable for multimodal deep learning in materials science. These sources included established repositories such as the Materials Project [38] and the Alexandria [16] databases. These databases provide comprehensive material data, covering chemical composition, structural properties, and various material characteristics. However, a common limitation of these sources is that the data is primarily stored in JSON format, which results in tabular representations rich in numerical, structural, and categorical information but lacks diversity in data modalities. As a result, a key objective of this work became the transformation of the JSON-based tabular data from the Alexandria dataset into a fully multimodal dataset incorporating text, tabular, and visual modalities. Given the scale of the Alexandria database, which includes data on millions of materials, a stratified random sample of 10,000 materials from the complete 3D database was selected to ensure representativeness across key crystallographic categories. This sampling approach was proportional to the distribution of space groups in the full database, thereby preserving the diversity of crystal systems and symmetry types while mitigating potential biases in a purely random selection. The decision was motivated by limitations in computational power and processing capacity required to handle the entire database, ensuring that the transformation process remained manageable while still providing a representative sample for multimodal learning. To verify the suitability of this subset, we analyzed its diversity in terms of space groups, structural prototypes, and chemical families. Figure 1 illustrates the distribution of crystal systems and the most prevalent space groups within the sampled dataset, showing a balanced representation that mirrors the broader database (e.g., tetragonal: 29%, cubic: 17%, monoclinic: 17%, orthorhombic: 16%, trigonal: 14%, hexagonal: 4%, triclinic: 2%, with detailed breakdowns for dominant space groups such as 123 at 14% in tetragonal and 216 at 6% in cubic). Figure 2 provides further statistics on structural prototypes and chemical families, confirming broad coverage across various compound types and elemental families, including metals, semiconductors, and insulators. This proportional sampling and statistical validation affirm the dataset’s adequacy for capturing symmetry-driven variations in material properties.

3.1. Alexandria Database

In the Alexandria database, each record is structured to provide comprehensive information about the material’s composition, properties, and atomic structure [16]. The record starts with metadata fields such as @module and @class, which indicate that the data is derived from Python’s v3.10 pymatgen library, a widely used toolkit in computational materials science. The composition field specifies the material’s elemental composition, with elements as keys and their corresponding quantities as values, defining the material’s stoichiometry. To illustrate the structure of the Alexandria database, we provide an example JSON snippet in Figure A1 on Appendix A.

Detailed material-specific data is stored under the data field. This includes a unique material identifier (mat_id) and the formula field, which provides the chemical formula in a human-readable format. The elements field lists the constituent elements of the material, while the spg field specifies the material’s space group number. The nsites field indicates the number of atomic sites in the unit cell, and the stress tensor is represented as a

3 \times 3

matrix in the stress field. The data field also contains computed properties such as total energy (energy_total), total magnetic moment (total_mag), indirect and direct band gaps (band_gap_ind and band_gap_dir), density of states at the Fermi level (dos_ef), corrected energy (energy_corrected), energy above the convex hull (e_above_hull), formation energy (e_form), and energy related to phase separation (e_phase_separation). Decomposition information, which specifies potential breakdown products of the material, is provided in the decomposition field. For example, the material in the JSON snippet (Figure A1, Appendix A), Ba(SrPd)₂, has a stress tensor of

[[1.4069707, 0.0, 0.01502944], [0.0, 1.5771061, 0.0], [0.01502944, 0.0, 1.4671515]]

GPa and a total magnetic moment of

5.04 \times 10^{- 5}

μ_{B}

for a specific atomic site, illustrating the mechanical and magnetic properties encoded in the dataset.

The structure field contains detailed information about the material’s atomic structure. It begins with the total charge (charge), followed by lattice information under the lattice field. This includes a

3 \times 3

matrix of lattice vectors (matrix), periodic boundary conditions (pbc), lattice constants (a, b, c), lattice angles (alpha, beta, gamma), and the unit cell volume (volume).

The atomic sites within the unit cell are listed under the sites field, where each site is represented as a dictionary containing the species present, defined by the species field. This field specifies the element (element) and its occupancy (occu), as well as fractional (abc) and Cartesian (xyz) coordinates. The label field provides the element symbol for reference, and each site also includes associated properties, such as the magnetic moment (magmom), charge, and atomic forces (forces), represented as a list of force components.

The structure of the Alexandria database makes it a highly detailed and versatile resource, enabling users to explore material properties comprehensively and facilitating advanced research in materials science. Further details about the Alexandria database, including information on the 1D, 2D, and 3D compounds it contains and the calculations performed using the exchange-correlation functional of density functional theory, can be found in Schmidt et al. [16].

3.2. Dataset Creation and Multimodal Representation

The process of constructing the dataset and generating multimodal representations is crucial for enabling robust multimodal deep learning applications in materials science. The dataset construction involved the creation of complementary modalities, including image, tabular, and text representations, along with the selection and preparation of standardized target features. Each modality provided unique insights into the properties and structure of materials, facilitating a comprehensive understanding and enhancing predictive performance across a range of deep learning models.

3.2.1. Generating the Image Modality for Multimodal Learning

To generate the image modality, 2D visualizations of each material’s 3D crystal structure were created. These images provide a spatial representation of atomic arrangements, capturing critical structural details that complement the text and tabular data modalities. The images were produced using the same framework employed to render 3D objects in the Materials Project Web Interface, ensuring consistent and high-quality visualizations.

The image generation process involved designing a web application to render and capture 3D visualizations of each material. The implementation utilized the Dash framework for the web interface and Crystal Toolkit [45], integrated with pymatgen, to display the 3D structures interactively. For each material, a 3D model was rendered in the web interface and automatically screenshotted, saving the result as a 2D PNG file. A layout was defined to display the structures, with an update interval of five seconds to progress through the dataset automatically. At each interval, a callback function was triggered when the structure data was updated. This function requested a PNG image of the rendered scene after a one-second delay to ensure the structure was fully rendered before capturing the screenshot. The captured image was saved to a specified directory and named using the material’s mat_id, maintaining alignment with the dataset.

A consistent, fixed viewpoint was used for every material to ensure uniformity across the dataset. This standardization eliminates variability in orientation as a confounding factor for the model. Furthermore, the generated PNG images are intentionally minimalist to prevent model bias from non-structural artifacts. Each

460 \times 460

pixel image contains only the visual representation of the material’s crystal structure: the atoms, the bonds, and the unit cell outline. No extraneous graphical user interface (GUI) elements, such as navigation compasses, toolbars, or informational labels (e.g., atom or element identifiers), are present in the final images. This controlled approach ensures that the deep learning model’s predictions are based solely on the intrinsic structural features of the material rather than any artifacts from the visualization software.

The entire image generation process required approximately 2.5 h to produce 10,000 images usuing parallel processing. This process was executed on a system with an Intel 12th Gen Core i7-1270P (12 cores, 2.20 GHz), 32 GB RAM. The Dash framework and Crystal Toolkit were run using Python 3.10 with pymatgen 2024.10.3. The resulting dataset included visualizations of various materials, examples of which are presented in Table 1.

3.2.2. Standardizing Structural Data for Tabular Representation

For the tabular modality, the dataset initially had a structured format that could be easily transformed into tabular data. However, incorporating the structure object, one of the most critical sources of information, posed a significant challenge. The structure object varied substantially across materials due to differences in the number of atomic sites and coordination environments. This variability rendered it impractical to create a consistent tabular format, as the number of columns and entries differed for each material. Without a fixed column structure, standard tabular models could not effectively process the data, necessitating the development of alternative methods to represent structural information consistently across all materials.

To address this challenge, the PotNet architecture was employed. PotNet is designed to generate fixed-size embeddings of a material’s structure, capturing essential features such as atomic positions and interatomic relationships. Although PotNet is primarily used for property predictions, its workflow was adapted in this context by eliminating the prediction layer, allowing for the extraction of fixed-length vectors representing each material’s structure. These fixed-length embeddings were then integrated into the tabular dataset, providing a consistent and comprehensive representation of the structural information. The PotNet model was utilized in a frozen state, employing pre-trained weights without any fine-tuning on this dataset.

PotNet relies on interatomic potentials as input features to capture spatial and bonding characteristics. These potentials describe atomic interactions based on relative positions, offering valuable insights into the physical and chemical properties of the material. To compute these potentials, the atomic structure data for each material was first converted into the Crystallographic Information File (CIF) format using the pymatgen library. CIF files are a standard format for storing crystal structure data, including lattice parameters, atomic coordinates, and symmetry information. Subsequently, these CIF files were processed using jarvis-tools, converting them into the JARVIS Atoms format, ensuring compatibility with various computational workflows. Node attributes, including atomic numbers, atomic masses, and elemental properties like electronegativity and valence electron count (derived from pymatgen and jarvis-tools), were calculated to represent individual atoms. Edge indices, defining bonds or potential interactions based on interatomic distances, were also computed to transform each material’s structure into a graph-based format suitable for PotNet.

PotNet was initialized and configured with three convolution layers, specifying dimensions for both atom and edge features, along with an output vector size. An output size of 128 dimensions was chosen to adequately capture the structural complexity of the materials. This choice aligns with common practices in graph-based models like CGCNN and PotNet, where embedding sizes of 64–256 dimensions balance expressivity and computational efficiency for datasets with thousands of materials [11,16,42]. Given the 10,000-material dataset’s diversity in crystal structures, 128 dimensions were deemed sufficient to encode atomic positions, bonding, and symmetry features. PotNet was chosen for its ability to capture long-range interatomic potentials and symmetry-aware features, outperforming traditional handcrafted descriptors (e.g., radial distribution functions) and other graph-based models like CGCNN in tasks requiring physical grounding. Unlike CGCNN, which relies on predefined bond distances, PotNet’s fully connected graph with pairwise potential weights better encodes subtle structural interactions. A detailed comparison of PotNet embeddings to other featurization methods is planned for future work. Attribute tensors, representing individual atom properties such as atomic numbers, were constructed to represent atoms as feature vectors. These tensors were then processed through convolutional and transformer layers using PotNetConv and TransformerConv to extract embeddings from node and edge features. Following this, global mean pooling was applied to aggregate node features into a single vector representation for the entire structure. This vector was further refined through a fully connected layer, producing the final embedding. The embeddings for all materials were stored as a tabular representation of structural features, each with 128 dimensions.

3.2.3. Generating the Text Modality for Multimodal Learning

Inspired by CrabNet’s architecture, which efficiently handles compositional data, a strategy was developed to represent compositions in a standardized text format. In this approach, each element in the material, along with its atom count, was included in the text description, with elements and their corresponding counts separated by spaces. This method was specifically designed to facilitate tokenization for the Hugging Face text foundation model used URL: https://huggingface.co/google/electra-base-discriminator (accessed on 22 July 2025), ensuring that the model could effectively process the compositional data.

This simplified representation also aimed to enhance generalization by promoting uniformity across composition texts. By standardizing the format, where elements are followed by their counts and separated by spaces, the model was better equipped to recognize common elemental compositions and identify useful patterns. For example, a material such as Ba(SrPd)₂, composed of ten atoms with the composition {Ba: 2.0, Sr: 4.0, Pd: 4.0}, was converted to the text representation “Ba 2 Sr 4 Pd 4”. This streamlined format supported the integration of text data into the multimodal framework, balancing simplicity with informativeness to optimize model performance.

To address potential limitations with fractional occupancies, which are common in materials with partial site occupations or defects, we note that the Alexandria dataset primarily features materials with integer stoichiometries in their reduced compositions (e.g., as floats representing whole numbers like 2.0 or 4.0). No fractional occupancies are present in the sampled 10,000 materials, so no special handling was required for this study. However, the text representation is designed to accommodate fractions by expressing counts as decimal numbers (e.g., “Ba 1.5 Sr 4.5 Pd 4” for a hypothetical fractional case), and the Electra tokenizer handles such numeric strings effectively by splitting them into subword tokens if needed, without requiring adjustments. This was not empirically tested here due to the dataset’s composition, but future work will evaluate robustness on datasets including fractional occupancies, quantifying impacts on tokenization and prediction accuracy.

3.2.4. Target Feature Selection

Identifying and selecting specific target features from the original dataset was essential to establishing benchmarks for prediction, ensuring consistent comparison across different modalities and architectures. By focusing on a standardized set of material properties, the study defined clear objectives for each model, whether unimodal or multimodal, allowing for a robust assessment of predictive performance across all selected features.

The target features were derived from relevant dataset columns, with transformations applied where necessary to standardize the data and enhance its suitability for deep learning models. The selected features included:

Gap (eV): The band gap, measured in electron volts, serves as a key indicator of a material’s electronic behavior, distinguishing conductors, insulators, and semiconductors. These values were directly retrieved from band_gap_ind.
Eform/atom (eV/atom): The formation energy per atom, in electron volts, reflects the thermodynamic stability of the material and was sourced directly from e_form.
Ehull/atom (eV/atom): Energy above the convex hull per atom, measured in electron volts, indicates the material’s stability relative to potential phase separation. These values were obtained from e_above_hull.
Etot/atom (eV/atom): The total energy per atom, in electron volts, representing the cumulative stability and binding energy of the atomic configuration, was calculated by dividing energy_total by the number of atomic sites (nsites).
Mag/vol ( $μ_{B}$ /Å³): The magnetic moment per unit volume, measured in micro-Bohr magnetons per cubic angstrom, provides a normalized measure of material magnetism. This value was computed by dividing total_mag by volume.
Vol/atom (Å³/atom): The atomic volume per atom, in cubic angstroms, offers insights into atomic packing density and was derived by dividing volume by nsites.
DOS/atom (states/(eV atom)): The density of electronic states per atom at the Fermi level, a key measure of electronic and conductive properties, was computed by dividing dos_ef by nsites.

Figure 3 displays the normalized distributions (scaled to a

[0, 1]

range for comparability) of the seven selected target material properties: Gap, Eform/atom, Ehull/atom, Etot/atom, Mag/vol, Vol/atom, and DOS/atom. The forms of the distributions vary notably: several exhibit strong right-skewness with sharp peaks near the lower end (around 0), such as likely the band gap, energy above hull, and density of states per atom, which aligns with physical expectations like many metallic (zero-gap) or non-magnetic materials in the dataset; others appear more bell-shaped or centered, with broader spreads and peaks toward the middle or higher values, reflecting diverse material behaviors across the sampled properties.

Together, these features provide a comprehensive representation of material properties, with each feature normalized or structured to ensure compatibility across different modalities. This systematic approach to feature selection and preparation not only facilitated robust model training but also enabled meaningful comparisons of predictive performance across various learning frameworks.

3.3. Multimodal Training Pipeline

During the model-building phase, an automated pipeline was designed to streamline the processes of training, prediction, and evaluation. This pipeline provided significant flexibility, allowing for various combinations of selected modalities and target features. The dataset was split, with 90% allocated for training (9000 materials, including 1800 for validation, representing 20% of the training set) and 10% for testing (1000 materials). To address potential data leakage arising from structural similarities, such as polymorphs or materials sharing the same crystallographic prototype, the dataset was partitioned into training and testing sets using a grouped split based on space groups. The grouping was performed in alignment with the stratified sampling procedure employed during dataset creation from the Alexandria database, where the initial selection of 10,000 materials was proportional to the distribution of space groups to maintain crystallographic diversity. By specifying the target feature and selecting the input modalities, such as tabular, image, text, or combinations thereof, the pipeline dynamically adapted to diverse data types.

Figure 4 provides a visual overview of the multimodal architecture, illustrating how each data modality is processed through dedicated encoders before fusion. Tabular data is handled by an FT-Transformer, textual data is encoded using a Hugging Face Transformer model, and images undergo feature extraction via a TIMM-based convolutional network. These representations are then aligned in a shared latent space through a hybrid fusion strategy, which integrates independent feature extraction with cross-modal interactions to enhance predictive performance.

3.3.1. Tabular Modality

The FT-Transformer (Feature Tokenizer + Transformer) is a transformer-based architecture designed for processing tabular data, specifically numerical features, leveraging an attention mechanism to capture complex dependencies [46]. The model comprises approximately 867,000 trainable parameters. Numerical features, initially represented in a 128-dimensional space, are first embedded into a 192-dimensional learnable representation through a numerical feature tokenizer [47]. This embedding maps each feature into a high-dimensional space, facilitating subsequent processing. A linear projection layer preserves the 192-dimensional feature representation. The core of the FT-Transformer consists of three stacked transformer blocks, each incorporating a multi-head self-attention mechanism with eight attention heads. These mechanisms compute inter-feature relationships by applying learnable linear projections to generate query, key, and value representations, followed by an output projection. Dropout (p = 0.2) is applied within the attention mechanism to enhance regularization. Each transformer block also includes a feed-forward network (FFN) that expands the feature dimension from 192 to 384 and then reduces it back to 192, employing a gated linear unit (GEGLU) activation function to improve expressiveness. Dropout (p = 0.1) is applied within the FFN to further promote generalization. Layer normalization (

eps = 1 \times 10^{- 5}

) is performed after both the attention and FFN operations to stabilize training dynamics. For unimodal operation, the transformer output undergoes a final layer normalization, followed by a ReLU activation and a linear transformation that reduces the 192-dimensional representation to a single output (1-dimensional), suitable for regression tasks. In multimodal settings, the model employs an identity head, preserving the 192-dimensional output for integration with other modalities via the fusion MLP, enabling flexible downstream processing.

3.3.2. Text Modality

The text processing component employs the base variant of the Electra architecture [48], a transformer-based model with approximately 108 million trainable parameters, fine-tuned for regression tasks URL: https://huggingface.co/google/electra-base-discriminator (accessed on 22 July 2025). Input tokens are first transformed into 768-dimensional dense vector representations via an embedding layer, integrating three components: word embeddings (vocabulary size: 30,522) to capture semantic information, positional embeddings (maximum sequence length: 512) to encode token order, and token-type embeddings (size: 2) to differentiate input segments. Layer normalization (

eps = 1 \times 10^{- 12}

) stabilizes the embeddings, and dropout (p = 0.1) mitigates overfitting. The core processing occurs in the Electra encoder, comprising twelve transformer layers designed to model complex contextual dependencies among tokens. Each layer features a multi-head self-attention mechanism, where input features are projected into query, key, and value representations (each 768-dimensional) via linear transformations. This mechanism computes inter-token relationships, followed by an output projection, dropout (p = 0.1) for regularization, and layer normalization (

eps = 1 \times 10^{- 12}

) for training stability. Subsequently, a feed-forward network (FFN) expands the feature dimension from 768 to 3072 and reduces it back to 768, applying a GELU activation function to introduce non-linearity and enhance expressiveness. Dropout (p = 0.1) and layer normalization are applied post-FFN to ensure robust training. In unimodal operation, the encoder’s output is processed through a linear layer, reducing the 768-dimensional representation to a single output (1-dimensional), suitable for regression tasks requiring a continuous numerical prediction. In multimodal configurations, an identity head is employed, preserving the 768-dimensional output for integration with other modalities via the fusion MLP, enabling seamless downstream processing in a multimodal framework.

3.3.3. Image Modality

The image processing pipeline uses a MetaFormer-based model from the TIMM (PyTorch Image Models) library, containing approximately 95.7 million parameters, namely ‘caformer_b36.sail_in22k_ft_in1k’ [49] URL: https://huggingface.co/timm/caformer_b36.sail_in22k_ft_in1k (accessed on 22 July 2025). This architecture processes input images through a hierarchical sequence of stages to produce numerical predictions, tailored for either unimodal or multimodal applications. The pipeline begins with a stem module that applies a convolutional layer to reduce the spatial dimensions of the input image while expanding its feature depth. This initial transformation extracts low-level features, such as edges and textures, preparing the data for subsequent processing. The core of the model consists of four sequential stages, each containing multiple MetaFormer blocks. These blocks integrate local feature extraction and global context modeling to progressively refine the feature representation. In the first two stages, separable convolution-based token mixers capture spatial relationships, preserving local details. Downsampling layers between stages reduce spatial resolution while increasing feature dimensionality, enabling the model to construct a hierarchical feature hierarchy. In the later stages, attention-based token mixers are employed to model long-range dependencies, enhancing the model’s ability to capture complex patterns across the image. The final stage aggregates the extracted features through a head module. For unimodal applications, the head module employs global average pooling, followed by layer normalization and flattening, to produce a compact feature representation. This representation is processed by a multi-layer perceptron (MLP) head, which includes an expansion layer with non-linear activation (Squared ReLU), layer normalization, and a final linear layer that outputs a single scalar value (1-dimensional) suitable for regression tasks. In multimodal settings, the head module is replaced with an identity layer, preserving the feature representation for fusion with other modalities, enabling integration with downstream processing components. This architecture leverages a combination of convolutional and attention-based mechanisms to effectively capture both local and global image features. By adapting its output head based on the application context, unimodal regression or multimodal feature extraction, the model demonstrates versatility and robustness for diverse predictive tasks.

3.3.4. Fusion Model

The fusion model implements a hybrid-fusion architecture designed to integrate feature representations from multiple modalities, such as images, text, and tabular data, to produce a unified prediction for multimodal tasks. Each modality is processed independently by its respective backbone model, generating modality-specific feature representations that are subsequently aligned and fused. The fusion process begins with transformations to standardize feature dimensions across modalities. Tabular data, represented in a 192-dimensional space, is projected to a 768-dimensional space using a linear transformation with bias. Image and text representations, already in the 768-dimensional space, undergo additional linear transformations to ensure consistency. These transformations align the feature representations for effective integration. In multimodal configurations, aligned representations are concatenated to form a feature vector of 2304 dimensions for three modalities or 1536 dimensions for two modalities, ensuring seamless integration of modality-specific embeddings for downstream processing. This vector is stabilized through layer normalization with an epsilon of

10^{- 5}

, mitigating internal covariate shift during training. The normalized vector is processed by a multi-layer perceptron (MLP) comprising a linear layer that reduces the dimensionality to 128, preserving critical inter-modal information. A LeakyReLU activation function with a negative slope of 0.01 introduces non-linearity to model complex relationships between modalities, followed by a dropout layer with a probability of 0.1 to enhance generalization by reducing overfitting. The fused representation is then mapped to a single scalar output (1-dimensional) via a linear transformation with bias, suitable for regression tasks. In unimodal settings, the fusion model is not employed. Instead, the prediction is generated directly by the head of the respective backbone model, as detailed in prior sections. This ensures that unimodal tasks leverage the specialized processing of individual backbones, while the fusion model provides a robust mechanism for integrating multiple modalities in multimodal contexts, capturing both intra-modal and inter-modal patterns for accurate regression predictions. Hybrid fusion was selected over early and late fusion due to its ability to balance modality-specific feature extraction with cross-modal interactions, as it integrates intermediate representations via linear transformations, concatenation, and a multi-layer perceptron. Early fusion risks losing modality-specific details due to premature integration, while late fusion limits inter-modal learning by merging only final outputs. The hybrid approach’s efficiency stems from its modular design, allowing flexible modality combinations without requiring retraining of individual encoders.

In addition to the MLP-based fusion, a transformer-based fusion mechanism was explored as an ablation study to integrate feature representations from multiple modalities, such as images, text, and tabular data, to produce a unified prediction for multimodal tasks. Similar to the MLP approach, each modality is processed independently by its respective backbone model, generating modality-specific feature representations that are subsequently aligned and fused. The fusion process begins with transformations to standardize feature dimensions across modalities. Tabular data, represented in a 192-dimensional space, is projected to a 768-dimensional space using a linear transformation with bias. Image and text representations, already in the 768-dimensional space, undergo additional linear transformations to ensure consistency, employing the maximum dimension adaptation strategy. These transformations align the feature representations for effective integration. In multimodal configurations, aligned representations are concatenated along the sequence dimension to form a tensor of shape [batch_size, number_of_modalities, 768], ensuring seamless integration of modality-specific embeddings for downstream processing. A learnable CLS token is then prepended to this sequence, resulting in a shape of [batch_size, number_of_modalities + 1768]. This tensor is processed through a stack of three transformer blocks, each incorporating a multi-head self-attention mechanism with eight attention heads (without additive attention or shared query-value weights). These mechanisms compute inter-modality relationships by applying learnable linear projections to generate query, key, and value representations, followed by an output projection. Dropout (

p = 0.2

) is applied within the attention mechanism to enhance regularization, and layer normalization (epsilon of

1 \times 10^{- 5}

) is performed to stabilize training dynamics. Each transformer block also includes a feed-forward network (FFN) with 192 hidden nodes that expands and reduces the feature dimension, employing a GEGLU activation function to improve expressiveness. Dropout (

p = 0.1

) is applied within the FFN to further promote generalization, with residual dropout and pre-normalization (without first pre-normalization) for overall stability. The output embedding corresponding to the CLS token from the final transformer block is then mapped to a single scalar output (1-dimensional) via a linear head with ReLU activation and layer normalization, suitable for regression tasks. In unimodal settings, this fusion model is not employed, following the same protocol as the MLP-based fusion. Experimental results using this transformer-based fusion yielded performance metrics (MAE and RMSE across the seven target properties) that were comparable to those of the MLP-based fusion but did not demonstrate improvements. Consequently, we retained the simpler MLP fusion strategy for its computational efficiency and equivalent predictive accuracy, as detailed in the main results.

3.3.5. Training

To optimize model performance, hyperparameter tuning was conducted using the Optuna framework, which employs the Tree-structured Parzen Estimator (TPE) algorithm for efficient Bayesian optimization. TPE models parameter distributions to adaptively select promising hyperparameter combinations, outperforming traditional grid or random search methods. A total of 30 trials were conducted, balancing exploration of the parameter space with computational efficiency.

The hyperparameter search spaces are detailed in Table 2. The learning rate was sampled logarithmically from

[10^{- 5}, 10^{- 2}]

to explore diverse optimization dynamics. The optimizer was chosen from

{AdamW, SGD}

, balancing adaptive and momentum-based strategies. Maximum epochs were selected from

{10, 20, 30}

, and batch sizes from

{16, 32, 64, 128}

, optimizing the trade-off between memory usage and training stability.

The training setup is summarized in Table 3. Training utilized a weight decay of 0.001 to enhance generalization and prevent overfitting. A learning rate decay strategy of layerwise_decay was paired with a decay rate of 0.9, ensuring stable convergence across model layers. A cosine learning rate scheduler was applied to smoothly adjust the learning rate over epochs. Warmup steps spanning 0.1 (10% of total training steps) stabilized early training. Early stopping with a validation patience of 10 epochs mitigated overfitting, with validation check interval of 0.5 epochs for consistent performance monitoring. The Mean Squared Error (MSE) loss function was used, aligning with the regression tasks for predicting material properties. Training was conducted with 16-mixed precision, and feature pooling mode was set to concat.

For multimodal inputs, feature representations from text, image, and tabular modalities were fused via concatenation, enabling robust integration in the hybrid fusion model. In unimodal settings, predictions were generated directly by the respective backbone model (e.g., FT-Transformer for tabular, Electra-based for text, MetaFormer for images), bypassing the fusion layer. Mixed-precision training (16-bit floating-point) was employed to optimize computational efficiency while maintaining numerical stability.

Following the identification of the single best set of hyperparameters for each model, the model’s performance and stability were assessed. To account for stochastic elements in the training process, the model with this optimal configuration was trained and evaluated three separate times, each initiated with a different random seed. The final reported metrics represent the average and standard deviation of the results from these three independent runs. This methodology ensures that the results are representative and not an artifact of a specific random initialization, providing a more robust measure of the model’s expected performance and its variance.

Training was conducted on an NVIDIA T4 GPU with 16 GB VRAM, with peak memory usage of approximately 12 GB, accommodating model parameters, optimizer states, and intermediate activations. No gradient accumulation was required, as the batch size fit within memory constraints. Training durations for each seed ranged from several minutes to approximately two hours, depending on modality complexity and dataset size. This setup enabled rapid iteration and evaluation of unimodal and multimodal configurations, streamlining development and minimizing manual tuning efforts.

4. Results and Analysis

The evaluation of the multimodal learning models focused on assessing their predictive performance on the test set across various combinations of modalities and target features. The results, shown in the tables and figures of this section, provide critical insights into the efficacy of integrating diverse modalities for predicting material properties and the influence of modality interactions on overall model performance.

4.1. Error Metrics

To quantify predictive performance, two primary error metrics, mean absolute error (MAE) and root mean squared error (RMSE), were utilized, as defined in Equations (1) and (2):

MAE = \sum_{i = 1}^{N} \frac{1}{N} | y_{i} - {\hat{y}}_{i} |,

(1)

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2}} .

(2)

In these equations,

y_{i}

, represents the true value,

{\hat{y}}_{i}

is the predicted value, and N is the total number of samples. The MAE measures the average absolute difference between predicted and actual values, providing a direct assessment of prediction accuracy. In contrast, the RMSE, due to its quadratic nature, emphasizes larger deviations, penalizing significant errors more heavily. Naturally, lower values of both MAE and RMSE indicate better model performance, as they reflect smaller prediction errors. Together, these metrics balance typical errors (captured by MAE) and outliers (captured by RMSE).

To enable normalized comparisons across target features with diverse scales and units, such as band gap (Gap, in eV) and atomic volume (Vol/atom, in Å³), scaled metrics, MAE Scaled and RMSE Scaled, were employed. These metrics normalize prediction errors relative to the inherent variability of each feature, ensuring robust comparisons across properties with differing ranges or distributions. The choice of mean absolute deviation (MAD) and root mean squared deviation (RMSD) for scaling was driven by their ability to capture the natural variability of each feature by measuring deviations relative to the mean, avoiding biases from extreme values or skewed distributions common in materials properties. Unlike simpler methods like min-max or standard deviation scaling, MAD/RMSD preserves the physical significance of errors, facilitating fair and interpretable comparisons across heterogeneous properties and highlighting the multimodal framework’s effectiveness. The MAE Scaled is defined in Equation (3):

MAE Scaled = \frac{MAE}{MAD} = \frac{\sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|}{\sum_{i = 1}^{N} |y_{i} - {\bar{y}}_{i}|} .

(3)

Similarly, the RMSE Scaled is defined in Equation (4):

RMSE Scaled = \frac{RMSE}{RMSD} = \frac{\sqrt{\frac{1}{N} \sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2}}}{\sqrt{\frac{1}{N} \sum_{i = 1}^{N} (y_{i} - {\bar{y}}_{i})^{2}}} .

(4)

The MAE Scaled quantifies the mean absolute error relative to the MAD, where the numerator represents the average prediction error and the denominator captures the average deviation of true values from their mean. Similarly, the RMSE Scaled compares the RMSE to the RMSD, with the numerator reflecting the quadratic mean of prediction errors and the denominator representing the quadratic mean deviation of true values from their mean. In both cases, a value close to 1 suggests prediction errors align with the data’s natural variability, while a value below 1 indicates superior performance relative to the baseline variability.

4.2. MAE and RMSE Results

The evaluation of predictive performance across various modality combinations reveals the strengths of multimodal learning in addressing the limitations of single-modality approaches in materials science, particularly for symmetry-dependent properties. Single-modality models, such as those relying solely on tabular structural descriptors, textual chemical compositions, or image-based visualizations, often fail to capture the complex interplay of compositional, structural, and symmetry-related factors that govern material properties. For instance, tabular data may encode lattice parameters and atomic coordinates but lack the visual context of atomic arrangements critical for symmetry-driven phenomena like electronic band gap or magnetic behavior. Similarly, text-based models excel at capturing compositional details but miss spatial and symmetry-related structural nuances, while image-based models may overlook quantitative structural metrics. Multimodal learning overcomes these limitations by integrating complementary data types, textual compositions, tabular structural descriptors, and 2D visualizations of 3D crystal structures, into a unified framework. This approach explicitly leverages symmetry-resolved crystallographic data from the Alexandria dataset, enabling the model to capture spatial invariants (e.g., lattice symmetry, atomic coordination) through image-based representations and quantitative symmetry metrics in tabular data. By fusing these modalities, the framework achieves a holistic representation of materials, significantly enhancing predictive accuracy for properties governed by crystallographic symmetry, such as band gap (Gap), energy above the hull per atom (Ehull/atom), and magnetic moment per volume (Mag/vol).

The MAE results in Table 4 highlight the superior performance of multimodal approaches over unimodal ones, underscoring the framework’s ability to leverage complementary information from textual compositions, tabular structural descriptors, and image-based 2D visualizations of 3D crystal structures to better capture symmetry-dependent material properties. Among the unimodal baselines, the tabular model exhibits the highest errors across most properties, such as a MAE of 0.219 eV for band gap (Gap) and 7.139 Å³ atom⁻¹ for volume per atom (Vol/atom), reflecting its limitations in encoding visual patterns of atomic arrangements and symmetry elements that influence electronic and volumetric behaviors. In contrast, image and text modalities perform comparably well on their own, with images achieving lower errors for properties like formation energy per atom (Eform/atom) at 0.133 eV atom⁻¹ and magnetic moment per volume (Mag/vol) at 0.006

μ_{B}

Å⁻³, likely due to their capacity to implicitly represent crystallographic symmetry through visual lattice depictions and compositional sequences, respectively. Bimodal combinations further enhance accuracy, with the image + text fusion standing out by reducing MAE for band gap by approximately 22.7% relative to the average unimodal baselines (from 0.135 eV to 0.104 eV), and for volume per atom by 22.4% (from 1.655 Å³ atom⁻¹ to 1.284 Å³ atom⁻¹), demonstrating how integrating visual symmetry cues with precise chemical information captures intricate interplay governing thermodynamic stability and spatial invariants. The trimodal approach (tabular + images + text) yields competitive results, such as the lowest MAE for band gap at 0.103 eV and density of states per atom (DOS/atom) at 0.160 states (eV atom)⁻¹, though it occasionally introduces slight overhead for properties like energy above hull per atom (Ehull/atom) at 0.133 eV atom⁻¹, possibly due to increased complexity in fusing high-dimensional tabular symmetry metrics (e.g., space group and lattice parameters) without overfitting. Overall, these findings validate the hybrid fusion strategy’s effectiveness in creating a unified latent space that preserves crystallographic constraints, advancing data-driven predictions for materials discovery in applications reliant on electronic, thermodynamic, and magnetic properties.

The RMSE results in Table 5 further affirm the advantages of multimodal integration, emphasizing the framework’s robustness in handling larger prediction deviations through its quadratic penalization of errors, which is particularly insightful for symmetry-governed material properties where outliers can significantly impact electronic, thermodynamic, and magnetic behaviors. Consistent with the MAE trends, the tabular unimodal model shows the highest RMSE values across most targets, such as 0.580 eV for band gap (Gap) and 9.232 Å³ atom⁻¹ for volume per atom (Vol/atom), highlighting its challenges in capturing the nuanced visual and compositional cues essential for modeling crystallographic symmetry and spatial invariants. Image and text unimodal approaches again demonstrate stronger individual performance, with images yielding lower RMSE for formation energy per atom (Eform/atom) at 0.239 eV atom⁻¹ and magnetic moment per volume (Mag/vol) at 0.013

μ_{B}

Å⁻³, attributable to their ability to encode symmetry patterns via 2D crystal visualizations and elemental sequences, respectively. Bimodal fusions amplify these gains, notably the image + text combination, which reduces RMSE for formation energy by approximately 28.4% relative to the average unimodal baselines (from 0.261 eV atom⁻¹ to 0.187 eV atom⁻¹), and for band gap by 16.6% (from 0.401 eV to 0.334 eV), illustrating the synergistic capture of symmetry-driven interactions that influence phase stability and electronic structures. The trimodal model (tabular + images + text) often achieves the overall lowest errors, including for band gap at 0.313 eV and density of states per atom (DOS/atom) at 0.231 states (eV atom)⁻¹, although it shows minor increases in some cases like energy above hull per atom (Ehull/atom) at 0.210 eV atom⁻¹, potentially reflecting the added complexity of incorporating tabular symmetry descriptors into the hybrid fusion without excessive noise. Collectively, these RMSE outcomes reinforce the modular framework’s capacity to forge a cohesive latent representation that respects crystallographic constraints, thereby enhancing predictive reliability for materials informatics and accelerating symmetry-aware design in fields such as energy materials and advanced electronics.

Figure 5 provides a visual representation of Mean Absolute Error (MAE) across various modality combinations for the predicted material properties, offering an intuitive overview of how multimodal integration enhances accuracy in capturing symmetry-dependent behaviors such as electronic band gaps, thermodynamic stabilities, and magnetic moments. The bar charts clearly depict the tabular unimodal approach yielding the highest errors for most properties, exemplified by elevated bars for band gap (Gap) and volume per atom (Vol/atom), which highlights its limitations in encoding crystallographic symmetry through visual or compositional cues alone. In contrast, unimodal image and text models show shorter bars, indicating better standalone performance, with images particularly effective for properties like formation energy per atom (Eform/atom) and volume per atom (Vol/atom) due to their ability to represent 2D visualizations of symmetry elements in crystal structures. Bimodal fusions, especially image + text, demonstrate markedly reduced error bars, such as substantial decreases for Gap and Vol/atom, underscoring the synergistic benefits of combining visual symmetry patterns with precise chemical compositions to model intricate physical invariants. The trimodal combination (tabular + images + text) often achieves the lowest bars overall, though with occasional slight increases for certain properties like energy above hull per atom (Ehull/atom), reflecting the hybrid fusion’s balance in incorporating additional symmetry descriptors without undue complexity.

Figure 6 visualizes the Root Mean Squared Error (RMSE) for the same modality combinations and target properties, emphasizing the framework’s robustness against larger deviations that could arise from symmetry-governed outliers in electronic, thermodynamic, and volumetric features. Mirroring the MAE patterns, the tabular unimodal bars are the tallest across most categories, such as for Gap and Vol/atom, illustrating its challenges in handling variance tied to unrepresented symmetry invariants like lattice parameters. Unimodal image and text approaches exhibit lower bars, with images showing superior performance for Eform/atom and Vol/atom through their encoding of visual crystal symmetries, while text provides compositional context that reduces errors in properties like total energy per atom (Etot/atom). Bimodal integrations, notably image + text, result in significantly shorter bars, highlighting reductions in RMSE for formation energy and band gap, which demonstrate the complementary capture of symmetry-driven phenomena affecting phase stability and electronic structures. The trimodal model frequently displays the shortest bars, including for Gap and density of states per atom (DOS/atom), although minor elevations occur in some instances like Ehull/atom, potentially due to the added fusion of tabular metrics introducing subtle noise.

4.3. MAE Scaled and RMSE Scaled Results

The MAE Scaled and RMSE Scaled metrics, defined in Equations (3) and (4), normalize prediction errors relative to the inherent variability of each target feature, enabling fair comparisons across properties with diverse scales and units (e.g., Gap in eV, Vol/atom in Å³). These metrics highlight how multimodal learning addresses the limitations of single-modality approaches by integrating symmetry-aware representations that capture the multifaceted nature of material properties. Single-modality models often struggle with symmetry-dependent properties due to incomplete representations: tabular models miss visual symmetry patterns, text models lack structural context, and image models may overlook quantitative compositional details. By fusing textual, tabular, and image-based modalities, the proposed framework leverages symmetry-resolved crystallographic data to model spatial invariants (e.g., lattice symmetry, atomic coordination) and compositional interactions, resulting in superior predictive performance for properties like Eform/atom, Ehull/atom, and Vol/atom.

The MAE Scaled results in Table 6, which normalize errors relative to each property’s inherent variability, reinforce the multimodal framework’s effectiveness in capturing symmetry-dependent material behaviors across diverse scales and units, enabling equitable comparisons for properties like band gap (Gap) and volume per atom (Vol/atom). Consistent with unscaled trends, the tabular unimodal model exhibits the highest scaled errors for most targets, such as 0.828 for Gap and 0.990 for Vol/atom, underscoring its shortcomings in representing visual symmetry elements and compositional nuances critical for modeling crystallographic invariants. Image and text unimodal models perform notably better individually, with images achieving lower scaled MAE for Eform/atom at 0.199 and Mag/vol at 0.502, likely owing to their encoding of 2D symmetry visualizations, while text excels in compositional-driven properties like Mag/vol at 0.487. Bimodal integrations yield further improvements, particularly the image + text fusion, which reduces scaled MAE for Gap by approximately 22.6% relative to the average unimodal baselines (from 0.508 to 0.393) and for Vol/atom by 22.4% (from 0.230 to 0.178), highlighting the complementary capture of symmetry-aware structural patterns and chemical interactions that govern electronic and thermodynamic properties. The trimodal approach (tabular + images + text) delivers competitive or superior results in several cases, including the lowest scaled MAE for Gap at 0.387 and DOS/atom at 0.344, though it shows minor elevations for properties like Eform/atom at 0.191, potentially due to the challenges of fusing additional tabular symmetry descriptors without introducing variability.

The RMSE Scaled results in Table 7, emphasizing larger errors through quadratic scaling while normalized for cross-property fairness, further validate the framework’s robustness in mitigating prediction variances tied to crystallographic symmetry across electronic, thermodynamic, and magnetic features. Mirroring prior patterns, the tabular unimodal baseline records the highest scaled RMSE values, including 0.933 for Gap and 0.994 for Vol/atom, reflecting its inability to integrate visual and textual cues essential for symmetry-resolved phenomena. Unimodal image and text models again show stronger standalone performance, with images posting lower scaled RMSE for Eform/atom at 0.248 and Ehull/atom at 0.418, attributed to their representation of lattice symmetry via visualizations, and text providing compositional context for properties like Etot/atom at 0.152. Bimodal combinations enhance these benefits, with image + text fusion notably reducing scaled RMSE for Eform/atom by approximately 28.2% relative to the average unimodal baselines (from 0.272 to 0.195) and for Gap by 16.7% (from 0.645 to 0.537), demonstrating synergistic modeling of symmetry-influenced interactions affecting stability and band structures. The trimodal model often secures the lowest overall errors, such as for Gap at 0.503, DOS/atom at 0.391, and Etot/atom at 0.107, although it occasionally incurs slight increases like for Ehull/atom at 0.393, possibly stemming from the complexity of incorporating tabular metrics into the fusion process.

Figure 7 provides a visual comparison of the Mean Absolute Error Scaled (MAE Scaled) and Root Mean Squared Error Scaled (RMSE Scaled) across various modality combinations for the predicted material properties, offering an intuitive, normalized perspective on the framework’s performance in handling symmetry-dependent behaviors while accounting for each property’s inherent variability. The left panel’s bar charts for MAE Scaled and right panel for RMSE Scaled clearly illustrate the tabular unimodal approach yielding the highest scaled errors for most properties, such as elevated bars for magnetic moment per volume (Mag/vol) and volume per atom (Vol/atom), highlighting its limitations in encoding crystallographic symmetry through visual or compositional cues alone. In contrast, unimodal image and text models display shorter bars, indicating superior standalone performance, with images particularly effective for properties like volume per atom (Vol/atom) and total energy per atom (Etot/atom) due to their representation of 2D symmetry visualizations, while text excels in composition-driven metrics like magnetic moment per volume (Mag/vol). Bimodal fusions, especially image + text, show substantially reduced bar heights, such as notable decreases for Gap and Vol/atom, emphasizing the synergistic benefits of integrating symmetry patterns with chemical information to model physical invariants across scales. The trimodal combination (tabular + images + text) frequently achieves the lowest bars, though with minor increases for certain properties like Ehull/atom, reflecting the hybrid fusion’s ability to incorporate additional symmetry descriptors without excessive complexity.

Figure 8 presents heat maps of MAE Scaled (left) and RMSE Scaled (right) for different modality combinations, providing a compact, color-coded overview of predictive performance across all properties and fusions, which facilitates quick identification of optimal integrations for symmetry-governed electronic, thermodynamic, and magnetic features. The heat maps likely use color gradients, darker shades indicating higher errors, to highlight patterns, with the tabular unimodal row or column showing predominantly higher error intensities for targets like Gap and Vol/atom, underscoring its challenges in capturing variance tied to unrepresented symmetry invariants such as space groups and lattice parameters. Unimodal image and text entries exhibit cooler (lower error) shades, with images demonstrating stronger performance for Eform/atom and Mag/vol through encoding of crystal symmetries, and text providing compositional context that mitigates errors in Etot/atom. Bimodal combinations, particularly image + text, display notably cooler regions, illustrating reductions in scaled errors for formation energy and band gap, which demonstrate the complementary modeling of symmetry-influenced interactions affecting stability and electronic structures. The trimodal entries often feature the coolest shades overall, including for Gap and density of states per atom (DOS/atom), although slightly warmer tones may appear in isolated cases like Ehull/atom, potentially due to the fusion of tabular metrics introducing subtle noise.

5. Conclusions

This paper presented a novel multimodal deep learning framework designed to enhance material property prediction by integrating textual (chemical compositions), tabular (structural descriptors), and image-based (2D crystal structure visualizations) data modalities. The framework was developed and validated using a comprehensive dataset of 10,000 materials from the Alexandria database, featuring aligned multimodal data with symmetry-resolved crystallographic information.

The experimental results demonstrated that multimodal fusion, particularly through the developed hybrid fusion strategy, significantly outperforms unimodal baselines in predicting seven critical material properties. Notably, the bimodal integration of image and text data showed significant gains, reducing the Mean Absolute Error (MAE) for band gap by approximately 22.7% and for volume per atom by 22.4% compared to the average unimodal models. This combination also achieved a 28.4% reduction in Root Mean Squared Error (RMSE) for formation energy. The full trimodal model (tabular + images + text) yielded competitive, and in several cases the lowest, error metrics, particularly for band gap, magnetic moment per volume and density of states per atom, confirming the value of integrating all three modalities.

For future work, a detailed comparison of the PotNet-based embeddings used for the tabular modality against other structural featurization methods is planned. Further research will also focus on alternatives to the fixed-embedding tabular model for representing crystal structures. This includes exploring the direct integration of Graph Neural Networks (GNNs), which are state-of-the-art for structure-aware learning, into the multimodal framework. Such an approach could potentially capture complex atomic relationships and symmetry constraints more natively than the current tabular representation, offering a path to further improve predictive accuracy.

Author Contributions

Conceptualization, V.C., J.M.O. and P.R.; methodology, V.C., J.M.O. and P.R.; software, V.C., J.M.O. and P.R.; validation, V.C., J.M.O. and P.R.; formal analysis, V.C., J.M.O. and P.R.; investigation, V.C., J.M.O. and P.R.; resources, V.C., J.M.O. and P.R.; data curation, V.C., J.M.O. and P.R.; writing—original draft preparation, V.C., J.M.O. and P.R.; writing—review and editing, V.C., J.M.O. and P.R.; visualization, V.C., J.M.O. and P.R. All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge The European High Performance Computing Joint Undertaking (EuroHPC JU) for supporting this work under Project ID: EHPC-DEV-2024D10-076. Access to the Deucalion petascale Supercomputer, co-funded by Fundação para a Ciência e a Tecnologia (FCT) and EuroHPC, and hosted at the Azurém campus of the University of Minho, was instrumental in completing this research.

Data Availability Statement

A publicly available database was used in this study. The data can be found here: https://alexandria.icams.rub.de/ (accessed on 25 September 2024).

Acknowledgments

This article is based upon work from COST Action Data-driven Applications towards the Engineering of Functional Materials: an Open Network (DAEMON), CA22154, supported by COST (European Cooperation in Science and Technology).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1. Example JSON snippet from the Alexandria database for a material record.

References

Xie, T.; Grossman, J.C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120, 145301. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Networks Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed]
Lin, Y.; Yan, K.; Luo, Y.; Liu, Y.; Qian, X.; Ji, S. Efficient Approximations of Complete Interatomic Potentials for Crystal Property Prediction. In Proceedings of the 40th International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; Proceedings of Machine Learning Research; TMLR: New York, NY, USA, 2023; Volume 202, pp. 21260–21287. [Google Scholar] [CrossRef]
Ramos, P.; Santos, N.; Rebelo, R. Performance of state space and ARIMA models for consumer retail sales forecasting. Robot. Comput.-Integr. Manuf. 2015, 34, 151–163. [Google Scholar] [CrossRef]
Oliveira, J.M.; Ramos, P. Assessing the Performance of Hierarchical Forecasting Methods on the Retail Sector. Entropy 2019, 21, 436. [Google Scholar] [CrossRef] [PubMed]
Ramos, P.; Oliveira, J.M.; Kourentzes, N.; Fildes, R. Forecasting Seasonal Sales with Many Drivers: Shrinkage or Dimensionality Reduction? Appl. Syst. Innov. 2023, 6, 3. [Google Scholar] [CrossRef]
Petropoulos, F.; Apiletti, D.; Assimakopoulos, V.; Babai, M.Z.; Barrow, D.K.; Ben Taieb, S.; Bergmeir, C.; Bessa, R.J.; Bijak, J.; Boylan, J.E.; et al. Forecasting: Theory and practice. Int. J. Forecast. 2022, 38, 705–871. [Google Scholar] [CrossRef]
Ramos, P.; Oliveira, J.M. Robust Sales Forecasting Using Deep Learning with Static and Dynamic Covariates. Appl. Syst. Innov. 2023, 6, 85. [Google Scholar] [CrossRef]
Teixeira, M.; Oliveira, J.M.; Ramos, P. Enhancing Hierarchical Sales Forecasting with Promotional Data: A Comparative Study Using ARIMA and Deep Neural Networks. Mach. Learn. Knowl. Extr. 2024, 6, 2659–2687. [Google Scholar] [CrossRef]
Ramos, P.; Oliveira, J.M. A procedure for identification of appropriate state space and ARIMA models based on time-series cross-validation. Algorithms 2016, 9, 76. [Google Scholar] [CrossRef]
Merchant, A.; Batzner, S.; Schoenholz, S.S.; Aykol, M.; Cheon, G.; Cubuk, E.D. Scaling deep learning for materials discovery. Nature 2023, 624, 80–85. [Google Scholar] [CrossRef]
Wang, A.Y.T.; Kauwe, S.K.; Murdock, R.J.; Sparks, T.D. Compositionally restricted attention-based network for materials property predictions. Npj Comput. Mater. 2021, 7, 77. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar] [CrossRef]
Xu, P.; Zhu, X.; Clifton, D.A. Multimodal Learning with Transformers: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef] [PubMed]
Moro, V.; Loh, C.; Dangovski, R.; Ghorashi, A.; Ma, A.; Chen, Z.; Kim, S.; Lu, P.Y.; Christensen, T.; Soljačić, M. Multimodal Learning for Materials. arXiv 2024, arXiv:2312.00111. [Google Scholar] [CrossRef]
Schmidt, J.; Cerqueira, T.F.; Romero, A.H.; Loew, A.; Jäger, F.; Wang, H.C.; Botti, S.; Marques, M.A. Improving machine-learning models in materials science through large datasets. Mater. Today Phys. 2024, 48, 101560. [Google Scholar] [CrossRef]
Škrlj, B. From Unimodal to Multimodal Machine Learning: An Overview; SpringerBriefs in Computer Science; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
Oliveira, J.M.; Ramos, P. Investigating the Accuracy of Autoregressive Recurrent Networks Using Hierarchical Aggregation Structure-Based Data Partitioning. Big Data Cogn. Comput. 2023, 7, 100. [Google Scholar] [CrossRef]
Oliveira, J.M.; Ramos, P. Cross-Learning-Based Sales Forecasting Using Deep Learning via Partial Pooling from Multi-level Data. In Proceedings of the Engineering Applications of Neural Networks, León, Spain, 14–17 June 2023; Iliadis, L., Maglogiannis, I., Alonso, S., Jayne, C., Pimenidis, E., Eds.; Springer: Cham, Switzerland, 2023; pp. 279–290. [Google Scholar] [CrossRef]
Oliveira, J.M.; Ramos, P. Evaluating the Effectiveness of Time Series Transformers for Demand Forecasting in Retail. Mathematics 2024, 12, 2728. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
MatBERT GitHub. MatBERT: A Pretrained BERT Model on Materials Science Literature. 2021. Available online: https://github.com/lbnlp/MatBERT (accessed on 15 October 2024).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar] [CrossRef]
Caetano, R.; Oliveira, J.M.; Ramos, P. Transformer-Based Models for Probabilistic Time Series Forecasting with Explanatory Variables. Mathematics 2025, 13, 814. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; Proceedings of Machine Learning Research, PMLR; TMLR: New York, NY, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Pyzer-Knapp, E.O.; Manica, M.; Staar, P.; Morin, L.; Ruch, P.; Laino, T.; Smith, J.R.; Curioni, A. Foundation models for materials discovery—Current state and future directions. Npj Comput. Mater. 2025, 11, 61. [Google Scholar] [CrossRef]
Moro, V.; Loh, C.; Dangovski, R.; Ghorashi, A.; Ma, A.; Chen, Z.; Kim, S.; Lu, P.Y.; Christensen, T.; Soljačić, M. Multimodal foundation models for material property prediction and discovery. Newton 2025, 1, 100016. [Google Scholar] [CrossRef]
Muroga, S.; Miki, Y.; Hata, K. A Comprehensive and Versatile Multimodal Deep-Learning Approach for Predicting Diverse Properties of Advanced Materials. Adv. Sci. 2023, 10, 2302508. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Gong, S.; Böger, T.; Newnham, J.A.; Vivona, D.; Sokseiha, M.; Gordiz, K.; Aggarwal, A.; Zhu, T.; Zeier, W.G.; et al. Multimodal Machine Learning for Materials Science: Discovery of Novel Li-Ion Solid Electrolytes. Chem. Mater. 2024, 36, 11541–11550. [Google Scholar] [CrossRef]
Ozawa, K.; Suzuki, T.; Tonogai, S.; Itakura, T. Graph-text contrastive learning of inorganic crystal structure toward a foundation model of inorganic materials. Sci. Technol. Adv. Mater. Methods 2024, 4, 2406219. [Google Scholar] [CrossRef]
Ock, J.; Montoya, J.; Schweigert, D.; Hung, L.; Suram, S.K.; Ye, W. UniMat: Unifying Materials Embeddings through Multi-modal Learning. arXiv 2024, arXiv:2411.08664. [Google Scholar] [CrossRef]
Jain, A.; Ong, S.P.; Hautier, G.; Chen, W.; Richards, W.D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. [Google Scholar] [CrossRef]
Das, K.; Goyal, P.; Lee, S.C.; Bhattacharjee, S.; Ganguly, N. CrysMMNet: Multimodal representation for crystal property prediction. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, JMLR.org, UAI ’23, Pittsburgh, PA, USA, 31 July–4 August 2023. [Google Scholar]
Zhao, F.; Zhang, C.; Geng, B. Deep Multimodal Data Fusion. ACM Comput. Surv. 2024, 56, 216. [Google Scholar] [CrossRef]
Shi, Y.; Ong, H.R.; Yang, S.; Fan, Y. Deep Multimodal Fusion of Visual and Auditory Features for Robust Material Recognition. Int. J. Comput. Commun. Control 2024, 19, 1–17. [Google Scholar] [CrossRef]
Barroso-Luque, L.; Shuaibi, M.; Fu, X.; Wood, B.M.; Dzamba, M.; Gao, M.; Rizvi, A.; Zitnick, C.L.; Ulissi, Z.W. Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models. arXiv 2024, arXiv:2410.12771. [Google Scholar] [CrossRef]
Takeda, S.; Priyadarsini, I.; Kishimoto, A.; Shinohara, H.; Hamada, L.; Masataka, H.; Fuchiwaki, J.; Nakano, D. Multi-modal Foundation Model for Material Design. In Proceedings of the AI for Accelerated Materials Design—NeurIPS 2023 Workshop, New Orleans, LA, USA, 15 December 2023. [Google Scholar]
Takeda, S.; Kishimoto, A.; Hamada, L.; Nakano, D.; Smith, J.R. Foundation Model for Material Science. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 15376–15383. [Google Scholar] [CrossRef]
Horton, M.; Shen, J.X.; Burns, J.; Cohen, O.; Chabbey, F.; Ganose, A.M.; Guha, R.; Huck, P.; Li, H.H.; McDermott, M.; et al. Crystal Toolkit: A Web App Framework to Improve Usability and Accessibility of Materials Science Research Algorithms. arXiv 2023, arXiv:2302.06147. [Google Scholar] [CrossRef]
Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting Deep Learning Models for Tabular Data. arXiv 2023, arXiv:2106.11959. [Google Scholar] [CrossRef]
Gorishniy, Y.; Rubachev, I.; Babenko, A. On Embeddings for Numerical Features in Tabular Deep Learning. arXiv 2023, arXiv:2203.05556. [Google Scholar] [CrossRef]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Yu, W.; Si, C.; Zhou, P.; Luo, M.; Zhou, Y.; Feng, J.; Yan, S.; Wang, X. MetaFormer Baselines for Vision. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 896–912. [Google Scholar] [CrossRef]

Figure 1. Distribution of crystal systems and space groups in the selected dataset of 10,000 materials.

Figure 2. Elemental and chemical family distribution within the 10,000-material dataset.

Figure 3. Normalized distributions (scaled to a [0,1] range for comparability) of the seven selected target material properties across the dataset.

Figure 4. Multimodal architecture employed for material property prediction.

Figure 5. Visualization of MAE across various modality combinations and target features, where blue bars represent single-modality models (1 modality), orange bars represent bimodal fusions (2 modalities), and green bars represent the full trimodal model (3 modalities).

Figure 6. Visualization of RMSE across various modality combinations and target features, where blue bars represent single-modality models (1 modality), orange bars represent bimodal fusions (2 modalities), and green bars represent the full trimodal model (3 modalities).

Figure 7. MAE Scaled (left) and RMSE Scaled (right) results for different combinations of modalities across target features.

Figure 8. Heat maps of MAE Scaled (left) and RMSE Scaled (right) results for different modality combinations across target features.

Table 1. Example materials and their corresponding 2D visualizations generated from 3D crystal structures.

${AsPS}_{2}$	${Ba}_{2} {Na}_{4} B_{16} {(H}_{2} O_{5})_{7}$	CaLaIrBr	${CuAsP}_{2}$	${Fe}_{2} TcPtC$
GaNiHgC	${KBaTa}_{3}$	${La}_{2} {Sm (Dy}_{2} {Y)}_{3}$	${LiSmH}_{2}$	MgFeRe
${PmSiNi}_{2}$	PtSeCl	${TbTm (AsO}_{4})_{2}$	${Ti}_{2} {Cu}_{2} Te$	${Zr}_{2} IrPt$

Table 2. Hyperparameter search spaces for optimization.

Hyperparameter	Range	Parameter Type
learning rate	$[10^{- 5}, 10^{- 2}]$	Continuous (log)
optimizer	{AdamW, SGD}	Discrete
maximum epochs	${10, 20, 30}$	Discrete
batch size	${16, 32, 64, 128}$	Discrete

Table 3. Training setup.

Parameter	Value
weight decay	0.001
learning rate decay strategy	layerwise_decay
learning rate decay	0.9
learning rate scheduler	cosine
warmup steps	0.1
validation patience	10
validation check interval	0.5
loss function	MSE
precision	16-mixed
feature pooling mode	concat

Table 4. MAE results for different combinations of modalities across target features.

Modalities	Gap	Eform/Atom	Ehull/Atom	Etot/Atom	Mag/Vol	Vol/Atom	DOS/Atom
Tabular	$0.219 \pm 0.021$	$0.629 \pm 0.006$	$0.327 \pm 0.014$	$15.136 \pm 0.255$	$0.013 \pm 0.000$	$7.139 \pm 0.061$	$0.448 \pm 0.014$
Images	$0.135 \pm 0.045$	$0.133 \pm 0.006$	$0.139 \pm 0.032$	$5.327 \pm 2.003$	$0.006 \pm 0.002$	$1.350 \pm 0.072$	$0.214 \pm 0.020$
Text	$0.134 \pm 0.020$	$0.199 \pm 0.010$	$0.159 \pm 0.009$	$5.636 \pm 0.413$	$0.006 \pm 0.000$	$1.959 \pm 0.299$	$0.223 \pm 0.026$
Tabular + Images	$0.156 \pm 0.073$	$0.164 \pm 0.021$	$0.115 \pm 0.005$	$4.848 \pm 0.839$	$0.006 \pm 0.001$	$1.492 \pm 0.074$	$0.258 \pm 0.031$
Tabular + Text	$0.177 \pm 0.064$	$0.190 \pm 0.021$	$0.163 \pm 0.005$	$3.986 \pm 0.199$	$0.007 \pm 0.000$	$1.698 \pm 0.147$	$0.198 \pm 0.014$
Images + Text	$0.104 \pm 0.031$	$0.125 \pm 0.006$	$0.108 \pm 0.006$	$3.535 \pm 0.143$	$0.006 \pm 0.001$	$1.284 \pm 0.053$	$0.216 \pm 0.023$
Tabular + Images + Text	$0.103 \pm 0.010$	$0.128 \pm 0.006$	$0.133 \pm 0.013$	$3.676 \pm 0.418$	$0.005 \pm 0.000$	$1.393 \pm 0.234$	$0.160 \pm 0.025$

Table 5. RMSE results for different combinations of modalities across target features.

Modalities	Gap	Eform/Atom	Ehull/Atom	Etot/Atom	Mag/Vol	Vol/Atom	DOS/Atom
Tabular	$0.580 \pm 0.013$	$0.856 \pm 0.012$	$0.489 \pm 0.002$	$24.625 \pm 1.518$	$0.020 \pm 0.000$	$9.232 \pm 0.030$	$0.572 \pm 0.006$
Images	$0.397 \pm 0.102$	$0.239 \pm 0.008$	$0.224 \pm 0.031$	$8.541 \pm 3.660$	$0.013 \pm 0.002$	$2.257 \pm 0.170$	$0.303 \pm 0.019$
Text	$0.404 \pm 0.027$	$0.283 \pm 0.012$	$0.247 \pm 0.017$	$8.325 \pm 1.378$	$0.014 \pm 0.001$	$2.731 \pm 0.422$	$0.308 \pm 0.040$
Tabular + Images	$0.434 \pm 0.113$	$0.265 \pm 0.023$	$0.195 \pm 0.011$	$8.014 \pm 1.840$	$0.013 \pm 0.001$	$2.396 \pm 0.125$	$0.346 \pm 0.030$
Tabular + Text	$0.465 \pm 0.088$	$0.268 \pm 0.027$	$0.245 \pm 0.007$	$6.783 \pm 0.683$	$0.014 \pm 0.001$	$2.454 \pm 0.163$	$0.275 \pm 0.016$
Images + Text	$0.334 \pm 0.074$	$0.187 \pm 0.010$	$0.175 \pm 0.009$	$6.662 \pm 2.022$	$0.012 \pm 0.002$	$1.948 \pm 0.052$	$0.302 \pm 0.030$
Tabular + Images + Text	$0.313 \pm 0.032$	$0.195 \pm 0.005$	$0.210 \pm 0.022$	$5.832 \pm 0.365$	$0.011 \pm 0.001$	$2.103 \pm 0.327$	$0.231 \pm 0.026$

Table 6. MAE Scaled results for different combinations of modalities across target features.

Modalities	Gap	Eform/Atom	Ehull/Atom	Etot/Atom	Mag/Vol	Vol/Atom	DOS/Atom
Tabular	$0.828 \pm 0.081$	$0.939 \pm 0.009$	$0.855 \pm 0.037$	$0.486 \pm 0.008$	$1.021 \pm 0.020$	$0.990 \pm 0.009$	$0.963 \pm 0.030$
Images	$0.510 \pm 0.168$	$0.199 \pm 0.009$	$0.363 \pm 0.082$	$0.171 \pm 0.064$	$0.502 \pm 0.134$	$0.187 \pm 0.010$	$0.459 \pm 0.042$
Text	$0.505 \pm 0.077$	$0.297 \pm 0.014$	$0.415 \pm 0.023$	$0.181 \pm 0.013$	$0.487 \pm 0.021$	$0.272 \pm 0.041$	$0.480 \pm 0.056$
Tabular + Images	$0.590 \pm 0.274$	$0.244 \pm 0.032$	$0.299 \pm 0.012$	$0.156 \pm 0.027$	$0.501 \pm 0.068$	$0.207 \pm 0.010$	$0.554 \pm 0.067$
Tabular + Text	$0.666 \pm 0.243$	$0.284 \pm 0.032$	$0.425 \pm 0.013$	$0.128 \pm 0.006$	$0.542 \pm 0.033$	$0.236 \pm 0.020$	$0.425 \pm 0.031$
Images + Text	$0.393 \pm 0.116$	$0.187 \pm 0.009$	$0.283 \pm 0.016$	$0.113 \pm 0.005$	$0.445 \pm 0.084$	$0.178 \pm 0.007$	$0.465 \pm 0.051$
Tabular + Images + Text	$0.387 \pm 0.040$	$0.191 \pm 0.009$	$0.348 \pm 0.034$	$0.118 \pm 0.013$	$0.405 \pm 0.022$	$0.193 \pm 0.032$	$0.344 \pm 0.054$

Table 7. RMSE Scaled results for different combinations of modalities across target features.

Modalities	Gap	Eform/Atom	Ehull/Atom	Etot/Atom	Mag/Vol	Vol/Atom	DOS/Atom
Tabular	$0.933 \pm 0.022$	$0.892 \pm 0.012$	$0.914 \pm 0.004$	$0.450 \pm 0.028$	$0.981 \pm 0.005$	$0.994 \pm 0.003$	$0.967 \pm 0.010$
Images	$0.639 \pm 0.164$	$0.248 \pm 0.008$	$0.418 \pm 0.057$	$0.156 \pm 0.067$	$0.625 \pm 0.087$	$0.243 \pm 0.018$	$0.512 \pm 0.032$
Text	$0.650 \pm 0.043$	$0.295 \pm 0.012$	$0.462 \pm 0.032$	$0.152 \pm 0.025$	$0.660 \pm 0.067$	$0.294 \pm 0.045$	$0.521 \pm 0.068$
Tabular + Images	$0.699 \pm 0.182$	$0.276 \pm 0.023$	$0.364 \pm 0.020$	$0.146 \pm 0.034$	$0.649 \pm 0.048$	$0.258 \pm 0.013$	$0.585 \pm 0.051$
Tabular + Text	$0.748 \pm 0.141$	$0.279 \pm 0.028$	$0.457 \pm 0.012$	$0.124 \pm 0.012$	$0.672 \pm 0.048$	$0.264 \pm 0.018$	$0.465 \pm 0.027$
Images + Text	$0.537 \pm 0.118$	$0.195 \pm 0.010$	$0.327 \pm 0.016$	$0.122 \pm 0.037$	$0.588 \pm 0.109$	$0.210 \pm 0.006$	$0.511 \pm 0.051$
Tabular + Images + Text	$0.503 \pm 0.052$	$0.203 \pm 0.005$	$0.393 \pm 0.040$	$0.107 \pm 0.007$	$0.544 \pm 0.059$	$0.226 \pm 0.035$	$0.391 \pm 0.044$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Costa, V.; Oliveira, J.M.; Ramos, P. Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions. Computation 2025, 13, 282. https://doi.org/10.3390/computation13120282

AMA Style

Costa V, Oliveira JM, Ramos P. Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions. Computation. 2025; 13(12):282. https://doi.org/10.3390/computation13120282

Chicago/Turabian Style

Costa, Vítor, José Manuel Oliveira, and Patrícia Ramos. 2025. "Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions" Computation 13, no. 12: 282. https://doi.org/10.3390/computation13120282

APA Style

Costa, V., Oliveira, J. M., & Ramos, P. (2025). Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions. Computation, 13(12), 282. https://doi.org/10.3390/computation13120282

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Driven Integration of Multimodal Data for Material Property Predictions

Abstract

1. Introduction

2. Related Work

2.1. Unimodal Models

2.2. Multimodal Models

3. Methodology

3.1. Alexandria Database

3.2. Dataset Creation and Multimodal Representation

3.2.1. Generating the Image Modality for Multimodal Learning

3.2.2. Standardizing Structural Data for Tabular Representation

3.2.3. Generating the Text Modality for Multimodal Learning

3.2.4. Target Feature Selection

3.3. Multimodal Training Pipeline

3.3.1. Tabular Modality

3.3.2. Text Modality

3.3.3. Image Modality

3.3.4. Fusion Model

3.3.5. Training

4. Results and Analysis

4.1. Error Metrics

4.2. MAE and RMSE Results

4.3. MAE Scaled and RMSE Scaled Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI