Materials Properties Prediction (MAPP): Empowering the Prediction of Material Properties Solely Based on Chemical Formulas

Xue, Si-Da; Hong, Qi-Jun

doi:10.3390/ma17174176

Open AccessArticle

Materials Properties Prediction (MAPP): Empowering the Prediction of Material Properties Solely Based on Chemical Formulas

by

Si-Da Xue

and

Qi-Jun Hong

^*

School for Engineering of Matter, Transport and Energy, Arizona State University, Tempe, AZ 85287, USA

^*

Author to whom correspondence should be addressed.

Materials 2024, 17(17), 4176; https://doi.org/10.3390/ma17174176

Submission received: 8 July 2024 / Revised: 17 August 2024 / Accepted: 21 August 2024 / Published: 23 August 2024

(This article belongs to the Section Materials Simulation and Design)

Download

Browse Figures

Versions Notes

Abstract

Predicting material properties has always been a challenging task in materials science. With the emergence of machine learning methodologies, new avenues have opened up. In this study, we build upon our recently developed graph neural network (GNN) approach to construct models that predict four distinct material properties. Our graph model represents materials as element graphs, with chemical formulas serving as the only input. This approach ensures permutation invariance, offering a robust solution to prior limitations. By employing bootstrap methods to train this individual GNN, we further enhance the reliability and accuracy of our predictions. With multi-task learning, we harness the power of extensive datasets to boost the performance of smaller ones. We introduce the inaugural version of the Materials Properties Prediction (MAPP) framework, empowering the prediction of material properties solely based on chemical formulas.

Keywords:

deep learning; graph neural networks; material property

1. Introduction

The accurate prediction of material properties is a demanding and time-consuming endeavor that presents persistent challenges, despite the extensive scientific efforts invested. Traditionally, experimental measurements and computational simulations have played primary roles in this pursuit. However, their reach was constrained by the throughput speed and the scope of chemical systems under investigation. Recent breakthroughs, such as the continuous expansion of material databases [1,2,3,4,5,6] and the rapid advancement of machine learning algorithms [7,8,9,10], have revolutionized the landscape of materials research. These developments have facilitated the ubiquitous adoption of machine learning models, introducing fresh perspectives to the field of materials research and substantially accelerating the process of materials discovery. As a powerful alternative and complement to physics-based simulation, machine learning offers significant advancements in predicting material properties. Yet, current models in materials science face limitations, often requiring specific descriptors and detailed crystal structures as inputs. The selection of new descriptors can be either a trial-and-error process or a challenging task demanding an understanding of physical mechanisms. The crystal structure is most likely unknown for an arbitrary chemical formula. Consequently, despite numerous works [11,12,13,14,15,16,17] that use machine learning for individual materials science problems, the broad applicability and adaptability of machine learning remain underdeveloped.

To overcome these limitations, we have developed a comprehensive framework that leverages the fundamental principle of using elements as building blocks and chemical composition as the input parameter, which enables the rapid and accurate computation of material properties solely based on chemical formulas. Our framework requires only the chemical formula as an input variable, allowing the prompt calculation of properties associated with any given chemical formula. We harness the capabilities of databases to build models and offer public access, transcending the conventional constraints of limited database size and entry count.

This approach offers several advantages. First, it eliminates the need for additional input beyond the chemical formula, making it applicable to any chemical formula, which is typically the only a priori known input for new material. Second, since the models can handle any chemical formula, they have the capability to perform queries without being constrained by a limited database with a finite number of data entries. Third, the output value of the model is determined collectively by the entire dataset, rather than a single data point, which potentially minimizes errors. Lastly, the MAPP framework and its models are publicly accessible via the internet, thus empowering individuals without expertise in density functional theory (DFT) or machine learning (ML) to compute material properties rapidly and precisely. This framework has the potential to transform the approaches adopted by modelers for designing products and by experimentalists for utilizing their predictions, thereby revolutionizing material design and discovery for the future.

2. Materials and Methods

In the MAPP framework, we achieve the overall goal in three steps. First, we design a unique deep learning architecture that operates on any chemical formula with only the formula itself as the input. With this advantage, material design can rapidly survey the complete chemical space, without requiring any additional information about the material’s properties, as the chemical formula is typically the only a priori known input for a new material. Second, we build a system to integrate data from various sources [18], including experiments, DFT, and ML. Using the deep learning model in Step 1 as a foundation, we employ bootstrap aggregation (or cross-validation) and construct ensemble models to not only provide uncertainty quantification but also detect outliers that can be reviewed and rectified manually. This approach is particularly useful as we assimilate data from various sources such as experiments, DFT, and ML, among others. Third, we build the MAPP framework, characterized by a diverse array of material properties, the potential for iterative enhancement, and the prospect of model integration for systematic improvement. Our initial success on the melting temperature [19] demonstrates that the approach is viable and effective, thus prompting us to further extend our efforts to construct models for predicting four additional material properties: the bulk modulus, volume, heat of fusion, and critical temperature of a superconductor. Furthermore, the availability of a well-performing model for a given property enables us to surpass the limitation of using only the chemical formula as input. By integrating multiple models and leveraging the well-established model, we can enhance the performance of other models, thus developing an interconnected network of models where a model contributes to the improvement of the others.

2.1. Data

We have selected five distinct material properties to showcase the predictive power of our GNN model. These properties span a diverse range of material characteristics, encompassing mechanical, structural, electrical, and thermal properties. Specifically, we focus on the melting temperature, heat of fusion, bulk modulus, volume, and superconducting critical temperature. The effective performance of the melting temperature model has been previously established in our prior research [19], which will not be illustrated here. The remaining four material properties will be discussed. Certain datasets, such as melting temperature, heat of fusion, and superconducting critical temperature, are obtained from experiments. Conversely, datasets for the bulk modulus and volume are sourced from DFT computations. The acquisition of datasets of these material properties is facilitated either through standard material science packages [20] or via web-based data crawling techniques.

2.1.1. Melting Temperature and Heat of Fusion

The datasets for both the melting temperature and heat of fusion were collected from a ten-volume compilation of thermodynamic constants of substances, the book “Thermodynamic properties of individual substances” [21], which is openly available in database format [22].

The melting temperature dataset contains 9375 materials, among which 982 compounds exhibit high melting temperatures with melting points exceeding 2000 K. While the majority of the dataset is derived from experimental findings, a small portion originates from DFT-based data generated by our first-principles calculation tool, SLUSCHI [23,24,25,26]. The dataset underlying this model predominantly consists of compounds with congruent melting temperatures. Hence, the melting temperature generated by this model is interpreted as the higher end of the solidus–liquidus temperature range. This interpretation similarly applies to compounds that decompose prior to melting. It is crucial to understand that this model does not predict solidus or liquidus temperatures.

Regarding the heat of fusion, the dataset comprises 774 data points. Upon inspecting the dataset, we identified an anomaly in the material

{GdPd}_{3}

, leading to its manual removal from the dataset.

2.1.2. Bulk Modulus and Volume

The bulk modulus and volume data were queried from the Materials Project database [2] utilizing the Pymatgen package [20]. To ensure data consistency, we then filtered out less stable structures, applying a threshold of 10 meV energy above the convex hull. Consequently, the bulk modulus dataset contains 4236 entries, while the volume dataset consists of 49,213 entries. It is noteworthy that, despite materials initially queried spanning all dimensions (0D, 1D, 2D, and 3D materials), our finalized dataset is restricted to materials with 3D crystal structures, given that only these materials provide well-defined volume values. We note that the bulk modulus and volume data are based on DFT calculations at absolute zero.

2.1.3. Superconducting Critical Temperature

In order to develop a universal model for predicting the superconducting critical temperature

T_{c}

with a wide range of

T_{c}

values, we utilized a comprehensive dataset from a previous study [27]. This dataset encompasses a wide range of superconductors, including cuprate-based, iron-based, and Bardeen–Cooper–Schrieffer (BCS) theory superconductors, which originate from the Supercon Material Database, maintained by the National Institute for Materials Science (NIMS) in Japan [28]. This database is the most extensive and widely employed resource for data-driven research on superconductors, and it has been extensively utilized in previous studies [16,27,29,30].

2.2. Model Architecture

2.2.1. Element Embedding

In our approach, the material’s chemical formula is depicted as a fully connected element graph

G_{e}

. Each element graph

G_{e}

comprises nodes and edges, defined as V and E, respectively. Figure 1a shows the process of converting the chemical formula into an element graph. To exemplify this, the material Li₇Mn₄CoO₁₂ is visualized as an element graph with nodes corresponding to the elements Lithium, Manganese, Cobalt, and Oxygen. Each node represents a specific element and is connected to each neighboring node through a single edge. Each edge symbolizes the path for information exchange between paired elements. Specifically, the edges have no features and are treated equally, only indicating the connection of the information-exchanging route. In constructing the element graph, each element within the chemical formula is associated with a node feature vector. Each node feature vector has 14-dimensional features composed of elemental properties, such as atomic mass, atomic number, melting temperature, boiling temperature, electronegativity, etc., as well as the composition of each element. The composition serves as an indicator of the relative importance of each element within the chemical formula of the material.

The node feature vector is denoted by

x_{v}

, with v symbolizing a specific node within the node set, V. To ensure the permutation invariance for all elements in the subsequent graph neural network section, each

x_{v}

undergoes nonlinear transformation through an identical fully connected neural network utilizing the same activation function. This transformation is expressed as

h_{v}^{0} = ReLU (W^{0} x_{v} + b^{0}), v \in V

. In this equation,

ReLU

denotes the rectified linear unit activation function. The term

h_{v}^{0}

signifies the initial element embedding prior to the graph neural network layer, while the superscript indicates the specific layer within the graph neural network. Moreover,

W^{0}

and

b^{0}

denote the weight and bias parameters of the neural network.

2.2.2. Graph Neural Network Section

After creating the element graph based on the chemical formula, the graph neural network model is used to transform element embeddings and, subsequently, the embedding of the whole element graph in order to capture more physical insights from the material and to perform the material property prediction better. Our task is to predict the material property using the element graph, as shown in Figure 1b, framing this as a graph-level regression task. Within the graph-level prediction task, the objective of the GNN is to learn a representation of the entire graph, the so-called material embedding A in Figure 1b. This is achieved by iteratively updating the node embeddings based on the neighboring information, also termed the message in the literature [31], and then aggregating the individual node representations to form the material representation, which is used to perform the regression. In the following paragraphs, the element embedding update process and the material embedding update process will be elaborated.

As shown in Figure 1b, the element embedding of the target node is updated by incorporating messages from all of the neighboring elements. This inference bias arises from the understanding that the information from the neighboring nodes not only is relevant to the target node but also enriches the target node’s information.

There are generally two phases in the element embedding update process, the message-forming phase and the message element combining phase, as is shown in Figure 1c. The message is created based on both the embedding of the neighboring node and the current node and has the general form shown below:

m_{v}^{t - 1} = \sum_{u \in N (v)}^{| V | - 1} M_{t} (h_{v}^{t - 1}, h_{u}^{t - 1}),

(1)

where

m_{v}^{t - 1}

is the message for target node v at iteration

t - 1

,

N (v)

is the neighboring node set of v, and

| V |

is the number of nodes in the entire graph.

h_{v}^{t - 1}

and

h_{u}^{t - 1}

denote the element embeddings of target node v and neighboring node u at iteration

t - 1

. The message function

M_{t}

can take multiple forms, for example, adding, averaging, concatenating, etc., which all serve as a way to preserve the information of the neighboring node. The target node’s overall message is updated by aggregating messages from all of the neighboring nodes, which preserves the properties of the permutation invariance of the chemical formula. In this study, the specific form of the message function is

m_{v}^{t - 1} = \sum_{u \in N (v)}^{| V | - 1} [ReLU (W^{t - 1} Add (h_{v}^{t - 1}, h_{u}^{t - 1}) + b^{t - 1})],

(2)

where

W^{t - 1}

and

b^{t - 1}

are the weight and bias of the neural network at the

t - 1

iteration. Figure 1c illustrates the message-forming procedures. The messages from elements Co, Mn, and Li are first created by adding the element embeddings of the target node

h_{O}

and neighboring node

h_{u}

. Then, all of the individual messages are passed through an identical neural network layer. An overall message

m_{O}

is formed by aggregating all of the individual messages.

After generating the message

m_{v}^{t - 1}

,

h_{v}

is updated by summing the message vector

m_{v}^{t - 1}

with the node embedding from the previous layer,

h_{v}^{t - 1}

:

h_{v}^{t} = ReLU (W^{t - 1} Add (h_{v}^{t - 1}, m_{v}^{t - 1}) + b^{t - 1}) .

(3)

The material embedding

A^{t}

for layer t is constructed by aggregating all element embeddings, a process illustrated in Figure 1b:

A^{t} = \sum_{v \in V}^{| V |} h_{v}^{t} .

(4)

The element and material embedding update process will last for T iterations, which is a hyperparameter to tune.

In our architecture, we record the material embeddings A from every iteration, accumulating them in part to construct the final material embedding,

A^{final}

. In general,

A^{final}

may be constituted in two ways. It could either be the material embedding

A^{T}

derived from the final iteration within the graph neural network or an aggregation of the material embedding A created during the T-th iteration along with those from earlier iterations. This method helps in conserving information from preceding GNN layers, potentially mitigating the prevalent issue of over-smoothing often encountered in GNN applications.

2.2.3. Property Prediction Section with ResNet Architecture

After generating a material embedding through the GNN section, the material embedding is fed into the fully connected layers with a ResNet architecture [32] for further nonlinear transformation. The ResNet layer has an advantage over the plain dense layer in that it can mitigate the exploding and vanishing gradient issues, therefore guaranteeing that the model can converge to the optimal solution more easily. After N number of ResNet transformations and a regression neural network layer, the final value of the predicted material property is given by the model.

To summarize the aforementioned model architecture and material prediction process, our GNN model’s standard workflow for predicting material properties is depicted in Figure 1d, demonstrating the comprehensive, end-to-end capability of the GNN. In this model, the chemical formula, as the sole input, is encoded as a series of elemental embeddings. Within the GNN layers, these elemental embeddings undergo T iterations of updates, aggregating to form an ultimate material embedding. This embedding is subsequently employed for regression within the ResNet block.

2.2.4. Ensemble Model and Uncertainty Estimation

To further increase the model accuracy, quantify prediction uncertainty, detect outliers, and perform comprehensive data analysis, we propose an ensemble model based on the bootstrap method. The model consists of 30 independent GNN models trained on different samples from the original dataset, with the final model performance derived from aggregating the performance metrics of 30 individual models, such as the coefficient of determination (

R^{2}

score), root-mean-square error (RMSE), and mean absolute error (MAE). The training data for each individual model are randomly sampled using the bootstrap method from the original dataset. The testing set is composed of data not included in each sampling process. The choice of 30 models in our ensemble is strategically made to ensure that every data point is included in both the training and testing datasets at least once, allowing us to comprehensively evaluate in-bag and out-of-bag performance across the dataset.

The ensemble model can increase the robustness of the individual deep learning model. The diversity of the model can be ensured because all of the models are trained on a different subset of the original dataset. This diversity can potentially improve the model’s accuracy since the errors made by different models may be canceled from each other when aggregated. Moreover, overfitting tends to be averaged out as each individual model in the ensemble is trained on a different subset of the original dataset, which may help the model better generalize to unseen data.

Additionally, the utilization of an ensemble model allows for the quantification of uncertainty and facilitates a comprehensive analysis of outliers within the dataset. By ensuring the inclusion of every data point in both the training and testing sets at least once, we establish a systematic basis for thoroughly assessing the model’s predictive performance for each individual material. This approach guarantees that the training set closely represents the overall distribution of the original dataset, thereby providing a robust evaluation of the model’s performance under various scenarios. By analyzing the statistics of the predicted errors, specifically the mean, median, and maximum differences between predicted and actual values, we can identify materials that exhibit significant deviations in both the training and testing phases. Such anomalies in the data may necessitate a closer examination to determine the causes of such anomalies. These inconsistencies could stem from incorrect data entries, requiring their removal from the original dataset to enhance accuracy. Alternatively, they might represent unique statistical distributions that our current model fails to recognize, indicating a need for more sophisticated data handling or model adjustment to accommodate these exceptions.

3. Results and Discussion

3.1. Bulk Modulus

The bulk modulus, denoted by

k_{vrh}

, quantifies a material’s resistance to uniform compression. Understanding the bulk modulus is essential for researchers investigating the elastic properties of materials. Developing a deep learning model to predict the bulk modulus could facilitate the discovery of ultra-compressible materials [33,34].

The data distribution of the bulk modulus dataset used in our work is shown in Figure 2a. Using the aforementioned element feature generation method and modeling the material as an element graph, a bulk modulus model was trained with 4236 bulk modulus data entries derived from DFT calculations. In this study, various hyperparameters were utilized, including the number of neurons in a hidden layer, the number of ResNet layers, the number of graph neural network layers, the dropout rate, batch size, etc.

According to Figure 3a, after 2000 epochs, the model’s loss function converges. The coefficient of determination (

R^{2}

score) for a single GNN model is 0.95 for the testing set. The parity plot for the testing bulk modulus dataset is shown in Figure 4a, which showcases the difference between the labeled bulk modulus values and the predicted bulk modulus values. The RMSE and MAE for the test set are 17.04 and 9.96 GPa, respectively. Moreover, we trained an ensemble model of 30 individual models with the same hyperparameters but with different training and testing sets generated from the bootstrap method. The model achieved a testing

R^{2}

, RMSE, and MAE of 0.93, 19.41 GPa, and 11.2 GPa, respectively. A summary of the performance of the single and ensemble bulk modulus models’

R^{2}

scores, RMSEs, and MAEs is shown in Table 1.

3.2. Volume

The volume data display significant variability, with a minimum value of 11 and a maximum of 7132

Å^{3}

. This diversity poses difficulties in model fitting. To address this challenge, we employ volume per atom as the target label, leading to a more constrained predictive range. The data distribution after calculating the volume per atom value is shown in Figure 2b. After training for 2000 epochs, the loss function of the single GNN model converges to an optimal level, as shown in Figure 3b, achieving an

R^{2}

score of 0.97 for the testing set. The RMSE and MAE of the single model are 1.56 and 0.65

Å^{3}

. The ensemble model achieves an average

R^{2}

score, RMSE, and MAE of 0.97, 1.36

Å^{3}

, and 0.84

Å^{3}

for the testing set across 30 distinct models. A more detailed result of the model’s performance can be seen in Table 1.

3.3. Superconducting Critical Temperature

The critical temperature

T_{c}

dataset can be divided into three groups: 2339 iron-based, 10,838 copper-based, and 8535 other types of superconductors. The category labeled as “other superconductors” primarily consists of materials explained by the BCS theory, which is effective for low-temperature superconductors. However, this theory does not adequately explain the superconducting behavior in high-temperature superconductors like copper-based and iron-based ones, as highlighted in previous studies [35,36]. In the pursuit of high-temperature superconductors, deep learning models play a crucial role. These models aid in expanding the chemical space for analysis through high-throughput computational screening, thus speeding up the discovery of new materials. Once materials with a high potential for high

T_{c}

are identified, experimental validation can be conducted. The

T_{c}

model effectively directs the material synthesis efforts on a more promising subset of high-temperature superconductors. The

T_{c}

data distribution is shown in Figure 2c. The threshold for high

T_{c}

is 30 K [37]. Inspecting the

T_{c}

dataset reveals that 8848 superconductors have

T_{c}

values above this threshold. The model achieved convergence after approximately 6000 epochs, as depicted in Figure 3c. Following hyperparameter tuning, the best single

T_{c}

model exhibited an

R^{2}

score of 0.91 for the testing set. Additionally, the RMSE and MAE are found to be 10.16 and 6.91 K for the testing set. The parity plot of the

T_{c}

model is shown in Figure 4c.

After creating a single GNN model, we trained an ensemble model based on 30 distinct models using data generated via the bootstrap method. The average model performance metrics are listed below: the testing

R^{2}

score is 0.88, while the RMSE and the MAE are 12.64 and 7.54 K, respectively. The single model and ensemble model performance metrics are shown in Table 1.

3.4. Heat of Fusion

The heat-of-fusion dataset currently contains a significantly smaller number of data points: 742. Within this dataset, the label employed is the heat of fusion per number of atoms in the chemical formula, rather than the absolute heat-of-fusion values. This normalization shifts the model’s focus toward discerning the average contribution of each atom to the heat of fusion, mitigating the influence of the compound’s size. The distribution of these data is shown in Figure 2d. The

R^{2}

score, RMSE, and MAE of the heat-of-fusion model are 0.70, 1.15 kcal/mol, and 0.74 kcal/mol, respectively, for the testing set.

Enhancing the accuracy of this model can be achieved by either collecting more data [23,25] or harnessing the multi-task learning technique [38] to facilitate the training of the target material property, in this case, the heat of fusion. The multi-task learning process leverages an auxiliary material property characterized by a larger dataset size, better data quality, and an inherent correlation with the target property to aid the training process of the target task.

We utilized the melting temperature dataset, discussed previously [19], as an auxiliary task to facilitate the training of the heat-of-fusion task. This study adopted the hard parameter-sharing multi-task learning architecture, as illustrated in Figure 5. This architecture employs shared GNN layers and weights across both material properties, effectively expanding the material representation space through simultaneous training on both the melting temperature and heat-of-fusion datasets. Following these shared layers, the model utilizes two separate ResNet architectures tailored to the regression tasks corresponding to each property.

The loss curve for the heat-of-fusion model after applying multi-task learning is shown in Figure 3d. The model starts to converge at 4000 epochs. After employing the multi-task learning methodologies, the testing

R^{2}

score for the heat-of-fusion model improved from 0.70 to 0.74. The parity plot of the heat-of-fusion model after employing multi-task learning is shown in Figure 4d. The testing RMSE and MAE are 1.01 and 0.67 kcal/mol. This progress in performance highlights the ability of multi-task learning to enhance the training of a machine learning model. Furthermore, it illustrates an alternative way to expand the model’s representational capacity without necessarily increasing the number of data points.

Due to the small size of the dataset, only the performance metrics of the ensemble model are presented, as the performance of a single model exhibits large fluctuations.

3.5. Discussion

Our MAPP framework has been applied to five material properties, four of which are illustrated in this work, while the melting temperature is presented thoroughly in our previous articles [18,19,39]. All of the above-mentioned models have been deployed and are publicly accessible via our group’s website. Notably, our model is distinguished by its input simplicity, requiring only the chemical formula, and by its robust model performance, merits deriving largely from the graph neural network’s inherent local information consolidation capability. Computational costs for training our models vary, but all complete training within 60 h on an NVIDIA A100 GPU, including the ensemble models.

Our ensemble model is capable of uncertainty quantification, which serves to identify materials that yield high prediction errors, potentially flagging outliers. We employ the bootstrap method to segregate data into training and testing subsets, also called in-bag and out-of-bag sets. This process creates 30 independent models, and the final performance of the ensemble model is determined by the aggregated outputs of these individual models. The employment of 30 separate bootstrap iterations ensures comprehensive coverage, with each data point in the original dataset featuring in the testing set at least once. Consequently, the in-bag and out-of-bag errors serve as metrics for assessing our model’s accuracy in predicting the properties of specific materials.

We showcase the potential of multi-task learning in our study by utilizing the sizable melting temperature dataset to augment the training of the significantly smaller heat-of-fusion dataset, especially beneficial in scenarios of limited data availability. This approach has resulted in considerable enhancements in model accuracy. Despite these advancements, the model’s performance has not yet reached an ideal level, which we attribute primarily to the data’s limited quantity and suboptimal quality, rather than to any deficiencies in the model’s design.

Alternatively, active learning could be employed to expand the dataset. This method would involve leveraging a first-principles calculation pipeline, such as the SLUSCHI package [25], tailored for computing high-temperature material properties using DFT. Through this pipeline, we generate a more comprehensive array of data points for properties such as the melting temperature and heat of fusion. Such enrichment of the dataset holds promise for significantly enhancing the deep learning model’s accuracy. Nevertheless, the implementation of this approach falls outside the scope of the current study.

Nonetheless, it is important to acknowledge the limitations of our existing model framework. Our current model relies solely on the chemical formula as input. While this simplifies the input process and maximizes general applicability, it also brings certain limitations: it is unable to differentiate between polymorphs—materials sharing the same formula but exhibiting distinct crystal structures. For example, it cannot distinguish between diamond and graphite. To overcome this limitation, we will consider incorporating crystal structure information into the inputs, which we plan to undertake once we have developed a robust model capable of crystal structure prediction.

Contrary to existing methodologies, such as the crystal graph convolutional neural network (CGCNN) approach [40], that necessitate predefined crystal structures, our model adopts a new approach that requires only the chemical formula as input, allowing for broader applicability to materials whose structures are not yet determined. The developed models are integrated into the publicly accessible MAPP framework, promoting wider use and enabling users without deep technical knowledge to make predictions based solely on chemical formulas, often the only a priori known input for new materials.

4. Conclusions

We introduce the MAPP framework, an extensive platform for material property prediction capable of delivering comprehensive material data based solely on chemical formula input. Utilizing our generic graph neural network approach, we have successfully developed five robust models for predicting solid-state material properties. These models, solely based on chemical formulas, bypass the need for manual feature engineering. With minimal input of physical information, our models enable us to explore the entire high-dimensional chemical space, exhibiting good performance and remarkable adaptability across various material property prediction tasks and demonstrating their potential in combinatorial material screenings. We further enhanced the platform by incorporating ensemble models, which allow for the systematic identification of uncertainties and outliers, improving the overall performance. This paper also sheds light on our strategy of harnessing inter-property correlations to enrich individual model learning. We demonstrate how the multi-task learning approach substantially improves the model’s performance in predicting the heat of fusion.

We have designed a user-friendly web application [41] for rapid predictions of material properties. This platform allows public access to our models, enabling users to obtain both material property predictions and associated uncertainties. In addition, we built Application Programming Interfaces (APIs) [42] equipped with batch-processing capabilities. Through these tools, users can perform the following tasks:

Evaluate material properties across large datasets;
Run interactive simulations for the design and discovery of materials with extreme properties;
Include material properties as new features for their models.

Based on traffic analysis, our websites and APIs have so far performed over 7000 and 300,000 calculations for our users, respectively. The melting temperature model is featured by the Materials Project team on their webpage [2].

In future work, we aim to extend the capabilities of our model to differentiate between polymorphs by incorporating structural information alongside chemical formulas. This development will enhance the model’s accuracy in predicting properties of materials with identical chemical compositions but different crystal structures. Additionally, we plan to explore the integration of active learning methodologies to dynamically refine the model based on new data as they become available (for example, from our DFT SLUSCHI package [23,24,25,26]), thereby continually improving its predictive performance. Further efforts will also focus on scaling the MAPP framework to handle larger datasets and more complex material properties, making it even more robust and versatile for users across different scientific disciplines.

Author Contributions

Conceptualization, Q.-J.H.; methodology, Q.-J.H.; software, Q.-J.H.; formal analysis, S.-D.X.; investigation, S.-D.X. and Q.-J.H.; resources, Q.-J.H.; writing—original draft preparation, S.-D.X.; writing—review and editing, Q.-J.H.; visualization, S.-D.X.; supervision, Q.-J.H.; project administration, Q.-J.H.; funding acquisition, Q.-J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Army Research Office (ARO) of the U.S. Department of Defense under the Multidisciplinary University Research Initiative W911NF-23-2-0145.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All of the models have been deployed and are publicly accessible via our group’s website [41], as well as our API tool [42].

Acknowledgments

We appreciate the start-up funding provided by the School of Engineering for Matter, Transport, and Energy (SEMTE) and the use of Research Computing at Arizona State University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Choudhary, K.; Garrity, K.F.; Reid, A.C.; DeCost, B.; Biacchi, A.J.; Hight Walker, A.R.; Trautt, Z.; Hattrick-Simpers, J.; Kusne, A.G.; Centrone, A.; et al. The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. NPJ Comput. Mater. 2020, 6, 173. [Google Scholar] [CrossRef]
Jain, A.; Ong, S.P.; Hautier, G.; Chen, W.; Richards, W.D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. [Google Scholar] [CrossRef]
Kirklin, S.; Saal, J.E.; Meredig, B.; Thompson, A.; Doak, J.W.; Aykol, M.; Rühl, S.; Wolverton, C. The Open Quantum Materials Database (OQMD): Assessing the accuracy of DFT formation energies. NPJ Comput. Mater. 2015, 1, 15010. [Google Scholar] [CrossRef]
Curtarolo, S.; Setyawan, W.; Wang, S.; Xue, J.; Yang, K.; Taylor, R.H.; Nelson, L.J.; Hart, G.L.; Sanvito, S.; Buongiorno-Nardelli, M.; et al. AFLOWLIB. ORG: A distributed materials properties repository from high-throughput ab initio calculations. Comput. Mater. Sci. 2012, 58, 227–235. [Google Scholar] [CrossRef]
Hellenbrandt, M. The inorganic crystal structure database (ICSD)—Present and future. Crystallogr. Rev. 2004, 10, 17–22. [Google Scholar] [CrossRef]
Zakutayev, A.; Wunder, N.; Schwarting, M.; Perkins, J.D.; White, R.; Munch, K.; Tumas, W.; Phillips, C. An open experimental database for exploring inorganic materials. Sci. Data 2018, 5, 1–12. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Choudhary, K.; DeCost, B.; Chen, C.; Jain, A.; Tavazza, F.; Cohn, R.; Park, C.W.; Choudhary, A.; Agrawal, A.; Billinge, S.J.; et al. Recent advances and applications of deep learning methods in materials science. Npj Comput. Mater. 2022, 8, 59. [Google Scholar] [CrossRef]
Hong, Y.; Hou, B.; Jiang, H.; Zhang, J. Machine learning and artificial neural network accelerated computational discoveries in materials science. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2020, 10, e1450. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Jha, D.; Ward, L.; Paul, A.; Liao, W.k.; Choudhary, A.; Wolverton, C.; Agrawal, A. Elemnet: Deep learning the chemistry of materials from only elemental composition. Sci. Rep. 2018, 8, 17593. [Google Scholar] [CrossRef] [PubMed]
Zheng, X.; Zheng, P.; Zhang, R.Z. Machine learning material properties from the periodic table using convolutional neural networks. Chem. Sci. 2018, 9, 8426–8432. [Google Scholar] [CrossRef]
Le, T.D.; Noumeir, R.; Quach, H.L.; Kim, J.H.; Kim, J.H.; Kim, H.M. Critical temperature prediction for a superconductor: A variational bayesian neural network approach. IEEE Trans. Appl. Supercond. 2020, 30, 1–5. [Google Scholar] [CrossRef]
Schmidt, J.; Pettersson, L.; Verdozzi, C.; Botti, S.; Marques, M.A. Crystal graph attention networks for the prediction of stable materials. Sci. Adv. 2021, 7, eabi7948. [Google Scholar] [CrossRef] [PubMed]
Allotey, J.; Butler, K.T.; Thiyagalingam, J. Entropy-based active learning of graph neural network surrogate models for materials properties. J. Chem. Phys. 2021, 155, 174116. [Google Scholar] [CrossRef]
Stanev, V.; Oses, C.; Kusne, A.G.; Rodriguez, E.; Paglione, J.; Curtarolo, S.; Takeuchi, I. Machine learning modeling of superconducting critical temperature. NPJ Comput. Mater. 2018, 4, 29. [Google Scholar] [CrossRef]
Zhang, J.; Liu, X.; Bi, S.; Yin, J.; Zhang, G.; Eisenbach, M. Robust data-driven approach for predicting the configurational energy of high entropy alloys. Mater. Des. 2020, 185, 108247. [Google Scholar] [CrossRef]
Hong, Q.J.; van de Walle, A.; Ushakov, S.V.; Navrotsky, A. Integrating computational and experimental thermodynamics of refractory materials at high temperature. Calphad 2022, 79, 102500. [Google Scholar] [CrossRef]
Hong, Q.J.; Ushakov, S.V.; van de Walle, A.; Navrotsky, A. Melting temperature prediction using a graph neural network model: From ancient minerals to new materials. Proc. Natl. Acad. Sci. USA 2022, 119, e2209630119. [Google Scholar] [CrossRef] [PubMed]
Ong, S.P.; Richards, W.D.; Jain, A.; Hautier, G.; Kocher, M.; Cholia, S.; Gunter, D.; Chevrier, V.L.; Persson, K.A.; Ceder, G. Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis. Comput. Mater. Sci. 2013, 68, 314–319. [Google Scholar] [CrossRef]
Glushko, V.P.; Gurvich, L. Thermodynamic Properties of Individual Substances: Volume 1, Parts 1 and 2. 1988. Available online: https://www.osti.gov/biblio/6862010 (accessed on 20 August 2024).
Database of Thermodynamic Properties of Individual Substances. Available online: http://www.chem.msu.su/cgi-bin/tkv.pl?show=welcome.html (accessed on 7 July 2024).
Hong, Q.J.; van de Walle, A. Direct first-principles chemical potential calculations of liquids. J. Chem. Phys. 2012, 137, 094114. [Google Scholar] [CrossRef]
Hong, Q.J.; van de Walle, A. Solid-liquid coexistence in small systems: A statistical method to calculate melting temperatures. J. Chem. Phys. 2013, 139, 094114. [Google Scholar] [CrossRef] [PubMed]
Hong, Q.J.; van de Walle, A. A user guide for SLUSCHI: Solid and Liquid in Ultra Small Coexistence with Hovering Interfaces. Calphad 2016, 52, 88–97. [Google Scholar] [CrossRef]
Hong, Q.J.; Liu, Z.K. A generalized approach for rapid entropy calculation of liquids and solids. arXiv 2024, arXiv:2403.19872. [Google Scholar]
Hamidieh, K. A data-driven statistical model for predicting the critical temperature of a superconductor. Comput. Mater. Sci. 2018, 154, 346–354. [Google Scholar] [CrossRef]
Materials Data Repository SuperCon Datasheet. Available online: https://mdr.nims.go.jp/collections/5712mb227 (accessed on 7 July 2024).
Zeng, S.; Zhao, Y.; Li, G.; Wang, R.; Wang, X.; Ni, J. Atom table convolutional neural networks for an accurate prediction of compounds properties. NPJ Comput. Mater. 2019, 5, 84. [Google Scholar] [CrossRef]
Konno, T.; Kurokawa, H.; Nabeshima, F.; Sakishita, Y.; Ogawa, R.; Hosako, I.; Maeda, A. Deep learning model for finding new superconductors. Phys. Rev. B 2021, 103, 014509. [Google Scholar] [CrossRef]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1263–1272. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778, ISSN 1063-6919. [Google Scholar] [CrossRef]
Mansouri Tehrani, A.; Oliynyk, A.O.; Parry, M.; Rizvi, Z.; Couper, S.; Lin, F.; Miyagi, L.; Sparks, T.D.; Brgoch, J. Machine Learning Directed Search for Ultraincompressible, Superhard Materials. J. Am. Chem. Soc. 2018, 140, 9844–9853. [Google Scholar] [CrossRef] [PubMed]
Kaner, R.B.; Gilman, J.J.; Tolbert, S.H. Designing Superhard Materials. Science 2005, 308, 1268–1269. [Google Scholar] [CrossRef]
Bardeen, J.; Cooper, L.N.; Schrieffer, J.R. Theory of Superconductivity. Phys. Rev. 1957, 108, 1175–1204. [Google Scholar] [CrossRef]
Mann, A. High-temperature superconductivity at 25: Still in suspense. Nature 2011, 475, 280–282. [Google Scholar] [CrossRef] [PubMed]
Bednorz, J.G.; Müller, K.A. Possible high T c superconductivity in the Ba- La- Cu- O system. Z. Für Phys. B Condens. Matter 1986, 64, 189–193. [Google Scholar] [CrossRef]
Crawshaw, M. Multi-task learning with deep neural networks: A survey. arXiv 2020, arXiv:2009.09796. [Google Scholar]
Hong, Q.J. Melting temperature prediction via first principles and deep learning. Comput. Mater. Sci. 2022, 214, 111684. [Google Scholar] [CrossRef]
Xie, T.; Grossman, J.C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120, 145301. [Google Scholar] [CrossRef] [PubMed]
Hong, Q.J. MAPP. 2023. Available online: https://faculty.engineering.asu.edu/hong/materials-properties-prediction-mapp/ (accessed on 7 July 2024).
MAPP-API. Available online: https://github.com/qjhong/mapp_api (accessed on 7 July 2024).

Figure 1. (a) Converting a chemical formula to an element graph, (b) the overall illustration of the graph neural network section and ResNet section, (c) the detailed process of the element embedding update process, including the message-forming phase and the message element combining phase, (d) the overall workflow of an end-to-end graph neural network deep learning framework. The illustration shows a direct mapping from chemical formulas to material properties.

Figure 2. The histograms of (a) bulk modulus (

k_{vrh}

) [GPa], (b) volume/atom [

Å^{3}

], (c) superconducting critical temperature (

T_{c}

) [K], (d) heat of fusion [kcal/mol].

Figure 2. The histograms of (a) bulk modulus (

k_{vrh}

) [GPa], (b) volume/atom [

Å^{3}

], (c) superconducting critical temperature (

T_{c}

) [K], (d) heat of fusion [kcal/mol].

Figure 3. The log-scaled training and validation loss functions of the (a) bulk modulus (

k_{vrh}

) [GPa], (b) volume/atom [

Å^{3}

], (c) superconducting critical temperature (

T_{c}

) [K], (d) heat of fusion [kcal/mol]. The loss functions used in this work are mean squared error.

Figure 3. The log-scaled training and validation loss functions of the (a) bulk modulus (

k_{vrh}

) [GPa], (b) volume/atom [

Å^{3}

], (c) superconducting critical temperature (

T_{c}

) [K], (d) heat of fusion [kcal/mol]. The loss functions used in this work are mean squared error.

Figure 4. The parity plots of (a) bulk modulus (

k_{vrh}

) [GPa], (b) volume/atom [

Å^{3}

], (c) superconducting critical temperature (

T_{c}

) [K], (d) heat of fusion/atom [kcal/mol]. The labeled values and the predicted values for specific material properties are shown in the parity plots.

Figure 4. The parity plots of (a) bulk modulus (

k_{vrh}

) [GPa], (b) volume/atom [

Å^{3}

], (c) superconducting critical temperature (

T_{c}

) [K], (d) heat of fusion/atom [kcal/mol]. The labeled values and the predicted values for specific material properties are shown in the parity plots.

Figure 5. A diagram of multi-task learning utilizing the large melting temperature dataset to augment the training of the significantly smaller heat-of-fusion dataset. Up to four elements are connected in this graph (denoted as A, B, C, and D).

Table 1. Performance metrics for the superconducting critical temperature, volume, bulk modulus, and heat-of-fusion models. Both the single model and the ensemble model are included. The ensemble model, while generally enhancing robustness and reliability through bootstrap aggregation, may display slightly inferior performance metrics compared to a single model due to the averaging of results from multiple bootstrap samples.

Model Type	Single Model			Ensemble Mode
Material Properties	$R^{2}$ Score	RMSE	MAE	$R^{2}$ Score	RMSE	MAE
Bulk modulus ( $k_{vrh}$ ) [GPa]	0.95	17.04	9.96	0.93	19.41	11.2
Unit cell volume/atom [ $Å^{3}$ ]	0.97	1.56	0.65	0.97	1.36	0.84
Superconducting critical temperature ( $T_{c}$ ) [K]	0.91	10.16	6.91	0.88	12.64	7.54
Heat of fusion/atom [kcal/mol]	-	-	-	0.70	1.15	0.74
Heat of fusion/atom (multi-task learning) [kcal/mol]	-	-	-	0.74	1.01	0.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xue, S.-D.; Hong, Q.-J. Materials Properties Prediction (MAPP): Empowering the Prediction of Material Properties Solely Based on Chemical Formulas. Materials 2024, 17, 4176. https://doi.org/10.3390/ma17174176

AMA Style

Xue S-D, Hong Q-J. Materials Properties Prediction (MAPP): Empowering the Prediction of Material Properties Solely Based on Chemical Formulas. Materials. 2024; 17(17):4176. https://doi.org/10.3390/ma17174176

Chicago/Turabian Style

Xue, Si-Da, and Qi-Jun Hong. 2024. "Materials Properties Prediction (MAPP): Empowering the Prediction of Material Properties Solely Based on Chemical Formulas" Materials 17, no. 17: 4176. https://doi.org/10.3390/ma17174176

APA Style

Xue, S.-D., & Hong, Q.-J. (2024). Materials Properties Prediction (MAPP): Empowering the Prediction of Material Properties Solely Based on Chemical Formulas. Materials, 17(17), 4176. https://doi.org/10.3390/ma17174176

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Materials Properties Prediction (MAPP): Empowering the Prediction of Material Properties Solely Based on Chemical Formulas

Abstract

1. Introduction

2. Materials and Methods

2.1. Data

2.1.1. Melting Temperature and Heat of Fusion

2.1.2. Bulk Modulus and Volume

2.1.3. Superconducting Critical Temperature

2.2. Model Architecture

2.2.1. Element Embedding

2.2.2. Graph Neural Network Section

2.2.3. Property Prediction Section with ResNet Architecture

2.2.4. Ensemble Model and Uncertainty Estimation

3. Results and Discussion

3.1. Bulk Modulus

3.2. Volume

3.3. Superconducting Critical Temperature

3.4. Heat of Fusion

3.5. Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI