Wenzhou TE: A First-Principle-Calculated Thermoelectric Materials Database

Since the implementation of the Materials Genome Project by the Obama administration in the United States, the development of various computational materials’ databases has fundamentally expanded the choice of industries such as materials and energy. In the field of thermoelectric materials, the thermoelectric figure of merit (ZT) quantifies the performance of the material. From the viewpoint of calculations for vast materials, the ZT values are not easily obtained due to their computational complexity. Here, we show how to build a database of thermoelectric materials based on first-principle calculations for the electronic and heat transport of materials. Firstly, the initial structures are classified according to the values of bandgap and other basic properties using the clustering algorithm K-means in machine learning, and high-throughput first principle calculations are carried out for narrow-bandgap semiconductors which exhibit a potential thermoelectric application. The present framework of calculations mainly includes a deformation potential module, an electrical transport performance module, a mechanical and a thermodynamic properties module. We have also set up a search webpage for the calculated database of thermoelectric materials, providing search facilities and the ability to view the related physical properties of materials. Our work may inspire the construction of more computational databases of first-principle thermoelectric materials and accelerate research progress in the field of thermoelectrics.


Introduction
In 2011, the Obama administration of the United States officially proposed the "Material Genome Project", which utilizes high-throughput computing and experiments to obtain massive material data, combined with data analysis technology by artificial intelligence for new material development.The goal is to shorten the cycle of new materials development and applications, as well as reduce the costs for materials research and development, so that the United States can continue to maintain a leading position in manufacturing technology.In 2016, the US government released the "First Five Years of the Materials Genome Initiative: Accommodations and Technical Highlights" report, which pointed out that during the five years of the implementation of the Materials Genome Engineering program, federal research institutions such as the Department of Energy, the Department of Defense, the Natural Science Foundation, the National Bureau of Standards and Technology, and the National Aeronautics and Space Administration have invested over 500 million US dollars, establishing computational materials research and development centers including the National Network for Virtual High throughput Preparation (NIST&NREL) and the Center for Cross scale Material Design and Multi scale Materials Research (NIST, ANL, ARL), forming three major computational materials databases: the Materials Project (MP) [1], AFLOW [2], and OQMD [3,4], several auxiliary databases such as Materials Data Repository (MDR), Materials Resource Registry, Energy Materials Network, as well as databases related analysis tools.
Shortly after the proposal of the Materials Genome Project by the United States, the European Science Foundation launched the Accelerated Metallurgy (ACCMET) program, which costs over 2 billion euros, with the aim of keeping up with the pace of the United States.The European Commission funded the Horizon 2020 project NoMatD, led by the Max Planck Institute in German, for a period of three years in 2015.The project aims to use the "centralized data warehouse" method to involve various research groups and provide data related to computational materials science, with the aim of building a "Encyclopedia of Materials" and a tool for analyzing big data on materials.In the UK, the government has also implemented the e-science program, with its funding, to carry out high-throughput material computing simulations and the construction of material computing basic databases, such as eMinerals and the "Material Grid" project.The Swiss EPFL University has led the development of the European Materials Database AiiDA [5].
Nowadays, with the vigorous development of big data and artificial intelligence technology, the material genome project research characterized by high-throughput experiments, high-throughput computing, and artificial intelligence big data analysis is in full swing, and has shown astonishing advantages in many materials fields.The paper "Machine-learning-assisted materials discovery using failed experiments" published in Nature in May 2016 [6] showed that based on years of accumulated experimental data, various catalytic new materials can be discovered using artificial intelligence (AI) technology.This work indicates that AI will profoundly transform the research methods in the field of materials.The centuries long history of human scientific development has formed three research paradigms: experimental, theoretical, and computational.However, in the fields of complex systems such as biology, astronomy, and materials, there are very complex interactions involved, coupled with a large number of variables, which greatly limits the effectiveness of theoretical and computational research models and requires the combination of big data and AI as the "fourth paradigm".In 2017, AlfaGo defeated the human Go master, but Google disbanded the DeepMind team responsible for developing the program, and then formed an AI research and development team engaged in material genome engineering.At present, American high-tech companies including Apple, Google, IBM, Tesla, etc. are all laying out the use of AI for the research and development of new materials based on material genomics methods.The fourth paradigm of materials science requires the ability to generate and process massive amounts of data, thus obtaining massive amounts of material data has become a key aspect of the Materials Genome Project.With the improvement of computing power, the accumulation of material data based on high-throughput computing is receiving more and more attention, and its application in the research and development of new thermoelectric materials is expected to greatly accelerate its application process.
The performance of thermoelectric materials is described by the figure of merit ZT, which can be expressed as follows: Where  is the Seebeck coefficient,  is the conductivity,  is the temperature,   and   is the thermal conductivity contributed by carriers and phonons, respectively.These parameters of ,  and  are coupled with each other, and it is difficult to independently regulate them.For example, for semiconductor materials, increasing doping concentration can increase conductivity, while at the same time reducing the Seebeck coefficient and increasing carrier thermal conductivity.At present, the three major material databases, Materials Project, AFLOW, and OQMD, have data on several common physical quantities, including atomic and band structure, and other physical properties are also being added.However, thermoelectric performance of materials, due to their particularity and the complexity in calculating electrical and thermal transport properties, generally require a large amount of computation.
Here we selects Materials Project as the structural source for constructing a thermoelectric material database.Specifically, we employed the atomic structure files POSCAR and CIF (currently 19952 materials) in MP materials with id-number below 100000 through the Materials Project API as the initial materials for building present thermoelectric material database--Wenzhou TE.We have built deformation potential modules, elastic properties modules, and BoltzTrap electronic transport modules.
And then, we collect data by Python scripts and display it on a web site, https://hezhu2024.github.io,for others to use.

Clustering (K-means)
At present, the excellent thermoelectric materials obtained in experiments are mainly semiconductors with narrow-bandgaps, then we choose bandgap as a major feature for material screening.
At the same time, we selected free energy, volume, density, and average atomic energy as other features from the descriptors obtained from the MP database.They form five featured variables for the K-means clustering algorithm.
Here is a brief introduction to the K-means principle [7].K-means is a clustering algorithm that divides data into K classes.Firstly, K class random points are randomly generated, denoted as  1 ,  2 , ⋯   , ⋯   .Assuming that the j-th feature of the i-th data is represented as   , the distance from the i-th data sample to the l-th class random point is: (2) Among them, J represents a total of J features in the data.The random class point with the smallest distance represents the same class.After the first iteration, each data sample will be classified into a certain class.Then, we calculate the average value of each class of data as the new random class point.
The new random class point can be represented as: Then we re-calculate these distances, and reclassify them.And such process is repeated until convergence achieved.And finally the data will be classified into K classes.In present work, we also standardize the data before classification.In order to illustrate how many categories are most reasonable, we could assume that the formula for the total loss as follows: Where n represents the number of samples.This formula represents the sum of distances from all sample points to their random class points.When there is a significant inflection point on the line of Loss with respect to class K, the value of K at the inflection point should be considered as a reasonable classification.
Through the K-means method, we divided the initial materials from MP into 5 categories.Their quantities are 6602, 5425, 3770, 2800, and 1355, respectively.

Deformation Potential Theory (DPT)
The deformation potential theory was proposed by Bardeen and Shockley [8] in the 1950s to describe charge transfer in non-polar semiconductors.The charge mobility can be expressed as   =   / * , where the relaxation time for bulk materials could be written as follows [8,9] where   =  2 /((∆  /  ) 2  0 ) is the elastic constant,   = ∆  /(∆  /  ) , ∆  is the deformation potential energy, which is the difference between the energy level of the i-th energy band and the energy level of the deep nuclear state, and  * = ℏ 2 /( 2 / 2 ) is the effective mass.

Elastic and thermal properties
We can obtain elastic properties, group velocity, Poisson's ratio, Debye temperature, Grüneisen coefficients, and lattice thermal conductivity, by after calculating the elastic constant of materials [10], which could be easily achieved for the high-throughput calculation.
In the case of uniform deformation for a crystal, the generalized form of Hooke's law of stress-strain [11] is: where   and   is a homogeneous second-order stress tensor and a strain tensor, respectively [12].
The Reuss [14] Bulk and Shear modulus can be calculated by In present work, we take the arithmetic mean of the boundaries between Voigt and Reuss Voigt-Reuss-Hill (VRH) [15]: The longitudinal (_), transverse (_), and average (_) elastic wave velocities can be calculated by The Debye temperature (θ  ) is obtained by: And the Grüneisen coefficient is calculated by: Where ) is the Poisson's ratio.
According to the Slack formula [16,17], the lattice thermal conductivity can be expressed as: where  ̅ is the average atomic mass,   is the Debye temperature,  is the volume of each atom,  is the number of atoms in the original cell,  is the Grüneisen coefficient,  is a constant of 3.1 × 10 −6 , and T is the temperature.

Methods for the first-principles calculations and transport properties
In the process of building a thermoelectric material database, first-principles calculations are done by the Vienna Ab initio Simulation Package (VASP) [18,19].The calculation of electricity transportation requires the use of the Boltztrap program package [20].In order to minimize computational costs while ensuring data reliability, during optimizing calculations, we set the plane-wave energy cutoff to be 1.4 times the maximum ENMAX of POTCAR of composed elements, the electronic energy convergence to be 10 −4 eV, the force convergence for ions to be 10 −2 eV/Å, and the density k-mesh to be 0.04×2π Å -1 .
All the processed are controlled through Shell scripts.Data collection and calculation are implemented by Python scripts.These codes are home-made.

The application of K-means on datasets from MP
From Figure 1a, it can be seen that the number of points with obvious inflection is 6, which means that the initial structures can be divided into 6 categories.Considering the reasonable distribution of the

Computational framework and relaxation process
After getting the structural file, we firstly perform structural relaxation and static calculation.
Structural relaxation refers to the optimization process of atomic positions and lattice constants.We employed VASP software for the first-principles calculations.Actually several mainstream databases such as AFLOW, MP, OQMD, etc. are also calculated using VASP software.
For the first and second types of materials obtained through K-means initial screening, there are more than 12000 materials, many of which contain too many element types and numbers of atoms in the primitive cell.In present work, we firstly calculate the material system with a relatively simple structure.
Therefore, a computational control process is employed during the structural relaxation to further screen them, and resulting in a total of more than 3000 materials with relatively simple structures in the first and second types.Nevertheless, conducting structural relaxation for so many materials is a computationally demanding task.In order to accelerate the calculation, we wrote several shell scripts to control the process of structural relaxation.The flowchart is shown in Figure 2. in our present setup calculations.In the second category, there are also 2451 materials with atomic numbers greater than 10 or element types greater than 4, and 1318 materials that are difficult to be relaxed.
After the relaxation calculation the convergent structures are saved for further calculations.
Then we perform the calculations of the parameters of deformation potential theory.Firstly, we performed an anisotropic property judgment on the material, and then we performed static calculations on the deformed structures in various directions.

Analysis of results of deformation potential theory (using Si as an example)
The deformation potential method considered acoustic phonons as the main scattering sources for electrons.The relaxation time obtained by ignoring the contributions of optical phonon branches and other scattering mechanisms could be larger than the real one, but the calculation of deformation potential is relatively simple, easily employed in high-throughput calculations.The coefficients for applying deformation to the lattice vector are {0.98,0.99, 1.00, 1.01, 1.02} of relaxed volumes, respectively.Such calculations could ensure the reliability of fitting with the second-order function for the elastic constant and the first-order function for the elastic potential energy.Taking Si as an example, as shown in Figure

Energy band and effective mass calculation
There are many methods to obtain the band structure of a material.Here we compare three feasible schemes.The first scheme is VASP high symmetry point energy band calculation, the second one is using BoltzTrap2 [20] to fit the band structure, and the third one is using maximally-localized Wannier function to interpolate the VASP results [21].Considering the accuracy and efficiency, the second scheme is chosen in our high-throughput calculations.As shown in Table 2, three schemes for Si are presented.The bandgap of Si in the MP database is 0.61eV, which is consistent with VASP calculation.The bandgap error calculated by Boltztrap is within 5%.Meanwhile, the effective mass of Si calculated by Boltztrap is smaller than that of the VASP scheme, indicating that the calculated relaxation time will be larger, as shown in Table 1, where the relaxation time of electrons is 1141.9.The energy band of Si by three schemes is shown in Figure 4. From Table 2, it can be seen that the Boltztrap calculation for band structure is most efficient, then it can help to accelerate the high-throughput calculation.To facilitate high-throughput calculation, we use the formula  * = ℏ 2 /( 2 / 2 ) to calculate the effective mass.The effective masses of Si by the Boltztrap scheme is shown in Figure 5.A series of effective masses of conduction and valence bands were obtained near the high symmetry points of Г and X.We selected the maximum values of 0.46 0 and 2.48 0 as the effective masses for the conduction band and valence band, respectively.In addition, our program is designed to automatically determine whether the band is degenerate and calculate the effective mass for each degenerate band.We note here that the reason for selecting the maximum effective mass is that the deformation potential overestimates the relaxation time.By selecting the maximum effective mass, the relaxation time can be effectively reduced to compensate for the shortcomings of the deformation potential theory.In high-throughput calculations, the program also selects representative effective masses for other materials such as the Si.

High-throughput electrical transport properties(Boltztrap)
Boltztrap is a program package calculating the semi-classic transport coefficients, based on a smoothed Fourier interpolation of the bands.Electrical transport properties such as Seebeck coefficient, electronic conductivity, and electronic thermal conductivity can be obtained at different temperatures and doping concentrations.The Boltztrap program has an input interface for VASP files, which can meet the needs of present high-throughput processes.After completing static calculations, the Boltztrap module can be performed.Meanwhile, Boltztrap based on Python can be well embedded into our highthroughput Python data processing scripts, which are written for quickly obtaining the calculated quantities such as Seebeck coefficient, electronic conductivity, and electronic thermal conductivity.
Combined with the lattice thermal conductivities estimated from the elastic properties calculations, we could obtain the ZT values for the materials.We listed the top ten semiconductor materials with ZT values in Table 3.

ZT value and BE value
As an example for the application of our database, we associate thermoelectric ZT values with the electronic quality factor.By  and , the electronic quality factor   can be defined by [22]: where   = ||/  .As shown in Figure 6, the  values of most materials are positively correlated to its electronic quality factor   /  , so the   /  values can also serve as another criterion for judging excellent thermoelectric materials.

Conclusions
In this work, we builds a thermoelectric material database--Wenzhou TE.We designed several modules to obtain the electronic and heat transport parameters for materials, including structural screening, deformation potential, elastic constant, and Boltztrap electrical transport performance calculations module.And we write several Python scripts to collect data and process results.Furthermore, we built a webpage for the first-principles calculated thermoelectric materials database (https://hezhu2024.github.io),which could be used for searching and viewing the physical properties of materials.Subsequently, we will continue the construction of the database to include more materials, and based on this, one can easily use these data for data mining and thermoelectric material development.
average-bandgap values, we ultimately divided it into 5 categories.The featured distribution map and various information of K-means are shown in Figures1c-g.The average value of bandgap for the first class is merely 0.025eV, so this class of material contains many metals.The second class with average bandgap value of 0.14eV mainly composed of semiconductors with narrow bandgaps.The third, fourth, and fifth categories are mainly composed of semiconductors and insulators with wide bandgaps.As a starting point, we focused on calculating the physical properties of candidate material sets for the first and second categories.

Figure 1 .
Figure 1.The application of K-means on MP databases: (a) The line chart of Loss for K-class; (b) the classification data and relaxation screening results of the initial structures under K-means; (c-g) the average distribution of 5 features for each K-means class.

Figure 3 .
Figure 3. Schematic diagram of second-order fitting elastic constant   and first-order fitting of elastic potential energy   for Si.

Figure 4 .
Figure 4.The band structure of Si by three schemes: VASP, Boltztrap, and Wannier interpolations.

Figure 5 .
Figure 5.The effective mass of Si by the Boltztrap scheme.

Figure 6 .
Figure 6.Thermal power quality factor   /  and maximum   at 300K.

Table 1 .
Calculated deformation potential parameters, effective mases and relaxation time of carries for Si.

Table 2 .
Band results of Si under three schemes.

Table 3 .
Top 10 semiconductor materials sorted by ZT value.