Computational Screening of New Perovskite Materials Using Transfer Learning and Deep Learning

: As one of the most studied materials, perovskites exhibit a wealth of superior properties that lead to diverse applications. Computational prediction of novel stable perovskite structures has big potential in the discovery of new materials for solar panels, superconductors, thermal electric, and catalytic materials, etc. By addressing one of the key obstacles of machine learning based materials discovery, the lack of su ﬃ cient training data, this paper proposes a transfer learning based approach that exploits the high accuracy of the machine learning model trained with physics-informed structural and elemental descriptors. This gradient boosting regressor model (the transfer learning model) allows us to predict the formation energy with su ﬃ cient precision of a large number of materials of which only the structural information is available. The enlarged training set is then used to train a convolutional neural network model (the screening model) with the generic Magpie elemental features with high prediction power. Extensive experiments demonstrate the superior performance of our transfer learning model and screening model compared to the baseline models. We then applied the screening model to ﬁlter out promising new perovskite materials out of 21,316 hypothetical perovskite structures with a large portion of them conﬁrmed by existing literature.


Introduction
The perovskite structure is one of the most common and widely studied structures in materials science. The general chemical formula of the perovskite compound is ABX 3 , wherein A and B are two different sized cations, and X is an anion bonded to both. Its ideal structure is a cubic structure, and B atoms are at the center of a typical anionic octahedron. This seemingly simple atomic arrangement is deceptive because it hides its diversity of special physical and chemical properties.
As a result of its rich and remarkable properties, perovskites are used in many technical fields. Examples include piezoelectric perovskites, lead zirconate titanate, perovskites with piezoelectric effect for sensors or actuators [1] and high temperature perovskite superconductors such as beryllium copper oxide [2,3]. Some perovskite groups (mainly manganese-based perovskite oxides) exhibit huge magnetoresistance, which can significantly change the electrical resistance in the presence of a magnetic field [4]. In addition, perovskites have also been studied and applied to other fields such as thermoelectric materials [5], catalysts [6,7], light emitting diodes (LEDs) [8,9], lasers [10], and 2 of 20 so on. In recent years, since perovskite can be used as an absorbent material for solar cells, it has become a research hotspot. In 2009, Miyasaka et al. [11] made CH 3 NH 3 PbBr 3 and CH 3 NH 3 PbI 3 as photosensitizers for dye-sensitized solar cells for the first time, with an efficiency of 3.8%, which laid the foundation for the development of perovskite solar cells. In 2012, Grätzel and Park [12] first used solid Spiro-OMeTAD as a hole transport layer to make a solid perovskite solar cell with an efficiency of 9.7%. In 2014, Hongwei [13] and others from Huazhong University of Science and Technology used a full-printing method to prepare a hole-free transport layer, and a mesoporous structured perovskite solar cell using carbon as the electrode, which achieved an efficiency of 13.4%. In 2016, the Korean Institute of Chemistry (KRICT) and the Ulsan University of Science and Technology (UNIST) jointly developed a perovskite battery with an efficiency of 22.1% [14], making it the most energy-efficient perovskite solar cell. In 2019, Lin et.al. [15] at Nanjing University developed a strategy to reduce tin vacancies in Pb-Sn narrow-band perovskites by a neutralization reaction, thereby improving the performance and stability of a perovskite series tandem solar cell. The large area of perovskite series tandem solar cells achieves 24.8% and 22.1% certification efficiency, respectively. Perovskite solar cell efficiency records are constantly being refreshed, and research results of perovskite solar cells continue to emerge [16]. Since perovskite materials were first used in solar cells in 2009, in just ten years, their energy conversion efficiency has reached 22.1%, far exceeding other thin film solar cells, and has broad commercial prospects.
In recent years, with the accumulation of material database resources, data mining [17] and machine learning (ML) [18] have been used more and more frequently in material research, platform design and analysis, and prediction of material properties [19]. In terms of new material discovery, machine learning algorithms have been used to discover new energy materials [20], soft materials [21], polymer dielectrics [22], etc., and have achieved remarkable results. While machine learning has become a promising tool for scientists, it also has a distinct disadvantage: it usually requires a large amount of training datasets (e.g., 10 4 -10 6 ). This is usually not feasible in materials science because the data set (mainly the property characterization data) for most interesting material properties, such as ion-conductivity, has approximately 10 1 -10 3 samples. Standard machine learning methods that use generic descriptors do not work well with small data unless it contains highly informative physical and structural descriptors. For example, Saad et al. [23] trained a ML model with 44 materials to predict the melting temperature of octagonal compounds; Seko et al. [24] used 101 materials to establish a lattice thermal conductivity model; Ghiringhelli et al. [25] established an energy difference model for sphalerite and rock salt phases using 82 materials; Reed et al. [26] proposed to transfer structural information descriptors to general descriptors, so that billions of unknown lithium-ion conductive components can be screened. All these studies emphasized the importance of descriptors for their ML models.
This paper proposes a transfer learning approach that develops a transfer learning model for annotating a large number of materials samples with structural information but without formation energy information. The enlarged annotated training set then trains a high-performance convolution neural network model with generic Magpie elemental features to predict the formation energy of hypothetical perovskite materials for which only the composition and stoichiometry are available without crystal structure information. The main contributions of this paper are as follows: (1) We proposed a transfer learning strategy to convert formation energy related structural features/insights into training data for a perovskite screening model using only elemental Magpie features. This enables us to address the small dataset issue in typical ML based materials discovery. (2) We developed a gradient boosting regressor (GBR) ML model trained with structural and elemental features for perovskite formation energy prediction, which outperforms the state-of-the-art artificial neural network (ANN) based model trained with two elemental descriptors. This highly accurate model allows us to annotate the large number of material samples with structural information but no formation energy.

Materials and Methods
All 21,316 compounds are generated by filling the A and B positions (73 2 × 4 = 21,316) in the ABX 3 (X = O, Br, Cl, I) perovskite crystal structure with 73 metals or semi-metals in the periodic table (see Figure 1a) [27]. The ideal ABX 3 perovskite crystal structure [28] is composed of an A cation in a 12-coordinate structure located in a cavity composed of octahedrons; and a B cation forms an octahedral coordination with six oxygen ions (see Figure 1b). In addition to the ideal cubic structure, many perovskites undergo local distortions. These distorted perovskites may have a variety of symmetrical structures, including diamond, tetragonal, and orthogonal distortion. In this work, all 21,316 compounds were generated in the ideal cubic structure.

Materials and Methods
All 21,316 compounds are generated by filling the A and B positions (73 2 × 4 = 21,316) in the ABX3 (X = O, Br, Cl, I) perovskite crystal structure with 73 metals or semi-metals in the periodic table (see Figure 1a) [27]. The ideal ABX3 perovskite crystal structure [28] is composed of an A cation in a 12coordinate structure located in a cavity composed of octahedrons; and a B cation forms an octahedral coordination with six oxygen ions (see Figure 1b). In addition to the ideal cubic structure, many perovskites undergo local distortions. These distorted perovskites may have a variety of symmetrical structures, including diamond, tetragonal, and orthogonal distortion. In this work, all 21,316 compounds were generated in the ideal cubic structure. In order to improve the accuracy of the ML model and to screen for perovskites that have never been reported in the literature and cannot be characterized, we propose a transfer learning method [29]. First, we used a descriptor with structural features to train an accurate ML model; then we used the trained ML model to predict the 21,316 perovskites as a label and train the new model through the elemental feature descriptors. Since this data set is much larger, we can train a good prediction model using only generic features. As elemental descriptors only depend on component or stoichiometry information, no structure or other information is needed. Once the accurate general model is trained, it can be used to effectively screen new perovskite materials. In order to improve the accuracy of the ML model and to screen for perovskites that have never been reported in the literature and cannot be characterized, we propose a transfer learning method [29]. First, we used a descriptor with structural features to train an accurate ML model; then we used the trained ML model to predict the 21,316 perovskites as a label and train the new model through the elemental feature descriptors. Since this data set is much larger, we can train a good prediction model using only generic features. As elemental descriptors only depend on component or stoichiometry information, no structure or other information is needed. Once the accurate general model is trained, it can be used to effectively screen new perovskite materials.

. Magpie Features
The Magpie is a set of extensible attributes, created by Chris et al. [32], that can be used for materials with any number of constituent elements. This set of attributes is broad enough to capture a wide variety of physical/chemical properties that can be used to create accurate models of many material prediction problems. These include stoichiometric characteristics (depending on the proportion of elements only), elemental property statistics (atomic number, atomic radius, melting temperature, etc.), electronic structural properties (valence electrons of s, p, d, and f), and ionic compound characteristics.

Overview of Our Data-Driven Framework for Computational Screening
The overall framework of our methodology for screening new perovskites is shown in Figure 2. First, a GBR machine learning model (M1) is trained using the hybrid structural and elemental features and the training dataset D1. The M1 model is then used to predict the formation energies of the materials dataset D2 of which all samples come with structural information. D1 and D2 datasets are then combined to train a convolution neural network model M2 using the elemental magpie features which do not need structural information. This M2 model can then be used to do large-scale screening of candidate dataset D3 to identify potential new perovskite materials for further DFT or experimental verification. beta number ° β angle of the relaxed structure. β = 90 for the cubic, tetragonal, and orthorhombic distortion gamma number ° γ angle of the relaxed structure. γ = 90 for the cubic, tetragonal, and orthorhombic distortion ° degree, the unit of angle.

Magpie Features
The Magpie is a set of extensible attributes, created by Chris et al. [32], that can be used for materials with any number of constituent elements. This set of attributes is broad enough to capture a wide variety of physical/chemical properties that can be used to create accurate models of many material prediction problems. These include stoichiometric characteristics (depending on the proportion of elements only), elemental property statistics (atomic number, atomic radius, melting temperature, etc.), electronic structural properties (valence electrons of s, p, d, and f), and ionic compound characteristics.

Overview of Our Data-Driven Framework for Computational Screening
The overall framework of our methodology for screening new perovskites is shown in Figure 2. First, a GBR machine learning model (M1) is trained using the hybrid structural and elemental features and the training dataset D1. The M1 model is then used to predict the formation energies of the materials dataset D2 of which all samples come with structural information. D1 and D2 datasets are then combined to train a convolution neural network model M2 using the elemental magpie features which do not need structural information. This M2 model can then be used to do large-scale screening of candidate dataset D3 to identify potential new perovskite materials for further DFT or experimental verification.  Figure 2. Framework for the computational screening of perovskite materials. Abbreviations: RFR, random forest regression; SVR, support vector regression; GBR, gradient boosting regressor; Crystallographic Information File (CIF).
In the following sections, we will describe each step of our screening framework.

GBR Machine Learning Model for Formation Energy Prediction
It has been shown [33] that training a perovskite specific formation energy prediction model using structural information can achieve high accuracy comparable to that of a DFT calculation. Instead of using artificial neural networks and elemental features only, as done before, we propose to use GBR with both structural and elemental features. In the following sections, we will describe each step of our screening framework.

GBR Machine Learning Model for Formation Energy Prediction
It has been shown [33] that training a perovskite specific formation energy prediction model using structural information can achieve high accuracy comparable to that of a DFT calculation. Instead of using artificial neural networks and elemental features only, as done before, we propose to use GBR with both structural and elemental features.

Gradient Boosting Regressor
Boosting is a family of algorithms that can promote weak learners to strong learners, and its performance is significantly better than other basic classifiers. Adding new models to the collection in turn is the main idea for improvement. In each particular iteration, a new, basic learner model is trained by the entire integration error that has been learned so far. Its gradient promotion, like other lifting methods, builds the model in stages and uses arbitrary loss functions. It uses the gradient descent method to solve the minimization problem and generates a predictive model in the form of a set of weak predictive models (usually decision trees). Boosting can be used for both regression and classification problems while this study uses it for regression.
The main architecture of GBR includes three elements: the loss function, the weak learner (for prediction), and the addition model. The main idea of this algorithm is to construct a new basic learner, which is most correlated with the negative gradient of the loss function and is related to the whole. Any loss function can be used. In general, the choice of the loss function depends on the user of the algorithm. So far, there have been various loss functions [34]. The mathematical formula for GBR is as follows: where h m (x) is a basic function and is often referred to as a weak learner. GBR uses a fixed-size decision tree as a weak learner. Decision trees have the ability to process mixed-type data and the ability to model complex functions. Like other enhancement algorithms, GBR builds the addition model in a step-by-step manner, with the following formula: At each stage, the decision tree h m (x) selects the minimized loss function L, giving the current model F m−1 and F m−1 (x i ), as follows: The initial model F 0 is determined for a particular problem, and for least square regression, the average of the target values is typically chosen. Given any divisible loss function L, the algorithm starts with the initial model. GBR solves this minimization problem by gradient descent numerical methods. The gradient descent direction is the negative gradient of the loss function on the current model F m−1 and can be calculated for any divisible loss function. The formula is: In the gradient-lifting regression tree, there are multiple hyperparameters. The values of the hyperparameters used in this study are: max_depth = 6; n_estimators = 500; min_samles_split = 0.5; subsample = 0.7; alpha = 0.1; learning_rate = 0.01; loss = ls To evaluate the performance of GBR, we also evaluated several mainstream machine learning algorithms as the baselines, including random forest regression (RFR), support vector regression (SVR), and least absolute shrinkage and selection operator (Lasso) with the same dataset using the same set of features.
Random forest regression is also a commonly used algorithm in boosting. It is composed of many decision trees trained on each subsample of the dataset and uses their averages to improve prediction accuracy and control overfitting. The sub-sample size is always the same as the original input sample size, but the samples are drawn with replacement if bootstrap = true (default). It is widely used in statistics, data mining, and machine learning. The hyperparameters in random forest (RF) are max_features and n_estimators, where max_features is the number of features to consider when looking for the best segmentation, and n_estimators is the number of trees in the forest. SVR is a commonly used regression algorithm that uses kernel functions to map data from low-dimensional space to high-dimensional space and then uses the support vectors to fit the hyperplane to the final prediction. The SVR has excellent performance in solving prediction problems with high-dimensional features. However, this advantage is reduced when the feature size is much larger than the number of samples. The main hyperparameters in SVR include C, γ, and epsilon, where C is the penalty parameter of the error term and γ is a parameter attached to the rbf function. The last baseline algorithm, Lasso, is a data dimension reduction method that is applicable not only to linear cases but also to nonlinear cases. The Lasso is based on the penalty method to select the variables of the sample data. By compressing the original coefficients, the original small coefficients are directly compressed to 0, so that the variables corresponding to these coefficients are regarded as non-significant variables. These variables are discarded directly. For ordinary linear models, Lasso usually chooses L1 as the penalty term. In this study, all the above machine learning algorithm models and the 10-fold cross-validation method are implemented using the open-source library Scikit-learn [35].

Transfer Learning
One of the key obstacles of applying machine learning to materials discovery is the limited training data for the most figure of merit (FOM) properties such as the ion-conductivity, thermal conductivity, and formation energy [26]. Here we propose to use a structural information enabled transfer learning method to train a screening model. The basic idea is to train a formation energy prediction model based on structure and elemental features (model M1 in Figure 2) firstly, which usually has high generalization performance due to the informative structural features. This model is then used to predict the formation energies of a large number of samples in the D2 dataset, of which the samples only have structural information but without formation energy. This formation energy annotation step gives us an enlarged dataset (D1 + D2) with a large number of samples with formation energy values. It is then feasible for us to train a deep learning (CNN) based screening model (M2 in Figure 2) based on the enlarged labeled dataset using the Magpie elemental features only. The independence of M1 on the structural feature here is essential as most hypothetical materials only have composition and/or stoichiometry information without the crystal structure information.

Convolutional Neural Network Model
As one of the most successful deep learning models, the convolutional neural network has a special structure compared to traditional neural network models. With multiple data input channels, it can receive multi-dimensional input data, and the complexity of the network model, which greatly reduces the weight sharing structure, reduces the amount of calculation. The neural unit sharing the weights can use the layer-by-layer feature mapping function to perform multi-level understanding of the input data. A general convolutional neural network consists of an input layer, one or more convolutional layers, one or more sampling/pooling layers, a few fully connected layers, and an output layer.
As the core structure in the convolutional neural network, the convolutional layer uses different scale convolution kernels to traverse the input data in the way of weight sharing and extracts different levels of data features for the same data sample through different parameter distributions. The extracted feature maps of different features are stored in different channels of the convolution network in a feature stacking manner to form a high-dimensional data matrix to be used in the next calculation. The calculation formula of the convolution kernel is as follows: In Equation (5), x l k represents the kth feature of the lth layer; x l−1 j represents the output of the jth feature of the previous layer; ω l jk represents the l −1th feature of the layer and the kth feature of the layer of the convolution kernel; b l k represents the offset of the kth feature of layer l; M k is the set of all features output after the convolution operation of layer l − 1 and layer l.
The pooling layer [36] is also called the downsampling layer. After acquiring the features through the convolution layer operation, it is usually useful to use the pooling layer to sample the features calculated by the previous convolution layer to reduce the data dimension and the computational overhead due to the limitation of computing resources and time overhead.
The feature map obtained by calculating the upper convolution layer is divided into non-overlapping rectangular regions, and the operation of taking the maximum value for each rectangular region is called maximum pooling, and the operation of averaging is called averaging pooling. Figure 3 shows the matrix obtained by the maximum pooling and average pooling of the convolutional layer output matrices.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 20 In Equation (5), represents the kth feature of the lth layer; −1 represents the output of the jth feature of the previous layer; represents the l −1th feature of the layer and the kth feature of the layer of the convolution kernel; represents the offset of the kth feature of layer l; is the set of all features output after the convolution operation of layer l − 1 and layer l.
The pooling layer [36] is also called the downsampling layer. After acquiring the features through the convolution layer operation, it is usually useful to use the pooling layer to sample the features calculated by the previous convolution layer to reduce the data dimension and the computational overhead due to the limitation of computing resources and time overhead.
The feature map obtained by calculating the upper convolution layer is divided into nonoverlapping rectangular regions, and the operation of taking the maximum value for each rectangular region is called maximum pooling, and the operation of averaging is called averaging pooling. Figure  3 shows the matrix obtained by the maximum pooling and average pooling of the convolutional layer output matrices. The fully connected layer is located at the end of the convolutional neural network and is a common layer for connecting the output features extracted from the convolutional layers to the prediction output to implement tasks such as classification or regression. All neuron nodes in the fully connected layer are connected to the neuron nodes of the previous layer network, and the highdimensional data obtained from the previous network layers is tiled as an input. Through the activation function which carries out nonlinear transformation, the fully connected layers learn the mapping of extracted abstract features to predict the desired output

The Convolutional Neural Network Training Process
Training of convolutional neural networks refers to training networks with known samples to learn the mapping between the input and output. It is divided into two stages: forward propagation and backward propagation [37]. Forward propagation refers to the input of the sample into the network, and the output value of the network is obtained by weighting the weight, offset, activation function, and full connection layer parameters of the convolution kernel. Backpropagation first calculates the error between the predicted value of the sample obtained from the forward propagation output and the true value of the sample, and then proceeds backward according to the error value to obtain the error information of each layer, and uses the calculated gradients to adjust the network parameters until the network converges or reaches the specified iteration termination condition.
The forward propagation step feeds the training sample into the network and initializes the parameters of each layer of the network. Through the layer-by-layer calculation of the network, the output corresponding to the input sample under the current network parameters is obtained, that is, a forward propagation is completed. In classification, the output vector characterizes the probability distribution of the sample belonging to the corresponding category, which is calculated by the The fully connected layer is located at the end of the convolutional neural network and is a common layer for connecting the output features extracted from the convolutional layers to the prediction output to implement tasks such as classification or regression. All neuron nodes in the fully connected layer are connected to the neuron nodes of the previous layer network, and the high-dimensional data obtained from the previous network layers is tiled as an input. Through the activation function which carries out nonlinear transformation, the fully connected layers learn the mapping of extracted abstract features to predict the desired output

The Convolutional Neural Network Training Process
Training of convolutional neural networks refers to training networks with known samples to learn the mapping between the input and output. It is divided into two stages: forward propagation and backward propagation [37]. Forward propagation refers to the input of the sample into the network, and the output value of the network is obtained by weighting the weight, offset, activation function, and full connection layer parameters of the convolution kernel. Backpropagation first calculates the error between the predicted value of the sample obtained from the forward propagation output and the true value of the sample, and then proceeds backward according to the error value to obtain the error information of each layer, and uses the calculated gradients to adjust the network parameters until the network converges or reaches the specified iteration termination condition.
The forward propagation step feeds the training sample into the network and initializes the parameters of each layer of the network. Through the layer-by-layer calculation of the network, the output corresponding to the input sample under the current network parameters is obtained, that is, a forward propagation is completed. In classification, the output vector characterizes the probability distribution of the sample belonging to the corresponding category, which is calculated by the convolutional neural network. In regression, the output is just a real vector which can be compared with the desired values to calculate the regression loss.
The predicted value of the forward propagation output of the convolutional neural network is compared with the true value and their difference is defined as the loss. Usually for regression, mean square error loss function is used. Then, according to the obtained loss function value, the error value, the parameters of each layer of the network are adjusted and updated to minimize the loss function. The parameters that need to be adjusted in the convolutional network are the weights and offsets of the fully connected layer, and the weights and offsets of the convolutional layer.
For the full layer, the weight and offset of the common network can be solved according to the back propagation algorithm of the common network. The formula is as follows: In Equation (6), w l ij is the weight of the full connection of layer l, b l i is the offset of the full connection of layer l, and η is the network learning rate. According to above formulas, the adjustment process of the network parameter is essentially a process of the loss function's partial derivative of the weight parameter and the offset, and thus the derivation rule can be obtained: In Equation (8), a l j is the jth neuron input of the lth layer, and δ l+1 i in Equation (9) is the ith neuron error of the l + 1th fully connected layer. Through the above two formulas, the partial derivative of the loss function to the weight and the offset can be obtained, thereby completing the update of the parameters of the fully connected layer network.
The weight update formula of the convolutional layer is similar to that of the fully connected layer. The difference is that the partial derivative value makes a 180 • rotation operation based on the output of the previous layer. The partial derivative formula is as follows: where a l j is the jth neuron input of the lth convolutional layer. δ l+1 i is the ith neuron error of the l + 1th convolutional layer. For the offset parameters of the convolutional layer, the update method is slightly special. This is because in the convolutional layer, the error δ l+1 i is a three-dimensional vector, and the offset b l i is a single vector, and the offset update method of the fully connected layer cannot be used. The approach is to sum the respective sub-matrix terms δ l+1 i u,v of the error δ l+1 i to obtain the error vector, which is the gradient of b l i , u and v are the height and width of the gradient of the output image when it is reversed. The partial derivative formula is as follows: After the parameters are updated, the training samples will be re-entered into the updated convolutional network model for forward and reverse propagation until the network converges or reaches the specified iteration termination condition, completing the training process. Figure 4 shows our convolutional neural network model for predicting material formation energy. The CNN input is a 12 × 11 fixed size two-dimensional matrix. The structure of the CNN model consists of three convolutional layers and four fully connected layers. The output of the last convolutional layer is expanded into a one-dimensional vector as input to the subsequent fully connected layer. Both the convolutional layer and the fully connected layer use ReLU as the activation function because it is fast, can help address gradient vanishing problem, and can add sparsity to the network. The output of the network is a continuous value, which is the formation energy of the prediction. Our CNN model uses the Adam optimizer and the MAE (mean absolute error) loss function to train the convolutional neural network. The Adam optimizer combines the advantages of multiple optimizers and demonstrates outstanding performance in many applications. In addition, we used 10-fold cross-validation in the assessment and used root mean square error (RMSE), MAE, and R 2 to evaluate the performance of CNN and other machine learning algorithms.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 20 Figure 4 shows our convolutional neural network model for predicting material formation energy. The CNN input is a 12 × 11 fixed size two-dimensional matrix. The structure of the CNN model consists of three convolutional layers and four fully connected layers. The output of the last convolutional layer is expanded into a one-dimensional vector as input to the subsequent fully connected layer. Both the convolutional layer and the fully connected layer use ReLU as the activation function because it is fast, can help address gradient vanishing problem, and can add sparsity to the network. The output of the network is a continuous value, which is the formation energy of the prediction. Our CNN model uses the Adam optimizer and the MAE (mean absolute error) loss function to train the convolutional neural network. The Adam optimizer combines the advantages of multiple optimizers and demonstrates outstanding performance in many applications. In addition, we used 10-fold cross-validation in the assessment and used root mean square error (RMSE), MAE, and R 2 to evaluate the performance of CNN and other machine learning algorithms.  Figure 4. Convolutional neural network architecture for material formation energy prediction.

Verification Whether a Screened ABX3 Material is Perovskite or Non-Perovskite
The ABX3 material after screening by the M2 model is not necessarily a perovskite material. In order to verify whether these ABX3 candidates are stable perovskites, we use a tolerance factor to predict the stability of the perovskite as proposed by Ghirringhell et al. [38]. It can accurately determine whether the selected ABX3 material is perovskite or non-perovskite. It only needs chemical composition to predict the stability of perovskite with τ, which makes it possible to verify perovskite materials with unknown structure. In addition to predicting whether the material is a stable perovskite, τ also provides a monotonic estimate of the probability of the material's stability in the perovskite structure. Its accuracy and probability, as well as its widespread presence in a single perovskite and double perovskite, provide a new physical insight for the stability of perovskite structure. The formula for τ is as follows: where nA is the oxidation state of A atom, rA and rB are the ionic radii of A and B cations, respectively, rA > rB. A key aspect of τ performance is the degree to which the sum of ionic radii estimates the interatomic bond distance for a given structure.

Selection of the Best Material Features and Analysis of Feature Importance
Features or descriptors are an important part of machine learning models. In general, choosing different feature descriptors will have a great impact on the prediction results. In order to prove that our calculated material descriptors (annotated by Hybrid_descriptors from now on) can be used to

Verification Whether a Screened ABX 3 Material is Perovskite or Non-Perovskite
The ABX 3 material after screening by the M2 model is not necessarily a perovskite material. In order to verify whether these ABX 3 candidates are stable perovskites, we use a tolerance factor to predict the stability of the perovskite as proposed by Ghirringhell et al. [38]. It can accurately determine whether the selected ABX 3 material is perovskite or non-perovskite. It only needs chemical composition to predict the stability of perovskite with τ, which makes it possible to verify perovskite materials with unknown structure. In addition to predicting whether the material is a stable perovskite, τ also provides a monotonic estimate of the probability of the material's stability in the perovskite structure. Its accuracy and probability, as well as its widespread presence in a single perovskite and double perovskite, provide a new physical insight for the stability of perovskite structure. The formula for τ is as follows: where n A is the oxidation state of A atom, r A and r B are the ionic radii of A and B cations, respectively, r A > r B . A key aspect of τ performance is the degree to which the sum of ionic radii estimates the inter-atomic bond distance for a given structure.

Selection of the Best Material Features and Analysis of Feature Importance
Features or descriptors are an important part of machine learning models. In general, choosing different feature descriptors will have a great impact on the prediction results. In order to prove that our calculated material descriptors (annotated by Hybrid_descriptors from now on) can be used to predict the formation energy of perovskites, we compare it with the descriptors proposed by Ong et al. [33] (Ong_Descriptors) and the Magpie features. These three feature sets are used to predict the perovskite formation energy on the same algorithm and dataset, using RMSE, MAE, and R 2 as evaluation indicators, all using 10-fold cross-validation. The results are shown in Table 2. Obviously, our feature set is superior over the other two feature descriptors in terms of all three evaluation criteria. In addition to using a comparative method to validate our proposed feature descriptors, we also analyzed the importance of features using the random forest approach. The random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. When training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Figure 5 shows the importance scores for all features. It can be seen from the figure that the Pauling electronegativity has a considerable influence on the formation energy of perovskite. It is worth noting that X's importance score is more than twice higher than the second highest feature, which is consistent with the results of the two literatures [27,39] analyses.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 20 predict the formation energy of perovskites, we compare it with the descriptors proposed by Ong et al. [33] (Ong_Descriptors) and the Magpie features. These three feature sets are used to predict the perovskite formation energy on the same algorithm and dataset, using RMSE, MAE, and R 2 as evaluation indicators, all using 10-fold cross-validation. The results are shown in Table 2. Obviously, our feature set is superior over the other two feature descriptors in terms of all three evaluation criteria. In addition to using a comparative method to validate our proposed feature descriptors, we also analyzed the importance of features using the random forest approach. The random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. When training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Figure 5 shows the importance scores for all features. It can be seen from the figure that the Pauling electronegativity has a considerable influence on the formation energy of perovskite. It is worth noting that X's importance score is more than twice higher than the second highest feature, which is consistent with the results of the two literatures [27,39] analyses.

Performance of the M1 Model with Hybrid Structural and Elemental Features
First, we compared the performances of our GBR and other ML models such as RFR, Lasso, and SVR using the hybrid structural and elemental descriptors as raw features. In order to obtain stable results, each algorithm was evaluated using 10-fold cross-validations ten times. Figure 6 shows the fitting accuracy of all models using the same number of samples. It can be clearly seen that GBR has the best prediction performance, followed by RFR, and the worst is SVR. In terms of RMSE, MAE, and R 2 evaluation criteria, the GBR model scores are 0.28, 0.20, and 0.91, respectively. As shown in Table 3, the scores of these three evaluation measures of GBR are the best among all these ML models. In addition to the above three machine learning models, we also tried other machine learning models

Importance(%)
Feature ranked by importance

Performance of the M1 Model with Hybrid Structural and Elemental Features
First, we compared the performances of our GBR and other ML models such as RFR, Lasso, and SVR using the hybrid structural and elemental descriptors as raw features. In order to obtain stable results, each algorithm was evaluated using 10-fold cross-validations ten times. Figure 6 shows the fitting accuracy of all models using the same number of samples. It can be clearly seen that GBR has the best prediction performance, followed by RFR, and the worst is SVR. In terms of RMSE, MAE, and R 2 evaluation criteria, the GBR model scores are 0.28, 0.20, and 0.91, respectively. As shown in Table 3, the scores of these three evaluation measures of GBR are the best among all these ML models. In addition to the above three machine learning models, we also tried other machine learning models (such as linear regression models, K-neighbor regression models, etc.) for comparisons. However, their predictions are extremely poor and are therefore not listed here. reason is the nonlinear nature of the GBR algorithm; another reason may be the structural similarity of the materials considered in the data set. In a given data set, the crystal structure of all materials is perovskite. Given the same structure, materials with similar chemical compositions may have similar properties, making it more feasible to interpolate properties in predictive models. Discussing structural similarities are beyond the scope of our research, but it has become an important topic in ML research in materials science.

Performance of M2 Perovskite Screening Model
Before we evaluate the performance of the M2 model, a convolutional neural network model, we first compared the prediction performance before and after data enhancement. First, we used the M1 model to label the structured unlabeled D2 dataset and then calculated the Magpie descriptors for datasets D1 and D2. Finally, D1 and D2 datasets were pooled together to predict the perovskite formation energy with M2, and the D1 dataset was used alone to predict the formation energy of the perovskite. It is worth noting here that the labels of the D2 data set are obtained by migrating the learning model M1, and the label of the D1 data set is calculated by the density functional theory  Table 3. RMSE (eV/atom), MAE(eV/atom), and R 2 values of 10-fold cross-validation results of all prediction models using the hybrid descriptor. The high accuracy of our GBR prediction model can be attributed to the following reasons. One reason is the nonlinear nature of the GBR algorithm; another reason may be the structural similarity of the materials considered in the data set. In a given data set, the crystal structure of all materials is perovskite. Given the same structure, materials with similar chemical compositions may have similar properties, making it more feasible to interpolate properties in predictive models. Discussing structural similarities are beyond the scope of our research, but it has become an important topic in ML research in materials science.

Performance of M2 Perovskite Screening Model
Before we evaluate the performance of the M2 model, a convolutional neural network model, we first compared the prediction performance before and after data enhancement. First, we used the M1 model to label the structured unlabeled D2 dataset and then calculated the Magpie descriptors for datasets D1 and D2. Finally, D1 and D2 datasets were pooled together to predict the perovskite formation energy with M2, and the D1 dataset was used alone to predict the formation energy of the perovskite. It is worth noting here that the labels of the D2 data set are obtained by migrating the learning model M1, and the label of the D1 data set is calculated by the density functional theory (DFT). The results obtained are shown in Figure 7, with the support of the M1 model, the prediction performance has been significantly improved. This also verifies that the hybrid materials features that we proposed play a role.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 20 (DFT). The results obtained are shown in Figure 7, with the support of the M1 model, the prediction performance has been significantly improved. This also verifies that the hybrid materials features that we proposed play a role. To further verify that our chosen convolutional neural network model can be used as the best screening model for screening perovskites, we used the ElemNet model proposed by Ward et al. [40], which is a 17-layer deep neural network model, as comparison. The elemental composition is used to predict the formation energy of the material and then screen the material. In addition to using the ElemNet model for comparison, we also compared it to common machine learning models (RF, GBR, SVR, linear regression, K-neighbor regression, etc.), where RF and GBR performed better. As shown in Table 4, we show the results of our convolutional neural network model, the ElemNet model, and the best two traditional machine learning models, the RF and GBR. The results show that our CNN model is the best in terms of RMSE, MAE, and R 2. Thus, our CNN model is used as the ML model for screening hypothetical perovskites. All of the above models are based on the same data set, and all models have been trained and tested using 10-fold cross-validations.

Screening Results Analysis
Our CNN model is able to make robust, fast predictions, so it can be used to screen 21,316 materials to discover new and stable perovskite materials. After screening by the M2 model, of 21,316 hypothetical ABO3 materials, 4147 had formation energy less than 0. More specifically, 5106 were ABBr3, 5236 were ABCl3, and 4279 were ABI3. The specific formation energy prediction value range distribution is shown in Figure 8. To further verify that our chosen convolutional neural network model can be used as the best screening model for screening perovskites, we used the ElemNet model proposed by Ward et al. [40], which is a 17-layer deep neural network model, as comparison. The elemental composition is used to predict the formation energy of the material and then screen the material. In addition to using the ElemNet model for comparison, we also compared it to common machine learning models (RF, GBR, SVR, linear regression, K-neighbor regression, etc.), where RF and GBR performed better. As shown in Table 4, we show the results of our convolutional neural network model, the ElemNet model, and the best two traditional machine learning models, the RF and GBR. The results show that our CNN model is the best in terms of RMSE, MAE, and R 2 . Thus, our CNN model is used as the ML model for screening hypothetical perovskites. All of the above models are based on the same data set, and all models have been trained and tested using 10-fold cross-validations.

Screening Results Analysis
Our CNN model is able to make robust, fast predictions, so it can be used to screen 21,316 materials to discover new and stable perovskite materials. After screening by the M2 model, of 21,316 hypothetical ABO 3 materials, 4147 had formation energy less than 0. More specifically, 5106 were ABBr 3 , 5236 were ABCl 3 , and 4279 were ABI 3 . The specific formation energy prediction value range distribution is shown in Figure 8. The candidate materials screened out are not necessarily perovskite materials, even though the formation energy may be less than zero. Here we use the new tolerance coefficient τ as discussed before to further screen the selected candidate materials. If the τ of a candidate material is calculated to be less than 4.18, the candidate material has a high probability of being a stable perovskite material. Since τ requires that the ionic radius of the A cation is greater than the ionic radius of the B cation, a portion of the candidate material is therefore filtered out. After screening the new tolerance coefficient τ, there are 625 ABO3 with τ less than 4.18 along with 52 ABBr3, 55 ABCl3, and 32 ABI3. By reviewing the literature, we found that 98 of the 626 ABO3s were reported in [27], and 98 perovskites were proved to be stable by DFT calculations. In addition, it was reported [41] that the doped lanthanide BaSnO3 can be used as the material of the electron transfer layer for a highly efficient and stable solar cell. This material showed up in our screening results. The specific 98 ABO3s are shown in Table 5. In addition to these 98 compounds, literature [42] proved that one of our predictions, SrMoO3 in 625 ABO3, has paramagnetism. Among the 32 ABI3 screened out by our model, literature [43] verified that CsPbI3 can be surface-coated with surfactants and environmental conditions are stable. It is expected to be used for light collection or LEDs. Among the 52 ABCl3 screened out, literature [44] calculated CsPbCl3 by DFT, and found that the reduction of band gap is due to the limiting effect of carriers, and the increasing number of perovskite layers. Among the 55 ABBr3 screened out, Anni et al. [45] for the first time reported the temperature dependence of the spontaneous emission (ASE) characteristics of CsPbBr3 nanocrystalline thin films. Swarnkar et al. [46] verified the luminescence of colloidal CsPbBr3 perovskite nanocrystals, which surpass traditional quantum dots.  The candidate materials screened out are not necessarily perovskite materials, even though the formation energy may be less than zero. Here we use the new tolerance coefficient τ as discussed before to further screen the selected candidate materials. If the τ of a candidate material is calculated to be less than 4.18, the candidate material has a high probability of being a stable perovskite material. Since τ requires that the ionic radius of the A cation is greater than the ionic radius of the B cation, a portion of the candidate material is therefore filtered out. After screening the new tolerance coefficient τ, there are 625 ABO 3 with τ less than 4.18 along with 52 ABBr 3 , 55 ABCl 3 , and 32 ABI 3 . By reviewing the literature, we found that 98 of the 626 ABO 3 s were reported in [27], and 98 perovskites were proved to be stable by DFT calculations. In addition, it was reported [41] that the doped lanthanide BaSnO 3 can be used as the material of the electron transfer layer for a highly efficient and stable solar cell. This material showed up in our screening results. The specific 98 ABO3s are shown in Table 5. In addition to these 98 compounds, literature [42] proved that one of our predictions, SrMoO 3 in 625 ABO 3 , has paramagnetism. Among the 32 ABI 3 screened out by our model, literature [43] verified that CsPbI 3 can be surface-coated with surfactants and environmental conditions are stable. It is expected to be used for light collection or LEDs. Among the 52 ABCl 3 screened out, literature [44] calculated CsPbCl 3 by DFT, and found that the reduction of band gap is due to the limiting effect of carriers, and the increasing number of perovskite layers. Among the 55 ABBr 3 screened out, Anni et al. [45] for the first time reported the temperature dependence of the spontaneous emission (ASE) characteristics of CsPbBr 3 nanocrystalline thin films. Swarnkar et al. [46] verified the luminescence of colloidal CsPbBr 3 perovskite nanocrystals, which surpass traditional quantum dots. In general, extensive literature inspection shows that our model made reasonable predictions and can be used for discovery of new perovskite materials. The top 200 predicted new perovskites are listed in Table 6. The complete set of predicted perovskites is provided in the supplementary file. The remaining non-reported perovskite materials are also promising.

Conclusions
In this paper, we proposed to use deep neural network based transfer learning and a hybrid descriptor set for perovskite formation energy prediction. The hybrid descriptors are composed of structural and elemental features as calculated via the pymatgen library. Using these 31 features, our transfer learning algorithm can be used to address the small data issue typical in machine learning based material discovery. It works by first training an annotation model using structured perovskite data sets and then using it to predict the formation energy of unannotated perovskite materials with structures. The experimental results show that the proposed hybrid feature descriptors perform better than the Ong_Descriptors and Magpie descriptors in predicting the perovskite formation energy. Moreover, compared to the commonly used machine learning model, the gradient-enhanced regression that we used outperformed random forest regression, Lasso, and support vector regression models.
Based on the transfer learning method, we established a convolutional neural network perovskite screening model by first labeling unannotated perovskite materials with high-precision structure feature based model and then built a CNN model and trained it with the Magpie descriptors to get a generic screening model, the M2 model, which does not require structural information. Compared to the ElemNet model and several machine learning models, the experiments show that our CNN model is the best in formation energy prediction of perovskites given only composition information. The ABX 3 materials with formation energy greater than 0 were screened out from 21,316 candidates through the CNN model, and the new tolerance factor τ was used to verify whether a screened material is a stable perovskite material. Extensive literature inspection showed that many predicted perovskite materials have been reported in the literature, and the rest is subject to further experimentation or DFT calculation verification.