Learning Interactions in Reaction Diffusion Equations by Neural Networks

Partial differential equations are common models in biology for predicting and explaining complex behaviors. Nevertheless, deriving the equations and estimating the corresponding parameters remains challenging from data. In particular, the fine description of the interactions between species requires care for taking into account various regimes such as saturation effects. We apply a method based on neural networks to discover the underlying PDE systems, which involve fractional terms and may also contain integration terms based on observed data. Our proposed framework, called Frac-PDE-Net, adapts the PDE-Net 2.0 by adding layers that are designed to learn fractional and integration terms. The key technical challenge of this task is the identifiability issue. More precisely, one needs to identify the main terms and combine similar terms among a huge number of candidates in fractional form generated by the neural network scheme due to the division operation. In order to overcome this barrier, we set up certain assumptions according to realistic biological behavior. Additionally, we use an L2-norm based term selection criterion and the sparse regression to obtain a parsimonious model. It turns out that the method of Frac-PDE-Net is capable of recovering the main terms with accurate coefficients, allowing for effective long term prediction. We demonstrate the interest of the method on a biological PDE model proposed to study the pollen tube growth problem.


Introduction
Two-component reaction-diffusion systems often model the interaction of two chemicals, leading to the formation of non-uniform spatial patterns of chemical concentration or morphogenesis under certain conditions due to chemical reactions and spreading. Since Turing's groundbreaking work [1], reaction-diffusion systems have been extensively used in developmental biology modeling. For example, let u = u(x, y, t) and v = v(x, y, t) represent the concentration of two chemical species, which may either enhance or suppress each other depending on the context. The system of u and v can be modeled as follows: where ∆ = ∂ 2 x + ∂ 2 y denotes the Laplacian operator, N 1 and N 2 are interactions between u and v. The functions N 1 and N 2 are sums of various reaction terms that can be derived from physical or chemical principles such as mass-action laws, Michaelis-Menten kinetics, or products that represent some competition, cooperation effects. We refer the readers to ( [2], Section 2.2) for more discussions. Hence, N 1 and N 2 are sums of meaningful functions that represent specific mechanisms: if we are able to identify these terms and discover the explicit formulas for N 1 and N 2 , then we can learn more about the nature of the interactions and predict future behaviors well. This situation arises commonly in biological applications such as chemotaxis, pattern formation in developmental biology, and also the cell polarity phenomenon [3,4].
Cell polarity plays a vital role in cell growth and function for many cell types, affecting cell migration, proliferation, and differentiation. A classic example of polar growth is pollen tube growth, which is controlled by the Rho GTPase (ROP1) molecular switch. Recent studies have revealed that the localization of active ROP1 is regulated by both positive and negative feedback loops, and calcium ions play a role in ROP1's negative feedback mechanism. Initially, ROP1 is inside the membrane. During positive feedback (rate k p f ), some of the ROP1 enters the membrane. At the same time, negative feedback (rate k n f ) causes some of it to return inside the membrane while the rest diffuse on the membrane (rate D r ). Calcium ions follow a similar process with positive rate k ac , negative rate k dc , and diffusion rate D c . In [5,6], the following 2D reaction-diffusion system (2) is introduced: with suitable initial and boundary conditions being proposed to quantitatively describe such spatial and temporal connection between ROP1 and calcium ions, leading to rapid oscillations in their distributions on the cell membrane. Here, R = R(x, t), C = C(x, t), and R t , C t , R x , R xx , C x and C xx are abbreviated notations for partial derivatives with respect to the time t or to the spatial variable x. Moreover, the non-linear function g(C) characterizes how calcium ions play a role in ROP1's negative feedback loop. Specifically, the active ROP1 causes an increase in Ca 2+ levels, leading to a reduction in ROP1 activity and a decrease in its levels. Meanwhile, the flow of Ca 2+ slows down as ROP1 drops. Ref. [6] proposed the equation g(C) = C 2 C 2 +k 2 c to describe such spatial-temporal patterns of calcium, where k c is a positive constant. Based on this model, ref. [6] developed a modified gradient matching procedure for parameter estimation, including k n f and k c . However, it requires that g(C) in (2) is a known function. In this work, we propose to apply neural network methods to uncover the function g(C) or more broadly, to learn interaction terms N 1 and N 2 in general reaction-diffusion PDEs (1), which may contain fractional expressions ( Figure 1). In the past decade, the artificial intelligence community has focused increasingly on neural networks, which have become crucial in many applications, especially PDEs. Deep learning-based approaches to PDEs have made substantial progress and are well-studied, both for forward and inverse problems. For forward problems with appropriate initial and boundary conditions in various domains, several methods have been developed to accurately predict dynamics (e.g., [7][8][9][10][11][12][13][14][15][16][17]). For inverse problems, there are two classes of approaches. The first class of approaches focuses on inferring coefficients from known data (e.g., [7,10,12,15,18,19]). An example of this is the widely known PINN (Physics-informed Neural Networks) method [10], which uses PDEs in the loss function of neural networks to incorporate scientific knowledge. Ref. [7] improved the efficiency of PINNs with the residual-based adaptive refinement (RAR) method and created a library of open-source codes for solving various PDEs, including those with complex geometry. However, this method is only capable of estimating coefficients for fixed known terms in PDEs, and may not work well for discovering hidden PDE models. Although [9] extended the PINN method to find unknown dynamic systems, the nonlinear learner function remains a blackbox and no explicit expressions of the discovered terms in the predicted PDE are available, making it difficult to interpret their physical meaning. The second class of approaches not only estimates coefficients, but also discovers hidden terms (e.g., [16,17,[20][21][22][23][24][25][26]). An example is the PDE-Net method [16], which combines numerical approximations of convolutional differential operators with symbolic neural networks. PDE-Net can learn differential operators through convolution kernels, a natural method for solving PDEs that has been well-studied in [27]. This approach is capable of recovering terms in PDE models with explicit expressions and relatively accurate coefficients, but often produces many noisy terms that lack interpretation. In order to produce parsimonious models, refs. [25,26] proposed to create a regression model with the response variable ∂ t u, and a matrix Θ with a collection of spatial and polynomial derivative functions (e.g., u, ∂ x u, u∂ x u): ∂ t u = Θξ. The estimation of differential equations by modeling the time variations of the solution is known to produce consistent estimates [28]. In addition, the Ridge regression with hard thresholding can be used to approximate the coefficient vector ξ. This sparse regressionbased method generally results in a PDE model with accurately predicted terms and high accuracy coefficients. However, few existing studies have focused on effectively recovering interaction terms in the fractional form (say one polynomial term divided by another polynomial term) in hidden partial differential equations, which is the focus of this paper.
Previous methods for identifying the hidden terms in reaction-diffusion partial differential equation models have mostly focused on polynomial forms. However, as indicated in Equation (2), the model for ROP1 and calcium ion distribution also involves fractional and integral forms, which can pose identifiability issues when combined with polynomial forms. Furthermore, we want to attain a parsimonious model, as the interpretability of the PDE model is important for biologists to comprehend biological behavior and phenomena revealed by the model.
In this paper, we utilize a combination of a modified PDE-Net method (which adds fractional and integration terms to the original PDE-Net approach), an L 2 norm term selection criterion, and an appropriate sparsity regression. This combination proves to produce meaningful and stable terms with accurate estimation of coefficients. For ease of reference to this combination, we call it Frac-PDE-Net.
The paper is organized as follows. In Section 2, we explain the main idea and the framework of our proposed method Frac-PDE-Net. In Section 3, we apply Frac-PDE-Net to discover some biological PDE models based on simulation data. Then, in Section 4, we make some predictions to test the effectiveness of the models learned in Section 3. Finally, we summarize our findings and present some possible future works in Section 5.

Methodology
The main idea of the PDE-Net method, as described in [16], is to use a deep convolutional neural network (CNN) to study generic nonlinear evolution partial differential equations (PDEs) as shown below: where u = u(z, t) is a function (scalar valued or vector valued) of the space variable z and the temporal variable t. Its architecture is a feed-forward network that combines the forward Euler method in time with the second-order finite difference method in space through the implementation of special filters in the CNN that imitate differential operators. The network is trained to approximate the solution to the above PDEs and then the network is used to make predictions for the subsequent time steps. The authors of [16] show that this approach is effective for solving a range of PDEs and can achieve satisfactory accuracy and computational efficiency compared to traditional numerical methods. In this paper, we follow a similar framework to PDE-Net, but with modifications on a symbolic network framework (SymNet k m ) to better align with biological models.

PDE-Net Review
The feed-forward network consists of several ∆t-blocks, all of which use the same parameters optimized through minimizing a loss function. For simplicity, we will only show one ∆t-block for two-dimensional PDEs, as repeating it generates multiple ∆t-blocks, and the concept can easily be extended to higher-dimensional PDEs.
Denote the space variable z in (3) to be z = (x, y) since we are dealing with the two-dimensional case. Let t 0 = 0 andũ(·, t 0 ) be the given initial data. For i ≥ 0,ũ(·, t i+1 ) denotes the predicted value of u at time t i+1 calculated from the predicted (true) value ofũ at time t i using the following procedure: where SymNet is an approximation operator of F. Here, the operators D ij are convolution operators with the underlying filters q ij , i.e., D ij u := 1 (∆x) i (∆y) j q ij ⊗ u. These operators approximate differential operators: By Taylor expansion, In particular, if choosing ∆x = ∆y = δ, then As a result, the training of q can be performed through the training of M := (m ij ) since the moment matrix M = M(q). It is important to note that the trainable filters M (or q) must be carefully constrained to match differential operators. For example, to approximate ∂u ∂x by D 10 u, or equivalently by 1 ∆x q 10 ⊗ u for a 3 × 3 filter q 10 , we may choose where * means no constraint on the corresponding entry. Generally, the fewer instances of * present, the more restrictions are imposed, leading to increased accuracy. In this example of (6), the choice of M 1 ensures the 1st order accuracy and the choice of M 2 guarantees the 2 nd order accuracy. More precisely, if we plug M 1 into (5) with ∆x = ∆y = δ, then In PDE-Net 2.0, all moment matrices are trained as subject to partial constraints so that the accuracy is at least 2nd order.
The SymNet k m network, modeled after CNNs, is employed to approximate the multivariate nonlinear response function F. It takes a m-dimensional vector as input and consists of k layers. As depicted in Figure 2, the SymNet 2 m network has two hidden layers, where each f i unit performs a dyadic multiplication and the output is added to the (i + 1)th hidden layer. The loss function for this method has three components and is defined as follows: Here, L data measures the difference between the true data and the prediction. Consider the data set {u j (·, t i ) ∈ R N s ×N s : 1 ≤ i ≤ n, 1 ≤ j ≤ N}, where n is the number of ∆t blocks, N is the total number of samples, and N s is the number of space grids. The index j indicates the jth solution path with a certain initial condition of the unknown dynamics, and the index i represents the solution at time t i . Then, we define Here, ij := ||u j (t i , ·) − u j (t i , ·)|| 2 2 , where u j represents the real data and u j denotes the predicted data. For a given threshold s, recall the Huber's loss function We then define the following: where q ij s are filters and M(q ij ) is the moment matrix of q ij . Using the same Huber loss function as in (8), we define where w ij s are the parameters in SymNet. The coefficients λ M and λ S in Equation (7) serve as regularization terms to help control the magnitude of the parameters, preventing them from becoming too large and overfitting to the training data.

mPDE-Net (Modified PDE-Net)
In mPDE-Net, we do not include multiplications between derivatives of u and v, as these interactions are not commonly present in biological phenomena. Additionally, to handle interactions in fractional or integral forms, such as those in Equation (2), mPDE-Net incorporates integral terms and division operations into SymNet k m . However, there was a challenge with identifiability using mPDE-Net. For instance, consider a two-component input vector u and v. mPDE-Net may produce results such as u 2 u+ or uv v+ , where is a small number due to noise. Although both of these terms essentially represent the same term u, the mPDE-Net is unable to automatically identify them as such. Keeping all similar terms such as u 2 u+ , uv v+ and u at the same time would result in a complex model and the real fractional term would not be effectively trained.
To address the identifiability issue, restrictions were imposed on the nonlinear interaction term N(u, v) by assuming that N(u, v) = g(u)h(v), where either g or h is linear and the other one can contain a fractional term with the order of the denominator larger than that of the numerator. For instance, the terms u 2 u+ and uv v+ are further decomposed as follows: As seen, the main part of the above two terms is u while the rest, such as , 2 u+ and u v+ , are considered as perturbations since is very small. This allows the mPDE-Net to identify and combine the main parts of terms, resulting in a compact model. Figure 3 presents an example of a system involving the derivatives of u and v up to the second order. The symbolic neural network in this example has five hidden layers, referred to as SymNet 5 10 . The operators f i are multiplication functions, i.e., f i (η i , ξ i ) = η i ξ i , for i = 1, 4, 5; and f j are division functions, i.e., f j (η j , ξ j ) = η j ξ j , for j = 2, 3. Additionally, a term u α is included to incorporate fractional powers, such as the term R α in (2). The algorithm corresponding to this example is outlined in Algorithm 1, where Algorithm 1 Scheme of mPDE-Net. To further demonstrate the mPDE-Net approach, we present a concrete example. To simplify the notation, we introduce the row vector e i with a 1 in the ith component and 0 in all other components, i.e., e i = (0, 0, · · · , 0, 1, 0 · · · , 0), where the number "1" is on the ith position. Then, we set According to Algorithm 1 for 1 ≤ i ≤ 5, Therefore, Let L denote the library for PDE-Net 2.0 and L f denote the library for mPDE-Net. It is clear that L and L f are distinct. Typically, L only seeks to identify multiplication terms and has the form: Conversely, L f is engineered to learn both multiplication terms and fractional terms, subject to certain constraints. In our paper, we make the choice of which is much larger than L. Therefore, our framework of neural networks, built upon L f , is more challenging to implement than the original framework, which is based on L.

Optimizing Hyperparameters
In this section, we will explain the process of tuning hyperparameters λ M and λ S in the loss function (7). Firstly, the range of spatial and temporal variables in the training set are defined as [−L, L] and [0, T], respectively. Then, using the finite difference method, we generate a dataset that acts as the "true data". Additionally, we consider M initial conditions. The time interval is determined by dt/dt, wheredt is the time step size for computing the "true data" and dt represents the time step size for selecting the "observational data". Typically,dt is chosen to be much smaller than dt. The solution corresponding to the mth initial condition is denoted as u m (·, ·), where the first "·" refers to the spatial variable and the second "·" represents the temporal variable. If the solution is evaluated at the kth time step, it is written as u m (·, t k ), with "·" representing the spatial variable.
The M initial values from M initial conditions are divided into three separate groups, resulting in M = M 1 + M 2 + M 3 , where M 1 , M 2 , and M 3 represent the sizes of the training set, validation set, and test set, respectively. The solutions produced by these initial values are designated as follows: Training set: We use the training set to train our models, the validation set to find the best parameters, and the testing set to evaluate the performance of the trained models.
Assume we divide the time range [0, T] into K blocks, with cutting points denoted as t k for 1 ≤ k ≤ K. Then, for any 1 ≤ m ≤ M and for any 1 ≤ k ≤ K, we define where · 2 denotes the L 2 norm with respect to the space variable on [−L, L], u m is the "true solution", andũ m is the "predicted solution" by a neural network. Based on this, the training loss, validation loss and the testing loss are defined as follows: • Training loss: • Validation loss: • Testing loss: We choose the hyperparameters λ M and λ S in the loss function (7) using the validation sets.
We define the training number by N t . We then gradually increase the time points of the training and validation sets. For instance, if K = 15 and N t = 5, the training and validation sets can be selected as follows. The performance metric is the same as the validation loss in (10).
Furthermore, we tune the hyperparameters using Hyperopt [29], which uses Bayesian optimization to explore the hyperparameter space more efficiently than a brute-force grid search. Specifically, the mPDE-Net is nested in the objective function of Hyperopt, which will optimize the average validation loss L avl of models.
The selection procedure is described in Algorithm 2.
Algorithm 2 Optimizing Hyperparameters using Hyperopt 1: Initialize the search spaces for λ M and λ S ; 2: Define the objective function (to be optimized) as the average testing loss obtained from mPDE-Net, implemented using PyTorch; 3: Set the optimization algorithm, specify the number of trials, and initialize the results list. 4: for i = 1 to number of trials do 5: Sample a set of hyperparameters from the search spaces, evaluate the objective function with the sampled hyperparameters, and set a list called the Validation loss. 6: for r = 1 to K N t do 7: Train model of mPDE-Net on B m1 , · · · , B mr , test it on B j1 , · · · , B jr to get a validation loss, and then append the validation loss to the Validation loss. 8: end for 9: Get an average validation loss from the Validation loss, append the hyperparameters and the average validation loss to the results list, and then update the search space based on the results so far. 10: end for 11: return the hyperparameters with the minimum objective function value.

Frac-PDE-Net
We have noted that mPDE-Net accurately fits data and recovers terms, but it may not always simplify the learned PDE, making it challenging to interpret. To address this, we implement sparsity-encouraging methods such as the Lasso approach. However, even with Lasso and hyperparameters chosen from the validation sets, the predicted equation still had redundant terms. This is likely due to correlated data and linear dependencies in the data, which prevent Lasso from fully shrinking the extra coefficients to zero. To overcome this, we employ two approaches. The first, called the L 2 norm-based term selection criterion, weakens or eliminates linear dependencies in the data. The second, called sequential threshold ridge regression (STRidge), creates concise models through strong thresholding. We will discuss these approaches in more detail below. L 2 norm based term selection criterion. Consider the underlying PDE in the form of where To address the issue of excessive terms in the learned PDE, we apply the L 2 norm based term selection criterion. This involves normalizing the columns of Θ(u) to obtain Φ k (u) and adjusting the coefficients ξ toξ, By removing the terms in Θ(u) whose adjusted coefficients η k are significantly smaller than the largest one, we shorten the vectorξ to ξ (s) . The corresponding columns in Θ(u) form a new matrix Θ (s) (u) with reduced linear dependency between its columns. This results in a simplified approximation of the PDE: Sparse regression: STRidge. After using the L 2 norm-based term selection criterion to select terms, as discussed previously, we move on to consider sparse regression to further improve the compactness of the representation for the hidden PDE model (13). Here, a tolerance threshold "tol" is introduced to select coefficients for sparse results. Coefficients smaller than "tol" will be discarded, and the remaining ones will be utilized until the number of terms stabilizes. The sparsity regression process is outlined in Algorithm 3. For further information, see [25].
To summarize, the mPDE-Net approach allows us to achieve relatively accurate predictions for the function and its derivatives. We then employ an L 2 norm-based term selection criterion and sparse regression to obtain a concise model, which we refer to as Frac-PDE-Net. Algorithm 4 summarizes this procedure.

Kolmogorov-Smirnov Test
After applying the Frac-PDE-Net procedure, a simplified, interpretable model has been created. Our next goal is to determine if this model can be further compressed. We designate Model 1 as the system learned by Frac-PDE-Net, and Model 2 as the system obtained by removing the interaction term with the smallest L 2 norm from Model 1. To determine if Model 1 and Model 2 come from the same distribution, we use the Kolmogorov-Smirnov test (K-S test).
Since our examples involve systems of two PDEs, a two-dimensional K-S test is appropriate. The time range is [0, T] with time step size dt, giving n := T dt time grids denoted as {t i } n i=1 , where t i = i(dt), and 1 ≤ i ≤ n. At a fixed time t i , we aim to test the proximity of two samples Y t i and Y t i , which are associated with Model 1 and Model 2, respectively, at time t i . For each t i , we specify: come from a common distribution.
Let H t i ,0 andp t i denote null hypotheses and the corresponding p-values, respectively, for 1 ≤ i ≤ n. In this paper, we employed Bonerroni [30], Holm [31] and Benjamini-Hochberg (B-H) [32] methods for multiple testing adjustment. Note that the Bonferroni method is the most conservative one among these three methods. Under the complete null hypothesis of a common distribution across all time points, no more than 5% of the total time points can be rejected.

Numerical Studies: Convection-Diffusion Equations with the Neumman Boundary Condition
In this section, we showcase numerical examples to demonstrate the efficacy of Frac-PDE-Net, our proposed method. The training, validation, and testing data are generated based on the underlying governing equation. Our aim is to use Frac-PDE-Net on these data to obtain a concise and interpretable model for the PDE. The governing PDEs under consideration in this paper are of the following form: where Here, d 1 and d 2 are positive diffusion coefficients, R 1 and R 2 represent fractional functions of (u, v), and P 1 and P 2 denote combinations of power functions and integration operators of (u, v) through addition and multiplication. For example, R 1 (u, v) can be u−2 v 2 −v+3 , and P 1 (u, v) can be 1 + u 1.5 − v 2 + u 1.5 u dx.

Example 1: A 2-Dimensional Model
Our first example is taken from (Equation (2.8) in Section 2.2 in [2]). In this example, we consider (14) under the Neumann boundary condition on a 2-dimensional domain Thus, Equation (14) is reduced to , ∂ x u(−5, y, t) = ∂ x u(5, y, t) = ∂ y u(x, −5, t) = ∂ y u(x, 5, t) = 0, The observations are generated with Equations (16) and (17), and then split into training data, validation data and testing data. The PDE is solved by applying a finite difference scheme to a 64 × 64 spatial mesh grid with the central difference scheme for ∆ := ∂ 2 x + ∂ 2 y , and with a temporal discretization of second-order Runge-Kutta (see [16]), using a time step size of 1 1600 . In addition, the observations are obtained from various initial values: this implies an extra variability in the datasets, that is necessary if we want to be able to generalize well to any initial conditions. We assume that we have N Init = 12 different solutions, coming from different initial values w 0 . The functions are random, defined through random parameters a i,j , b i,j , c i,j , d i,j , a k,l , b k,l , c k,l and d k,l , which follow from the standard normal distribution N (0, 1), c 1 and c 2 , which follow from uniform distributions: c 1 ∼ U (−0.5, 0.5) and c 2 ∼ U (0.5, 1.5). Then, we generate the 12 initial values (u 0 , v 0 ) by setting For any given initial data (u 0 , v 0 ), we denote the corresponding solution to be (u * , v * ). When noise is allowed, we assume the perturbed data to be u(x, y, t) = u * (x, y, t) + n l Q 1 , v(x, y, t) = v * (x, y, t) + n l Q 2 , where n l is the level of Gaussian noise added, and Q 1 and Q 2 are random variables, which follow from the normal distribution: Q i ∼ N (0, σ 2 i ) for i = 1, 2, where σ 1 (or σ 2 resp.) is the standard deviation of the true data u * (or v * resp.).
Since the time is from 0 to 0.15, there are 15 time blocks and we denote N Time = 15. For spatial variables, we have N Space = 64, where N Space represents the number of space grids. Therefore, the dataset is where both u t,k and v t,k are matrices in R N Space ×N Space . The following Tables 1 and 2 show a summary of parameters for Frac-PDE-Net. Our goal is to discover the terms F 1 (u, v) and F 2 (u, v) on the right hand side of (16) and the true expressions are given by (17). For convenience of notation, we denote F 1 and F 2 to be our predicted operators for F 1 and F 2 . Based on some existing models (see, e.g., Section 2.2 in [2]), we adopt some assumptions before discovering F 1 and F 2 . More precisely, we assume that where d 1 and d 2 are positive constants, P 1 and P 2 are polynomials of (u, v) up to order 2, and both the fractional terms R 1 and R 2 are in the form l(u)r(v) or r(u)l(v), where l means a linear function and r denotes a fractional function in which the numerator is linear and the denominator is quadratic. Based on these assumptions, we consider the following library {u, u xx , u yy , v, v xx , v yy } for training our model.
The filters q (as defined in (4)) are selected to be of size 5 × 5. The total number of parameters in W (i) (as defined in Algorithm 1) for approximating F 1 and F 2 is 56, and the number of trainable parameters in the moment matrices M (as defined in (6)) is 52. To optimize the parameters, we use the BFGS algorithm instead of the Adam or SGD optimizers since the BFGS algorithm is faster and also stable.
In the following, we outline the notation used and summarize the key steps of our framework.

1.
F mPDE−Net i denotes the result of applying the modified PDE-Net on our model.

2.
Next, we utilize the L 2 norm-based selection criterion and sparse regression on Finally, to verify that no further terms can be eliminated after Frac-PDE-Net, we compare two models: Model 1, generated by Frac-PDE-Net; and Model 2, which is identical to Model 1 but removes the term with the smallest L 2 norm from F 1 and For this case, we added 5% noise to the generated data to form the observational data. The results are displayed in Table 3. Table 3 shows that F mPDE−Net i (modified PDE-Net framework) accurately identifies the terms in Example 1 and estimates their corresponding coefficients. However, it also produces unnecessary terms with low weights after training. By applying the L 2 norm-based selection and sparse regression (L 2 +SP), we successfully remove these extra terms in F rsmPDE−Net i . After the terms in F 1 and F 2 are identified, we retrain the model with these fixed terms to obtain the final coefficients in F rsmPDE−Net i .
where Y * t i represents the true solution, and Y t i and Y t i denote the predicted solutions based on Model 1 and Model 2, respectively, at time t i . We will test if the residuals Applying Bonferroni method, Holm method and the B-H's procedure for multiple testing adjustment, discussed in Section 2.5, the test results are presented in the following Table 4. The results in Table 4 show that Model 1 (Frac-PDE-Net) is significantly different from Model 2, meaning all terms in Model 1 should be kept. Hence, the final discovered terms for F 1 and F 2 are represented by Model 1 (Frac-PDE-Net) in Table 4.
To assess the stability of the results shown above, we repeated the experiments 100 times and the results are presented in Figures 4 and 5. The process of merging similar terms is outline in Appendix A.1. The plots show that there are some instances where the three methods fail to eliminate certain redundant terms. However, these instances are rare, as the median of these terms is 0, indicating that they appear infrequently.

Example 2: A 1-Dimensional Model
Our second example is taken from [6]. In this example, we consider (14) under the Neumann boundary condition on a one-dimensional domain D 1 := [− 5π 2 , 5π 2 ] with d 1 = 0.1, d 2 = 10, Thus, Equation (14) is reduced to with (x, t) ∈ [−5, 5] × [0, 0.75] and   The training data, validation data and testing data are generated, based on (20), by applying a finite difference scheme to a 600 spatial mesh grid and then restricted to a 200 spatial mesh grid with the central difference scheme for ∆ := ∂ 2 x , and with a temporal discretization of the implicit Euler scheme, using a time step size of 0.01. Furthermore, we evaluate 14 different initial values, 10 of which were selected from a set of solutions with periodic patterns. The remaining initial values were generated by combining elementary functions. The reason for using different ways to produce initial values is to test if this method still works for periodical solutions.
We also add noise to the generated data in the following form: u(x, y, t) = |u * (x, y, t) + n l Q 1 |, v(x, y, t) = |v * (x, y, t) + n l Q 2 | where n l is the level of Gaussian noise added and Q 1 and Q 2 are random variables that follow from the normal distribution: Q i ∼ N (0, σ 2 i ) for i = 1, 2, where σ 1 (or σ 2 resp.) is the standard deviation of u * (or v * resp.). The reason of imposing the absolute value sign is to avoid negative values, which may cause trouble to evaluate power functions with non-integer power, such as u 1.5 .
We choose 15 blocks for the time on the interval [0, 0.75] and denote N Time = 15. For spatial variables, we set N Space = 200, where N Space represents the number of space grids. Therefore, the dataset is where both u t,k and v t,k are matrices in R N Space ×N Space . The following Tables 5 and 6 show a summary of parameters for Frac-PDE-Net.
In [6], some assumptions are made on the model based on existing experimental knowledge of the biological behavior. For example, it is assumed that the operator F 2 (u, v) is linear in both u and v, while F 1 (u, v) is nonlinear in both u and v. As the form in (15), In [6], the nonlinear dependence of P 1 (u, v) on u is via the combination of the power function u α and the integration operator u dx, where α is further restricted to the range [1,2]. On the other hand, R 1 (u, v) is assumed to be linear in u, but nonlinear in v and the nonlinear dependence on v is via a fractional function whose denominator is a quadratic polynomial. Thanks to these a priori constraints, we consider the library {u, u x , u xx , v, v x , v xx , I, u α } for F 1 (u, v) and the library {u, u x , u xx , v, v x , v xx } for F 2 (u, v), where α takes the form α = 1.5 + 0.5 sin(η) for η ∈ R to ensure that α ∈ [1, 2]. The filters q are of size 1 × 19. The total number of parameters for approximating F 1 and F 2 is 29, and the number of trainable parameters in the moment matrices M is 32. To optimize the parameters, we again use the BFGS algorithm.
For this case, we added 1% noise to the generated data to form the observational data. The results are displayed in Table 7, in which the notations are consistent with those in Table 3.
Similar to the post hoc selection procedure we performed in Example 1, we also need to compare Model 1 ( F rsmPDE−Net  ) is a negative number -0.026, which leads to rapid concentration rather than diffusion effect. With this being said, Model 2 is essentially different from Model 1 and the distributions of are totally different. To assess the stability of the results shown above, we repeated the experiments 100 times and the results are presented in Figures 6 and 7. The plots show that there are some instances where the three methods fail to eliminate certain redundant terms. However, these instances are rare, as the median of these terms is 0, indicating that they appear infrequently.

Example 1: The 2-Dimensional Model
In this section, we validate the robustness of the model discovered by Frac-PDE-Net in Example 1 by performing predictions with non-typical initial values u 0 and v 0 , (50y 2 − y 4 + 4) 2 + cos π 5 x + 1.
We use the finite difference method to generate the "true data" in the forward direction using the known coefficients and terms in (16) and (17). The spatial step sizes (dx and dy) are set to 10 64 and the time step size (dt) is 1 1600 . We then simulate the data using the trained model from Table 3 up to t = 0.5.
In Figure 8, both the true solution (u, v) and the predicted solution (ũ,ṽ) of the trained model by Frac-PDE-Net are plotted at different time instances: t ∈ {0.4, 0.6, 0.8, 1}. One can see from Figure 8 that the predicted solution is very close to the true one.  Table 8, while the predicted solutions are displayed in Figure 9. Although PDE-Net 2.0 only utilizes polynomials, the predicted images still have a similar shape to the true ones. To further evaluate the performance, the predicted errors are analyzed quantitatively using the L ∞ norm and L 2 norm on the space domain [−5, 5] × [−5, 5], as seen in Table 9. The results show that Frac-PDE-Net has smaller errors compared to PDE-Net 2.0, highlighting its advantage.  Figure 9. Images of the predicted dynamics using PDE-Net 2.0 with 5% noise level.

Example 2: The One-Dimensional Model
In this section, we validate the robustness of the model discovered by Frac-PDE-Net in Example 2 in Section 3.2 by performing predictions with the following periodic initial values u 0 and v 0 , u 0 (x) = 0.0259 + 0.01 sin(3x), v 0 (x) = 0.06475 + 0.01 sin(3x).
We use finite difference method to generate the "true" data in the forward direction using the known coefficients and terms in (20) and (21). The spatial step sizes (dx and dy) are set to 5π 200 and the time step size (dt) is 5 100 . The time interval considered is t ∈ [0, 10]. We then simulate the data using the trained model from Table 7 over the time period [0, 10]. In Figure 10, both the true solution and the predicted solution of the trained model by Frac-PDE-Net are plotted for t ∈ [0, 10]. One can see from Figure 10 that the predicted solution is very close to the true one.
The results of the comparison between Frac-PDE-Net and PDE-Net 2.0 are presented in both graphical and quantitative form. The model discovered by PDE-Net 2.0 is shown in Table 10, while the predicted solutions are displayed in Figure 11. We can clearly see that the predicted images by PDE-Net 2.0 are far from satisfactory compared to the true one in Figure 10. To further evaluate the performance, the predicted errors are analyzed quantitatively using the L ∞ norm and L 2 norm on the space-time region − 5π 2 , 5π 2 × [0, 10] in Table 11. The results show that Frac-PDE-Net has much smaller errors compared to PDE-Net 2.0, highlighting its advantage. Figure 10. The first row shows the true dynamics of (u, v) for (x, t) ∈ − 5π 2 , 5π 2 × [0, 10]. The second row presents the predicted dynamics of (u, v) with 1% noise level by Frac-PDE-Net.

Conclusions
Our approach, Frac-PDE-Net, builds on the symbolic approach developed in PDE-Net for addressing the discovery of realistic and interpretable PDE from data. While the neural network remains very efficient for generating and learning dictionaries of functions, typically polynomials, we have shown that if we enrich the dictionaries with large families of functions (typically uncountable), an extra-care is needed for selecting the important terms by penalization and by evaluating and testing the impact of a reaction term in the predicted solution. Quite remarkably, we can extract a sparse equation with readable terms and with good estimates of the associated parameters.
The introduction of rich families of functions, such as fractions (rational functions) is often necessary because they are well used by modelers, but also they can avoid the limitations of the approximation capacity of polynomials. Indeed, it might be necessary to have numerous terms in the expansion in order to have a correct approximation of the unknown reaction terms. As a matter of fact, we have introduced a very flexible family of fractions that avoid truncation based on powers u p , v q , q, p ∈ N. While we learn then the numerator and denominator coefficients in R, our approach is incorporated seamlessly in the symbolic differentiable neural network framework of PDE-Net by the introduction of extra layers.
Our work is originally motivated by the discovery and estimation of reaction-diffusion PDEs, with possibly complex terms such as fractions, non-integer powers, or non-local terms (such as an integral), as it has been introduced for the pollen tube growth problem [6]. Nevertheless, our selection approach could be used to handle other dictionaries, or in the presence of advection terms as our methodology does exploit the reaction-diffusion structure only for imposing some constraints on the dictionaries of interest, and because of the interpretability of each term in that case. As the next steps, the Frac-PDE-Net methodology can be improved by considering more advanced numerical schemes in time discretization, say implicit Euler or second-order Runge-Kutta. In that case, we expect to have a better accuracy and stability for model recovery and prediction. Another possible improvement would be to enrich the dictionaries of fractionals by replacing the current form N(u, v) = g(u)h(v) by more rational functions with denominators that depends both on u and v, say N(u, v) = u−v u 2 −v 2 +1 . Finally, we put an emphasis on the fact that Frac-PDE-Net reaches a trade-off by discovering the main terms of the PDE, accurately estimating each coefficient in order to gain interpretability, while it also allows effective long-term prediction, even for unseen initial conditions.  Acknowledgments: The authors would like to thank all anonymous reviewers for their constructive comments and suggestions.

Conflicts of Interest:
The authors declared no conflict of interest.
During the process of simulation, if only the addition and the multiplication operators are involved, then it is not an issue to combine terms as the program can easily identify same terms and then add their coefficients together. However, combining similar terms can be difficult when fractional terms are present. To address this issue, we classify the simulation results into various groups before combining them.
As an example, we consider the scenario where the nonlinear term takes the form g(u)h(v), and one of the following two structures is assumed.
(i) g is linear and h is a fractional function whose denominator is a second order polynomial: (ii) h is linear and g is a fractional function whose denominator is a second order polynomial: (v + c 2 ). There are now 32 groups. In each of them, all members share the same main terms and same signs in the denominator while the coefficients are allowed to be different. For example, in the group with the form all members share the same term uv in the numerator, same terms v 2 and v in the denominator, and same signs of β 1 and β 2 , while the specific values of α 1 , β 1 and β 2 may vary. Based on the above groups, we will adopt the following general principle to proceed. If two terms live in distinct groups, then they are considered to be different and will not be combined. If two terms live in the same group, then we will further quantify how close their coefficients in the denominator (say β 1 and β 2 ) are. If these coefficients are close enough, then we will regard them as the "same" term and combine them by adding their coefficients in the numerator (say α 1 ) together. So, the next question is how to quantify the distance between two members in the same group with possibly different coefficients (say β 1 and β 2 ).
We will illustrate the criterion in the following by studying a specific form uv v 2 +β 1 v+β 2 . More precisely, suppose there are two terms T 1 and T 2 as below, 1 uv v 2 + β (2) 1 v + β i − β (1) i | max{|β (2) i |, |β (1) i |} . (A1) According to this concept, we combine T 1 and T 2 together if and only if D[T 1 , T 2 ] < 0.2, that is when the relative difference between the coefficients is less than 0.2. In such a case, we add the coefficients α (1) 1 and α (2) 1 to obtain