Parallel Implementation of the Deterministic Ensemble Kalman Filter for Reservoir History Matching

: In this paper, the deterministic ensemble Kalman ﬁlter is implemented with a parallel technique of the message passing interface based on our in-house black oil simulator. The implementation is separated into two cases: (1) the ensemble size is greater than the processor number and (2) the ensemble size is smaller than or equal to the processor number. Numerical experiments for estimations of three-phase relative permeabilities represented by power-law models with both known endpoints and unknown endpoints are presented. It is shown that with known endpoints, good estimations can be obtained. With unknown endpoints, good estimations can still be obtained using more observations and a larger ensemble size. Computational time is reported to show that the run time is greatly reduced with more CPU cores. The MPI speedup is over 70% for a small ensemble size and 77% for a large ensemble size with up to 640 CPU cores.


Introduction
The ensemble Kalman filter (EnKF) is a data assimilation method to estimate poorly known model solutions and/or parameters by integrating the given observation data. It was introduced by Evensen [1] and has attracted a lot of attention in the fields of weather forecasting sciences, oceanographic sciences and reservoir engineering. Regarded as a Monte Carlo formulation of the Kalman Filter, it is applied to each ensemble forecast in an analysis step. The deterministic ensemble Kalman filter (DEnKF) is a variation of the EnKF which has an asymptotic match of the analysis error covariance from the Kalman filter theory with a small correction and without any perturbed observation; its form is simpler while its robust property is still kept [2].
The EnKF has been introduced to reservoir engineering in the past decade. As a data assimilation method for history matching, the EnKF can be used to assimilate different kinds of production data, estimate model parameters and adjust dynamical variables. It was first applied to history matching by Naevdal [3,4]. A detailed review was given by Aanonsen et al. [5]. It has been used to estimate reservoir porosity and permeability successfully [4,6,7]. It has also been adopted to estimate relative permeability and capillary pressure curves with good accuracy [8][9][10][11]. Nevertheless, for most of the applications, hundreds of the ensemble members are required to guarantee the accuracy of an estimation. For example, an estimation of endpoints of relative permeability curves is challenging; a good estimation can only be obtained in limited cases [8,9] and requires a large ensemble size. Therefore, a serial run of simulations is very time-consuming. Furthermore, for large-scale numerical models, the memory of one single processor may even not be enough for a simulation with one ensemble member. Therefore, the parallel technique becomes a natural choice to reduce the computational time and make large-scale models feasible.
The parallel technique has been successfully applied in many fields including atmosphere and oceanographic sciences. A multivariate ensemble Kalman filter was implemented on a massively parallel computer architecture for the Poseidon ocean circulation model [12]. At an EnKF forecast step, each ensemble member is distributed to a different processor. To parallelize an analysis step, the ensemble is transposed over processors according to domain decomposition. A parallelized localization implementation was reported in [13] based on domain decomposition. As for the reservoir history matching, the parallel technique has not been well developed. An automatic history matching module with distributed and parallel computing was done by Liang et al. using weigthed EnKF [14], where two-level high-performance computing is implemented, distributing ensemble members simultaneously by submitting all the simulation jobs at the same time and simulating each ensemble member in parallel. A parallel framework was given by Tavakoli et al. [15] using the EnkF and ensemble smoother (ES) methods based on the simulator IPARS, in which a forecast step was parallelized while an analysis step was computed by one central processor. It discussed the case where the processor number is fewer than the ensemble size and each processor is in charge of one or several ensemble members according to its simulation speed. In their another work [16], the case where the processor number is greater than the ensemble size was also discussed. For parallelizing an analysis step, the parallelization was carried out by partitioning a matrix in a row-wise manner. Comparing to the EnKF, ES requires less time to assimilating the data as it does not need to synchronize all the ensemble members at the analysis step. However, for the large data point problems, the matrices become very large, resulting in long computational time and large memory requirements for the inversion.
Although the implementation of the parallel technique was explored in the above papers, good computational efficiency has not been obtained, even with high performance simulators. One important reason is that after each analysis step, updated variable values were written on a disk and read by different processors to restart a simulator. A large number of inputs/outputs not only waste the disk space, but also slow down accessing, reading and writing. On the other hand, an analysis step must have been done after all the simulations in a forecast step are finished. This synchronization determines that at every forecast step, the simulation time equals the maximum run time among all the processors. From the parallel aspect, MPI inputs/outputs have poor performance on common computer clusters. Load and storage operations are more time-consuming than multiplication operations. Moreover, parallelization of an analysis step needs message passing multiple times. According to the above published papers, the parallel efficiency was generally lower than 50% with an ensemble of no more than 200 members compared to the ideal efficiency 100%.
In contrast to the traditional EnKF method, some modifications of this method have been done and different filters have been developed, such as Ensemble Square Root Filters (ESRFs) [17,18] and deterministic Ensemble Kalman filters (DEnKF) [2] that give better performance in certain conditions. First proposed by Sakov et al. [17], DEnKF can be regarded as a linear approximation to the ESRF under a small analysis correction condition, and it combines the simplicity of the ESRF and the versatility of the traditional EnKF. Its robust property has been shown by numerical experiments [17]. Despite the intensive research and implementation of EnKF, to our knowledge, the DEnKF has not been adopted in the reservoir history matching field yet.
In this paper, the DEnKF is implemented for reservoir history matching with a parallel technique and our in-house parallel black oil simulator. Both forecast and analysis steps are parallelized according to a relationship between an ensemble size and a processor number. To improve the computational efficiency, several switches are defined to control the input/output and set by the user. The restart of the simulator is slightly modified to coordinate the history matching implementation. The simulator is called as a function from our platform library. This paper is organized as follows. First, our black oil simulator is briefly introduced. After that, the EnKF and DEnKF are given, followed by a parallel technique used in our computations. Based on this technique, numerical experiments on the estimation of relative permeability curves with known endpoints and unknown endpoints are presented. The parallel speedup is reported to show the computational efficiency. Conclusions are given at the end of this paper.

Black Oil Simulator
In this paper, a black oil model is simulated. Using the mass conservation law and Darcy's law on water, oil and gas, the following equations can be obtained [19,20]: (1) In these equations, K is the absolute permeability tensor of a reservoir; K rl and S l are the relative permeability and saturation of phase l (l = w, o, g), respectively; ρ w is the water density; ρ o o is the oil component density in the oil phase; ρ g o is the gas component density in the oil phase; and ρ g g is the gas component density in the gas phase. The potential of phase l has the form Φ l = P l − Gρ l z where P l is the pressure of phase l (l = w, o, g), G is the magnitude of the gravitational constant and z is the depth of the position. The saturations of water, oil and gas satisfy For the pressures of the water, oil and gas, we have where P cow is the capillary pressure between oil and water and P cog is the capillary pressure between oil and gas. We choose P = P o , S w and S g as the primary unknowns, and they can be obtained from Equation (1). There are production and injection wells in a reservoir. The rates of these wells have the forms Here, q g = q g o + q g g , q o = q o o and W I is the well index which is known [19,20]. In summary, four unknowns P, S w , S g and P BH are to be solved from Equations (1)- (3). An in-house parallel simulator is used to simulate the black oil model. Developed on our platform PRSI, it uses the finite (volume) difference method with upwinding schemes [20,21]. Its highly computational efficiency and good parallel scalability have been shown by numerical experiments that simulate very large scale models with millions of grid cells using thousands of CPU cores [22]. In our implementation, this simulator is called as a function from the library.

EnKF and DEnKF
EnKF is an efficient data assimilation method which has been used in history matching of reservoir simulation to estimate model parameters with uncertainty such as porosity and permeability. It is a Monte Carlo method in which an ensemble of reservoir models is generated according to the given statistic properties. The ensemble can be written as Y k = [y k,1 , y k,2 , · · · , y k,N e ], k = 1, · · · , N t , where subscript k is the time index, N e is the ensemble size and N t is the total number of the assimilation steps. Each y k,j denotes one sample vector. For the history matching of reservoir simulation, y k,j is named a state vector which consists of two parts: model variables m and observation data d. The model variables include static variables m st (such as porosity, absolute permeability and relative permeability parameters) and dynamical variables m dyn (such as pressure and saturation). The observation data includes well measurement data such as production rates, injection rates and bottom hole pressures. Thus, a state vector can be written as We denote the length of the vectors m k,j and d k,j by N m,k and N d,k , and define the matrix H k as H k = [0 I], where 0 is the N d,k × N m,k zero matrix and I is the N d,k × N d,k identity matrix. Then, it is obvious that Note that the observation vector d k,j is random and can be expressed as where d true k is the true value of the observation and k,j is a vector of the error at the kth time step. According to Burgers et al. [23], vector k,j consists of measurement errors and noise. Both of them are assumed Gaussian and uncorrelated. Thus, the covariance of k,j can be simplified to a diagonal matrix C D,k .
At the time step t k when observation data is available, a reservoir simulator stops to start the data assimilation. The traditional form of an EnKF analysis step is as follows [1]: where the superscript f means a forecast solution that is computed from the simulator, the superscript a means an analysis ensemble updated by data assimilation, and T denotes the transpose of a vector or a matrix. K k named the Kalman gain can be regarded as a weighted matrix that controls which sample has more influence on updating. In Equation (6), C f y,k is the covariance matrix of the state vector y f k,j from simulation, i.e., Note that for the convenience of numerical implementation, Equations (5) and (6) can be expresses as where the N e × N e matrix X 5,k is calculated from Y f k . The form of X 5,k is given by Evensen [1,24].
Here, we drop the subscript k of the above notations for simplicity. For the DEnKF, denoting A f = Y f −Ȳ f and A a = Y a −Ȳ a as the ensemble anomalies of the forecast ensemble Y f and the analysis ensemble Y a , respectively, then they have the relationship [17] A Then, X 5 has the form [17,24] and w is a vector which satisfies y a = y f + A f w and has the form Therefore, to obtain the analysis ensemble Y a , X 5 and its multiplication with the forecast ensemble Y f need to be calculated. This is similar to the traditional EnKF and the ESRF, while T R has a simpler form for the DEnKF. It will be seen in our numerical experiments that the performance is quite good.

Parallel Technique
In most cases, EnKF requires an ensemble of hundred data points to assimilate history data. Between every two assimilation steps, a reservoir model is simulated with as many times as the ensemble size, which is extremely computational costly. In this work, the MPI parallel technique is adopted to improve the computing efficiency and reduce the computing time. Note that from the beginning of the EnKF process, many ensemble members of parameters are generated based on statistical properties and then for each ensemble member, reservoir simulation is conducted independently. Based on this consideration, the ensemble members can be distributed to all the processors in balance. Each processor is in charge of an equal number of ensemble members and simulating at the same time. When it comes to a data assimilation step where a processor cannot compute independently, analysis values are calculated by communicating with each other using the MPI technique. After all data is assimilated, it comes to the prediction step. In this step, each processor independently simulates the reservoir model with the distributed ensemble members which have already been adjusted. From this procedure, it can be seen that the MPI technique can be adopted in the EnKF method naturally.
There are the following steps for the programming implementation: The flow chart of the implementation is given in Figure 1. In our implementation, for the result visualization purpose, two switches writeens and writepars can also be set by the user. When writeens = TRUE, all the dynamic and static variables are written to the files but not necessarily read during running. When writepars = TRUE, only static variables are written. The ensembles of the dynamic and static variables are always output to files at the last analysis step. For the switch readens, if it is set to TRUE by the user, all the dynamic variables and the static variables are written to the files on a hard disk and the corresponding variable spaces allocated in the computer are freed after they are written at the each data assimilation step. After one cycle is done and before the next reservoir simulation starts, these data are read in from those files to restart the reservoir simulator. However, when the ensemble size is large, the performance of IO (Input and Output) is poor. Many small files written to the hard disk can not only waste space of the hard disk, but also are slow to access. The efficiency becomes even lower when the dynamic variable values are written to the files using MPI/IO. Nevertheless, MPI/IO must be used in a certain condition. For example, when the ensemble size is smaller than the processor size, for each ensemble, the reservoir simulation is done by several processors in parallel. Therefore, after the simulation, the vector of each dynamic variable is partially distributed in several processors. It is inevitable to write a dynamic vector to one file in the grid cell order using MPI/IO. To avoid reading and writing data to files, our simulator is slightly modified and the switch readens can be set to FALSE by the user so that both the dynamic variables and the static variables are always in the memory during computing and their values are updated every step. One might worry about that these variables would occupy too much computer memory. In fact, the ensemble members are equally distributed in the processors from the beginning. The allocation of the ensemble members and the simulations in different processes are shown in Figure 2. In this figure, the solid box denotes one simulation, the dash dot box denotes one processor and the dot box denotes the processor group. If the ensemble size N e is larger than the processor number p, each processor only has Ne p ensemble numbers. If the ensemble size is smaller than the processor number, each simulation with one ensemble member is carried out by p Ne processors. The vector of each dynamic variable value is equally distributed to these p Ne processors. Therefore, when multiple computer nodes and CPU cores are used on a supercomputer, each core only spends limited memory to save the dynamic variable vectors. It is not costly to keep these variables all the time. As an ensemble size is generally much smaller than the length of a state vector, the main computational cost of an analysis step is the matrix multiplication in Equation (7) which is computed in parallel after the matrix X 5 is computed sequentially by one processor. Supposing that the division of N e and p is an integer, there are two different cases: • N e > p. If the ensemble size N e is larger than the processor number p, each processor runs the simulation for Ne p times in every cycle independently and get the dynamic variable values corresponding to its ensemble members (see Figure 2a). Therefore, each processor obtains one or several complete columns of matrix Y f . For processor j (j = 1, · · · , p), by multiplying these columns by X 5 , it gets a matrix Y a,j with a size the same as that of matrix Y f . Note that Y a = ∑ p j=1 Y a,j so the analysis results can be obtained by adding all the Y a,j from every processor using the function MPI_Allreduce. The matrix multiplication is illustrated by Figure 3.
• N e <= p. If the ensemble size N e is smaller than or equal to the processor number p, each simulation is done by the processor group G f j = {P 1+j * p/N e , P 2+j * p/N e , · · · , P p/N e +j * p/N e }, (j = 0, · · · , N e − 1) (see Figure 2b) with p N e processors included (note that they are at the same time in one simulation box). The computing tasks of the processors in this group are assigned by the simulator based on the domain decomposition. According to the load balancing principle, the vector entries of each dynamic variable are equally distributed and the variable values are not complete in each processor. For the processor group G a j = {P j , P j+p/N e , · · · , P j+(N e −1) * p/N e } (j = 1, · · · , p/N e ) (see Figure 2b) by multiplying these vector entries of its own with X 5 , the processors in Group G a j get the matrices Y a,j , Y a,j+p/N e ,· · · Y a,j+(N e −1) * p/N e , respectively. By gathering these values in its own group G a j , every processor in this group gets the value ∑ N e −1 i=0 Y a,j+i * (p/N e ) which includes the corresponding updated vector entries. The distribution of the matrix entries is displayed in Figure 4. The processor groups are created by the function lMPI_Comm_split. The matrix multiplication of Y f and X 5 in case N e > p. P i (i=1, · · · , p) denotes the ith processor.
in Group G  Figure 4. The entries of the matrix Y f on different processors in case N e <= p. S i (i = 1, · · · , N e ) denotes the ith ensemble member. P i (i=1, · · · , p) denotes the ith processor.
When N e > p and the division of N e and p is not an integer, some of the processors get fewer ensemble members than the others and thus simulate a reservoir with less time. However, the algorithm is still the same. When N e <= p and the division of p and N e is not an integer, special treatment and more message passing are required when gathering the vectors since for some of the ensemble members the corresponding reservoir simulations are conducted by fewer processors than the others.
For a large-scale reservoir model that requires a large number of grid cells for discretization, the simulation with every ensemble member is time-consuming. In this case, the ensemble is not distributed to processors. Instead, the models with different ensemble members are simulated sequentially. Each model with one ensemble member is simulated by all the processors in parallel. Although the simulation run is very costly, it is obviously a trivial case of our discussion from the programming design aspect.

Numerical Experiments
In this section, the history matching of the SPE-9 benchmark model is presented. Estimations of relative permeability curves with known endpoints and unknown endpoints are plotted. With each estimation, the corresponding prediction results are obtained. Following that, the parallel efficiency of computations is reported.
There are 26 wells in the reservoir: one injector and 25 producers. The injector's perforation is at layer 11 to layer 15 while each producer's perforation is at layers 2, 3 and 4. The injector has a maximum standard water rate of 5000 bbl/day and a maximum bottom hole pressure of 4543.39 psi. The radius of each well is 0.5 ft. The schedule can be obtained from the keyword file in the document folder of the commercial software such as CMG IMEX.
The oil relative permeability in three phases can be obtained from the water and oil two-phase relative permeabilities and the gas and oil two-phase relative permeabilities [25]. To represent a relative permeability curve by a power-law function (e.g., based on an empirical Corey's model [26]), a power-law model only requires to determine the endpoints and exponential factors. Power-law models of water and oil relative permeabilities can be written as where Similarly, power-law representations of gas and oil relative permeabilities are k rg (S g ) = a g S g − S gc 1 − S gc − S org − S wc n g = a g S n g gD (10) with Equations (8)-(11) will be used in our computing. We consider the case where the oil relative permeability is defined by normalizing its effective permeability by using its value at the critical water saturation, i.e., where k ow is the effective oil permeability. Then k row (S wc ) = 1 holds, which leads to a ow = 1.
Similarly, for the gas and oil relative permeabilities, if the gas relative permeability is defined by normalizing its effective permeability by using its value at the irreducible gas saturation, i.e., where k og is the oil effective permeability, then k rog (S gc ) = 1 and thus a og = 1. Consequently, 10 parameters are left to define the relative permeability curves. Written as a parameter vector, they are m st = [a w , n w , n ow , S wc , S orw , a g , n g , n og , S gc , S org ]. (12) Note that the endpoints are fixed by 6 parameters: a w , S wc , S orw , a g , S gc and S org . If these parameters are known, only 4 parameters are left and the endpoints of the relative permeability curves are fixed. Thus, the parameter vector becomes m st = [n o , n ow , n g , n og ].
For an estimation with known endpoints, the observations are the cumulative water production rate (CWPR), cumulative oil production rate (COPR) and cumulative gas production rate (CGPR) of all the 25 production wells as well as the cumulative water injection rate (CWIR) of the injection well. For an estimation with unknown endpoints, the observations are the oil production rates, the gas production rates, the water production rates of the 25 producers, the water injection rate of the injector and the bottom-hole pressures of all the wells. All the rates are obtained from our simulator PRSI. All the data is perturbed by a Gaussian distribution error with 5% variance.

Estimation with Known Endpoints
As mentioned before, when the endpoints are known, only four unknowns are left as shown in Equation (13). In this work, totally 160 Gaussian realizations are generated for these parameters to assimilate the observations. The days on which the observations are assimilated are listed in Table 1. The first step of data assimilation is on the 12th day and the last one on the 552nd day. After that, each updated ensemble member with the permeability parameter values at the last step is used to simulate the model until the end day to predict the rates. The reference values, together with the means and the variances of the unknowns used to generate the ensemble, are listed in Table 2. The adjusted mean values after data assimilation and their relative errors different from the reference values are also listed in this table. It turns out that the data assimilation is very effective. From Table 2, it is seen that the mean values of the initial ensemble are far from the reference values. However, after data assimilation, their values are quite close to the reference values with the relative errors ranging from 0.75% to 7.67%. The relative permeability curves are shown in Figure 5. It is seen that the initial curves are randomly scattered and the curves from the mean values are far from the reference ones. After assimilation, the adjusted curves almost overlap with the reference ones. Table 2. This table shows the reference values, the adjusted values and the relative errors at the last assimilation step, the initial ensemble means and variances of the four parameters in Equation (13). To show the parameter values more clearly, the variations of all the four parameters at different time are shown in Figure 6a-d, which depicts that the variation of each parameter becomes smaller by data assimilation. At the same time, the mean values become closer to the reference value. A relative root mean square error is calculated to show the convergence of the parameters. It is a normalized difference between the mean value of the updated parameter vector and its true value, i.e., where m st,i is the ith component of the vector m st and N m st is the length of the vector [9]. Figure 6e depicts the relative root mean square error of the relative permeability model vector. It shows that at the last assimilation step, the relative root mean square error drops to 5%. From the 552nd day, the adjusted ensemble parameters from the last analysis step are used to predict the injection and production data. The cumulative oil production rate (COPR), the cumulative water production rate (CWPR) and the cumulative gas production rate (CGPR) of the group of all the 25 production wells as well as the cumulative water injection rate (CWIR) of the injection well are predicted and shown in Figure 7. In this figure, gray lines represent the rates from the initial ensemble. As the initial ensemble is randomly generated, the rates are quite different from each other. From day 12 to day 552 by the EnKF method, the observation data are assimilated 30 times, the permeability curves are adjusted and the new calculated and updated rates match the reference rates very well. After the last assimilation step, the adjusted permeability curves at day 552 are used to simulate the reservoir model to predict the rates. It is shown that the predicted rates and the reference rates almost overlap, which means that the adjusted permeability curves are accurate enough to be used to predict the future performance.
For the last analysis step, the relative errors of the pressure P, the water saturation S w , the gas saturation S g and the well bottom hole pressure P BH are calculated by normalizing the difference of the ensemble mean and the variable's true value. For each variable v, the relative error is computed from the following form: where · 2 denotes the Euclidean norm of a discretized variable. The relative errors of these variables are listed in Table 3. The table shows that the relative errors of the dynamic variables are no more than 2.5%. In particular, the pressure P, the water saturation S w and the well bottom hole pressure P BH are lower than 1.0%.  Figure 7. The production rates with known endpoints: (a) cumulative oil production rate; (b) cumulative water production rate (The upper curve is the cumulative water injection rate. The lower curve is the cumulative water production rate); (c) cumulative gas production rate.

Estimation with Unknown Endpoints
For permeabilities with unknown endpoints, good estimations are not obtained by assimilating the observations of COPR, CWPR, CWIR and CGPR. It is probably because when the endpoints are unknown, the solution is not unique for this inverse problem. Therefore, we assimilate more detailed data including the oil production rates (WOPR), the water production rates (WWPR), the gas production rates (WGPR) of the 25 producers, the water injection of the injector (WWIR) and the bottom-hole pressures (WBHP) of all the wells. It is interesting to see that with more data assimilated and an ensemble with 640 samples, all the parameters are well estimated. Table 4 shows the reference values, the adjusted values, the relative errors, the initial ensemble means and variances. From this table, we can see that most of the relative errors are no more than 10% except a w , n ow and S org which are~12%. Figure 8 shows the relative permeability curves of oil-water and oil-gas. We can see that although the initial curves are quite random, the updated curves are close to the reference curves.
Similar to the previous example, the variations of all the ten parameters at different time are also plotted (see Figure 9a-j). This figure depicts that the variations of parameters become smaller by data assimilation. At the same time, the mean values become closer to the reference values. Figure 9k shows that from the beginning to the last data assimilation step, the relative mean-square-root error of all the ten parameters drops from 65.5% to 8.4%. Table 4. This table shows the reference values, the adjusted values and the relative errors RE st at the last assimilation step, the initial ensemble means and variances of the ten parameters in Equation (12).  For the prediction step, the injection/production rates obtained from the adjusted parameters match the observation very well. The results of WOPR, WWPR, WGPR and WBHP of some selected wells including Producer1, Producer 4, Producer 8 and Producer 20 are depicted in Figure 10. For each well, the relative error of one prediction variable is calculated from the mean-square-root of the observation variable on N D days included in the set D = {565, 600, 601, 604, 610, 625, 660, 661, 664, 670, 685, 720, 721, 724, 730, 745, 783, 873, 900} using the following formula: where d mean  Figure 11. This figure shows that with the estimated parameters, the relative errors of the predicted rates and the well bottom-hole pressure are no more than 3.5%. The relative errors of the dynamic variables at the last analysis step calculated by Equation (15) are listed in Table 5. This table shows that the relative errors of the dynamic variables are no more than 4.28%.  Figure 10. The oil production rates, the water production rates, the gas production rates and the bottom-hole pressure of the selected production wells with unknown endpoints.

Parallel Performance and Discussion
The numerical experiments are carried out on the Niagara Scinet cluster of 1548 Lenovo SD 530 servers each with 40 Intel 'Skylake' cores at 2.4 GHz. To test the MPI parallel performance of our computing, the examples of both known endpoints and unknown endpoints in the previous section are computed with 40, 80, 160, 320 and 640 cores. In these computations, the switch readens is set to FALSE. Therefore, only the ensembles at the last analysis step are output to files. To compare more precisely, the initial ensemble is generated in advance and read from the disk at the beginning of each run. Both experiments are computed with one processor per CPU core. The 40-core case is used as the base case. Computational time is given for both of the examples. The parallel efficiency and the speedup are presented. Generally, the parallel efficiency is defined as where T 1 and T p denote the run time of a task using one processor and p processors, respectively. T 1 is the base case time. If the execution time of p 0 processors is used as a base case, the parallel efficiency can be calculated by The speedup is defined as Similarly, if the execution time of p 0 processors is used as a base case, the speedup can be calculated by For the known endpoint experiment with 160 ensemble members, the computational time and parallel efficiency are listed in Table 6. In this table, the 'forecast' denotes the total forecast time including the synchronization time. The 'analysis' is the total time spent on the analysis step. The 'predict' denotes the prediction time after the last step of analysis. The 'other' time mainly includes the run time of input/output. The last row 'run time' is the total execution time. From this table, it can be seen that most time is spent on the forecast steps and the prediction steps. With the CPU cores increasing, the run time decreases rapidly. When 40, 80 and 160 CPU cores are used, the time for the forecast steps is reduced by half when the CPU core number is doubled. This also happens at the prediction steps. Therefore, we can see that the parallel efficiencies of the forecast and the prediction steps are over 100%. When 320 and 640 CPU cores are used, the ensemble size is 160 which is smaller than the processor numbers. Each simulation is conducted by multiple CPU cores which needs message passing. Therefore, the parallel efficiencies of the forecast and prediction steps are lower than 100%. The last row shows that the parallel efficiency of the whole execution is over 74% compared to the ideal efficiency 100%. Note that the parallel efficiency of the analysis step is not as good as the forecast and prediction as the time for the multiple message passing becomes dominant in the analysis step when more processors are used. The speedup is shown in Figure 12. It can be seen from this figure that when 40, 80 and 160 CPUs are used, the speedup is almost linear. This benefits from the good parallel efficiencies of the forecast steps and the prediction steps. When 320 and 640 CPU cores are used, it drops down and becomes sub-linear. For the unknown endpoint experiment with 640 ensemble members, the time and the parallel efficiencies are shown in Table 7. From this table, it can be seen that all steps take more time than those with the known endpoint example since the ensemble size is much larger. Similar to the previous example, good parallel efficiencies of the forecast and the prediction steps are obtained, as here the ensemble size is always less or equal to the processor number, which means each simulation is conducted by one processor independently with no message passing required. The run time reduces significantly when more CPU cores are used. The parallel efficiency is over 77%. The time for the analysis also reduces with more CPU cores are used. However, it can be seen that the parallel efficiencies for this step is much lower than the ideal one. One reason is that the observation reading time does not decrease with the number of CPU cores increasing. Another reason is that several times of message passing are required to obtain the updated ensemble in the analysis steps.
To show the time allocation of the analysis step in detail, the time spent on every part of the last analysis step is listed in Table 8. In this table, 'read obs' means the time used for reading observations from the files. The term 'compute X 5 ' means the time for computing matrix X 5 . The term 'A f * X 5 ' denotes the time used for the multiplication of the local forecast ensemble and matrix X 5 . The term 'MPI_Allreduce' denotes the message passing time used in the analysis step. The term 'output' denotes the time used for the ensemble output. The 'total' denotes the total time of this analysis step. From the second row 'read obs' and the third row 'compute X 5 ', it can be seen that the time for computing matrix X 5 is very short compared to the others. From the fourth row 'A f * X 5 ', it can be seen that the time used for the matrix multiplication reduces significantly when more CPU cores are used. Compared to the time used for message passing shown in the fifth row 'MPI_Allreduce', it becomes less dominant with the CPU cores increasing. The time for reading observations and the message passing does not reduce when more CPU cores are used, which leads to the low parallel efficiencies for the analysis steps. The 'output' row shows that when more CPU cores are used, the time used for outputting the ensemble is also reduced somehow. The speedup curve of the whole execution is plotted in Figure 13 which shows better performance than the known endpoint example.

Conclusions
A parallel technique is used to implement the DEnKF for reservoir history matching. The algorithm and implementation are introduced in detail. Two numerical experiments of estimating relative permeabilities are computed using a massively parallel computer. Reservoir history matching using DEnKF can be efficiently implemented using the MPI parallel technique. The parallel efficiency is over 74% for a small ensemble with 160 members and 77% for a large ensemble with 640 members using up to 640 CPU cores.
The numerical experiments demonstrate that the DEnKF can be used to estimate the relative permeability curves given by a power-law model in reservoir history matching. For the relative permeability curves with known endpoints, all the parameters in a power-law model can be estimated well. For the unknown endpoint model, good estimations can be obtained with more observations and a larger ensemble size compared to the known endpoint model.

Informed Consent Statement: Not applicable.
Data Availability Statement: SPE-9 model data are from the CMG reservoir simulation software under the installation directory \IMEX\2019.10\TPL\spe.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: