MCMCINLA Estimation of Missing Data and Its Application to Public Health Development in China in the Post-Epidemic Era

Medical data are often missing during epidemiological surveys and clinical trials. In this paper, we propose the MCMCINLA estimation method to account for missing data. We introduce a new latent class into the spatial lag model (SLM) and use a conditional autoregressive specification (CAR) spatial model-based approach to impute missing values, making the model fit into the integrated nested Laplace approximation (INLA) framework. Combining the advantages of both the Markov chain Monte Carlo (MCMC) and INLA frameworks, the MCMCINLA algorithm is used to implement imputation of the missing data and fit the model to derive estimates of the parameters from the posterior margins. Finally, the economic data and the hemorrhagic fever with renal syndrome (HFRS) disease data of mainland China from 2016–2018 are used as examples to explore the development of public health in China in the post-epidemic era. The results show that compared with expectation maximization (EM) and full information maximum likelihood estimation (FIML), the predicted values of the missing data obtained using our method are closer to the true values, and the spatial distribution of HFRS in China can be inferred from the imputation results with a southern-heavy and northern-light distribution. It can provide some references for the development of public health in China in the post-epidemic era.


Introduction
Missing data are common and unavoidable in daily life. For example, in engineering design, equipment failure may cause some data to fail to be collected normally, and during market research, there may also be situations where respondents refuse to answer relevant questions. Besides, missing data often occurs in the medical context. For instance, in epidemiological surveys, equipment limitations prevent access to complete information, and in medical databases, not all patients' clinical test results are available at a given time, leaving a portion of the attribute values vacant. In addition, data can be lost due to the failure of storage media and transmission media.
Information theory is an important discipline based on the methods of probability theory and mathematical statistics for the study of information entropy, data processing, and data transmission [1]. It has a wide range of applications in the effective processing and reliable transmission of information. In the era of big data, the data processing of complex information becomes an important part of research and analysis. Dealing with missing data has always been the first problem that researchers need to solve before conducting statistical analysis, as improper handling can lead to deviations in statistical inferences and even affect the final decision. Little and Rubin (2002) [2] gave a detailed and systematic introduction to the different missing mechanisms and imputation models of missing data. Buuren (2011) [3] supplemented and improved it on this basis and provided a demonstration of R code to make the imputation process more intuitive. At present, scholars have proposed a variety of imputation methods for the missing data problem, which can basically be divided into two categories: statistical methods and machine learning methods. Statistical methods mostly make assumptions based on the dataset itself, and then use the original dataset to impute the missing data accordingly. Common methods include expectation maximization (EM) imputation [4], regression imputation [5], and multiple imputation [6]. Machine learning methods generally impute the missing dataset by clustering with K-nearest neighbor imputation [7] and K-means imputation [8], and Bayesian networks [9] are represented. In recent years, with the rise of the machine learning boom, Bayesian networks have become a frequently used method to deal with missing data. Mason (2009) [10] proposed the use of a Bayesian method to model nonrandom missing data through the Bayesian missing imputation framework to adjust the missing covariates in longitudinal studies. Erler et al. (2016) [11] suggested that missing values can be imputed in a joint estimation framework using Bayesian methods. Zhang et al. (2017) [12] proposed a missing data processing method based on plain Bayesian and EM algorithms for the software workload problem. Ding (2020) [13] conducted a comparative study of the missing data imputation problem in normal models using Bayesian and jackknife multiple imputation methods, respectively, and concluded that Bayesian imputation results are more accurate.
The Markov chain Monte Carlo (MCMC) algorithm has now become a standard method for parameter estimation in many models. Dogan and Taspinar (2018) [14] (hereinafter DT) performed parameter estimation for the spatial error sample selection model with nonrandom missing data using MCMC; Hajime Seya et al. (2020) [15] improved on the work done by DT, and in parallel, they proposed that MCMC can handle the parameter estimation problem of the spatial lag sample selection model with nonrandom missing data. Although the MCMC can solve Bayesian inference excellently, MCMC may be limited by the speed of convergence and numerical stability when faced with larger models or more data. To address this problem, Rue et al. (2009) [16] proposed an algorithm that combines Laplace approximation with modernized numerical integration under a Bayesian framework-the integrated nested Laplace approximation (INLA)-which can significantly reduce computation time while guaranteeing the accuracy of an MCMC estimation. G'omez-Rubio et al. (2017) [17] described the realization of a new class of latent model in INLA which can be used directly for fitting spatial econometric models, and G omez-Rubio and Rue (2018) [18] created a new approach combining INLA and MCMC, namely, MCMCINLA, and used it to fit spatial econometric models, linear regression models with missing data in covariates, Bayesian Lasso models, and mixed models. Gomez-Rubio et al. (2019) [19] also redefined the problem of missing values in regression models covariates by latent Gaussian Markov random field (GMRF) for analysis and imputation of missing data, and they applied it to the spatial model and the multiple linear regression model to overcome the problem wherein INLA cannot handle a model with missing values in covariates.
This paper proposes a new MCMCINLA imputation method for missing data and uses the hemorrhagic fever with renal syndrome (HFRS) disease data with random missing in covariates to establish a spatial lag (SLM) latent model to explore the developmental inputs to public health in China before the COVID-19 outbreak, and to provide reference suggestions for the national financial inputs to public health in the post-epidemic era. In addition, the use of the imputation effects of EM, Full Information Maximum Likelihood estimation (FIML), and MCMCINLA method on the missing data are compared to illustrate the effectiveness of the imputation method proposed in this paper.
The paper is structured as follows: Section 2 reviews the different missing mechanisms for the missing data and introduces the SLM latent model with random missing data in covariates; Section 3 gives the proof procedure of the model GMRF structure and describes the process of implementing the MCMCINLA algorithm for the SLM latent model with random missing data in covariates; Section 4 conducts numerical simulations for the model and algorithm proposed in this paper to verify the correctness of the method; Section 5 presents an empirical analysis of the reform adjustment problem of public health development in China in the post-epidemic era as an example; and finally, the conclusions and discussions are given in Section 6.

Missing Mechanism
The relationship between missing data and complete data is known as the missing mechanism, which is broadly classified into three types: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). MCAR occurs when the missing data is random and unrelated to both observed and unobserved data, MAR occurs when the missing data is only related to observed data, and NMAR occurs when the missing data is related to both observed and unobserved data [20].
Furthermore, missing data can be classified into four categories:(Y obs , X obs ), (Y mis , X obs ), (Y obs , X mis ), and (Y mis , X mis ), depending on where the missing values are located. For the first case, when neither X nor Y contains missing data, the fit can be performed directly using INLA or MCMCINLA (see, for example, Gómez-Rubio, V. et al. (2017,2018) [17,18]); for the second case, when the missing data are in the response variables, INLA can use its own properties to predict the missing values by directly calculating the predicted distribution of all the missing data in the response variable (see, for example, Gómez-Rubio, V. et al. (2017) [17]); for the third and fourth case, when the missing data are in the covariates, it is necessary to first define the imputation submodel as latent effects with a GMRF structure to make it suitable for the INLA framework and include the imputation in the main model, and then further perform the model fitting with the help of the INLA, which is also the focus of this paper.

The SLM Latent Model
In spatial statistics, the SLM has received increasingly wide attention from many scholars. It is mainly used to study the impact of the behavior of adjacent regions on the behavior of other regions of the whole system, expressed formally as: where Y represents the observation vectors of n different regions, ρ Lag is the spatial autocorrelation parameter, W is the adjacency matrix, β is the coefficient vector of covariates, and the error term e obeys a multivariate normal distribution with the mean 0 and diagonal variance-covariance matrix σ 2 I n . We can also shift the term for Y and rewrite the model as: The key to enabling the SLM to be implemented under the MCMCINLA is whether the model can be implemented within INLA. Normally, the SLM cannot be fitted directly with INLA, and we need to construct latent classes and redefine the original SLM as a model with GMRF so as to conform to the INLA framework. We can construct a latent class for the SLM as follows: where x denotes the vector of n random effects, ρ is the spatial autocorrelation parameter, W is the weight matrix, X = (X mis , X obs ) are the covariates with random missing data, β is the coefficient vector of covariates, and the error term e obeys an independent Gaussian distribution with the mean 0 and precision matrix τ I n .
Using the constructed latent class of Equation (2) to rewrite the SLM model, we can obtain the SLM latent model: where ε is a small error term that is used to fit the model.

Algorithm Description
The INLA algorithm mainly targets models with structured additive regression models with a latent random field of GMRF. The premise of using the MCMCINLA fitting model is that the model needs to conform to the INLA framework, i.e., it has a GMRF structure in order to be solved. Therefore, this section first explains the GMRF structure of the main model and the imputation model, and then describes the implementation process of the MCMCINLA algorithm. If we assign a Gaussian prior with a zero mean and the precision matrix Q to β in Equation (2), the e obeys a Gaussian distribution with a zero mean and the precision matrix τ I n , with τ as a precision parameter. Then, INLA will want to obtain the joint distribution π(x, β) of x and β. By Bayes' theorem, we have: and, by definition, where P is the precision matrix of (x, β) with the structure: This shows that for the given hyperparameters τ and ρ, the mean and precision matrix of (x, β) are 0 and P; that is, the constructed latent class x has a GMRF structure with the mean 0 and precision matrix P, which is consistent with the INLA framework, and thus can be implemented with the help of MCMCINLA.

GMRF Structure of the Imputation Model
For the covariates X = (X mis , X obs ) containing missing data, X mis denotes the part with the missing values and X obs denotes the part that is observable. We define the imputation submodel of the covariates as the latent effect x = (x mis , x obs ), and set different priors for x mis and x obs , respectively, where the observed term x obs in the latent effect is set to the mean equal to X obs and a high precision matrix (e.g., here, we take it as 5 10 I) so that its variance is very small, i.e., it makes the observed term x obs in the latent effect as infinitely close to the observed covariate data X obs as possible [19]. A spatial model with the mean µ c and precision matrix Q c is built for the missing term x mis in the latent effect to impute the missing covariate data, and the procedure is as shown below.
The imputation model, that is, given the observed data X obs and the hyperparameter θ, provides the distribution of the missing values X mis ; hence, we have: Since π( θ|X obs ) is only related to the observed data X obs , it can be considered as a priori of θ, which can further rewrite π( θ|X obs ) as: In the general case, we assume that the covariates X = (X mis , X obs ) follow a multivariate normal distribution of: Q mis,mis Q mis,obs Q obs,mis Q obs,obs , and then it defines that its imputation model will obey: where µ c = µ mis − Q −1 mis,mis Q mis,obs (X obs − µ obs ) and Q c = Q mis,mis . Considering that the covariates in the SLM are spatially correlated, the missing data in the covariates are imputed using a conditional autoregressive specification (CAR) spatial model-based approach. Under CAR, the mean of the model is set as µ = α T and the precision matrix as Q = τ(I − ρW), where α is the intercept of the linear predictor, τ is the precision parameter, ρ is the spatial autocorrelation parameter, W denotes the adjacency matrix, and the hyperparameter θ at this time consists of τ, ρ, and α.
Thereby, the latent effect x = (x mis , x obs ) will obey the following multivariate normal distribution: where µ c and Q c are taken as shown in Equation (9). This shows that the latent effect x that we defined for the imputation model has a GMRF structure that can be applied to the INLA framework, and thus it can be implemented with the help of MCMCINLA.

Implementation of the MCMCINLA Algorithm
When using MCMCINLA to deal with the SLM latent model with random missing data in covariates, it is first necessary to define an imputation submodel to impute the covariates with missing values so as to substitute the complete covariate data back into the SLM, and then use MCMCINLA to fit the model for estimation. The core of the MCMCINLA algorithm parameter estimation lies in dividing the estimated parameters into two groups: the first group is estimated using the Metropolis-Hastings (MH) algorithm in MCMC and the second group is estimated using the Bayesian model averaging (BMA) algorithm [21] in INLA. Therefore, the whole algorithmic process is carried out in three main steps:

1.
Imputation of the missing covariates X = (X mis , X obs ) using INLA In this paper, we chose to impute the missing covariates X = (X mis , X obs ) using a CAR space model-based approach, which first requires the definition of the mean, precision, hyperparameters, and prior of each hyperparameter in the latent effect x . The key codes are as follows: inla.rgeneric.micar.model = function(cmd = c("graph", "Q", "mu", "initial","log.norm.const", "log.prior", "quit"), theta = NULL), which define the spatial weight matrix ("graph"), precision matrix ("Q"), mean ("mu"), hyperparameter prior ("log.prior"), etc. in the latent effect x by the inla.rgeneric.micar.model, respectively, in preparation for defining the imputation model; which defines the imputation model via the inla.rgeneric.define(), where n denotes the total number of indices, x denotes the covariates containing the missing data, idx.mis denotes the index of each missing datum, and W is the spatial weight matrix; and where, finally, INLA is used to complete the fit of the imputation model, and where f(idx, model =model) represents the spatial effect of the imputation model. The covariate X after imputation via INLA is incorporated into the SLM, at which time there are three parameters to be estimated in the model, namely, the spatial autocorrelation parameter ρ, the covariate coefficient β, and the error term precision τ. Here, we use MH estimation for the parameter ρ and BMA estimation for the remaining parameters.

2.
Estimation of the spatial autocorrelation parameter ρ using MH The estimation of ρ using MH is carried out in three main steps, as follows:
The key code for the process is as follows: in which we fit the SLM by defining fit.inla() to prepare for subsequent sampling, where formula is a formula with the response variable and the fixed effects, d represents the complete data set after imputation, W is the spatial weight matrix as above, and rho is the spatial autocorrelation parameter; and INLAMH (d.mis, fit.inla, d.init, rq, dq, prior, n.sim = 200, n.burnin = 100, n.thin = 1, . . . ), in which we fit the parameters ρ by INLAMH(), where fit.inla is the model fitted earlier, and here, namely, the SLM, and d.init, rq, dq, and prior are all basic settings in the MH algorithm which denote the initial value of sampling, the proposed distribution, the density function of the proposed distribution, and the prior distribution, respectively.

3.
Estimation of the coefficient β and precision τ using BMA For the parameter β and hyperparameter τ, the conditional margins π( β Y, ρ (j) ) and π( τ Y, ρ (j) ) generated by the MH algorithm in each iteration can be obtained using BMA, and further, the posterior margins of β and τ are derived by integrating over ρ, namely: The key code for the process is as follows: INLABMA:::fitmatrixBMA(l.models, ws, "summary.fixed") INLABMA:::fitmatrixBMA(l.models, ws, "summary.hyperpar") which calculate the posterior margins for fixed effects and hyperparameters by the fitma-trixBMA() function in the INLABMA package and where l.models are the INLA models to be averaged and ws is the weight vector.

Data Generation
Assuming that (I n − ρ Lag W) is invertible, we consider the numerical simulation process of the SLM latent model for covariates with random missing data as follows: where Y is the response variable, X 1 and X 2 are the covariates, and β 1 and β 2 are the coefficients corresponding to X 1 and X 2 , respectively, I n is an n × n unit matrix, W is an n × n spatial weight matrix, e is a random error term with e ∼ N(0, σ 2 ), and ε is a fitting error term.
The specific simulation data are taken as follows: • In the main model, X 1 ∼ U(0, 1) and X 2 ∼ U(0, 1), β 1 = 0.3 and β 2 = 0.5, ρ Lag = 0.9, and Y is generated by Equation (14) and we randomly remove 15% data as the missing values in X 1 using the MAR mechanism; • In the imputation model, α = 0.5 and ρ = 0.2; and • The spatial weight matrices W in both the main model and the imputation model are selected as Queen-type adjoints, created in the regular lattice, and all error terms in the model obey the normal distribution N(0, 0.5 2 ), taking the sample size n = 250, pre-burn simulation of 20 times, interval rejection after pre-burn to keep one of the 5 iterations, and, finally, a total of 80 iterations of simulation are completed.

Fitting Effect Evaluation Indicators
The mean square error (MSE) value is used to reflect the difference between estimates and estimates, the deviance information criterion (DIC) value is used to measure the fit of the Bayesian model, and the accuracy is used to measure the prediction of the missing data. The MSE can evaluate the degree of change of the data, and the smaller the value of the MSE, the higher the accuracy of the data; the DIC criterion can weigh the complexity of the estimated model and the goodness of the model fit, and the smaller the value, the better the model fit; the accuracy is the ratio of the predicted value to the true value, which is used to reflect the similarity between the true value and the predicted value of the missing data and capture the missing information, and the larger the value, the more accurate the prediction result.

Results Analysis
Using MCMCINLA to first impute the missing data in the covariates by the method based on the CAR spatial model, and then fit the estimates to the SLM latent model, the results of each simulation are shown in Tables 1 and 2, the fitted curves of ρ Lag , β 1 , and β 2 and τ and the prediction comparison plots of the missing data are shown in Figures 1-3. According to the results in Table 1, it can be found that the mean of each parameter β 1 , β 2 , ρ Lag , and τ estimated in the SLM latent model are very close to the true values and the MSE and the DIC are small, indicating that the parameter estimation and model fitting of this estimation method are good. In addition, Table 2 shows the imputation results for each missing datum in the simulation, where "Mean" represents the mean of the predicted value of the missing data, "95%CI" denotes the 95% credible interval of the predicted value of the missing data, "True value" represents the true value of the missing data, and "Accuracy" is the ratio of "Mean" to "True value". As shown in Table 2, except for the prediction accuracy of V7 and V34, which is around 85%, the prediction accuracy of each missing datum is basically above 90%, indicating that the imputation accuracy of our method is high and the information contained in the missing data can be effectively mined for research with the help of this method. the missing data and the y-axis represents the true values of these missing data that are artificially removed. As can be seen from Figures 1 and 2, except for the peak of , which deviates slightly from the true value, the peaks of all the other parameters are very close to the true value, and all points in Figure 3 basically fall on the line = . This indicates that the SLM latent model with the covariates containing missing data can be estimated and predicted relatively well using MCMCINLA.  the missing data and the y-axis represents the true values of these missing data that are artificially removed. As can be seen from Figures 1 and 2, except for the peak of , which deviates slightly from the true value, the peaks of all the other parameters are very close to the true value, and all points in Figure 3 basically fall on the line = . This indicates that the SLM latent model with the covariates containing missing data can be estimated and predicted relatively well using MCMCINLA.

Subject Presents
In the post-epidemic era, outbreaks of various infectious diseases, especially the outbreak of COVID-19, have raised thoughts about the restructuring of the public health system and the corresponding financial investment reforms [22]. How to effectively adjust to  Figures 1 and 2 show the fitted curves of the covariate coefficients β 1 and β 2 , the error term precision τ, and the density function curve of the spatial autocorrelation parameter ρ Lag , respectively, where the black solid line is the fitted value and the black dashed line perpendicular to the x-axis is the true value. Figure 3 shows the predicted versus true values of the missing data, where the x-axis represents the predicted values of the missing data and the y-axis represents the true values of these missing data that are artificially removed. As can be seen from Figures 1 and 2, except for the peak of ρ Lag , which deviates slightly from the true value, the peaks of all the other parameters are very close to the true value, and all points in Figure 3 basically fall on the line y = x. This indicates that the SLM latent model with the covariates containing missing data can be estimated and predicted relatively well using MCMCINLA.

Subject Presents
In the post-epidemic era, outbreaks of various infectious diseases, especially the outbreak of COVID-19, have raised thoughts about the restructuring of the public health system and the corresponding financial investment reforms [22]. How to effectively adjust to the various problems revealed under the epidemic to ensure the continued health of the country's public health has become a key and urgent issue for current research. In the case of medical data, data are often missing during collection and transmission due to clinical trials or equipment failures [23]. It is important to make full use of the large amount of data to mine the important information and make correct predictions of the missing data to make a comprehensive analysis of the research subject. This paper obtains the economic and disease data of 31 regions in mainland China from 2016 to 2018 through the China Statistical Yearbook (http://www.stats.gov.cn, accessed on 31 December 2020) and the Public Health Science Data Center (http://www.phsciencedata. cn, accessed on 31 December 2018), and uses the national financial investment in public health as the response variable Y and the number of infectious disease cases, economic development, and scientific and technological development as the covariates X to construct the SLM latent model to explore how public health in China can further be developed in the post-epidemic era. The national financial investment in public health is expressed by the indicator "health expenditure in general public budget expenditure by region". Since China has become the most seriously affected country in the world by HFRS [24], the infectious disease studied here is HFRS, and the number of infectious disease cases is expressed by the indicator "number of HFRS cases by region". The economic development is expressed by the indicator "regional gross product". The scientific and technological development is expressed by the indicator "number of research and experimental development R&D projects by region". Using the MAR mechanism to randomly remove 15% of the data from the number of infectious disease cases to construct the SLM latent model for covariates with random missing data, we have: If we set the prior distribution for the hyperparameters, the ρ Lag of the main model is assigned a uniform distribution that satisfies (−2, 1), and at this time, the minimum eigenvalue of the adjacency matrix W is −0.5. The prior of the coefficient vector β i and τ are set as the default values in R-INLA. Then, we assign a Gaussian prior with a zero mean and 0.00005 precision to the intercept α in the imputation model, set in the imputation model as above using the default settings in R-INLA, and assign logit(ρ) to a Gaussian prior with a zero mean and 0.001 precision.

Exploring the Development of Public Health in China in the Post-Epidemic Era
Using MCMCINLA, the missing data are imputed with the help of inla.rgeneric.define(), and the model is fitted using fit.inla( ) and the estimated values of the spatial autocorrelation parameters and the posterior estimates of the regression coefficients of the influencing factors are obtained as shown in Table 3. The regression coefficients of the influencing factors are plotted in Figure 4. Table 3. Posterior estimates of the regression coefficients of each influencing factor of financial investment in public health. As estimated by MCMCINLA, the spatial autocorrelation parameter is ρ Lag = 0.6933, which indicates that there is a significant spatial correlation between the national financial investment in public health between regions. To a certain extent, the public health development of a region also has some influence on the public health development of the surrounding regions. From the results of Table 3, it can be found that the posterior mean of economic development is 1.2719, which indicates that the increase of economic level will increase the national financial investment in public health, which can provide the power financial source for the development of public health. However, the posterior mean of the number of infectious disease cases and scientific and technological development are −0.0421 and −0.3777, respectively, indicating that the incidence of infectious diseases and the country's scientific and technological development during 2016-2018 have the opposite effect on the amount of financial investment in public health, which, to some extent, also reveals the shortcomings and inadequacies of the public health management system and the institutional mechanism for epidemic prevention and control. The increase in the number of incidences of general infectious diseases did not attract sufficient attention to the refinement of the CDC structural system, and the rising level of science and technology did not contribute to the problem of updating and replenishing the equipment of specialized public health institutions, leading to the sudden outbreak of the COVID-19 resulting in 2019 a shortage of medical resources and medical personnel and insufficient reserves of premises and materials [25]. Therefore, in the post-epidemic era, it is more important for the government to learn from previous experiences and lessons, establish a reserve mechanism of materials for public health emergencies, build a modern epidemic prevention material reserve system [26], increase the construction of professional public health institutions and the procurement of professional equipment updates, and improve the ability to face public health emergencies in order to quickly and effectively control sudden major infectious epidemics.

Variables
Mean

Imputation of Missing Predictor Values
Using the CAR spatial model-based approach to impute the SLM latent model with the missing data in X 1 , we obtain the prediction information of the missing data and comparing the imputed predicted value with the true value to determine the imputation accuracy.
Based on the prediction results of the missing data in Table 4, it can be seen that the predicted means of the missing values are very close to the true values, and the accuracy rates are basically above 90%, indicating that the imputation of the missing covariates using the CAR space model-based approach with the help of MCMCINLA works well. Meanwhile, observing the data information in the table, we can find that the number of HFRS cases in different regions varies significantly, with a certain spatial heterogeneity and a spatial distribution trend of south-heavy and north-light. The areas with more incidences are Guangdong and Fujian, mostly in the mountainous areas in the south, which have richer vegetation and more precipitation; the areas with fewer incidences are Xinjiang, Tibet, Ningxia, and Gansu, which are mostly in the drier plains. To some extent, increased precipitation promotes the growth of vegetation and crops and provides a more suitable environment for rodents, such as rats, in mountainous areas, creating an increased risk of HFRS transmission [27]. Therefore, for HFRS, the relevant CDC departments should focus on the prevention and control of the epidemic in the southern region in the future to effectively control the further spread of the epidemic.

Comparison of Different Imputation Methods
With the development of statistical techniques, a series of model-based missing data processing methods, such as the maximum likelihood estimation, have received increasing attention from academics [28]. They mainly include EM and FIML, which have the advantages of convenient operation and more applicable models.
EM is an iterative algorithm that processes the missing data by calculating the maximum likelihood, and its imputation of the missing values can be achieved by continuous iteration once the initial values of the estimated parameters are given [29]. The principle of FIML is to model the available data using a "one-step" operation and estimate the parameters using a likelihood function; thus, the missing data imputation and parameter estimation processes are implemented simultaneously [30]. The missing data imputation process of the EM and FIML methods is done with the help of the TestDataImputation package and GDINA packages, respectively. The EM and FIML are imputed separately for the missing infectious disease case data and the results are obtained as shown in Tables 5 and 6. The comparison shows that all three imputation methods can obtain more accurate estimates, and the results obtained by the different methods do not differ much. Relatively speaking, the imputation accuracy of MCMCINLA is slightly better than EM, and the FIML method ranks last; however, in terms of computational speed, both MCMCINLA and EM need to perform imputation before completing the estimation, and the estimation of ρ when MCMCINLA needs to be performed with the help of MH sampling, which takes a long time, takes about 0.5 h, while the FIML method can obtain both the imputed and estimated values at the same time, which is more efficient. In addition to using the accuracy to evaluate the imputation effect of each missing datum from the individual point of view, we can also select three evaluation indicators: MSE, mean absolute percentage error (MAPE) and Pearson correlation coefficient (r) to compare the imputation performance of the three methods from the global point of view [31]. The results are shown in Table 7. MSE can be used to measure the absolute deviation between the imputed value of the missing data and the true value, MAPE can be used to measure the relative error between the imputed value and the true value, both of which are as small as possible, and r can be used to illustrate the correlation between the imputed value and the true value, and the larger the r value, the better the fitting effect. From the results in Table 7, it can be found that the MSE and MAPE values of the three algorithms of MCMCINLA, EM, and FIML are all small, and the r values are all large, indicating that the three algorithms can obtain ideal imputation effects for missing data, and in comparison, the imputation performance of the MCMCINLA proposed in this paper is more prominent.

Conclusions and Discussion
Medical data are often missing during collection and transmission due to clinical trials or equipment failures. In this paper, we investigate the problem of parameter estimation for SLM when the covariates contain random missing data, and we propose a new imputation method that uses MCMCINLA to get not only accurate parameter estimates for the model, but also good imputation results for the missing data. Taking the economic and HFRS disease data of mainland China from 2016-2018 as an example for empirical analysis, the study found that the HFRS epidemic in China had obvious spatial heterogeneity and a south-heavy and north-light distribution trend. Before the outbreak of COVID-19 in 2019, China's public health management system had certain problems, and the state's financial investment in public health did not receive certain attention. Compared with EM and FIML, the predicted values of the missing data obtained using our method are closer to the true values. Therefore, in the future, the relevant CDC departments should focus their attention on the south or areas with a high incidence of epidemics in wetter climatic conditions and do a good job of the research and diagnosis of HFRS epidemics. In the post-epidemic era, the government should play a leading role, actively learn from previous experiences and lessons, establish a mechanism for stockpiling materials for public health emergencies, build a modern epidemic prevention material reserve system, and increase the construction of professional public health institutions and the procurement of professional equipment updates in order to quickly and effectively control sudden major infectious epidemics.
Since this paper mainly focuses on model imputation, fitting, and estimation with random missing data, the other two mechanisms are not explored in depth. It can continue to be extended in the future to study how to use the method to deal with different imputation models and missing mechanisms.
One of the fundamental aspects of imputation, in addition to the missing pattern (MAR, MCAR, and NMAR), is the percentage of missing data. It would be more intriguing to study how the new algorithm is affected by the percentage of missing data and compare its performance with other algorithms. In this paper, we take the 15% missing data percentage as an example to conduct intensive research. In the future, different levels of missing rates (such as 5%, 10%, 15%, 30%, and 50%) can be set to discuss the performance of the algorithm more comprehensively.
Finally, since in this paper we only design a set of simulation experiments to test the algorithm, objectively speaking, the effectiveness of the algorithm is questionable when it is extended to other situations (for example, if X takes different probability distributions, ρ Lag takes different spatial autocorrelation degrees). Therefore, the test effect of the imputation algorithm in other cases can be further explored in the future.