1. Introduction
Spatial analysis is one of the analytic approaches that considers the important aspects of spatial data; that is, data indicated by spatial effects. In general, the figures for spatial data can be estimated using various modeling methods. However, if the spatial effects in the spatial data are not considered, then the estimated model will be imprecise. Data indicating the existence of neighborhood effects, therefore, is very important to analyze spatially and to show that one region affects other regions. For example, it is very important for observed disease or virus data to be considered as spatial data and to be analyzed spatially. This is because diseases or viruses are naturally spread very easily, especially in areas that are close together. Spatial factors are divided into two categories: the geostatistical approach and the lattice approach. In this paper, the lattice approach is used, which means that the spatial influence is set by neighboring regions [
1]. The inclusion of spatial lattice effects is added as a random effects component in the model that is not explained in the model without random effects [
2]. If spatial effects are not included in the modeling, the results obtained still contain spatial autocorrelation, so that the magnitude of the risk effect between neighboring regions is unknown [
2]. On the basis of the research conducted by Rantini et al. [
3], the survival model coupled with the normal conditional autoregressive (CAR) spatial effect is considered more representative in modeling correlated spatial data than the survival model without spatial random effects. The spatial dependencies of random effects between regions that are close together are expressed through the prior conditional autoregressive (CAR) model [
2,
4].
To date, many studies have analyzed spatial data. Research into spatial effects on survival data has been conducted by Banerjee et al. [
1]. Their study modeled infant mortality data in Minnesota by employing the Weibull distribution with the normal CAR spatial effect. A similar study was also conducted by Darmofal [
2] applying the Weibull distribution on political cases coupled with the normal CAR spatial effect. The Weibull distribution with the normal CAR spatial effect has also been implemented to analyze dengue hemorrhagic fever (DHF) by Iriawan et al. [
5]. Another work of DHF data research that also used spatial CAR was presented by Aswi et al. [
6]. In their research, the Weibull survival model was given three different prior CAR models: Leroux prior, intrinsic conditionally autoregressive (ICAR) or normal CAR prior, and independent prior models. The other study that used a non-normal CAR model was presented by Motarjem et al. [
7]. Their research succeeded in demonstrating a model for asthma data in Tehran by using Weibull distribution coupled with the normal CAR spatial effect, and they compared the result with a non-normal CAR model (the non-normal CAR model used was the double-exponential (DE) CAR model). Model comparison using the normal CAR and DE CAR has also been carried out by Rantini and colleagues [
8,
9]. Their research demonstrated DHF data modeling in the Pamekasan district, East Java, Indonesia and in the eastern Surabaya, East Java, Indonesia, respectively.
There are several methods for estimating parameters of models. In the Bayesian perspective, one of them is the Hamiltonian Monte Carlo (HMC), which is one of the algorithms of the Markov chain Monte Carlo (MCMC) method, like the Gibbs sampling and Metropolis–Hastings algorithm [
10,
11,
12]. HMC provides a powerful MCMC sampling algorithm [
10,
13]. There are some advantages in using HMC. Firstly, HMC can converge to the posterior probability density with smaller MCMC samples than required by the derivative-free Metropolis–Hastings algorithm [
14]. This is because HMC uses numerical integration schemes by developing the entire system in parallel, consisting of small steps in order to avoid the non-locality problems [
11]. Secondly, HMC provides an ergodic Markov chain with high-probability acceptance, even in large transitions [
15,
16,
17]. Thirdly, HMC allows for a more efficient exploration of state space than standard random-walk proposals, due to its high-probability acceptance proposal determination mechanism within the Metropolis–Hastings framework [
13]. Fourthly, HMC has better performance compared to the Metropolis or Gibbs samplers, because HMC is able to manage the proposed transitions in the Markov chain that lie far apart in the sampling space [
12].
Meanwhile, only spatial modeling with a symmetrical CAR model—i.e., normal CAR and double-exponential (DE) CAR—has been provided in software such as Bayesian inference using Gibbs sampling (BUGS) software [
18]. In the BUGS software, “car.normal” and “car.l1” are the functions for the normal CAR and DE CAR model, respectively [
18]. An innovation for relaxation of the symmetrical CAR model should be developed by researchers by creating short subprograms as an additional user-defined utility. An open-source programming language, designed as probabilistic programming for Bayesian analysis, is called Stan language, which uses HMC and can be widely used to help in the realization of the relaxation of the symmetrical CAR model. It is rivaling the use of BUGS and the just another Gibbs sampler (JAGS) software that uses MCMC [
19]. The efficiency of HMC in achieving inference is faster than the MCMC used in BUGS and JAGS software [
19]. In the Stan language, facilities are provided for users to create additional user-defined distributions according to their own programming written in Stan code [
20]. This presents a great opportunity for researchers to be more creative in their data-driven analysis. The process of adding a distribution or model as a user-defined in Stan is easier when compared to BUGS and JAGS software. In the research conducted by Wetzels et al. [
21], it was found that adding a new distribution to the BUGS program has quite complex steps and requires another program, namely the BlackBox Component Builder. As the BUGS software, adding a new distribution in JAGS is also complicated, which can be seen in research conducted by Wabersich et al. [
22]. Stan has its own programming language for defining and adding statistical models [
20]. Stan modeling language can be learned through several interfaces in several software. Some of Stan’s interfaces are RStan (R), PyStan (Python), CmdStan (shell, command-line terminal), CmdStanR (R, lightweight wrapper for CmdStan), CmdStanPy (Python, lightweight wrapper for CmdStan), MatlabStan (MATLAB), Stan.jl (Julia), StataStan (Stata), MathematicaStan (Mathematica), ScalaStan (Scala) (see the interfaces in
https://mc-stan.org/users/interfaces/). Among these Stan interfaces, R (RStan) and Python (PyStan) are the most popularly used [
20]. With Stan’s interface in several software tools, the researcher can generalize the proposed method. In this study, we used RStan.
Considering that there have been many studies using the normal CAR model, Stan provides researchers with a new opportunity to conduct data-driven analysis by incorporating the concept of spatial effect modeling with a non-normal CAR model. Starting with Motarjem et al. [
7], using the DE CAR model, researchers could develop another, more flexible non-normal CAR model. In the research by Rantini et al. [
8], it was shown that the DE CAR model is no better than the normal CAR model. This research provided evidence that the DE CAR model is not always more robust than the normal CAR, as stated by Motarjem et al. [
7]. This is one of the reasons for the proposed Fernandez–Steel skew normal (FSSN) CAR model. Skewness is one of the features of skew-symmetric distributions [
23]. Compared to the normal and DE distributions, which can only capture data that have a symmetrical pattern, FSSN distribution is able to be more flexible when explaining both symmetrical and asymmetrical data [
24]. In a study conducted by Castillo, et al. [
24], who modeled volcanic data with less symmetrical histograms, the FSSN distribution approach would be more favorable than the normal distribution. This fact supports our proposed study, since the distribution of spatial effects is not always symmetrical. One of the advantages of the FSSN distribution over the skew-normal (SN) distribution by Azzalini [
25], is that its Fisher information matrix does not have a singularity problem, as occurs with the corresponding Fisher information matrix of the SN distribution [
24].
In real-world problems, the size of the data available is not always large enough and it becomes a challenge. This can affect the variation of model parameters [
26]. To solve this problem, a fairly strong simulation is suggested, especially through the Bayesian approach [
27]. This approach is claimed as a suitable method for data-driven applications [
28]. In statistical theory, frequentist approaches are only better for modeling large amounts of data. Bayesian approaches, on the other hand, can be used not only for large data sizes, but also for limited data sizes [
26]. To do so, Bayesian emphasizes the accurate choice of priors and simulates for the model. The Monte Carlo simulation becomes effective in dealing with this problem [
28,
29]. Regarding dependence modeling for small data, Zhang and Shields [
30] demonstrated that when the Gaussian dependence assumption is applied, biased estimation results are obtained when the dependence structure deviates from this assumption. The proposed copula dependence was employed to solve their problem. Handling the small data in this study, we proposed to use a simulation with HMC, which is applied to the distribution of non-normal data, namely the FSSN distribution. At the end of this study, it was used on data with spatial dependence.
The aim of this study is to show the new creation of the user-defined FSSN distribution and the FSSN CAR model in Stan and to demonstrate their flexibility to explain the distribution of spatial effects adaptively. The latter exhibits the ability and adaptability to model symmetrical and asymmetrical spatial data patterns. A more mathematical and in-depth explanation of the FSSN distribution can be seen in Fernandez and Steel [
31].
This paper is organized as follows.
Section 2 introduces the CAR model in general and its mathematical explanation.
Section 3 describes the intrinsic conditional autoregressive (ICAR) or normal CAR model. The Stan code for the normal CAR model can be seen in Morris [
32]. Further explanation of the normal CAR model derived from the mathematical calculation and applied to the Stan code is given in
Appendix A.
Section 4 describes the FSSN distribution and the FSSN CAR model and demonstrates the Stan code according to its mathematical description.
Section 5 provides several scenarios for simulation studies on univariate and multivariate distributions.
Section 6 contains the application and comparison of the normal CAR and FSSN CAR models using the Scotland lip cancer dataset and lung cancer dataset from the London Health Authority. The conclusions are given in
Section 7.
2. Conditional Autoregressive (CAR) Model
The area data represents objects defined in terms of geometric features, such as points, lines, polygons, regions, and volumes. The regions are partitioned into a limited number of subregions with clear boundaries. The area data, consisting of a single aggregate size per unit area, could have binary, count, or continuous values. These values can be modeled using the CAR model. The CAR model calculates the proximity between neighboring areas that are close together. According to Besag [
33], the area data with a spatial structure show that the neighborhood regions have a higher correlation than those that are far away from each other. Area data are different from point data, which consist of measurements from known geospatial points. While the relationship between regions is given in terms of proximity, the relationship between two regional data points is explained by the unit of distance.
Spatial interactions between a pair of areas
and
in the given set of observations taken at
different areas of a region can be modeled conditionally as a spatial random variable
, which is an
length vector
. In CAR models, the spatial relationship between the number of
areas is represented as an adjacency matrix
with dimensions
. Each component entry of
and
is positive when the areas
and
are neighbors, and is zero otherwise. The neighbor’s relationship, written as
, is defined in terms of this matrix; i.e., the neighbors of the area
are those areas that have non-zero entries in a row or column
. The conditional distribution for each
is specified in terms of a mean and precision parameter
, and can be written as follows [
33].
where the parameter
controls the strength of the spatial association—when
, this corresponds to spatial independence—and
is the number of areas in a region.
The corresponding joint distribution can be uniquely determined from the set of full conditional distributions by introducing a fixed point from the support of
. The random vector
has a multivariate normal standard distribution, and precision parameters are formed from two matrices that describe the neighborhood of
areas; i.e., the diagonal matrix
d and the adjacency matrix
as written in Equation (2) [
34].
where
denotes the
n-dimensional normal distribution;
is between 0 and 1;
d is an
diagonal matrix, where each diagonal entry
contains the number of neighbors of the area
and all off-diagonal entries are zero;
is the adjacency matrix, where entry is
if the areas
and
are neighbors and
otherwise, and all diagonal entries
are zero; and
I is an
identity matrix.
4. Fernandez–Steel Skew Normal Conditionally Autoregressive (FSSN CAR) Model
Let
with location, scale, and skewness parameters
,
, and
, respectively. The probability density function (p.d.f) is given in Equation (6).
where
and
denote the standard normal p.d.f and cumulative distribution function (CDF), respectively [
24]. Let
, with the mean and variance given by
and
, respectively [
24]. Then, we have relationship between
and the mean, as well as
and variance, as
and , respectively.
Equation (6) can be written as follows:
The construction of multivariate skewed distributions is based on linear transformations of univariate skewed distributions [
35]. Let
be the dimension of the spatial random variable
, with mean vector
, the
positive definite variance-covariance matrix
, and skewed parameters vector
; the multivariate FSSN p.d.f can be written as follows [
35]:
where each
is as in Equation (7). The p.d.f of the multivariate FSSN, therefore, can be written as follows:
The individual spatial random variable
for
has an FSSN distribution with a mean and variance equal to the normal distribution in Equation (3) and a skew parameter
. The joint distribution can be simplified as in Equation (10):
where we set
, so
; thus, we obtain a p.d.f as in Equation (11):
where
Since
is a constant, Equation (11) can be rewritten in a proportional form as follows:
with explanations analogous to
Appendix A. In logarithmic form, the pairwise difference formulation can be expressed as in Equation (13):
where, for
, it means that
neighbors
and
4.1. Adding FSSN Distribution in Stan
As stated in the Introduction, based on Wetzels et al. [
21], adding a new distribution to the BUGS software is extremely complicated and requires another specialized program, namely the BlackBox Component Builder. Likewise in the JAGS software, adding a new distribution also has difficult steps, as stated by Wabersich et al. [
22]. Unlike the BUGS and JAGS programs, Stan makes it easy for its users to add new distributions. By knowing the mathematical form of the distribution to be added, the steps added in Stan can be written according to the mathematical calculation steps of the distribution. This convenience provides an advantage for researchers to add new distributions, such as the CAR model. The addition of custom distribution in Stan has already been explained by Annis et al. [
20]. Therefore, Stan was chosen as an implementation method for spatial modeling involving adaptive distribution for FSSN. Based on Equation (7), the addition of the user-defined FSSN distribution Stan code can be seen in Listing 1.
Listing 1. A user-defined Stan code of Fernandez–Steel skew normal (FSSN) distribution. |
functions{
real FSSN_lpdf(real x, real mu, real sigma, real delta){
real logpdf;
real z;
real delta2;
delta2=delta*delta;
if(sigma<=0)
reject("sigma <= 0; found sigma =", sigma);
if(delta<=0)
reject("delta<=0; found delta =", delta);
z=(x−mu);
if(x<mu){
logpdf=normal_lpdf(delta*z|0,sigma)+log(2)+log(delta)-log1p(delta2);
}
else{
logpdf=normal_lpdf(z/delta|0,sigma)+log(2)+log(delta)-log1p(delta2);
}
return logpdf;
}
}
|
4.2. Adding the FSSN CAR Model in Stan
Additions to the normal CAR or ICAR model in Stan can be seen in Morris et al. [
32]. In this study, the proposed user-defined Stan code for the FSSN CAR model was created. Based on Equation (13), the Stan code for the FSSN CAR model can be seen in Listing 2.
Listing 2. A user-defined Stan code for the FSSN conditional autoregressive (CAR) model. |
real car_FSSN_lpdf(vector phi, int N, int[] node1, int[] node2, vector delta){
vector [N] logpdf;
vector [N] delta2
vector [N] phi2;
for(i in 1:N){
delta2[i]=delta[i]*delta[i];
phi2[i]=(phi[node1][i]- phi[node2][i])*(phi[node1][i]- phi[node2][i]);
if(delta[i]<=0)
reject("delta[i]<=0; found delta[i] =", delta[i])
}
for(i in 1:N){
if(phi[i]<0)
logpdf[i]=log(2)+log(delta[i])-log1p(delta2[i])-0.5*delta2[i]*phi2[i]
+FSSN_lpdf(sum(phi)|0,0.001*N,mean(delta));
else
logpdf[i]=log(2)+log(delta[i])-log1p(delta2[i])-0.5*(1/delta2[i])*phi2[i]
+FSSN_lpdf(sum(phi)|0,0.001*N,mean(delta));
}
return sum(logpdf);
}
|
7. Conclusions
An FSSN CAR model for analyzing spatial data is proposed in this paper. This approach was developed on the basis of the normal CAR model, which was given a skewness parameter to capture spatial data that has an asymmetrical pattern. The FSSN CAR model has demonstrated its capability to detect symmetric and asymmetric data patterns. Moreover, this model allows for the use of light- or heavy-tailed data. In real life, data that are truly symmetrical are rarely found. Thus, the FSSN distribution has a wide opportunity to be selected for analyzing data that has an almost symmetrical pattern. With its flexibility, the FSSN CAR model is more representative for modeling spatial random effects when compared to the normal CAR model.
This paper provides a simulation of data with 24 scenarios. The first to eighth scenarios used simulated symmetrical data patterns which were normally distributed with different variances and sample sizes. Meanwhile, the ninth to sixteenth scenarios used simulated symmetrical data which were plausibly leptokurtic, such as having a double-exponential distribution with different dispersions and sample sizes. Then, the seventeenth to the twenty-fourth scenarios used simulated symmetrical data patterns which were Student-t distributed with different degrees of freedom. On the basis of the analysis of the 24 scenarios, the FSSN distribution exhibited its capability to detect the 24 scenarios perfectly. These 24 scenarios were simulation studies carried out with 500 replications, then, the estimation results for each replication formed a 95% HPD interval for each parameter. The HPD interval for the skewness parameter in the FFSN distribution for the 24 scenarios was close to 1, indicating that the generated data was symmetrical. This was consistent with the data patterns of the generated distribution, namely, the normal, double-exponential, and Student-t distributions. Thus, through this estimated skewness parameter, it can be concluded that the FSSN can estimate the data generated from a normal, double-exponential, or Student-t distribution. Moreover, this HPD interval showed its capability to cover the targeted parameter values for all scenarios. At each replication, we recorded the posterior parameter, so that across replications, we got posterior values for each parameter according to the number of replications. From these posterior values, we obtained the bias, RMSE, and CP values. The bias values were close to zero for the estimated parameters of the 24 scenarios, especially for the FSSN goodness-of-fit distribution for the generated normal, double-exponential, and Student-t distribution data. For the 24 scenarios, the RMSE for the estimated parameters of the FSSN distribution were close to zero. The CP of the estimated parameters of the FSSN distribution were more than or equal to the normal distribution. On the basis of the HPD, biases, RMSE, and CP for the estimated parameters of the FSSN distribution, we can finally draw the conclusion that the FSSN distribution is able to estimate and capture the characteristics of data which are normally, double-exponentially, or Student-t distributed.
In addition, we presented 8 scenarios for the multivariate case. To measure the goodness of the estimation results in each scenario, the Euclidean distance was used. On the basis of the simulation results, the Euclidean distance in the multivariate FSSN distribution was smaller than in the multivariate normal distribution. The results obtained in the univariate simulation with 24 scenarios and the multivariate simulation with 8 scenarios show that the FSSN distribution is able to estimate the generated data according to these scenarios. This fact is what we used to model the spatial effect with the normal CAR model and the FSSN CAR as an alternative model.
The application of the FSSN CAR model to Scotland lip cancer dataset and the lung cancer dataset from the London Health Authority was also carried out. In this study, the FSSN CAR model challenged the normal CAR model. To compare the normal CAR and FSSN CAR models for these datasets, we used a visual comparison, namely, a plot for the original data against the estimated models. Visually, it was found that the FSSN CAR model was closer to the original data when compared to the normal CAR model. Then, on the basis of the WAIC and LOO values, the Poisson regression model with the FSSN CAR model was also found to be better than the normal CAR model. Both these test data showed the ability of the FSSN CAR model to explain left- and right-skew patterns. In contrast, the normal CAR model was only able to accommodate symmetry patterns and short tails.
We believe that, when data have a spatial effect, the use of the FSSN CAR model should be recommended over the normal CAR model. This is because of its ability to capture various data patterns covering the weakness of the normal CAR model for use with symmetrical data with short tails only. However, for the parameter estimation, normal CAR and FSSN CAR models give almost the same results.