^{1}

^{2}

^{3}

^{*}

^{4}

^{2}

^{5}

^{3}

^{6}

^{4}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (

This paper describes the basis functioning and implementation of a computer-aided Bayesian Network (BN) method that is able to incorporate experts’ knowledge for the benefit of remote sensing applications and other raster data analyses: Bayesian Network for Raster Data (BayNeRD). Using a case study of soybean mapping in Mato Grosso State, Brazil, BayNeRD was tested to evaluate its capability to support the understanding of a complex phenomenon through plausible reasoning based on data observation. Observations made upon Crop Enhanced Index (CEI) values for the current and previous crop years, soil type, terrain slope, and distance to the nearest road and water body were used to calculate the probability of soybean presence for the entire Mato Grosso State, showing strong adherence to the official data. CEI values were the most influencial variables in the calculated probability of soybean presence, stating the potential of remote sensing as a source of data. Moreover, the overall accuracy of over 91% confirmed the high accuracy of the thematic map derived from the calculated probability values. BayNeRD allows the expert to model the relationship among several observed variables, outputs variable importance information, handles incomplete and disparate forms of data, and offers a basis for plausible reasoning from observations. The BayNeRD algorithm has been implemented in R software and can be found on the internet.

Understanding complex phenomena in the field of Earth observation sciences represents a considerable challenge for scientific analysis [

Interactions of probabilities have been identified as the most promising way for a computer to effect plausible reasoning [

Neapolitan [

Although researchers have made substantial advances in developing the theory and application of BNs [

The aim of this paper is to describe, implement and test a computer aided BN method that is able to incorporate experts’ knowledge for the benefit of remote sensing applications and other raster data analyses. The freely available algorithm is named Bayesian Networks for Raster Data (BayNeRD). Following development of the approach, BayNeRD was tested on a case study for soybean identification and mapping in Mato Grosso State, Brazil. The test enabled evaluation of the capability of BayNeRD to support the understanding of a complex phenomenon through plausible reasoning based on data observation.

A BN for a set of _{1}, _{n}) to denote both a variable and its corresponding node, and the same but lower-case letters (e.g., _{1}, _{n}) to denote the state or value (defining a particular instantiation) of the variable. Then, the joint probability distribution for any particular instantiation of all _{i} represents the instantiation of variable _{i}_{i}_{i}, with _{i}_{i}

To illustrate the concept, suppose we are interested in inferring soybean occurrence based on observations of other variables. It is well known that soybean plantations have certain peculiarities [

Indeed,

The representation of conditional (in)dependencies is the essential function of BNs. For each node in a BN structure, there is a conditional-probability function that relates this node to its immediate parents. If a node has no parents (e.g.,

In practical terms, the definition of these probability functions is often the most complicated part of BN modeling. However, the empirical Bayesian approach suggests that the functions can be defined based on observations,

Aware of the great demand for implemented computer algorithms to help handle and understand phenomena in the field of Earth observation science, we implemented BayNeRD in R software [

R software was used to implement BayNeRD because it is a high-level language and environment for data analysis and graphics. It is growing in popularity and uptake, and is freely available for the research community [

The BayNeRD algorithm handles data in the GeoTIFF format, which has been widely used to represent raster data with geographical coordinates. For use in BayNeRD all raster data (

The variable which directly represent the phenomenon is called the target variable. A GeoTIFF with data representing the target variable as

The context variables are those that exhibit any kind of relationship with the target variable (such as

One of the main difficulties of using BNs for real problems is the definition of the probability functions of the model [

After the target variable has been entered as

To design the BN graphical model the user is asked about the (in)dependence relations among all variables read (

The discretization divides the range of the observed values for a variable into intervals and codes the values in the variable according to which interval they belong. In BayNeRD the discretization is based on choosing the number of intervals defined for each context variable and can be computed following three implemented criteria: (i) equidistant intervals, where each interval has the same width; (ii) quantiles, where each interval tends to have the same number of elements (

The discretization will have an impact on the computed probability functions. These probabilities are computed through pixel counting according to both the (in)dependence relations defined in the BN graphical model and the intervals defined in the discretization processes. Indeed, both the definition of the BN graphical model and the discretization processes enable users to add their knowledge about the phenomenon into the model. The more a data set is accurate, and a user is skilled in defining both BN graphical model and interval limits during discretization processes, the more the data-based probability functions computed are representative of the real probability functions [

Let us suppose that the

The user should be sufficiently expert to define suitable discrete intervals for each context variable so that all scenarios (

The PI consists of a raster data (

If any context variable presents missing data for any specific pixel in the study area, it is considered as “unobserved” in the model but

BayNeRD also allows the user to quantify the influence of each context variable on the probabilities computed for the target variable. This is done through the Kullback-Leibler (KL) divergence, which is a non-symmetric measure of the difference between two probability distributions [_{1}, _{2}, … and _{n}

The main result of BayNeRD is the PI and it can be used in several applications. For example, the PI can be used to generate a thematic map with classes target and non-target (e.g., soybean and non-soybean) just by slicing the PI using a limiting probability value named the Target Probability Value (TPV). Thus, by setting TPV at 50%, for instance, all pixels with values equal to or greater than 0.5 in the PI will be labeled as target and the remaining pixels (with values smaller than 0.5) will be labeled as non-target. However, what if the best TPV was 70% instead of 50%? What about even 80%?

Apart from a user-defined value, six criteria are implemented in BayNeRD to select the TPV which best meets a chosen criterion, making use of available reference information (

The case study involves soybean identification and mapping in Mato Grosso, which is a major Brazilian soybean producer (about 30% of the total domestic production) and an important global hub for tropical agricultural production [^{2} [

Although Brazil is the second largest producer of soybean worldwide [

In addition to remotely sensed spectral and temporal information, several other context variables are closely related with soybean occurrence in a given field (e.g., soil type and infrastructure facilities) [

In summary, six context variables and a reference thematic map were used as inputs in BayNeRD, where a BN model was defined based on experts’ knowledge. Probability functions were computed based on pixel counting of discretized variables, allowing BayNeRD to compute the PI, which was eventually used to produce a thematic map of soybean occurrence over the study area. This thematic map was then assessed using reference data. The following subsections describe the research materials and methods in detail.

All variables used in this case study, each represented by a raster GeoTIFF, were resampled to match the grid of the MODIS vegetation indices product (MOD13Q1), with a nominal spatial resolution of 250 × 250 m [

Next, two classes of variables were entered:

Target variable—

Context variables—the selected and available variables to compose the model are listed in

As a remote sensing input, the Crop Enhancement Index (CEI) [

In BayNeRD we used

The fourth context variable used was the

Another variable that influences soybean occurrence is the

The

Finally, areas that have no realistic role for commercial soybean production or are safeguarded by environmental protection laws in Mato Grosso were masked out. These include: (i) natural forest, identified from the Amazon Deforestation Monitoring Project (PRODES), carried out by INPE [

Given the (in)dependence relationships among the context variables and between each context variable and the target variable (

The first step after the definition of the BN graphical model is the discretization of continuous variables. The number of intervals must be appropriately chosen,

Regarding

CEI (

Indeed,

As with

BayNeRD uses the (in)dependence relationships among the variables (

Based on the designed BN model and the probability functions defined, BayNeRD computes, for each pixel in the study area, the probability of soybean presence given observations made on the context variables,

The resulting PI (

The PI shows the spatial distribution of (the probability of) soybean crops throughout Mato Grosso territory in crop year 2005/2006. Green colored pixels represent areas with higher probability of soybean presence based on observation of the context variables. Some of the main soybean production centers according to IBGE [

The higher probabilities shown in

Various other combinations of context variables can be found in the study area. The BN network is adept at dealing with such occurrences. According to KL divergence [_{C} = 0.28 and KL_{L} = 0.16,

The relatively small influence of

The influence of

In general, where only one context variable is unfavorable and/or is not strongly related to _{W} = 0.003), any decrease in the calculated probability of soybean presence is likely to be very small. However if the context variable has a strong relationship with _{C} = 0.28), any unfavorable condition of this variable is likely to decrease soybean probability values substantially. Additionally, the mixing within a pixel size of 250 × 250 m (defined as our nominal spatial resolution), especially over the boundaries of defined discretized intervals, could be noted in

The PI, as shown in

Additionally, the PI can also be used to produce a thematic map (e.g., for acreage estimates) by applying a threshold probability value where all pixels with values above the threshold are allocated to the target thematic class (e.g., soybean). This value, herein called TPV, can be defined as any real value between 0% and 100%. Apart from a manually defined TPV, six criteria were implemented in BayNeRD to select a TPV according to some criterion, as defined in Section 3.6.

The goal is to find the TPV that generates the most suitable thematic map showing two classes: target (soybean) and non-target (non-soybean). Several metrics are discussed in the literature to access map accuracy [

By varying the TPV from 0% to 100% different thematic maps were produced. Obviously, TPV = 0% produced a thematic map where all pixels within the study area were labeled as soybean. When all pixels were labeled soybean, all true soybean areas were then labeled as soybean and consequently sensitivity was equal to 100%. On the other hand, all true non-soybean areas were also labeled as soybean, and, consequently, specificity was 0%. With TPV increasing from 0 to 100%, sensitivity decreases while specificity increases. A useful graph to represent accuracy assessment in terms of these two indices is known as a Receiver Operating Characteristic (ROC) curve [

In the ROC curve presented in

A TPV can be defined to be more or less restricted in terms of associating a degree of belief, represented by a probability value, in which a pixel can be associated to the target thematic class, prioritizing either sensitivity or specificity. If the aim is that the total soybean area of the final thematic map closely matches the official statistics, the TPV can also be selected accordingly. For example, the thematic map generated with a TPV (manually defined) equal to 84% is more restrictive in terms of labeling a pixel as soybean but best matched the official soybean acreage for the 2005/2006 crop year in Mato Grosso. Indeed this thematic map presented 6.1 Mha of soybean—only 0.8% higher than the official data published by IBGE [

Similar to mapping soybean using remote sensing and environmental variables, Krug

This paper described the basis functioning and implementation of a computer aided BN method for raster data analysis: Bayesian Networks for Raster Data (BayNeRD). BayNeRD provides a new computer-aided method to characterize phenomena through plausible reasoning inferences based on observations of several variables. The number of variables is not limited and the sole conditions are an accurate match of raster cells and the availability of a suitable reference data set.

The case study of mapping soybean areas in Mato Grosso State, Brazil, showed BayNeRD’s capability to model environmental phenomena. Based on observations made upon Crop Enhanced Index (CEI) values for the current and last crop years, soil type, terrain slope, and distance to the nearest road and water body, the resulting Probability Image (PI) from BayNeRD depicted a spatial distribution of soybean areas consistent with expert knowledge and official statistical data. Furthermore, the PI was used to produce soybean thematic maps by varying the Target Probability Value (TPV) according to different criteria, achieving an overall accuracy greater than 91% or a soybean acreage estimation with more than 99% in accordance with the official data.

Advantages of BayNeRD include that it incorporates expert’s knowledge into the process; it models the (in)dependence relationships among several observed variables; it outputs variable importance information, through the Kullback-Leibler divergence; it can accommodate different forms of data (numerical and categorical); it can handle incomplete data; it allows computation of probability functions from the data; and it is a user-friendly implementation in a free software ready to handle raster data sets.

The BayNeRD algorithm has been implemented in R software [

The authors thank Nikolay Balov, from the University of Rochester Medical Center, for assistance with the catnet R package; Ruy Dalla Valle Epiphanio for sharing part of the dataset used in this work; the Brazilian Research Councils CNPq (

The authors declare no conflict of interest.

Directed Acyclic Graph (DAG) representing a hypothetical BN graphical model where the target variable

Study area corresponding to Mato Grosso State, Brazil. The analysis was only performed in areas that were not masked out.

Summary of the procedures used in the case study of applying BayNeRD to identify soybean plantations in Mato Grosso State, Brazil.

Directed Acyclic Graph (DAG) encoding assertions of conditional (in)dependence among the variables and representing the designed Bayesian Network graphical model for the case study of

Discretization of context variable

(

Histogram of

Bayesian network structure and the defined probability function (shown in the related table) for each variable used in this study case. The six context variables are described in

Probability Image (PI) of soybean presence for the entire Mato Grosso State, Brazil. Main soybean producer centers and the capital, Cuiabá, are highlighted. The color indicates the calculated probability of soybean presence in 2005/2006 given the observations made for the context variables, as expressed by

Probability Image (PI) of soybean presence and six context variables (described in

Receiver Operating Characteristic (ROC) curve, depicting sensitivity and specificity indices associated with thematic maps generated from the Probability Image (PI) by varying the Target Probability Value (TPV) from 0% to 100%. The circle points out the best TPV according to the chosen criterion.

Accuracy indices associated with thematic maps generated from the Probability Image (PI) by varying the Target Probability Value (TPV) from 0% to 100%. The vertical line identifies the best TPV, according to the chosen criterion, highlighting the accuracy achieved according to each index (described in the legend).

Summary of the six context variables used in the soybean mapping case study.

CEI^{*} | |

CEI^{*} | |

Soil | |

Distance to the nearest | |

Distance to the nearest |

Crop Enhancement Index [

Summary of the intervals limits defined for each of the six context variables, described in

1 | [−∞; 0.05) | [−∞; 0.05) | low | [−∞; 0.06) | [−∞; 0.5) | [−∞; 3.0) |

2 | [0.05; 0.20) | [0.05; 0.20) | high | [0.06; 0.12) | [0.5; 1.0) | [3.0; 8.0) |

3 | [0.20; 0.26) | [0.20; 0.26) | [0.12; +∞) | [1.0; 2.0) | [8.0; +∞) | |

4 | [0.26; +∞) | [0.26; +∞) | [2.0; +∞) | |||

| ||||||

# of intervals | 4 | 4 | 2 | 3 | 4 | 3 |

Intervals are closed on the left and opened on the right, as denoted by [ and ), respectively.