Remote Sensing a Bayesian Based Method to Generate a Synergetic Land-cover Map from Existing Land-cover Products

Global land cover is an important parameter of the land surface and has been derived by various researchers based on remote sensing images. Each land cover product has its own disadvantages and limitations. Data fusion technology is becoming a notable method to fully integrate existing land cover information. In this paper, we developed a method to generate a synergetic global land cover map (synGLC) based on Bayes theorem. A state probability vector was defined to precisely and quantitatively describe the land cover classification of every pixel and reduce the errors caused by legends harmonization and spatial resampling. Simple axiomatic approaches were used to generate the prior land cover map, in which pixels with high consistency were regarded to be correct and then used as benchmark to obtain posterior land cover map. Validation results show that our hybrid land cover map (synGLC, the dataset is available on request) has the best overall performance compared with the existing global land cover products. Closed shrub-lands and permanent wetlands have the highest uncertainty in our fused land cover map. 5590 This novel method can be extensively applied to fusion of land cover maps with different legends, spatial resolutions or geographic ranges.


Introduction
Land cover data describes physical material at the surface of the earth.It has great impacts on surface energy, carbon cycle, water balance and consequences of land use and land cover change [1][2][3][4].Also, it is a basic parameter for many land surface models, such as Ecosystem-Atmosphere Simulation Scheme [5] and the Common Land Model [6].Reliable and accurate land cover data provides key information for relevant environmental researches [7].
As shown in Table 1, various global land cover datasets have been produced based on remote sensing data, including Global Land Cover Classification [8,9] from University of Maryland Department of Geography (UMDLC), Global Land Cover Characterization (GLCC) Data Base [10,11], Global Land Cover map for year 2000 (GLC2000) from the European Commission Joint Research Centre [12,13], the Moderate-Resolution Imaging Spectroradiometer (MODIS) global land cover map products (MCD12Q1) developed by Boston University and coordinated by the MODIS Land Team from the National Aeronautics and Space Administration [14][15][16], and Global Land Cover Map (GlobCover) from the European Space Agency (ESA) in cooperation with an international network of partners [17,18].GLCC was developed through a continent-by-continent unsupervised classification of 1-km monthly Advanced Very High Resolution Radiometer (AVHRR) Normalized Difference Vegetation Index (NDVI) composites covering a 12-month period (April 1992-March 1993).UMDLC was based on data from the AVHRR, using the classification tree approach [8].GLC2000 was produced based on daily global data acquired by the Vegetation instrument on board the Systeme Probatoire d'Observation de la Terre (SPOT) 4 satellite [12].MCD12Q1 was derived from observations spanning a year's input of Terra-and Aqua-MODIS data.GlobCover2009 was generated using an automated processing chain from the 300-m Medium Resolution Imaging Spectrometer Instrument (MERIS) time series.
Previous inter-comparisons of these data-sets [19][20][21][22] revealed marked disagreements and uncertainties among them.Several researchers tried to produce a hybrid global land cover map by fusion of existing land cover products [7,23,24].See and Fritz [24] firstly produced a hybrid land cover map by fusion of the GLC2000 and MODIS products.Jung et al. [23] presented a method that merged existing products into a new joint 1-km global land cover product with improved characteristics for the carbon cycle models.However, the individual strengths and weaknesses of the products were not considered, and did not provide validation or data quality assessment.Fritz et al. [25] then generated a synergy cropland map in sub-Saharan Africa from five global land cover products, which requires subjective ranking by experts and does not consider legends conversion.Perez-Hoyos et al. [7] developed a general framework of building a hybrid land-cover map for Europe using four land-cover data-sets.This approach can be applied to any set of existing products; however it requires enough training data, which limits its application to the global scale.
The objective of this study is to produce a hybrid global land cover map by making use of all existing global land cover datasets with different legends and different spatial resolutions.A novel technique based on Bayes theorem was developed.Classification of each pixel in land cover map was represented by discrete probability distribution which more precisely describes state of land cover.The hybrid global land cover dataset was produced as the posterior distribution of a prior global land cover map.

Land-Cover Datasets
Five global land cover datasets were used in this study, which are GLCC, UMDLC, GLC2000, MCD12Q1 version 051 for year 2005 and GlobCover for year 2009.GLCC was published in various classification legends, one of which with International Geosphere Biosphere Programme (IGBP) land cover classification was chosen in this study.UMDLC was developed with a 14-class labeling and shading schemes.GLC2000 uses the Food and Agriculture Organization (FAO) Land Cover Classification System (LCCS).MCD12Q1 is provided with five global land cover classification systems, among which IGBP legend was selected in this study.The GlobCover was associated with a legend defined and documented using the United Nations (UN) LCCS.
These five land cover data-sets have different spatial resolutions and coverage years as shown in Table 1.How inconsistent classifications and spatial resolutions are harmonized will be described in Section 3.1.Even though some outdated maps were included, their aberrant classifications were recognized and valuable information was considered in our method.The different acquisition dates cannot account for the discrepancy among land cover maps, because land cover change cannot be detected due to insufficient accuracy of the individual land cover maps [23].We therefore ignored the impacts of land cover changes when designing our fusion method.
GLC2000ref is the result of a consolidation work realized on the original GLC 2000 dataset with 1253 samples provided [12].GlobCover-2005ref dataset is the result of a consolidation work realized on the original ESA-GlobCover 2005 dataset; it contains 4258 samples, globally distributed and selected according to a random stratified sampling [26,27].STEP is maintained as a database of training polygons drawn on high spatial resolution imagery that can be extracted with GIS to produce a global land cover classification, which is the training site database of MCD12Q1 [14,15,[28][29][30][31][32].VIIRS Surface Type validation database is based on a stratified random sample of 500 blocks (5 × 5 km) globally [14,15,[28][29][30][31][32].The correct class of each sample according to the IGBP legend was identified by manual interpretation of very-high spatial resolution (<2 m) image; MODIS time series data were used to improve the interpretations.

Method
In this section, we describe our methodology for fusion of five land cover data-sets.The main steps of fusion procedure are represented in the following flowchart (Figure 1).(2) generating prior estimation of state probability vector of International Geosphere Biosphere Programme (IGBP) classes for each pixel; (3) updating the state vector of each pixel according to classes of pixels with high certainty.
We first resampled and reclassified each land cover dataset into the same legend and spatial resolution (1/120 degree), denoted by .We then combined them into a prior global land cover ( ).The pixels with high consistence were then extracted and denoted by .Finally, we updated the probability distribution of each pixel ( ) using Bayes theory and got the posterior global land cover map ( ).

Reclassification and Resampling
To facilitate fusion of different land cover maps, they need to be homogenized into a common legend, and in this study we selected the IGBP classification system (Table 3).The 17 categories of IGBP land cover legend embrace the climate independence and canopy component philosophy presented by Running et al. [35], and are compatible with classification systems for environmental modeling for providing landscape information [10].The correspondence between the IGBP and other legends is rarely 100% [24] and some classes have partial overlap [19].Simple conversion can produce errors.Thus, in this study every land cover type was translated to a state probability vector representing the probability it belongs to each IGBP land cover type.The state probability vector makes it possible to convert one land cover type to more than one IGBP land cover types and reduce error caused by land cover legend conversion.Different land cover legends (UMD (Table 4), FAO LCCS (Table 5) and UN LCCS (Table 6)) were converted to the IGBP legend according to comparison of legend definitions, pixel-by-pixel statistical comparison and previous comparison studies [36][37][38].Because of insufficient information, each land cover type was converted to several IGBP land cover types equi-probably.Considering possible classification mistakes, we assumed that each pixel of every land cover map was classified into a wrong class with 50% probability.That is to say, in state probability vector of a certain land cover class, the total probability for all specified IGBP classes is 50%.This assumption will not substantially change the classification of each pixel but will allow for assessing the uncertainties of classification of land cover maps.All the global land cover maps need to be projected to the same projection (geographical projection in this study) and resampled to the same spatial resolution (1/120 geographical degree in this study).The state vector of each resampled pixel was the average of original pixels' state vectors, weighted by their area overlapped with resampled pixel.For example, when resampling from 300 m to 1 km, which does not fit with each other, as shown in Figure 2, land cover state probability vectors of resampled pixels were combined based on the overlapped area with original pixels.By this method, no information will be lost when resampling.Finally, all the global land cover data-sets were homogenized.The state probability vector of pixel located in the x-th path of y-th line in k-th land cover map is represented by , of which stands for the probability it belongs to the i-th IGBP class.

Generate Prior Global Land Cover Map
A prior global land cover needed by the Bayes method was generated by aggregating information provided by the existing land cover products, in which the prior state probability vector of pixel (x, y) is denoted by . Therefore, we need to combine probability distributions in existing land cover products into one.Without any other information available, simple axiomatic approaches [39] were used, such as linear opinion pool, and logarithmic opinion pool (2) where N is the number of land cover maps used (N = 5 in this study).is the weight of the k-th land cover map ( in this study).is normalizing constant.The prior land cover class is denoted by , which was derived from (3) where M is the number of classes in the common legend (M = 17 for IGBP legend in this study).The parameter represents classification certainty for pixel (x, y).Without further information about which method is more accurate, both linear and logarithmic opinion pools were used when generating prior global land cover map in this study and their differences were also compared.

Update State Vector of Each Pixel
The state probability vector of each pixel in the prior global land cover map was updated based on Bayes theorem.The updated probability for pixel (x, y) can be written as conditional probability given classifications of existing land cover products: (4) where is the true class of pixel (x, y), which is unknown, and .The symbol denotes joint probability, denotes the maximum likelihood land cover class in the state probability vector of pixel (x, y) in the k-th land cover map, which means .
According to Bayes formula, above conditional probability can be written as: (5) where is a normalizing constant.is the prior probability that true class of pixel (x, y) is t and identical to .Given the assumption that each land cover map is independent, Equation ( 5) can be rewritten as (6) Here, is the updating coefficient of prior state vector .
For any in the updating coefficient, we have As we do not know the true class for any pixel (x, y), we assume that for any pixel (x, y) if its certainty is higher than a given threshold.This threshold varies for different classes and is defined as the upper quartile of certainties for each class, so we have: (8) where is the certainty threshold for class t.In other words, we figured out the probability in Equation ( 8) by summarizing under condition of and .
After substituting Equation (8) into Equation ( 5) and normalization, we obtained the updated state vector .Furthermore, the posterior global land cover map was derived from: (9)

Validation
Four validation data-sets have different land cover classifications (Table 2).When validating, comparisons between land cover legends of validation data-sets and land cover map need to be defined.We regarded the two land cover types from different legends as identical if they can be translated to the same IGBP classes according to the conversion rules defined above (Tables 4-6).For example, type 13 of GLC2000 legend was converted to types 6 and 10 of IGBP legend (Table 5), and type 140 of GlobCover legend was converted to types 7 and 10 of IGBP legend (Table 6); then we took GLC2000 class 13 and GlobCover class 140 as identical when validating GlobCover with GLC2000ref.
Considering the spatial representativeness, geo-location errors and pixel-shift errors of validation points, every validation point was compared with the pixel it located at and its 2-order neighboring pixels.The percentage of matched pixels was defined as validation accuracy.An example is shown in Figure 3.The total validation accuracy of a land cover map was defined as average accuracy of all the validating points in a reference data-set.

Posterior Global Land Cover Map and its Uncertainty
Using the method proposed in this study, a synergetic global land cover (synGLC-linear and synGLC-log) dataset ( ) with an additional information on their certainties ( , the maximum of state probability vector) was generated (Figures 4 and 5) based on prior land cover maps from linear (Equation ( 1)) and logarithmic (Equation ( 2)) opinion pool.The most conspicuous differences between these two posterior maps were found in the Antarctic, which was probably due to the uncertainty caused by melting ice sheet during the past decades.Spatial patterns of classification certainties (Figure 5) were similar, and most high uncertain pixels distributed in land cover transition regions.The preferable synGLC map was decided after validation (see Section 4.2).
To understand the differences between these two posterior maps, percentages of each land cover class (represented by number of pixels) are shown in Table 7.Additionally, their average certainties for each land cover are shown in Figure 6.Closed shrublands, open shrublands, cropland/natural vegetation mosaic and permanent wetlands had the most differences between posterior linear and logarithmic land cover maps.Accordingly, these classes had low averaged certainties (Figure 6), indicating high uncertainties existed.Posterior logarithmic land cover map had higher certainty than the linear one for every class, but it was the result of different calculations and did not imply that the logarithmic one was better.It was different prior land cover maps that engendered differences in posterior uncertainty.The certainty was only comparable within the same prior land cover map in this approach.Validating with other reference data was necessary to assess the performance of our method and decide which land cover map is better.

Validation
Table 8 shows the validation results using the method described in Section 3.4.The synGLC-log has higher accuracy than synGLC-linear, thus later we only discuss the synGLC-log (hereafter refer to synGLC).It is reasonable that every land cover map has the highest accuracy when validating with their own reference data.For example, the GlobCover land cover map ranks first for GlobCover2005ref, and it is the same for the GLC2000ref (GLC2000 reference data).MCD12Q1 ranks first for VIIRS and STEP, because STEP is its training data and VIIRS is interpreted with help of MODIS image.The synGLC ranks second or third in every reference data.Because the synthetic map introduced information from other datasets, it will inevitably decrease the accuracy when validating with their own reference data.However, considering each map has its own bias on its reference or training data, the integrated map is considered to be less biased.The synGLC has the highest average ranking (2.5) followed by GLC2000 (3.0) and GlobCover (3.0), indicating that it has the best overall performance when validating with four reference data sets.
The MCD12Q1 has the highest average accuracy, followed by synGLC, due to its extraordinary high accuracy in STEP and VIIRS.However, it has unfavorable accuracy when validated with GlobCover2005ref and GLC2000ref.In contrast, our synGLC map has fine accuracy when validated with every reference data set, and thus has the best overall performance and is much less biased compared with other products.

Compare synGLC with the Existing Global Land Cover Maps
To unravel how much information from each land cover product contributes to synGLC, the differences between synGLC and previous land cover products were compared (Figure 7).Classifications that could not be converted to IGBP classifications in synGLC according to the rules (Tables 4-6) are defined as inconsistent.
The fewest inconsistent pixels were found between MCD12Q1 and synGLC (7.73%), and the largest inconsistencies were found in grasslands, open shrublands, woody savannas and cropland/natural vegetation mosaic, which indicated that the synGLC is closest to the dataset with the highest average accuracy (MCD12Q1).GLCC has the second fewest inconsistent pixels with synGLC (8.54%), most of which are mixed forests, cropland/natural vegetation mosaic, open shrub-land, snow and ice.About 9.45% pixels of UMDLC are inconsistent with synGLC, and most of which are woodland, wooded grassland, grassland, closed and open shrub-land.Inconsistency percentages of GLC2000 and GlobCover2009 are relatively higher than others, which are 26.58% and 22.05%, respectively, mainly because of their insufficient information within Antarctica.For GLC2000, most of the inconsistent pixels are herbaceous cover (closed-open), cultivated and managed areas, tree cover.For GlobCover2009, most of the inconsistent pixels were Sparse (<15%) vegetation, mosaic forest or shrub-land (50%-70%)/grassland (20%-50%) and closed to open (>15%, broadleaved or needle-leaved, evergreen or deciduous) shrub-land (<5 m).For each pixel, the number of land cover maps that have inconsistent classification with synGLC based on Figure 7 is shown in Figure 8.The inconsistency values of more than 90% pixels are equal or less than 2. Most of the consistent pixels (zero inconsistency) are distributed in the ocean, desert regions of North Africa, Amazon rainforests and barren regions.The highly inconsistent pixels are mainly distributed in transition zones, such as tropical forests and savannahs.Because GLC2000 and GlobCover2009 did not provide the land cover map within the Antarctic, the coastline of Antarctica in synGLC comes from the information in UMDLC, GLCC and MCD12Q1.
The percentage of pixels with different inconsistency in each land cover class of synGLC is shown in Figure 9. Water and barren or sparsely vegetated region have the highest consistency.Among the five forest classes, evergreen broadleaf forest has the highest consistency.The pattern of inconsistency among these six global land cover data-sets (five original ones and the synGLC) is similar to that of uncertainty (Figure 6).

Assumptions and Limitations
Our fusing method is based on Bayes theory and assumptions that makes the technique practicable.All the assumptions we made are as follows: (1) Each land cover map can make a mistake with 50% probability; (2) Classification of each land cover map is independent; (3) Classification with high agreement is true.Assumption 1 does not change the information in each land cover map but reduces error of misclassification and legends conversion.Assumption 2 is likely to be true considering each land cover map is produced by different researchers with different data and techniques.It makes it possible to solve the probability equation without thinking about covariance.Assumption 3 helps to construct the benchmark pixels and update the prior probability.Intuitively, hybrid land cover map should make the most of all the advantages of every land cover product and was expected to have the highest accuracy under any circumstance.However, several limitations still exist and prevent it from achieving its ideal state.The most important limitation is that a wrong prior state probability vector cannot be corrected if all the land cover products have wrong classifications, because we need Assumption 3 to distinguish good or bad classifications.To overcome the bias introduced by this assumption, we can use independent third-party reference data as benchmark to update the prior land cover map and generate a posterior one.This method can definitely be more effective with fewer assumptions.

Legends Translation
One major problem of our method is subjective definition of land type legend conversion.Any two classes in different legends cannot be identical, and probably have overlapped definitions.Legend homogenization always produces errors.Detailed comparison of different legends is complicated and beyond this research.Consequently, we defined our legends translation rules according to the previous comparison researches [36][37][38] with some modifications.
First of all, to tackle the legend mismatch problem we defined the state probability vector to make it possible to convert from one class into multiple classes without losing information.Furthermore, we assumed that any land cover product may make mistakes (Assumption 1) to weaken the noise in land cover information.All these techniques can reduce the error cause by legend conversion.However, the rules described by the state probability vector are far from precise.More information is required to make it more accurate-rather than the equi-probable distribution found in this study-which requires that more researches be carried out on the quantitative relationship between different land cover legends.

Effects of Land Cover Changes
Uncertainties in synGLC mostly come from two sources: land cover changes and inaccuracies of land cover products.How these two factors affect the fusing method and the synGLC is important for understanding the reliability of our method and the accuracy of synGLC.However, due to the lack of sufficient land cover data in long time series, we cannot directly assess the effects of land cover changes.
By simple comparison of GLCC, GLC2000 and MODIS, Jung et al. [23] concluded that land cover change between 1993 and 2000 cannot explain their inconsistencies.In our research, inconsistency percentages among land cover products range from 23% to 30%, excluding area of ocean.Additionally, their accuracies range from 45% (GLCC) to 61% (MCD12Q1).In contrast, the uncertainties stemmed from land cover changes are relatively smaller.For example, only 8.6% of land in the United States experienced changes from 1973 to 2000 [40].Interannual variations derived from MODIS land cover time series is about 10%, which is higher than actual global land cover change [14].
Given the facts above, we can surmise that the principal source of uncertainties in SynGLC was inaccuracy in land cover classification and the effects of land cover changes are ignorable.Our method mainly focuses on handling inconsistencies among land cover maps and achieving an optimal estimate.

Strength of Our Method
Although there are so many limitations, a significant advantage of our method is its remarkable extensibility.It can fuse land cover maps with different spatial resolutions and different legends by adjusting the state vector and parameters of resampling accordingly.Even other land surface parameters (such as leaf attributes, LAI) can be integrated if they are related to land cover and can be translated into a state probability vector of land cover classes.This method can synergize the regional land cover maps into the global map by defining the state vector of no data pixel as a uniformly distributed one.In that way, all the available regional land cover maps can be fused into a global one to make use of all available information.
In addition, our method can integrate both old (such as UMDLC) and new (such as GlobCover2009) land cover products and generate the average state of global land cover during the whole time range of input products.It is important for land surface models that are run with a constant land cover parameter.Besides, weight coefficients in Equations ( 1) and ( 2) can be modified according to research interests.They directly affect the prior land cover map and , which will be used to generate a posterior map.Therefore, the different weight coefficients would bring different biases into the synergetic land cover map.Such biases maybe compensate the inaccuracy in land cover products if increasing the weights of land cover maps with high accuracy.In addition, such biases can be used to estimate the land cover map over a certain time span, by increasing the weighting of land cover maps within the time range.

Conclusions
In this paper, we demonstrated a technique based on Bayes theory to generate hybrid global land cover map by blending the existing products with different legends and spatial resolutions.Our method was simple and viable with thhree reasonable assumptions and definitions of the state probability vector.Based on this method, our synGLC map was validated to have the best overall performance with an average accuracy of 56.89% and average ranking of 2.5, which was the most unbiased land cover map compared with existing global land cover maps.
The remarkable extensibility of this method makes it possible to take advantage of all available information.With more and more land cover datasets available for different regions, it is expected to become increasingly useful to take advantage of all existing maps.Although the limitations from the legend conversion and the three assumptions of true state are considerable, however, they can be reduced by further researches on land cover legends and the increasing accessibility of independent reference data.

Figure 1 .
Figure 1.The flow chart of our method includes three steps: (1) resampling and reclassifying existing land cover maps into common legend and spatial resolution;(2) generating prior estimation of state probability vector of International Geosphere Biosphere Programme (IGBP) classes for each pixel; (3) updating the state vector of each pixel according to classes of pixels with high certainty.

Figure 2 .
Figure 2.Example of resampling from 300 m to 1 km.Land cover state probability vectors of resampled pixels were combined based on the overlapped area with original pixel.By this method, no information will be lost when resampling.

Figure 3 .
Figure 3.This is an example of a validating point.The validating point is compared with its neighboring 5 × 5 pixels.Sixteen pixel matches with validating point and the validating accuracy is 16/25 (64%) for this validation.

Figure 6 .
Figure 6.Average certainties of each land cover type.

Figure 7 .
Figure 7. Inconsistent part between synGLC and (a) UMD; (b) GLCC; (c) GLC2000; (d) MCD12Q1 and (e) GlobCover2009, shown in respective land cover classification, with the percentage of each class in total inconsistent pixels.Consistent pixels are shown in white.

Figure 9 .
Figure 9. Percentages of pixels with different inconsistency in each land cover class of synGLC.

Table 1 .
Land cover datasets used in this study.

Table 2 .
Validation reference data used in this study.

Table 3 .
Numbers and descriptions of International Geosphere Biosphere Programme (IGBP) land cover classification.

Table 4 .
Conversion rules from University of Maryland (UMD) land cover legend to International Geosphere Biosphere Programme (IGBP) legend and corresponding state probability vectors.Please see Table3for description of IGBP values.

Table 5 .
Conversion rules from GLC2000 land cover legend to International Geosphere Biosphere Programme (IGBP) legend and corresponding state probability vectors.

Table 6 .
Conversion rules from GlobCover land cover legend to International Geosphere Biosphere Programme (IGBP) legend and corresponding state probability vectors.

Table 7 .
Pixel percentages of each land cover type in the posterior land cover maps.

Table 8 .
Accuracy and corresponding ranking of each land cover map when validated with different reference data.