Random Forest Classification of Wetland Landcovers from Multi-Sensor Data in the Arid Region of Xinjiang , China

The wetland classification from remotely sensed data is usually difficult due to the extensive seasonal vegetation dynamics and hydrological fluctuation. This study presents a random forest classification approach for the retrieval of the wetland landcover in the arid regions by fusing the Pléiade-1B data with multi-date Landsat-8 data. The segmentation of the Pléiade-1B multispectral image data was performed based on an object-oriented approach, and the geometric and spectral features were extracted for the segmented image objects. The normalized difference vegetation index (NDVI) series data were also calculated from the multi-date Landsat-8 data, reflecting vegetation phenological changes in its growth cycle. The feature set extracted from the two sensors data was optimized and employed to create the random forest model for the classification of the wetland landcovers in the Ertix River in northern Xinjiang, China. Comparison with other classification methods such as support vector machine and artificial neural network classifiers indicates that the random forest classifier can achieve accurate classification with an overall accuracy of 93% and the Kappa coefficient of 0.92. The classification accuracy of the farming lands and water bodies that have distinct boundaries with the surrounding land covers was improved 5%–10% by making use of the property of geometric shapes. To remove the difficulty in the classification that was caused by the similar spectral features of the vegetation covers, the phenological difference and the textural information of co-occurrence gray matrix were incorporated into the classification, and the main wetland vegetation covers in the study area were derived from the two sensors data. The inclusion of phenological information in the classification enables the classification errors being reduced down, and the overall accuracy was improved approximately 10%. The results show that the proposed random forest classification by fusing multi-sensor data can retrieve better wetland landcover information than the other classifiers, which is significant for the monitoring and management of the wetland ecological resources in arid areas.


Introduction
Wetland is the richest ecosystem in terms of biodiversity, and one of the most important living environments for human beings [1].In recent years, the wetland landscapes in China have been disturbed more or less by human activities, especially in the arid areas where most wetlands have been converted into farming lands, causing the deterioration of wetland services and values [2].Thus, it is crucial for the wetland management department to develop an effective and robust monitoring method, to understand the land use/cover changes of wetlands and their surrounding areas in a timely manner.Research has proved that remote sensing can offer valuable information for monitoring the wetland changes [3][4][5].However, the land cover classification of remotely sensed data is relatively difficult for arid areas due to the strong soil background interference, and the challenge and cost for in situ data collection [6].
Data from optical satellite sensors of Landsat, Moderate Resolution Imaging Spectroradiometer (MODIS), and SPOT satellite data have been used to map land covers and to assess ecological variables in West Africa, Turkey, and the Mediterranean area [7][8][9].Synthetic Aperture Radar (SAR) satellite imagery such as RADASAT and JERS-1 has been proven reliable in water extraction and wetlands classification [10][11][12].There are many studies on wetland mapping over China also, and remote sensing has been used in monitoring changes in the Poyang Lake, Honghe wetlands and Zhalong River wetlands [13][14][15].Classification algorithms such as conventional decision tree, maximum likelihood, support vector machines (SVM), and artificial neural networks (ANN) have been employed in wetland classification [16][17][18][19][20][21].With the improved spatial resolution of remotely sensed imagery, the object-oriented strategy has also been proposed [22][23][24].However, the accuracy and robustness of the existing classification methods are not yet satisfactory for wetland management, bearing big omission and commission errors due to the sparse yet variable vegetation and the hydrological fluctuation in the wetlands of arid areas [25,26].The random forest (RF) algorithm, an integrative classifier, has shown to be able to achieve high classification accuracy even when applied to analyze data with stronger noise [27,28].Currently, the random forest classifier has been widely employed in the landcover classification of mesophyte environments, but is rarely used in the wetland classification for arid and semiarid areas [29][30][31][32][33].
The objective of this study is to develop a wetland classification approach for the arid and semiarid areas by fusing multi-sensor data.Specifically, an optimized feature set will be studied and a random forest classifier will be implemented and compared with other machine learning classifications or classification methods, to demonstrate the robustness of the RF in the arid wetland classification.The Pléiade-1B and Landsat-8 satellites are used to classify the wetland landcovers in the Ertix River watershed in northern Xinjiang, China.

Study Area
The study area is situated in the southern foot of the Altai Mountains and on the northern verge of the Junggar Basin (Figure 1).It contains the town of Beitun and the adjacent areas where the River Ertix goes through.It has a typical continental and temperate climate with an annual average temperature of 3.6-3.9• C.Many natural and artificial oases are distributed along the Ertix River watershed, most of which have been converted into farming lands.The vegetation is mostly temperate broadleaved deciduous forest, and there are few native forestland covers in this region.The main types of vegetation are secondary temperate shrubs and grasses growing near the water bodies and the emergent macrophytes such as reed in the shallow waters [34,35].
The difficulty for the wetland classification in the study area lies in the following aspects.First, the wetland landcovers are adjacent to the town of Beitun and thus strongly affected and changed by human activities.Because many of the lands close to the river and ponds have been converted for farming use, the presence of crops complicates the wetland classification; Second, the terrestrial vegetation in the arid and semiarid areas is generally very sparse and short, and therefore the soil background increases its contribution as noise to the spectral information collected by remote sensors in space.In fact, the same type of land covers may show different spectral characteristics while different land covers may have similar spectral signatures due to their same life form.This is especially true in the arid area due to the large spatial and temporal heterogeneity of the land covers; Third, the hydrology of the wetlands in arid areas undulates significantly from season to season, introducing uncertainty for the classification of water bodies, and altering the spectral characteristics of the nearby vegetation.The water volume of the river usually increases sharply in the wet season and phytoplankton then grows widely at the end of summer, which increases the difficulty of the classification of the emergent plants in the waters [36]; Finally, the canopy of the tall emerging plants at their growth peak is distributed in clusters, and has similar textural characteristics as those of the deciduous trees.All these challenges have resulted in the relatively low accuracy of wetland landcover classification in arid and semiarid areas with the conventional methods.Therefore, the random forest algorithm is employed to synthetically utilize and analyze the spectral, geometric, and phenological information in order to improve the accuracy and robustness of arid wetland classification.
Remote Sens. 2016, 8, 954 3 of 14 of the land covers; Third, the hydrology of the wetlands in arid areas undulates significantly from season to season, introducing uncertainty for the classification of water bodies, and altering the spectral characteristics of the nearby vegetation.The water volume of the river usually increases sharply in the wet season and phytoplankton then grows widely at the end of summer, which increases the difficulty of the classification of the emergent plants in the waters [36]; Finally, the canopy of the tall emerging plants at their growth peak is distributed in clusters, and has similar textural characteristics as those of the deciduous trees.All these challenges have resulted in the relatively low accuracy of wetland landcover classification in arid and semiarid areas with the conventional methods.Therefore, the random forest algorithm is employed to synthetically utilize and analyze the spectral, geometric, and phenological information in order to improve the accuracy and robustness of arid wetland classification.

Pléiades-1B Multispectral Data
Pléiades-1B is a commercial satellite launched by French National Space Research Center (CNES) on 30 November 2012.A multispectral imager with a spatial resolution of 2 m is onboard, acquiring data in blue, green, red, and near-infrared bands, and Pléiades-1B also acquires a 0.5 m panchromatic band data in the meantime.A Pléiades-1B image covering the town of Beitun and surrounding wetlands was acquired on 10 August 2014.A principal component analysis (PCA) was performed and the grey-level co-occurrence matrix (GLCM) (variation) was calculated from the second principal component [37].The first principal component image was segmented to generate image objects so that spectral and geometric features were calculated for each image object.The edge detection module in ENVI-EX was used in the segmentation, and the parameter "segmentation scale level" was set to 50 through a series of experiments.

Calculation of Landsat-8 Normalized Difference Vegetation Index (NDVI)
The operational land imager (OLI) onboard the National Aeronautics and Space Administration (NASA) Landsat-8 satellite can acquire multispectral imagery at a spatial resolution of 30 m and a temporal resolution of 16 days.In total, 21 images covering the Altai region were collected and preprocessed including geometric and atmospheric correction, and cloud and snow masking.The atmospheric correction was performed using the Mid-Latitude Summer atmospheric parameters and urban aerosol parameters, which are implemented in the Research Systems Inc.ENVI/FLAASH module.A time series of phenological images were generated based on the OLI multispectral images acquired from 24 March to 19 November 2014 as the proxy for the

Data Acquisition and Preprocessing
Pléiades-1B Multispectral Data Pléiades-1B is a commercial satellite launched by French National Space Research Center (CNES) on 30 November 2012.A multispectral imager with a spatial resolution of 2 m is onboard, acquiring data in blue, green, red, and near-infrared bands, and Pléiades-1B also acquires a 0.5 m panchromatic band data in the meantime.A Pléiades-1B image covering the town of Beitun and surrounding wetlands was acquired on 10 August 2014.A principal component analysis (PCA) was performed and the grey-level co-occurrence matrix (GLCM) (variation) was calculated from the second principal component [37].The first principal component image was segmented to generate image objects so that spectral and geometric features were calculated for each image object.The edge detection module in ENVI-EX was used in the segmentation, and the parameter "segmentation scale level" was set to 50 through a series of experiments.

Calculation of Landsat-8 Normalized Difference Vegetation Index (NDVI)
The operational land imager (OLI) onboard the National Aeronautics and Space Administration (NASA) Landsat-8 satellite can acquire multispectral imagery at a spatial resolution of 30 m and a temporal resolution of 16 days.In total, 21 images covering the Altai region were collected and preprocessed including geometric and atmospheric correction, and cloud and snow masking.The atmospheric correction was performed using the Mid-Latitude Summer atmospheric parameters and urban aerosol parameters, which are implemented in the Research Systems Inc.ENVI/FLAASH module.A time series of phenological images were generated based on the OLI multispectral images acquired from 24 March to 19 November 2014 as the proxy for the phonologies of different types of vegetation in the study area, and finally three OLINDVI images acquired on 19 June, 22 August, and 23 September 2014 were selected in the classification after feature optimization, please see the details in Section 2.3.
In addition, roads were delineated from the 1:50,000 topographic maps and used as ancillary data for more effective geometric correction of the images because it is relatively easy to determine GCPs in the street features.Two field trips were made in August 2014 and August 2015 to investigate the wetland vegetation and to collect in situ data for the purpose of image interpretation.In total, 24 sample plots (approximately 10 m × 10 m for each plot) were investigated in the study area, and the wetland vegetation species, photos and GPS coordinates were collected and processed.The plots were mainly located in the shallow coast of lakes, swamps, and ponds, where typical wetland vegetation such as Phragmites australis, Tamarix chinensis, and Echinochloa crusgalli developed well.

Random Forest Classifier
Random forests (RF) are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all the trees in the forest.The RF classification is an ensemble classification, which refers to a new approach that uses not only one, but also many classifiers.In fact, hundreds of classifiers are built in RF classification and their decisions are combined usually by plurality vote.The premise is that combining ensemble classifiers is often more accurate than any one from the ensemble [27,38], avoiding the conflicts among the feature subsets.As a result, the RF classification is widely used in processing remotely sensed imagery.The common element in all of these procedures is that for the k-th tree, a random vector θ k is generated independent of the prior random vectors θ 1 , ..., θ k−1 , but with the same distribution; the tree is grown using the training set and θ k , resulting in a classifier h (x, θ k ), where x is an input vector [27].The classifier consists of multiple trees constructed systematically by pseudo-randomly selecting subsets of components of the feature vector, that is, trees are constructed in randomly chosen subspaces maintaining the highest accuracy of training data and improving on generalization accuracy as it grows in complexity [39].Through a bagging process, when the RF classification makes a tree grow, it uses the best split of a random subset of input features or predictive variables in the division of every node, instead of using the best split variables.Therefore, although weakening the strength of every single tree, it reduces the correlation between the trees and the generalization error [40].Due to that the trees of a RF classifier grow with no pruning, the time for generating the model does not increase significantly.The predicted class of an observation is calculated based on the majority vote of the trees in the RF model, and the discrimination function is defined as Equation (1): where For each new training set that is generated to make a tree grow, one-third of the samples are randomly excluded, called the out-of-bag (OOB) samples.The remaining (in-the-bag) samples are used for building the tree.These OOB samples can be used to evaluate the model performance, and it has been proven that the OOB estimates are unbiased [27].

Feature Extraction and Optimization
Although existing research has proved that the RF classifier is robust and can achieve accurate classification results, the features selected in the classification are important and sensitive for the wetland classification in arid areas due to what is discussed in Section 2.1.It is important to note that different natural plants and crops have distinct growth periods, and therefore a time series of remotely sensed images can be used to capture the phenologies of various vegetation types.The analysis of them is expected to improve the classification accuracy of the wetland landcovers.The OLI sensor onboard the Landsat-8 satellite can be used to extract the classification features that are able to reflect vegetation phenological difference.

Spectral Feature Selection and Optimization
The Pléiades-1B sensor collects multispectral imagery at four bands with a 2 m spatial resolution, including the blue band of 430-550 nm, green band of 500-620 nm, red band of 590-710 nm, and the near-infrared band of 740-940 nm.The derived NDVI image is able to show the impact of the underlying soil background and the vegetation canopy structure to some degree.The nonlinear mapping from the red and NIR bands to the NDVI can well reveal the difference of the sparsely to moderately vegetated areas.On the other hand, normalized difference water index (NDWI) is a widely used composite index to detect water bodies because they have relatively high reflectivity in the green band and strong absorption in the NIR band.Furthermore, terrestrial vegetation and soil have a much stronger reflectivity in the NIR band compared to water, as a result the water bodies can be easily differentiated from the vegetation covers on the multispectral imagery.NDWI is defined in Equation ( 2) [41].
where NDWI is the water body index, while ρ green and ρ nir are reflectance in green and NIR, respectively.The land uses by humans such as building clusters, farming lands, and green belts, and the crowns of natural vegetation tend to present recognizable patterns and regular configurations on satellite remotely sensed imagery [42].Previous studies have proved that the GLCM texture metric (e.g., variation) is useful for improving the classification accuracy of various land cover types and reducing the classification errors for those objects with similar spectral features [22].Therefore, the NDVI and NDWI images were derived from the original Pléiades-1B images, and a PCA transformation was performed to extract the first component comprising brightness and the second component comprising surface structures.Subsequently, a GLCM texture analysis was conducted on the second component, and only the variation texture was selected due to its great effectiveness in the classification.The optimized feature data set contains eight classification features, namely, NDVI, NDWI, PC1, variation texture, and the four spectral bands.

Object-Oriented Feature Extraction
The spatial and geometric attributes of the image objects were used as features in the classification as well.First, the first principal component band was segmented with edge detection algorithm (the segmentation scale level was 50) to generate image objects, from which eight metrics for each image object were calculated such as compactness, elongation, grey information entropy, form factor, rectangle fitting, roundness, object area, and primary axis length.These metrics were converted into eight feature layers, providing valuable geometric and shape information for the classification.

Calculation of the NDVI Series
The time series of Landsat-8 OLI images in the same year were used to derive a NDVI series, as research has suggested that NDVI can serve as an effective indicator of seasonal changes of vegetation greenness, intensity of photosynthesis, and metabolic intensity.This is very important for being able to differentiate non-wetland vegetation and therefore improve the classification accuracy of wetland vegetation [43].
The accuracy of NDVI calculation may be affected by a variety of factors, such as solar elevation angle, cloud, water vapor, etc.The seasonal variations can therefore be underrepresented by the NDVI curves limiting their implications of phenological characteristics.In fact, it is difficult to detect the trend of change or pattern.In this study, a harmonic analysis was performed to remove the noises and reduce the sample errors in the NDVI time series, in order to enhance the phenological characteristics.The harmonic analysis is an optimization method based on discrete fast Fourier transform, through which a complicated NDVI time series function can be decomposed into an assemblage of periodic functions with different frequencies.The information related to vegetation phenology is usually presented in the low-frequency harmonic functions, and the non-periodic interference signals such as the noise and errors introduced in data acquisition or processing are presented in the higher frequency harmonic functions [44].The higher frequency harmonic functions were removed and the remaining low-frequency functions were accumulated to re-construct the NDVI time series with many fewer errors or noises.
As shown in Figure 2, the NDVI time series curves of the three main types of wetland vegetation in the study area can be used to differentiate their growth cycles.It can be seen that the growth peak of the Phragmites australis is relatively the latest among the three.Snowmelt water and a small amount of precipitation usually begin to supplement the drought soil in the Altai area at the end of May, and the moist soil allows for the rapid growth of the wetland vegetation.As a result, the biomass of the shrub Tamarix chinensis and the wetland Echinochloa crusgalli starts to increase, and the vegetation coverage becomes larger, both leading to a continuous increase of NDVI value.With the increased soil moisture, the land surface water is also produced and flows into the water bodies.In the meantime, the water level of the lakes and the marshes in the Altai area rise gradually, which also contributes to the rapid growth of emergent aquatic plants.Phragmites australis, a typical emergent aquatic plant in the study area enters its rapid growing season later than terrestrial vegetation does.These three types of wetland vegetation mentioned above show distinct differences in their NDVI values from June to September.Therefore, the NDVI data from images acquired on the three dates of 19 June, 22 August, and 23 September were selected to represent the phenological stages of rapid growing, growing peak, and the end of growth, respectively.
Remote Sens. 2016, 8, 954 6 of 14 to detect the trend of change or pattern.In this study, a harmonic analysis was performed to remove the noises and reduce the sample errors in the NDVI time series, in order to enhance the phenological characteristics.The harmonic analysis is an optimization method based on discrete fast Fourier transform, through which a complicated NDVI time series function can be decomposed into an assemblage of periodic functions with different frequencies.The information related to vegetation phenology is usually presented in the low-frequency harmonic functions, and the non-periodic interference signals such as the noise and errors introduced in data acquisition or processing are presented in the higher frequency harmonic functions [44].The higher frequency harmonic functions were removed and the remaining low-frequency functions were accumulated to re-construct the NDVI time series with many fewer errors or noises.As shown in Figure 2, the NDVI time series curves of the three main types of wetland vegetation in the study area can be used to differentiate their growth cycles.It can be seen that the growth peak of the Phragmites australis is relatively the latest among the three.Snowmelt water and a small amount of precipitation usually begin to supplement the drought soil in the Altai area at the end of May, and the moist soil allows for the rapid growth of the wetland vegetation.As a result, the biomass of the shrub Tamarix chinensis and the wetland Echinochloa crusgalli starts to increase, and the vegetation coverage becomes larger, both leading to a continuous increase of NDVI value.With the increased soil moisture, the land surface water is also produced and flows into the water bodies.In the meantime, the water level of the lakes and the marshes in the Altai area rise gradually, which also contributes to the rapid growth of emergent aquatic plants.Phragmites australis, a typical emergent aquatic plant in the study area enters its rapid growing season later than terrestrial vegetation does.These three types of wetland vegetation mentioned above show distinct differences in their NDVI values from June to September.Therefore, the NDVI data from images acquired on the three dates of 19 June, 22 August, and 23 September were selected to represent the phenological stages of rapid growing, growing peak, and the end of growth, respectively.Based on the classification system of wetland landcovers and the constructed feature set, a random forest classifier was created in our study.Due to the fact that the selection of training samples has a strong influence on the final classification result, the iterative self-organized cluster algorithm was adopted to perform cluster analysis on the features [45].Data from the in situ investigation was also used in order to select suitable training samples for the RF classification.The RF classifier was trained with 200 trees and each one is constrained with five features [27].The flowchart for the RF classification is illustrated in Figure 3. Based on the classification system of wetland landcovers and the constructed feature set, a random forest classifier was created in our study.Due to the fact that the selection of training samples has a strong influence on the final classification result, the iterative self-organized cluster algorithm was adopted to perform cluster analysis on the features [45].Data from the in situ investigation was also used in order to select suitable training samples for the RF classification.The RF classifier was trained with 200 trees and each one is constrained with five features [27].The flowchart for the RF classification is illustrated in Figure 3.In addition to the RF classifier, the other two machine learning algorithms, SVM and ANN, have also been implemented to perform a comparative study because they have been widely used in remote sensing image classification.The SVM is to search the hyperplane with maximum interval in a linear space, and can achieve good classification for the data with small signatures and high dimensions [46], while the ANN performs the training process by constant modification of the weights through weighted accumulation of the input layers and inverse propagation of errors [47].The SVM classifier is trained with the kernel of radial basis function.A penalty parameter of 100 was adopted in the training of SVM classifier.The ANN classifier is built as three-layer BP neural network model with one hidden layer and the logistics activation function is adopted in the implementation.The ANN classifier was set to 1000 iterations for each training experiment to ensure the stability and reliability.To evaluate the different classifiers, cross validation was conducted while training.In the training of RF classifier, the out-of-bag samples were used as validation samples, and they are different in every RF training experiment.The training samples were also randomly divided into two parts at a proportion of 4:1 in the SVM and ANN classifications.80% samples were used to train the classifiers and 20% of them were used to validate the results.The training experiment was conducted five times for RF, SVM, and ANN classifiers, and the averaged classification accuracy was finally adopted for comparison.

Experiment and Results
Using the approach proposed in Section 2, we extracted and optimized the features from the Pléiades-1B and the Landsat-8 OLI multispectral data, and developed an optimal feature set of 19 variables to perform the random forest classification described in Section 2.2.Referring to the wetland classification system in the Convention on Wetlands of International Importance Especially as Waterfowl Habitat, and based on the characteristics of the study area, the classification scheme shown in Table 1 was developed and used in this study (Table 1).
Wheat, corn, and forest are the main land covers introduced by humans in the classification system (Table 1).Artificial oases are distributed in areas with better water supply in the area, which are often where the human settlements are located.Lakes, marshes, reservoirs, and ponds are the main wetland types in the study area.The Pléiades-1B image used in this study was acquired in August 2014 when the water supply was relatively adequate and the water level was relatively high in the year.Phragmites australis is the main aquatic vegetation, and Tamarix chinensis shrub and wetland Echinochloa crusgalli grow alternatively in the areas immediately around the open waters.In addition, the bare soils (e.g., sand, wasteland) and the urban impervious surfaces (e.g., buildings, road pavement) were both categorized as others without differentiation.In addition to the RF classifier, the other two machine learning algorithms, SVM and ANN, have also been implemented to perform a comparative study because they have been widely used in remote sensing image classification.The SVM is to search the hyperplane with maximum interval in a linear space, and can achieve good classification for the data with small signatures and high dimensions [46], while the ANN performs the training process by constant modification of the weights through weighted accumulation of the input layers and inverse propagation of errors [47].The SVM classifier is trained with the kernel of radial basis function.A penalty parameter of 100 was adopted in the training of SVM classifier.The ANN classifier is built as three-layer BP neural network model with one hidden layer and the logistics activation function is adopted in the implementation.The ANN classifier was set to 1000 iterations for each training experiment to ensure the stability and reliability.To evaluate the different classifiers, cross validation was conducted while training.In the training of RF classifier, the out-of-bag samples were used as validation samples, and they are different in every RF training experiment.The training samples were also randomly divided into two parts at a proportion of 4:1 in the SVM and ANN classifications.80% samples were used to train the classifiers and 20% of them were used to validate the results.The training experiment was conducted five times for RF, SVM, and ANN classifiers, and the averaged classification accuracy was finally adopted for comparison.

Experiment and Results
Using the approach proposed in Section 2, we extracted and optimized the features from the Pléiades-1B and the Landsat-8 OLI multispectral data, and developed an optimal feature set of 19 variables to perform the random forest classification described in Section 2.2.Referring to the wetland classification system in the Convention on Wetlands of International Importance Especially as Waterfowl Habitat, and based on the characteristics of the study area, the classification scheme shown in Table 1 was developed and used in this study (Table 1).
Wheat, corn, and forest are the main land covers introduced by humans in the classification system (Table 1).Artificial oases are distributed in areas with better water supply in the area, which are often where the human settlements are located.Lakes, marshes, reservoirs, and ponds are the main wetland types in the study area.The Pléiades-1B image used in this study was acquired in August 2014 when the water supply was relatively adequate and the water level was relatively high in the year.Phragmites australis is the main aquatic vegetation, and Tamarix chinensis shrub and wetland Echinochloa crusgalli grow alternatively in the areas immediately around the open waters.In addition, the bare soils (e.g., sand, wasteland) and the urban impervious surfaces (e.g., buildings, road pavement) were both categorized as others without differentiation.Based on the field investigation, topographic map, and image visual interpretation, 182 confidential and reliable sub-regions with 21,503 pixels were selected as training samples.For the purpose of comparison, our image data were classified by using the three classifiers based on the training data set.The three classifiers can all produce relatively good results and the mean overall classification accuracies of the random forest classifier (RFC), SVM, and ANN classifiers are 98.7%, 89.4%, and 90.2%, respectively.
Another comparative experiment was also implemented to examine the accuracy of RF classification of using Pleiade-1B by itself and Landsat-8 NDVI data by itself (Table 2).Results indicate that neither the RF classification of Pleiade-1B nor multi-temporal Landsat-8 NDVI data can achieve a better result than our approach.Compared with Table 3, the accuracy of the three wetland covers is improved by 10% or so, which indicates that the inclusion of the three NDVI layers extracted from the Landsat-8 OLI data played an important role in the classification due to making use of the vegetation phenological information.
To assess the accuracy of the three classifications independently, 5694 pixels testing data were selected in the extent of the image through the uniform random sampling approach.The comparative results show that the RFC classification stood out with the highest classification accuracy (Table 2), indicating its effectiveness in handling interference of soil background in the sparsely vegetated arid areas (Figure 4).
The overall accuracy of the RF classification is approximately 92.5%, which is about 10% higher than those by the other two classifications.The difference between the simulation accuracy and the actual validation accuracy of the RF classification is about 5% smaller than that of SVM and ANN.This means that the RF classification has the stronger genealization capacity in the training process while the other two tend to over fit their model to the training data.This has further proved that the RFC's strategy of decision making based on combined voting outperformed the SVM and ANN classifiers.The confusion matrix for the RFC classification was produced based on the testing data set.The commission error is 100% minus the User's accuracy, and similarly the omission error is 100% minus the Producer's accuracy (Table 3).In the RFC result, the accuracy of the two main crops and water bodies is over 95%, while that of the forestland and three wetland plants is between 85%-90%.Specifically, forestland, Phragmites australis, and Tamarix chinensis were only classified with a relatively lower accuracy.Although their overall accuracy is still satisfactory, considerable commission and omission errors were observed in the experiment.

Discussion
The RF classification only used a subset of features to build each individual decision tree.This is advantageous because each decision tree can make a more accurate classification decision based only on effective features, reducing the error associated the structural risk of including the entire feature vector in the analysis.Simply put, each decision tree is a classification "specialist" of the feature domain used in it.For a feature used to differentiate a class type, a non-specialist decision tree adopts a nearly random way of voting, in which the number of the specialist decision trees is sufficient to make accurate classification when all the decision trees meet a certain number.It is fair to state that the RFC is merited with better stableness and robustness.
For the study area, although all the three classifiers could map the water bodies with a relatively high accuracy of over 90%, the RF classification performed the best with an accuracy of 95%.Detailed comparison further shows that the SVM and ANN classifications have issues in mapping water bodies.Within the marsh area, some marsh pixels were misclassified as other types in the SVM and ANN results, especially in the water-land boundary areas.This issue is more pronounced in mapping ponds.In contrast, the three types of wetland water bodies were more accurately classified in the RFC result.Except for a few water bodies enclosed by Phragmites australis being misclassified, the classes of lakes, marshes, and ponds were correctly classified, even for those located in the water-land transition zones.Compared with the marshes and ponds, the lakes are characterized as being relatively large and in a somewhat circular shape with a smoother shoreline.In the segmented images, the pixels of each water body were grouped to form as a single object, and some geometric metrics such as shape index, roundness, and area, etc. could be calculated for the object.In contrast, those water pixels in marshes and ponds inherit their heterogeneous features and often have an irregular shape.In building the RFC model, the specialist

Discussion
The RF classification only used a subset of features to build each individual decision tree.This is advantageous because each decision tree can make a more accurate classification decision based only on effective features, reducing the error associated the structural risk of including the entire feature vector in the analysis.Simply put, each decision tree is a classification "specialist" of the feature domain used in it.For a feature used to differentiate a class type, a non-specialist decision tree adopts a nearly random way of voting, in which the number of the specialist decision trees is sufficient to make accurate classification when all the decision trees meet a certain number.It is fair to state that the RFC is merited with better stableness and robustness.
For the study area, although all the three classifiers could map the water bodies with a relatively high accuracy of over 90%, the RF classification performed the best with an accuracy of 95%.Detailed comparison further shows that the SVM and ANN classifications have issues in mapping water bodies.Within the marsh area, some marsh pixels were misclassified as other types in the SVM and ANN results, especially in the water-land boundary areas.This issue is more pronounced in mapping ponds.In contrast, the three types of wetland water bodies were more accurately classified in the RFC result.Except for a few water bodies enclosed by Phragmites australis being misclassified, the classes of lakes, marshes, and ponds were correctly classified, even for those located in the water-land transition zones.Compared with the marshes and ponds, the lakes are characterized as being relatively large and in a somewhat circular shape with a smoother shoreline.In the segmented images, the pixels of each water body were grouped to form as a single object, and some geometric metrics such as shape index, roundness, and area, etc. could be calculated for the object.In contrast, those water pixels in marshes and ponds inherit their heterogeneous features and often have an irregular shape.In building the RFC model, the specialist trees were grown with the consideration of these features based on the training data.As a result, although the spectral features of the lake waters are similar to those of the marshes or ponds, shape and area can significantly increase the confidence in the node splitting of the decision trees.This explains the reason for RFC achieving better classification of the wetland water bodies than SVM and ANN.
In the classification of the two crops, RFC achieved a higher accuracy than that of SVM and ANN.The wheat in the northern area of Xinjiang is usually harvested in the end of August or in the beginning of September, which is one month earlier than the harvest of corn.Both crops can be recognized on the Pléiades-1B image.Although the amount of biomass is different from the two crops, it is difficult to differentiate them using just a single date remote sensing image due to the saturation of NDVI when the lands are densely vegetated.The addition of Landsat-8 NDVI images for three different dates has allowed for a much more accurate classification of the two crops.Moreover, it can be seen in Figure 4 that the SVM and ANN classifications appear to be relatively fragmented and the classification errors often occurred in the land parcels with mixed crops.Such issues were rarely found in the RF classification result.The special farming way in the study area created typical geometric outlines of the cropland parcels (Figure 5b).Therefore, the fitness to a rectangle is generally high for both wheat and corn land parcel objects, which is helpful to differentiating the two croplands from other land covers.
explains the reason for RFC achieving better classification of the wetland water bodies than SVM and ANN.
In the classification of the two crops, RFC achieved a higher accuracy than that of SVM and ANN.The wheat in the northern area of Xinjiang is usually harvested in the end of August or in the beginning of September, which is one month earlier than the harvest of corn.Both crops can be recognized on the Pléiades-1B image.Although the amount of biomass is different from the two crops, it is difficult to differentiate them using just a single date remote sensing image due to the saturation of NDVI when the lands are densely vegetated.The addition of Landsat-8 NDVI images for three different dates has allowed for a much more accurate classification of the two crops.Moreover, it can be seen in Figure 4 that the SVM and ANN classifications appear to be relatively fragmented and the classification errors often occurred in the land parcels with mixed crops.Such issues were rarely found in the RF classification result.The special farming way in the study area created typical geometric outlines of the cropland parcels (Figure 5b).Therefore, the fitness to a rectangle is generally high for both wheat and corn land parcel objects, which is helpful to differentiating the two croplands from other land covers.For the classification of the three wetland plants and forest, RFC performed much better than SVM and ANN, with an OA improvement of 10% to 15%.In the RFC result, more Phragmites australis communities are recognized with a smaller omission error.They are more accurately differentiated from the croplands better than in the results from SVM and ANN.This is largely due to the fact that the Phragmites australis grows explosively in the moist summer season, while the two crops do not follow the same phenological pattern.Furthermore, soil moisture rapidly decreases over distance to the open water bodies such as lakes, rivers, and ponds.The terrestrial vegetation is mainly Tamarix chinensis bush with sparse drought tolerant grass underneath it.The crown of the bush often shows to be in a circular or triangular shape on the high-resolution remote sensing image, and RF classification is capable of utilizing the textural features of the bush clusters to reduce the confusion between Tamarix chinensis and Echinochloa crusgalli.
In the study area, water level changes dramatically from the dry season to the rain season, and consequently peat accumulation mainly occurs near the water level of dry season.In the moist period, the aquatic plants such as Phragmites australis grow explosively because the water level is higher and the climate is more temperate.A clear spatial pattern can be found for the distribution of Phragmites australis communities, that is, a large range of them is densely grown in the central area of the peat zone, and the density gradually decreases towards the edges.At the end of summer season, Phragmites australis are in clumped clusters with explosively grown floating plants in between, because the underlying soils have richer nutrition.Such spatial mix of Phragmites australis and floating plants makes it difficult to differentiate Phragmites australis from forest, even for the RFC.
The relative contributions of the different features used in the RF classification were calculated and presented in Figure 6.Each feature contribution was calculated in an accuracy-degraded way, namely, the contribution of a feature to the classification is the normalized accuracy difference, calculated from the two classifications of including and excluding the feature in question.
For the classification of the three wetland plants and forest, RFC performed much better than SVM and ANN, with an OA improvement of 10% to 15%.In the RFC result, more Phragmites australis communities are recognized with a smaller omission error.They are more accurately differentiated from the croplands better than in the results from SVM and ANN.This is largely due to the fact that the Phragmites australis grows explosively in the moist summer season, while the two crops do not follow the same phenological pattern.Furthermore, soil moisture rapidly decreases over distance to the open water bodies such as lakes, rivers, and ponds.The terrestrial vegetation is mainly Tamarix chinensis bush with sparse drought tolerant grass underneath it.The crown of the bush often shows to be in a circular or triangular shape on the high-resolution remote sensing image, and RF classification is capable of utilizing the textural features of the bush clusters to reduce the confusion between Tamarix chinensis and Echinochloa crusgalli.
In the study area, water level changes dramatically from the dry season to the rain season, and consequently peat accumulation mainly occurs near the water level of dry season.In the moist period, the aquatic plants such as Phragmites australis grow explosively because the water level is higher and the climate is more temperate.A clear spatial pattern can be found for the distribution of Phragmites australis communities, that is, a large range of them is densely grown in the central area of the peat zone, and the density gradually decreases towards the edges.At the end of summer season, Phragmites australis are in clumped clusters with explosively grown floating plants in between, because the underlying soils have richer nutrition.Such spatial mix of Phragmites australis and floating plants makes it difficult to differentiate Phragmites australis from forest, even for the RFC.
The relative contributions of the different features used in the RF classification were calculated and presented in Figure 6.Each feature contribution was calculated in an accuracy-degraded way, namely, the contribution of a feature to the classification is the normalized accuracy difference, calculated from the two classifications of including and excluding the feature in question.It can be seen that the original NIR band, NDVI, and NDWI are the three largest contributors to the classification.This is partly because the land cover types in the study area are distinctive of their values of these features (Figure 5c,d).Secondly, the analysis of these features helps reduce the often severe commission and omission errors in the classification of heterogeneous land surfaces, from high spatial resolution imagery.The features with a relatively low contribution such as blue band are still useful in differentiating between some types (Figure 5f).For example, the reflectance It can be seen that the original NIR band, NDVI, and NDWI are the three largest contributors to the classification.This is partly because the land cover types in the study area are distinctive of their values of these features (Figure 5c,d).Secondly, the analysis of these features helps reduce the often severe commission and omission errors in the classification of heterogeneous land surfaces, from high spatial resolution imagery.The features with a relatively low contribution such as blue band are still useful in differentiating between some types (Figure 5f).For example, the reflectance of blue light is low for both vegetation and water, but considerably higher for barren land, sand land, and urban impervious surfaces.

Conclusions
This research has demonstrated the application of random forest classification in mapping wetland landcovers in an arid and semiarid area, based on the use of multiple sources of remotely sensed data.To address the difficulties in wetland classification using a conventional method, an effective set of features were found including the spectral and geometric ones that could be derived from the high spatial resolution imagery, and the phenological pattern recognized from a time series of images.A case study was conducted to classify the wetlands along the Etrix River in North Xinjiang, China based on the analysis of the Pléiades-1B multispectral data and multi-date Landsat-8 OLI data.The random forest classification model was successfully developed with 19 features (e.g., spectral, textural, geometric, NDVI curve), and a satisfactory classification with accuracy of 92.5% was achieved.It was found that the overall accuracy of the RF classification was 10% higher than that of the SVM and ANN classifiers.The spatial and geometric characteristics from image segmentation and the temporal pattern of NDVI implying vegetation phenology has added important dimensions in the classification, allowing the water bodies and the wetland vegetation to be more accurately differentiated from forests and crops.
Future work will be focused on the statistical analysis of the probability information of the adjacent land covers as well as the spatial distribution patterns of various wetland landcovers.More accurate classification of wetland landcovers in arid and semiarid areas has significant practical values in supporting the effective management and protection of wetland resources.

Figure 1 .
Figure 1.The geographic location of the study area.

Figure 1 .
Figure 1.The geographic location of the study area.

Figure 2 .
Figure 2. The normalized difference vegetation index (NDVI) time series curves of three wetland vegetation after harmonic analysis.

Figure 2 .
Figure 2. The normalized difference vegetation index (NDVI) time series curves of three wetland vegetation after harmonic analysis.

Figure 3 .
Figure 3.The flowchart for the random forest classification of the wetland.OLI: Operational land imager; NDVI: normalized difference vegetation index; FFT: fast Fourier transform

Figure 3 .
Figure 3.The flowchart for the random forest classification of the wetland.OLI: Operational land imager; NDVI: normalized difference vegetation index; FFT: fast Fourier transform.

Figure 4 .
Figure 4.The classification results of the RFC (a); SVM (b); and ANN (c) models.

Figure 4 .
Figure 4.The classification results of the RFC (a); SVM (b); and ANN (c) models.

Figure 6 .
Figure 6.Relative contributions of the features to the classification.

Figure 6 .
Figure 6.Relative contributions of the features to the classification.

Table 1 .
The wetland classification system for the study area.

Table 2 .
Comparison of the overall accuracies of the three classifiers.

Table 3 .
The User's and Producer's accuracy of the RF classification (%).Class Commission Errors Omission Errors Producer's Accuracy User's Accuracy

Table 3 .
The User's and Producer's accuracy of the RF classification (%).