#### 2.3.1. Preprocessing

The ALOS-2/PALSAR-2 SLC data were converted into backscattering coefficients (σ

^{o}) using the following equation (Equation (1)) [

36]:

where

I and

Q are the real and imaginary parts of the SLC images.

CF corresponds to the radiometric calibration factor (−83 dB), and

A is the conversion factor (32 dB) [

36]. For the speckle noise attenuation, we employed the Refined Lee polarimetric filter and an adaptive window size of 7 pixels × 7 pixels. This filter preserves the statistics and the linear features of the images [

37].

The incoherent polarimetric parameters are derived from the power measurements in σ

^{o} [

38]. In this research, these parameters were generated to compose the set of attributes used in the machine-learning phase. The following indices were generated: Radar Vegetation Index (RVI) [

39]; Radar Forest Degradation Index (RFDI) [

40]; Canopy Structure Index (CSI); Volume Scattering Index (VSI); and Biomass Index (BMI) [

41]. Parallel (co-pol) and cross-polarization (cros-pol) ratios were also generated [

38].

Target decomposition aims to represent scattering processes as a sum of independent elements related to the physical scattering mechanisms [

42]. The methods of target decomposition are classified into coherent and incoherent types [

42,

43]. Coherent decompositions assume the existence of deterministic scatterers and that the backscatter wave is polarized. In general, this type of target decomposition uses the Jones scattering matrix to represent the polarization states of the electromagnetic wave. Incoherent decompositions assume that scattering is not deterministic, so the backscattered wave is partially polarized. In this case, the power reflection matrices (covariance and coherence matrices) are used to characterize the backscattered wave [

37,

44].

In remote sensing applications, the assumption of the occurrence of pure deterministic targets is invalid [

44], so the power reflection matrices are often used. In this study, we used only incoherent methods. The following algorithms were considered: van Zyl (three components) [

45]; Freeman–Durden (three components) [

46]; Yamaguchi (four components) [

47]; and Cloude–Pottier (three components: entropy (H), anisotropy (A), and α angle) [

42]. The decompositions were generated directly from the SLC images using the SNAP 6.0 application and the 5 pixels x 5 pixels window. Filters were not applied in the power matrices used in the polarimetric decompositions.

#### 2.3.2. Image Segmentation and Attribute Extraction

The calibrated polarized images in terms of backscattering coefficients, the incoherent polarimetric parameters, and the polarimetric decompositions parameters were orthorectified using the range Doppler model specific for SAR sensors. The 30-m spatial resolution, digital elevation models (DEM) obtained by the Shuttle Radar Topograpy Mission (SRTM) were used in the orthorectification process.

For SAR images segmentation (σ

^{o}_{HH}, σ

^{o}_{HV}, σ

^{o}_{VH}, and σ

^{o}_{VV}), a multiresolution algorithm based on growing region was used [

48]; in considering the scaling factor and the homogeneity composition variables, the latter was divided into color and shape. The shape, in turn, is subdivided into compactness and smoothness. The scale defines the size of the segments of an image, and the homogeneity composition tests the equality between segments [

49]. Only one level of segmentation was generated, with parameters defined from several empirical tests. The scale parameter of 50 was selected, and weights were assigned to the criteria of homogeneity (shape = 0.10; color = 0.90; smoothness = 0.50; and compactness = 0.50).

After segmentation, the segment attributes were extracted. Among the various existing categories of attribute metrics, we selected the layer values of mean, standard deviation, asymmetry, and the pixel-based, minimum and maximum values. Thus, for each of the 25 images (eight polarimetric parameters, 13 decomposition components, and four polarizations) available, the attributes of the two categories mentioned previously were extracted. Therefore, a set of 125 layers of attributes (five attributes for 25 images) was used in the machine-learning-based classifications.

#### 2.3.3. Classification and Validation

The following machine-learning classification algorithms were analyzed: NB, DT J48, RF, MLP, and SVM. An NB classifier employs the Bayesian theory dealing with conditional probability and predictions of events, with strong (naive) independent assumptions. NB assumes that the presence (or absence) of a given feature of a class is not related to the presence (or absence) of any other feature. Depending on the precise nature of the probability model, NB classifiers can be trained very efficiently in a supervised learning framework. In many practical applications, parameter estimation for NB models uses the method of maximum likelihood; in other words, one can work with the NB model without assuming Bayesian probability or using any Bayesian methods. Despite their naive setting and apparently over-simplified assumptions, NB classifiers have performed quite satisfactorily in many complex real-world situations. An advantage of the NB classifier is that it only requires a reduced amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix [

50,

51,

52].

The DT classifier (DT J48) consist in a graph that employs the “divide-and-conquer” approach to test attributes and assign classes to independent instances [

53]. Basically, DTs are a non-parametric supervised learning method used for classification and regression. DTs learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model. A DT builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets, while at the same time an associated DT is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor is called the root node. DTs can handle both categorical and numerical data. There are several steps involved in the building of a DT. The first one is the process of partitioning the dataset into subsets, named splitting. Splits are formed on a particular variable. The second one is pruning, which corresponds to the shortening of branches of the tree. Pruning is the process of reducing the size of the tree by turning some branch nodes into leaf nodes and removing the leaf nodes under the original branch. Pruning is useful because classification trees may fit the training data well but may do a poor job of classifying new values. A simpler tree often avoids over-fitting. And finally, the next process is tree selection, which is responsible for finding the smallest tree that fits the data. Usually this is the tree that yields the lowest cross-validation error [

53].

The RF algorithm was conceived by combining a large number of random DTs. Each tree contributes with only one class vote for each instance, and the final classification is determined by the majority of the votes of all forest trees [

54]. The trees in RF are created by drawing a subset of training data through a bagging approach. The bagging randomly selects about two-thirds of the samples from the training data to train these trees. This means that the same sample can be used in a training subset several times, while others may not be selected in a particular subset [

55]. In the RF algorithm, there are two main parameters to be defined: the number of variables in the random subset at each node (mtry) and the number of trees (ntree). Rodriguez-Galiano et al. [

56] conducted an empirical evaluation as to the parameter “number of trees” and reported that differences of more than a hundred trees in the classification accuracy are not meaningful; hence we opted the use of 100 trees in this work. Concerning the mtry parameter, the default value was adopted, which corresponds to the square root of the total number of features used in each experiment [

57]. Other authors, however, rely on optimization procedures to assess the values of ntree and mtry, as described in [

58,

59]. RF has shown to own several advantages in relation to other classifiers, since it is not based on strict parametric assumptions, besides being able to handle high dimensional data and to deal with nonlinearity. However, RF has its limitations, like longer computing time and higher algorithmic complexity as compared to an individual DT [

60].

The MLP is a forward-structure artificial neural network (ANN) trained by the backpropagation method, designed to map a set of input vectors to a set of output vectors [

61]. ANN can be simply defined as a massively parallel distributed computational device consisting of processing units, also called neurons or nodes, which are organized in a couple of layers. The neurons are responsible for the storage of knowledge acquired within the system, which is then made available for further use [

62]. MLPs learn fast with high generalization and have a strong self-learning ability [

61,

63]. They are composed of an input layer to receive the signal, an output layer that makes a decision or prediction about the input, and in between those two, an arbitrary number of hidden layers (or eventually none) that are the true computational engine of the MLP. MLPs with one hidden layer are capable of approximating any continuous function. These successive layers of processing units present connections running from every unit (neuron) in one layer to every unit in the next layer. The connections are responsible for passing information throughout the network, and they are characterized by weights, which are initially set in a random way and can be positive or negative [

64]. All the neurons, except those belonging to the input layer, perform two simple processing functions—receiving the signal (activation) of the neurons in the previous layer and transmitting a new signal as the input to the next layer. The weights in the network can be updated from the errors calculated for each training example, and this is called online learning. Alternatively, the errors can be saved up across all of the training examples, and the network can be updated at the end. This is called batch learning and is often more stable. The learning should be stopped when the validation set error reaches its minimum. At this very point, the net is able to attain the best generalization [

62,

65]. If learning is not stopped, overtraining occurs, and the performance of the net is jeopardized. Once a neural network has been trained, it can be used to make predictions.

The SVM function is based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. According to [

66], the possibility to maximize the margin (either side of a hyperplane that separates two data classes) and to create the largest possible distance between the separating hyperplanes has been acknowledged to reduce the upper bound of the expected generalization error. SVM supports both regression and classification tasks and can handle multiple continuous and categorical variables. Thus, SVM is primarily a classifier method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates linear and nonlinear samples of different class labels [

67,

68]. This classifier is meant to maximize the distance between these hyperplanes and the classes samples, in which the bordering samples are called support vectors. Multiclass problems are solved by pairwise classification. There are different algorithms to train an SVM, like quadratic programming and the more efficient sequential minimal optimization (SMO), that uses heuristics to partition the training problem into smaller problems (that can be solved analytically), replaces all missing values, and transforms nominal attributes into binary ones, besides normalizing all attributes by default, aiming at minimizing an error function.

According to [

31], there are approximately 132,000 hectares of soybeans and 85,000 hectares of maize in the study area. Cultivated pastures, forestlands, and shrublands are the other major classes found in the study area [

29,

30]. Field surveys were carried out on 10–11 September 2017 along the BR-010 and GO-118 highways with the purpose of identifying the major LULC classes present in the study area (

Figure 3). Thus, we considered the following representative LULC classes: forestlands; shrublands; grasslands; reforestations; croplands; pasturelands; bare soils/straws; urban areas; and water reservoirs.

Based on LANDSAT-8 and higher spatial resolution images available in the Google Earth and Bing platforms, 200 training samples of segments were selected for each LULC class (except for grasslands, reforestations, and water reservoirs—25 samples each because of their limited occurrence in the study area). Another set of 959 segments was selected for validation purposes, according to approaches reported by [

69]. A set of 1000 random and nonstratified points designed previously for the field campaign was considered. Forty-one segments were disregarded since they were located in hilly regions affected by layover or foreshortening effects associated with the radar image acquisition processes.

Thus, seven shapefiles were generated: five for training the classifiers (5, 25, 50, 100, and 200 samples per class), one for validation, and one for a classification using all 39,254 segments generated in the segmentation step. Each classification algorithm was trained five times and the respective validations were carried out with the same set of 959 segments. The validations were performed based on the error matrices generated with the segments of each classification. The following validation metrics were used: global accuracy, Kappa index, conditional producer’s accuracy (PA), and user´s accuracy (UA). Hypothesis tests were also analyzed based on the standard normal distribution to compare Kappa indices and to evaluate the performances of the different classifications.