Segmentation of Vegetation and Flood from Aerial Images Based on Decision Fusion of Neural Networks

The detection and evaluation of flood damage in rural zones are of great importance for farmers, local authorities, and insurance companies. To this end, the paper proposes an efficient system based on five neural networks to assess the degree of flooding and the remaining vegetation. After a previous analysis the following neural networks were selected as primary classifiers: you only look once network (YOLO), generative adversarial network (GAN), AlexNet, LeNet, and residual network (ResNet). Their outputs were connected in a decision fusion scheme, as a new convolutional layer, considering two sets of components: (a) the weights, corresponding to the proven accuracy of the primary neural networks in the validation phase, and (b) the probabilities generated by the neural networks as primary classification results in the operational (testing) phase. Thus, a subjective behavior (individual interpretation of single neural networks) was transformed into a more objective behavior (interpretation based on fusion of information). The images, difficult to be segmented, were obtained from an unmanned aerial vehicle photogrammetry flight after a moderate flood in a rural region of Romania and make up our database. For segmentation and evaluation of the flooded zones and vegetation, the images were first decomposed in patches and, after classification the resulting marked patches were re-composed in segmented images. From the performance analysis point of view, better results were obtained with the proposed system than the neural networks taken separately and with respect to some works from the references.


Introduction
Detection and segmentation of small regions of interest (RoIs) from images (e.g., natural vegetation areas, crops, floods, forests, roads, buildings, waters, etc.) is a difficult task in many remote image processing applications. Recently, considerable efforts have been made in this direction with applications in different domains like agriculture [1,2], environment [3,4], and transport [5,6]. On the other hand, the utility of the surveillance/monitoring systems on various areas has been proven by the management of natural disasters [7] and rescue activities. Different solutions based on image analysis are proposed for detection and analysis of RoIs in areas affected by different types of natural disasters (floods, hurricanes, tornadoes, volcanic eruptions, earthquakes, tsunamis, etc.). Among these, floods are the most expensive types of disasters in the world and represented 31% of the economic losses generated by natural disasters during 2010-2018 [8]. Determining and evaluating flooded areas during or immediately after flooding in agricultural zones are important for timely assessment of economic damage and taking measures to remedy the situation.
ResNet and GoogLeNet with transfer learning provided good results to classify different ground RoIs from satellite images [9].

Materials and Methods
As mentioned above, the proposed system for flood and vegetation assessment was based on information fusion from a set of efficient neural networks, considered as individual classifiers, grouped through a new convolutional layer into a global system. Fusing the individual decisions of neural networks, considered as subjective factors (due to specific learning), an increase in the degree of objectivity of the global classification was obtained.
The images were taken from an orthophotoplan created from the images acquired by a UAV in the real case of a flood in a rural region from Romania. Then, each image, extracted from the orthophotoplan, was decomposed into non-overlapped patches that were labeled as one of the mentioned classes: F (flood), V (vegetation), and R (rest).
The proposed neural networks were training (the first phase-the learning phase) with a set of patches (the training set). A weight (corresponding to the confidence level) was established in the validation phase for each neural network. Based on the results of the previous works [46] and [39], a fusion system with increasing performance was proposed and implemented for flood and vegetation assessment. The following types of neural networks are considered as primary classifiers (PCs): YOLO, cGAN, LeNet, AlexNet, and ResNet. The fusion algorithm considers two elements: the confidence level (associated with a weight) given to each PC obtained after a validation phase, and the detection probabilities provided by these networks at the time of the operation itself. Each PC receives an input patch and provides an output patch of the same dimension, indexed with the class label (F-blue, V-green, and R-unchanged) and the associated probability, calculated using the cost (loss) function.
The selection of CNN was based on our previous studies [24,29,39], and also on the consultation of other relevant works. We considered individual networks as subjective classifiers based on their structure and learning. Combining more subjective information with an associated confidence (weight), we sought to create a more objective classifier (global classifier). The most important aspect was that an error committed by a classifier can be corrected by the information (probability of belonging to a class) provided by other classifiers.

UAV System for Image Acquisition
To increase the flood assessment area, we used a fixed-wing UAV with greater autonomy, higher speed, and an extended operating area than a multicopter. The fixed-wing UAV MUROS was implemented by the authors in [47]. The main characteristics, flight requirements, and performances are given in Figure 1 and Table 1.

Materials and Methods
As mentioned above, the proposed system for flood and vegetation assessment was based on information fusion from a set of efficient neural networks, considered as individual classifiers, grouped through a new convolutional layer into a global system. Fusing the individual decisions of neural networks, considered as subjective factors (due to specific learning), an increase in the degree of objectivity of the global classification was obtained.
The images were taken from an orthophotoplan created from the images acquired by a UAV in the real case of a flood in a rural region from Romania. Then, each image, extracted from the orthophotoplan, was decomposed into non-overlapped patches that were labeled as one of the mentioned classes: F (flood), V (vegetation), and R (rest).
The proposed neural networks were training (the first phase-the learning phase) with a set of patches (the training set). A weight (corresponding to the confidence level) was established in the validation phase for each neural network. Based on the results of the previous works [46] and [39], a fusion system with increasing performance was proposed and implemented for flood and vegetation assessment. The following types of neural networks are considered as primary classifiers (PCs): YOLO, cGAN, LeNet, AlexNet, and ResNet. The fusion algorithm considers two elements: the confidence level (associated with a weight) given to each PC obtained after a validation phase, and the detection probabilities provided by these networks at the time of the operation itself. Each PC receives an input patch and provides an output patch of the same dimension, indexed with the class label (F-blue, V-green, and R-unchanged) and the associated probability, calculated using the cost (loss) function.
The selection of CNN was based on our previous studies [24], [29], and [39] and also on the consultation of other relevant works. We considered individual networks as subjective classifiers based on their structure and learning. Combining more subjective information with an associated confidence (weight), we sought to create a more objective classifier (global classifier). The most important aspect was that an error committed by a classifier can be corrected by the information (probability of belonging to a class) provided by other classifiers.

UAV System for Image Acquisition
To increase the flood assessment area, we used a fixed-wing UAV with greater autonomy, higher speed, and an extended operating area than a multicopter. The fixed-wing UAV MUROS was implemented by the authors in [47]. The main characteristics, flight requirements, and performances are given in Figure 1 and Table 1.   A portion of the UAV, namely flight to image acquisition (GPS points marked), is presented in Figure 2a. From the successive images, acquired as the result of area surveillance, an orthophotoplan was created with special software (Figure 2b). To this end, the successive images overlapped in both length and width up to 60%. Then, images of 6000 × 4000 pixels were cropped and regions like flood and vegetation were segmented based on the following operations described in the above section: image decomposition in non-overlapped patches of dimension 64 × 64 pixels; patch classification and marking; and, finally, patch recombination. Some patches were difficult to be analyzed because of mixed zones. We created a database of 2000 images from flooded rural areas.
Remote Sens. 2020, 12, x; doi: FOR PEER REVIEW www.mdpi.com/journal/remotesensing A portion of the UAV, namely flight to image acquisition (GPS points marked), is presented in Figure 2a. From the successive images, acquired as the result of area surveillance, an orthophotoplan was created with special software (Figure 2b). To this end, the successive images overlapped in both length and width up to 60%. Then, images of 6000 × 4000 pixels were cropped and regions like flood and vegetation were segmented based on the following operations described in the above section: image decomposition in non-overlapped patches of dimension 64 × 64 pixels; patch classification and marking; and, finally, patch recombination. Some patches were difficult to be analyzed because of mixed zones. We created a database of 2000 images from flooded rural areas.

YOLO Network
The CNN named decision YOLO, proposed in [39], operates at a global level, and the architecture is presented in Figure 3. The network has only convolutional layers grouped in two parts: down sampling and up sampling. The number of parameters in each dimension ascending layer is equal to the number of parameters in the correspondent layer on the descending side, establishing a connection between them.
The proposed network was created starting from YOLO by applying five combinations of convolutional layers followed by max pooling (the down sampling stream) and then five combinations of convolutional layers followed by the up sampling. The architecture contains concatenations between the obtained ascending layer and the descending layer of the same dimensions (a U-net structure). For every two layers, the number of parameters doubles. Finally, the classification probability is provided, and this is used in the convolutional layer of the global classifier. In the integrated scheme of the proposed system, YOLO CNN is referred to as the primary classifier PC1. In the case of the YOLO network, it was observed that the size of the patch sometimes

YOLO Network
The CNN named decision YOLO, proposed in [39], operates at a global level, and the architecture is presented in Figure 3. The network has only convolutional layers grouped in two parts: down sampling and up sampling. The number of parameters in each dimension ascending layer is equal to the number of parameters in the correspondent layer on the descending side, establishing a connection between them.
The proposed network was created starting from YOLO by applying five combinations of convolutional layers followed by max pooling (the down sampling stream) and then five combinations of convolutional layers followed by the up sampling. The architecture contains concatenations between the obtained ascending layer and the descending layer of the same dimensions (a U-net structure). For every two layers, the number of parameters doubles. Finally, the classification probability is provided, and this is used in the convolutional layer of the global classifier. In the integrated scheme of the proposed system, YOLO CNN is referred to as the primary classifier PC 1 . In the case of the YOLO network, it was observed that the size of the patch sometimes influences the decision of the network. In the case of the presented application, the chosen size was 64 × 64 pixels, taking into account the size of the UAV images. An essential quality of these networks is the short learning time. influences the decision of the network. In the case of the presented application, the chosen size was 64 × 64 pixels, taking into account the size of the UAV images. An essential quality of these networks is the short learning time.

GAN Network
In this work, a modified variant of the original GAN was used, namely conditional GAN (cGAN) [48], by considering, as starting images, a pair of an original image and an original mask of the segmented image [46]. The generator component (G) of cGAN is of the encoder-decoder, U-net type ( Figure 4). This typical architecture successively down samples the data to a point and then applies the inverse procedure. Multiple connections between the encoder and decoder at corresponding levels can be observed at the G structure. The discriminator (D) role is to provide the information with an associated probability that a generated mask from G is a real one. Both G and D are based on typical layers, as presented in Table 2.
ReLU (R) is a function of activation of a neuron that implements the mathematical function (1): In certain situations (Figure 4), it is preferable to consider the negative values and then use the LeakyReLU (LR) variant that lets the negative fraction of the input pass (2). Applies an activation function per element.

GAN Network
In this work, a modified variant of the original GAN was used, namely conditional GAN (cGAN) [48], by considering, as starting images, a pair of an original image and an original mask of the segmented image [46]. The generator component (G) of cGAN is of the encoder-decoder, U-net type ( Figure 4). This typical architecture successively down samples the data to a point and then applies the inverse procedure. Multiple connections between the encoder and decoder at corresponding levels can be observed at the G structure. The discriminator (D) role is to provide the information with an associated probability that a generated mask from G is a real one. Both G and D are based on typical layers, as presented in Table 2. ReLU (R) is a function of activation of a neuron that implements the mathematical function (1): In certain situations ( Figure 4), it is preferable to consider the negative values and then use the LeakyReLU (LR) variant that lets the negative fraction of the input pass (2).
Like in the case of YOLO net, it should be noted that the use of the direct links in U-net [49] of the generator does not stop the normal flow of data. As seen in Figure 4, there are three dropouts (DOs).
The dropout level is the simplest way to combat the overfitting and involves the temporary elimination of some network units. DO is active only in the learning phase. The T unit that appears at the last level of G is the tanh function (3).
Similarly, the S unit that appears at the last level of D ( Figure 5) is the sigmoid function (4).
Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 30 pooling P Sub-sample the image data from C and reduces the dimension. of the feature map. The maximum or the average value of a small region is used. fully connected FC Each node is connected to a node of the previous layer. It performs a classification task. batch normalization

BN
Technique to normalize the output of a previous layer by adjusting and scaling the activations. The NN speed is increasing. dropout DO Technique to reduce overfitting and improving the generalization of deep NNs.
Like in the case of YOLO net, it should be noted that the use of the direct links in U-net [49] of the generator does not stop the normal flow of data. As seen in Figure 4, there are three dropouts (DOs).
The dropout level is the simplest way to combat the overfitting and involves the temporary elimination of some network units. DO is active only in the learning phase. The T unit that appears at the last level of G is the tanh function (3).
Similarly, the S unit that appears at the last level of D ( Figure 5) is the sigmoid function (4).  These two main components are connected in a complex architecture ( Figure 6), inspired from [46], especially in the learning phase, where both G and D internal weights are established. As stated above, the architecture is a conditional generative adversarial network with the objective function V(D,G) (5). In the learning phase, a set of patches are used to create the corresponding real masks (RM) of flood or vegetation. The same set is introduced in G to obtain fake masks (FM). Two image pairs, RP and FP (real and fake), are considered as D inputs [46]. There are four comparators (two for D and two for G) that are based on the binary cross entropy criterion for comparisons.
One goal is to minimize the error and gradient between the real segmented image and a unit matrix of 1 (UM). Another goal is to minimize the error and gradient between the fake segmented image and a null matrix of 0 (NM). The results are then used, via a weight optimizer (GW for G and DW for D), to update the weights. The procedure is repeated until the desired number of epochs (iterations) is reached.
G is effectively used only in the learning phase to establish the weights of D. Further, the role of D is to decide whether there is a real image or a fake image and, especially, to provide the decision probability. Due to sigmoid function (4), D provides a value between [0,1] that is the probability that a mask is a real one and this is the probability that the tested patch belongs to a class. cGAN is referred as primary classifier PC2 in the global classifier structure. The learning images (IL) come from our dataset with three classes: flood, vegetation, and rest. These two main components are connected in a complex architecture ( Figure 6), inspired from [46], especially in the learning phase, where both G and D internal weights are established. As stated above, the architecture is a conditional generative adversarial network with the objective function V(D,G) (5).
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 30 Figure 6. Block diagram of the cGAN based system for flood and vegetation detection. The following notation were used: IL-image for learning; RM-real mask; G-generator; FM-fake mask; RP-real pair; FP-fake pair; D-discriminator; UM-unit matrix; UC-unit comparator; DW-weights for the discriminator; Σ-adder; NC-null comparator; NM-null matrix; GW-weights optimizer for the generator; DC-comparator for D with UM; GC-comparator for G between RM and FM.
The following technologies were used to test the GAN: Torch, a machine learning framework, to implement the neural network and Python (namely the NumPy, and PIL-Python Image Librarylibraries), to evaluate the accuracy of cGAN results. Figure 6. Block diagram of the cGAN based system for flood and vegetation detection. The following notation were used: IL-image for learning; RM-real mask; G-generator; FM-fake mask; RP-real pair; FP-fake pair; D-discriminator; UM-unit matrix; UC-unit comparator; DW-weights for the discriminator; Σ-adder; NC-null comparator; NM-null matrix; GW-weights optimizer for the generator; DC-comparator for D with UM; GC-comparator for G between RM and FM.
In the learning phase, a set of patches are used to create the corresponding real masks (RM) of flood or vegetation. The same set is introduced in G to obtain fake masks (FM). Two image pairs, RP and FP (real and fake), are considered as D inputs [46]. There are four comparators (two for D and two for G) that are based on the binary cross entropy criterion for comparisons.
One goal is to minimize the error and gradient between the real segmented image and a unit matrix of 1 (UM). Another goal is to minimize the error and gradient between the fake segmented image and a null matrix of 0 (NM). The results are then used, via a weight optimizer (GW for G and DW for D), to update the weights. The procedure is repeated until the desired number of epochs (iterations) is reached.
G is effectively used only in the learning phase to establish the weights of D. Further, the role of D is to decide whether there is a real image or a fake image and, especially, to provide the decision probability. Due to sigmoid function (4), D provides a value between [0,1] that is the probability that a mask is a real one and this is the probability that the tested patch belongs to a class. cGAN is referred as primary classifier PC 2 in the global classifier structure. The learning images (IL) come from our dataset with three classes: flood, vegetation, and rest.
The following technologies were used to test the GAN: Torch, a machine learning framework, to implement the neural network and Python (namely the NumPy, and PIL-Python Image Library-libraries), to evaluate the accuracy of cGAN results.

LENET
The LeNet-inspired network, containing five pairs of one convolutional layer followed by one max pooling layer, was created as in Figure 7 [39]. For simplicity, the number of parameters from one convolution to another was always doubled. In addition, the network contained one flattening layer and seven fully connected layers (dense). Although LeNet is considered effective in recognizing handwritten characters, and the modified alternative has been used successfully in segmenting regions of interest in aerial imagery [39]. In the experiments of the proposed application it was proved that it could intervene in a complementary way as a consensus agent of the global system.
The cost function used is categorical cross-entropy (6), where N is the number of patches, C is the number of classes, (ŷ i, j ) is the prediction, and (y i, j ) is the correct element that is considered as the probability that the patch i belongs to the class j. In the case of a decision, a patch is considered as belonging to the class with the highest probability. LeNet is considered as primary classifier PC 3 in the proposed global system.

ALEXNET
AlexNet is considered as primary classifier PC 4 in the proposed global system. It was chosen because it sometimes reacted complementarily to the other networks in the field of false positive or false negative areas, thereby contributing to the improvement of the overall classifier performance. The proposed AlexNet classifier, inspired from [33] is presented in Figure 8. This deep CNN has the ability of fast network training and the capability of reducing overfitting due to dropout layers.

ALEXNET
AlexNet is considered as primary classifier PC4 in the proposed global system. It was chosen because it sometimes reacted complementarily to the other networks in the field of false positive or false negative areas, thereby contributing to the improvement of the overall classifier performance. The proposed AlexNet classifier, inspired from [33] is presented in Figure 8. This deep CNN has the ability of fast network training and the capability of reducing overfitting due to dropout layers.
The activation function, used at the output, was Softmax, which ensured the probability of the image (patch) being part of one of the three classes: F, V, and R. In order to increase the image number for the training phase we used the data augmentation by rotation (90°, 180°, 270°). The activation function, used at the output, was Softmax, which ensured the probability of the image (patch) being part of one of the three classes: F, V, and R. In order to increase the image number for the training phase we used the data augmentation by rotation (90 • , 180 • , 270 • ).

RESNET
ResNet is an ultra-deep feedforward network with residual connections, designed for large scale image processing. It can have different numbers of layers: 34, 50 (the most popular), 101 (our choice), etc. ResNet resolved the gradient vanishing problem and had a good position in image classification top-5 error.
The ResNet (residual net) architecture, used as primary NN classifier PC5, is presented in Figure  9. This deeper network has one of the best performances on object recognition accuracy. ResNet is composed from building blocks (modules), marked by A and B in Figure 9, with the same scheme of short (skip) connections. The shortcuts are used to keep the previous module outputs from possible inappropriate transformations. The blocks are named residual units [34] and are based on the residual function F (7), (8):

RESNET
ResNet is an ultra-deep feedforward network with residual connections, designed for large scale image processing. It can have different numbers of layers: 34, 50 (the most popular), 101 (our choice), etc. ResNet resolved the gradient vanishing problem and had a good position in image classification top-5 error.
The ResNet (residual net) architecture, used as primary NN classifier PC 5 , is presented in Figure 9. This deeper network has one of the best performances on object recognition accuracy. ResNet is composed from building blocks (modules), marked by A and B in Figure 9, with the same scheme of short (skip) connections. The shortcuts are used to keep the previous module outputs from possible inappropriate transformations. The blocks are named residual units [34] and are based on the residual function F (7), (8): y n = x n + F(x n , wg n ) (7) x n+1 = f (y n ) (8) where x n is the block input, x n+1 is the block output, wg n is the set of weights associated with the n block, and f is ReLU.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 30 where is the block input, is the block output, is the set of weights associated with the n block, and f is ReLU.
A pipeline of repetitive modules A and B is described in detail in Figure 10a and 10b, respectively.

System Architecture
As previously mentioned, the proposed system contains five primary classifiers of the deep neural network type (PCi, i = 1,2,…,5), that have two contributions each: the weights (wi, i = 1, 2,…, 5), where is the block input, is the block output, is the set of weights associated with the n block, and f is ReLU.
A pipeline of repetitive modules A and B is described in detail in Figure 10a and 10b, respectively.
A PC weight is expressed as its accuracy ACC (9) computed from parameters of confusion matrix (TP, TN, FP, and FN are true positive, true negative, false positive, and false negative cases, respectively): A score S j (j = F, V, R) is calculated for each class (F, V, and R) as can be seen in Equations (10), (11), and (12), respectively. These Equations are convolutional laws. The decision is made by the aid of a decision score (DS), and the class corresponds to the index obtained by maximum DS selection (13).
The main operating steps of the system are as follows and the flow chart is presented in Figure 11: The meanings of the notations in Figure 12 are the following: I-image to be segmented (input), ID-image decomposition in patches, Pi,j-patch of I on (i,j) position, PCk-primary classifier (PC1-YOLO, PC2-GAN, PC3-LeNet, PC4-AlexNet, and PC5-ResNet), pk-probability of Pi,j classification by PCk, wk-weight according to primary classifier PCk, FBC-fusion based classifier, which indicates the patch class and provides the marked patch, Si,j-patch classified, SIC-segmented 1. The image is decomposed into patches of fixed size (64 × 64 pixels).

2.
A patch is passed in parallel through neural networks to obtain individual classification probabilities.
3. The probabilities are merged by the convolutional law that characterizes the system, and the final decision of belonging to one of the classes F, V, R is taken.
4. The patch is marked according to the respective class. 5. The patch is reassembled into an image of the same size as the original image. 6. Return to step 2 until the patches in the original image are finished. 7. The segmented image results. 8. Additionally, the counting of patches from each class is done in order to evaluate the extent of the specific flood and vegetation areas.
The architecture of the proposed system, based on a decision fusion, expressed by the previous Equations (10)-(13) is presented in Figure 12. The system contains five classifiers, experimentally chosen, based on the individual accuracy evaluated for each classification task and detailed in the next section.

System Tuning: Learning, Validation, and Weight Detection
For the application envisaged in the paper, the system used three phases: learning, validation, and testing (actual operation). First, the learning phase was separately performed for each primary classifier, directly or by transfer learning, to obtain the best performances in terms of time and Figure 12. Architecture of the image segmentation proposed system. I-input image, ID-image decomposition module, P ij -patch (i,j) as input, PC i -primary classifier i, p i -probability provided by PC i , w i the weight associated with PC i , FBC-fusion based classifier, S ij -patch (i,j) as output (segmented), SIC-segmented image recomposition module, and SI-segmented image.
The meanings of the notations in Figure 12 are the following: I-image to be segmented (input), ID-image decomposition in patches, P i,j -patch of I on (i,j) position, PC k -primary classifier (PC 1 -YOLO, PC 2 -GAN, PC 3 -LeNet, PC 4 -AlexNet, and PC 5 -ResNet), p k -probability of P i,j classification by PC k , w k -weight according to primary classifier PC k , FBC-fusion based classifier, which indicates the patch class and provides the marked patch, S i,j -patch classified, SIC-segmented image re-composition from classified patches, and SI-segmented image (output).
As can be seen from the previous Equations and Figure 12, a new convolutional layer is made by the FBC module with the weights w i , i = 1, 2, . . . , 5 (fixed after validation phase). The inputs are the probabilities p i , i = 1, 2, . . . , 5, provided by the primary classifiers, and the output is the decision based on Equations (10)- (13).
The execution time differs from a network to network and of the system implementation.

System Tuning: Learning, Validation, and Weight Detection
For the application envisaged in the paper, the system used three phases: learning, validation, and testing (actual operation). First, the learning phase was separately performed for each primary classifier, directly or by transfer learning, to obtain the best performances in terms of time and accuracy. Next, in a similar manner, in the validation phase the attached weight was obtained for each classifier. Finally, the images were processed by the global system presented in Figure 12.
From the images cropped from orthophotoplan, 4500 patches were selected for learning (1500 flood patches, 1500 vegetation, and 1500 from the rest). Similarly, 1500 patches were selected for validation (500 from each class). As mentioned above, the weight associated with the primary classifiers were established in the validation stage. Examples of such patches are presented in Figure 13.  Figure 14 (our dataset). For comparison, the original image and manual segmentation versus predicted segmentation images are presented. As can be seen, errors of segmentation were presented at the edges because of the mixed regions. For each of the five neural networks and of the three classes, the confusion matrices were calculated (examples in Figure 15 are given for YOLO, F, and V classes). Based on the confusion matrices, the performance parameters TP, TN, FP, FN, and ACC from (9) were evaluated. ACC for flood (92.8%) was better than the ACC for vegetation (87.5%), because, generally, a flood patch is more uniform than a vegetation patch. As a result, the weights (Table 3) were evaluated (by two digit For simplicity, we considered that the patches selected for training and validation contained pixels from a single type of region (F, V, or R). Since the convolution operation is not independent of changes of rotation and mirroring, they could be applied to simulate new cases for the network. Thus, on the patches obtained previously, a 90 • rotation was applied, and then a mirror was used to increase the image number four times (18,000 for learning and 6000 for validation). Both learning and validation were performed separately for the three types of regions (F, V, or R). Examples of flood and vegetation segmentation based on YOLO (PC 1 ), GAN (PC 2 ), LeNet (PC 3 ), AlexNet (PC 4 ), and ResNet (PC 5 ) are given in Figure 14 (our dataset). For comparison, the original image and manual segmentation versus predicted segmentation images are presented. As can be seen, errors of segmentation were presented at the edges because of the mixed regions.
For each of the five neural networks and of the three classes, the confusion matrices were calculated (examples in Figure 15 are given for YOLO, F, and V classes). Based on the confusion matrices, the performance parameters TP, TN, FP, FN, and ACC from (9) were evaluated. ACC for flood (92.8%) was better than the ACC for vegetation (87.5%), because, generally, a flood patch is more uniform than a vegetation patch. As a result, the weights (Table 3) were evaluated (by two digit approximation) for the primary classifiers PC 1 , PC 2 , PC 3 , PC 4 , and PC 5 , and, also, for each RoI (F, V, and R). To this end the table presents the intermediate parameters (TP, TN, FP, and FN) to calculate ACC and, finally, the weights. All the parameters were indexed by the class label (F, V, and R). It can be seen the accuracy was dependent on PCs and classes. Thus, the flood accuracy ACC-F was greater than other accuracies (ACC-V, vegetation, and ACC-R, rest) for all classifiers due to the reason mentioned above. For each of the five neural networks and of the three classes, the confusion matrices were calculated (examples in Figure 15 are given for YOLO, F, and V classes). Based on the confusion matrices, the performance parameters TP, TN, FP, FN, and ACC from (9) were evaluated. ACC for flood (92.8%) was better than the ACC for vegetation (87.5%), because, generally, a flood patch is more uniform than a vegetation patch. As a result, the weights (Table 3) were evaluated (by two digit approximation) for the primary classifiers PC1, PC2, PC3, PC4, and PC5, and, also, for each RoI (F, V, and R). To this end the table presents the intermediate parameters (TP, TN, FP, and FN) to calculate ACC and, finally, the weights. All the parameters were indexed by the class label (F, V, and R). It can be seen the accuracy was dependent on PCs and classes. Thus, the flood accuracy ACC-F was greater than other accuracies (ACC-V, vegetation, and ACC-R, rest) for all classifiers due to the reason mentioned above.

Original
Manual Segmentation Predicted PC1  The meaning of the notations in Table 3 is as follows: NN -the neural network used, TP, TN, FP and FN are the true positive cases, true negative, false positive and false negative, respectively, ACC -accuracy, w -associated weight. They are associated with classes: F -flood, V -vegetation and R -rest. Thus, TP-F means the true positive in terms of flood detection, ACC -accuracy in flood detection, wF -weight for flood detection, etc. Table 3. Associated weights for the primary classifiers (PC).   The meaning of the notations in Table 3 is as follows: NN-the neural network used, TP, TN, FP and FN are the true positive cases, true negative, false positive and false negative, respectively, ACC-accuracy, w-associated weight. They are associated with classes: F-flood, V-vegetation and R-rest. Thus, TP-F means the true positive in terms of flood detection, ACC-accuracy in flood detection, w F -weight for flood detection, etc.

Experimental Results
After learning and validation of the individual classifiers, the proposed system was tested in a real environment. Like in the previous phases, the images were obtained from a photogrammetry flight over the same rural area in Romania after a moderate flood in order to accurately evaluate the damages in agriculture. Thus, our own dataset was obtained.
In the operational phase, the RoI segmentation was performed by the global system proposed in Figure 12. First, the image extracted from the orthophotoplan was decomposed in patches according to the methodology described in Section 3. The patch classification and segmentation were performed based on Equations (10)(11)(12)(13), with the weights obtained in the validation phase (Table 3). For each patch, a primary classifier ( Figure 12) gives the probability to belong to a predicted class (see p ij from Table 4). Some examples of patch classification and segmentation are given in Table 4. The decision score S is calculated as in Equation (13). The resulted patches are colored with blue (F), green (V), or are maintained as the initial (R). The real patches and the segmented patches are labeled by the class name. For correct segmentation, in Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F).  Table 4 on the same raw, the original and segmented images pair has the same label (F-F; V-V, or R-R). As can be seen, in the last row of Table 4, there is a false segmentation decision (R-F). The decision score is also approximated by three digits. In this operational phase, the primary classifiers provide only the probabilities as the evaluated patch belongs to a predicted class and the fusion based classifier evaluates the decision score for the final classification. After image recomposition, the segmented images are shown as in Figure 16. The examples in Table 4  The decision score is also approximated by three digits. In this operational phase, the primary classifiers provide only the probabilities as the evaluated patch belongs to a predicted class and the fusion based classifier evaluates the decision score for the final classification. After image re-composition, the segmented images are shown as in Figure 16. The examples in Table 4 show that most patches were correctly classified by all PCs except the last two. The penultimate patch was misclassified by four PCs and the overall result was a misclassification. The last patch was misclassified by two classifiers (PC1 and PC4), but the global classifier result was correct. The decision score is also approximated by three digits. In this operational phase, the primary classifiers provide only the probabilities as the evaluated patch belongs to a predicted class and the fusion based classifier evaluates the decision score for the final classification. After image recomposition, the segmented images are shown as in Figure 16. The examples in Table 4 show that most patches were correctly classified by all PCs except the last two. The penultimate patch was misclassified by four PCs and the overall result was a misclassification. The last patch was misclassified by two classifiers (PC1 and PC4), but the global classifier result was correct.

Original image
Manual segmentation In order to evaluate the flood extension and the remaining vegetation, from each analyzed image, the percentage of flood area (FA) and vegetation area (VA) were calculated in manual (MS) and automatic (AS) segmentation cases (Table 5). Finally, the percentage occupancy of flood and vegetation, from the total investigated area, was evaluated as the average of image occupancy (the last row-Total-in Table 5). It can see that, generally, the flood segmentation was more accurate than the vegetation segmentation. Compared with the manual segmentation, the flood evaluation differed by 0.53%, and the vegetation evaluation differed by 0.84%.  In order to evaluate the flood extension and the remaining vegetation, from each analyzed image, the percentage of flood area (FA) and vegetation area (VA) were calculated in manual (MS) and automatic (AS) segmentation cases (Table 5). Finally, the percentage occupancy of flood and vegetation, from the total investigated area, was evaluated as the average of image occupancy (the last row-Total-in Table 5). It can see that, generally, the flood segmentation was more accurate than the vegetation segmentation. Compared with the manual segmentation, the flood evaluation differed by 0.53%, and the vegetation evaluation differed by 0.84%.

Discussion
The proposed system based on decision fusion combines two sets of information: the PC weights obtained in a distinct validation phase, and the probabilities obtained in the testing (operational) phase. The primary neural networks were first experimentally tested. Accuracy differs from one network to another and from one type of class to another. Thus, in Table 6 are presented the accuracies of individual networks (YOLO, GAN, LeNet, AlexNet, and ResNet) and of the global system for each class (F, V, and R) and the mean. It can be observed that for individual networks, the best results were obtained for ResNet and flood. In case of the experiments described in this paper, the accuracy was lower for the YOLO network, for the patch size of 64 × 64 pixels (86.7%). The authors analyzed the performance of the YOLO network on different patch sizes for the same images and found the following: for the 128 × 128 pixel size of the patch, the accuracy was 89% [39], and for 256 × 256 pixel size the accuracy was 91.2%. This is explained by the mechanism of network operation within the convolution to preserve the size of the patch. The proposed system is adaptable for different patch sizes. In cases of more extensive floods, where no small portions of water mixed with vegetation appear, the system operates with larger patch sizes, and the YOLO network will have a better accuracy (respectively, a higher weight in the global decision).
On the other hand, the purpose of combining several networks was to reduce the individual number of false positive and false negative decisions. It was found that in some cases the decision of the YOLO network leads to a global correct decision. This aspect is also a motivation for choosing several neural networks for the global classification system.
Although the fusion based on a majority vote would have been simpler, we chose a fusion decision that takes into account a more complex criterion based on two elements: a) the "subjectivism" of each network, expressed by the probability of classification, and b) the rigid weights of these networks, previously established (in the validation phase). This is the new convolutional layer of the proposed system that leads to a more objective classification criterion than a simple vote criterion.
Why five networks and not more or less remains an open question. It is a matter of compromise, assumed by our experimental choice. A large number would lead to greater complexity, so a greater computational effort and time, while a small number would lead to lower accuracy.
The classification results were good (Table 6); however, due to images that were difficult to interpret, erroneous results were also obtained. The main difficulties consist of factors such as (a) the parts of the ground that were wet or recently dried were possibly confused with flood (error R-F), (b) the vegetation that was uneven was possibly confused with class R (V-R error), and c) possible green areas (trees) covered the flooded area (V-F error), etc. One such example is presented in Table 4, the last row; this is because it contains a surface of land, recently dried, similar with the flood (error R-F).
The obtained results were better than the individual neural networks ( Table 6) and better or similar than other works (Table 7). The papers in Table 7 were mentioned in the Introduction or Related Work sections and refer to similar works as ours. Only paper [39] addressed both flood and vegetation classification. The results obtained with different deep CNNs have less accuracy than our global system. The authors in [1] used UAV images for vegetation segmentation, particularly different crops, based on hue color channel and corresponding histogram with different thresholds. The methods in Table 7 were tested on our database for papers [29] and [39]. In [1] the images are very similar to ours. The images in [4] and [13] are very different from our application (urban).
The authors in [4] used a VGG-based fully convolutional network (FCN-16s) for flooded area extraction from UAV images on a new dataset. However, our data set is more complex and difficult to segmentation. Compared with traditional classifiers such as SVMs, the obtained results are more accurate. As in our study, the problem of floods hidden under trees remains unresolved.
Our previous work [25] combines an LBP histogram with a CNN for flood segmentation, but the results are less accurate and the operating time is longer because for each patch the LBP histogram must be calculated.
The images that characterize the two main classes, flood and vegetation, may differ inside an orthophotoplan due to the characteristics of the soil (color and composition) that influence the flood color or the texture and color of the vegetation. There are also different features for different orthophotoplans depending on the season and location. Obviously, for larger areas there is the possibility of decreasing accuracy. We recommend a learning action at each application from as many different representative patches as possible.
The network learning was done from patches considered approximately uniform (containing only flood or vegetation). Due to this reason, a smaller segmented area was usually obtained (mixed areas being uncertain). Other characteristic areas, such as buildings and roads, are less common in agricultural regions. They were introduced to the "rest" class. Another study [46] found that the roads (especially asphalted) cannot be confused with floods.
Compared to our previous works, [29] and [39], this paper introduces several neural networks, selected after performance analysis, in an integrative system (the global system) based on fusion of decisions (probabilities). This system can be considered as a network of convolutional networks, the unification layer being also based on a convolution law (10)(11)(12).
One of the weak points of our approach is the empirical choice of the neural networks that make up the global system (only based on experimental results). We have relied on our experience in recent years and on the literature. On the other hand, we have modified the well-known networks in order to obtain the best possible performances for our own database. The images used were very difficult to interpret because they contain areas at the boundary between flood and non-flood (for example, wetland and flood), and areas of vegetation are not uniform. On clearly differentiable regions of interest, the results are much better, but we wanted to demonstrate the effectiveness on real, difficult cases.
Neural image processing networks are constantly evolving, modernizing the old ones and appearing new ones. In this case, the question arises: how many and why are the networks involved in such a fusion based system? It is desired to develop a mathematical criterion, an objective, based on optimizing parameters such as time, processing cost, accuracy, etc., to take into account the specific application.

Conclusions
In this paper, we proposed an efficient solution for segmentation and evaluation of the flood and vegetation RoIs from aerial images. The system proposed combines in a new convolutional layer the outputs of five classifiers based on neural networks. The convolution is based on weights and probabilities and improves the accuracy of classification. This fusion of neural networks into a global classifier has the advantage of increasing the efficiency of segmentation demonstrated by the examples presented. Images tested and compared were from own database acquired with a UAV in a rural zone. Compared to other methods presented in the references, the accuracy of the method proposed increased for both flood and vegetation zones.
As feature work, we proposed the segmentation of more RoIs from UAV images using multispectral cameras. For more flexibility and adaptability to illumination and weather conditions we will also consider the radiometric calibration of images. We also want to create a bank of pre-trained neural networks that can be accessed and interconnected, depending on the application, to obtain the most efficient fusion classification system.
We want to expand the application for monitoring the evolution of vegetation, which means both the creation of vegetation patterns and the permanent adaptation to color and texture changes that take place during the year.
Author Contributions: L.I. contributed to the conception and design of the CNN, performed the experiments, and edited the paper. D.P. conceived of the paper, contributed to processing the data, analyzed the results, and selected the references. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funding by University POLITEHNICA of Bucharest.