Self-Organizing Deep Learning (SO-UNet)—A Novel Framework to Classify Urban and Peri-Urban Forests

: Forest-type classiﬁcation is a very complex and difﬁcult subject. The complexity increases with urban and peri-urban forests because of the variety of features that exist in remote sensing images. The success of forest management that includes forest preservation depends strongly on the accuracy of forest-type classiﬁcation. Several classiﬁcation methods are used to map urban and peri-urban forests and to identify healthy and non-healthy ones. Some of these methods have shown success in the classiﬁcation of forests where others failed. The successful methods used speciﬁc remote sensing data technology, such as hyper-spectral and very high spatial resolution (VHR) images. However, both VHR and hyper-spectral sensors are very expensive, and hyper-spectral sensors are not widely available on satellite platforms, unlike multi-spectral sensors. Moreover, aerial images are limited in use, very expensive, and hard to arrange and manage. To solve the aforementioned problems, an advanced method, self-organizing–deep learning (SO-UNet), was created to classify forests in the urban and peri-urban environment using multi-spectral, multi-temporal, and medium spatial resolution Sentinel-2 images. SO-UNet is a combination of two different machine learning technologies: artiﬁcial neural network unsupervised self-organizing maps and deep learning UNet. Many experiments have been conducted, and the results showed that SO-UNet overwhelms UNet signiﬁcantly. The experiments encompassed different settings for the parameters that control the algorithms.


Introduction
Urban and peri-urban forests comprise all the trees and associated vegetation found in and around cities. They occur in a range of settings, including in managed parks, natural areas (e.g., protected areas), residential areas, and informal green spaces; along streets; and around wetlands and water bodies [1].
Urban and peri-urban forests provide fundamental ecosystem services. Particularly, they control and mitigate the environment, which includes cooling climates, reducing pollution via carbon sequestration processes, and watershed protection.
Urban and peri-urban forests also provide cultural services, which include natural heritage, recreation, aesthetics, knowledge transfer, and "sense of place" [2].
Monitoring urban and peri-urban forests using remote sensing data and classification algorithms is a complex task due to feature differences in forest cover urban systems. Feature differences display specific spectral responses in remote sensing images, which are reflected in the phenology variations of forest species as well as in natural and anthropogenic stressors [3,4].
There are several remote sensing methods for monitoring forests in the urban and peri-urban environment. These methods are based on close-range remote sensing such as via spectroradiometers and unmanned and aerial vehicles (UAVs) that carry different types of sensors, such as hyper-spectral and light detection and ranging (LiDaR). In addition, Two different areas were selected in Lebanon, a country that is located in the east of the Mediterranean basin ( Figure 1). The size of the first area (urban) is 5.67 Km 2 , and the size of the peri-urban area is 3.28 Km 2 .
The study areas have different geographic characteristics. The urban area is located in Beirut, the capital of Lebanon, at sea level, and it is densely populated. The peri-urban area in Broumana is at a higher elevation of about 600 m, and it is characterized by steep to very steep slopes. Nature in the last area suffers from overexploitation by tourism activities. Moreover, the Broumana study area is suffering from the expansion of urban infrastructure and buildings and decrease in forest cover, which adds more stress to the existing cover.
The Mediterranean climate in the study areas can be described in winter as moderate in Beirut and cold in Broumana. In summer, the average temperature in Beirut and Broumana is about 17 °C, and the average yearly precipitation is about 400 mm [25].  The study areas have different geographic characteristics. The urban area is located in Beirut, the capital of Lebanon, at sea level, and it is densely populated. The peri-urban area in Broumana is at a higher elevation of about 600 m, and it is characterized by steep to very steep slopes. Nature in the last area suffers from overexploitation by tourism activities. Moreover, the Broumana study area is suffering from the expansion of urban infrastructure and buildings and decrease in forest cover, which adds more stress to the existing cover.
The Mediterranean climate in the study areas can be described in winter as moderate in Beirut and cold in Broumana. In summer, the average temperature in Beirut and Broumana is about 17 • C, and the average yearly precipitation is about 400 mm [25].

Data
In this research, we deployed European Space Agency (ESA) Sentinel-2 (A and B) products. Sentinel-2A is an optical remote sensing satellite with a spatial resolution that varies between 10 m and 60 m depending on the wavelengths. It also has a temporal Sustainability 2021, 13, 5548 4 of 15 resolution of 10 days that can improve to 5 days by being combined with the Sentinel-2B optical satellite. Sentinel-2A and 2B cover wavelengths from 443 nm to 2190 nm and have a spectral resolution that varies between 35 nm (red edge bands) and 580 nm short wavebands.
The selected Sentinel-2 images are level-2 images, meaning that the images are atmospherically corrected, and the pixel values represent surface reflectance. Moreover, the selected images have cloud cover less than or equal to 5%. The dates of acquisition for some selected Sentinel-2 images are listed in Table 1. These images were retrieved using Google Earth Engine (GEE). Different scripts were written in Java language to retrieve these images (see code below as an example). var collection = ee.ImageCollection('COPERNICUS/S2_SR') // Source of data .filterDate('2019-01-01', '2019-12-31') //Date of acquisition .filterBounds(AreaofStudyBoundaries) // Area of studies // Pre-filter to get less cloudy granules less than 5 %. .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE',5)) });

Methods
After selecting the needed remote sensing data for the classification of forests in the urban and peri-urban environments, the next step was to create an efficient and robust classification method.
The suggested new method is a cooperative one that combines two different artificial neural network machine learning algorithms, one being an unsupervised and the other a supervised algorithm.
The first algorithm is UNet [18], a supervised deep learning algorithm, and the second algorithm is self-organizing maps (SOMs) [22,23].
The role of SOMs is to create the training and validation data that help UNet to classify the input image correctly and efficiently. To explain in detail the new cooperative method, the following subsection describes UNet's structure and functionality.

UNet Network
The deep learning architecture UNet [18] was created for use in biomedical image classification. It consists of an encoder network followed by a decoder network. In addition to the requirement of discrimination at pixel level, UNet semantic classification also requires a mechanism to project the discriminative features learned at different stages of the encoder onto the pixel space.
The encoder is the first half of the architecture ( Figure 2). Normally, it is a pre-trained classification network such as ResNet [20], where you apply convolution blocks, followed by a maximum pool layer as a downscaling process to encode the input image into feature representations at multiple, different levels.
The decoder is the second half of the architecture (Figure 2). The goal is to semantically project the discriminative features (lower resolution) learned by the encoder onto the pixel space (higher resolution) to obtain a compact and dense classification. The decoder consists of upscaling concatenation followed by regular convolution operations. UNet takes its name from the architecture which, when visualized, appears similar to the letter U, as shown in Figure 2. As one can notice from the schema, the final number of classes is 5, and one of these classes should be forests.
UNet transforms multi-dimensional input images into classified images. The network does not have a fully connected layer. Only the convolution layers are used. The UNet architecture is based on several layers; these layers can be input 2D or 3D image layers or convolutional layers (2D or 3D) to downsample or upsample (transpose).
Each standard convolution process is activated by a ReLU activation function. A ReLU layer performs a threshold operation to each element of the input, where any value less than zero is set to zero. Max pooling performs downsampling by dividing the input into rectangular pooling regions and computing the maximum of each region.
In the upsampling process, a transposed 2D convolution layer is used, followed by a depth concatenation layer that takes inputs that have the same height and width and concatenates them along the third dimension (channel dimension).
Finally, a Softmax layer (normalized exponential function) [9] is often used as the last activation function of UNet. Its job is to normalize the output of a network to a probability distribution over predicted output classes. The standard Softmax function Sm: ℝ → [0,1] is defined by the following Equation (1): for I = 1,…,K and z = (z1,…,zk) ∈ ℝ which applies the standard exponential function to each element zi of the input vector z and normalizes these values by dividing by the sum of all these exponentials. In brief, the According to [18] the UNet architecture consists of an encoder and decoder subnetworks The decoder is the second half of the architecture ( Figure 2). The goal is to semantically project the discriminative features (lower resolution) learned by the encoder onto the pixel space (higher resolution) to obtain a compact and dense classification. The decoder consists of upscaling concatenation followed by regular convolution operations. UNet takes its name from the architecture which, when visualized, appears similar to the letter U, as shown in Figure 2.
As one can notice from the schema, the final number of classes is 5, and one of these classes should be forests.
UNet transforms multi-dimensional input images into classified images. The network does not have a fully connected layer. Only the convolution layers are used. The UNet architecture is based on several layers; these layers can be input 2D or 3D image layers or convolutional layers (2D or 3D) to downsample or upsample (transpose).
Each standard convolution process is activated by a ReLU activation function. A ReLU layer performs a threshold operation to each element of the input, where any value less than zero is set to zero. Max pooling performs downsampling by dividing the input into rectangular pooling regions and computing the maximum of each region.
In the upsampling process, a transposed 2D convolution layer is used, followed by a depth concatenation layer that takes inputs that have the same height and width and concatenates them along the third dimension (channel dimension).
Finally, a Softmax layer (normalized exponential function) [9] is often used as the last activation function of UNet. Its job is to normalize the output of a network to a probability distribution over predicted output classes. The standard Softmax function Sm: R K → [0, 1] K is defined by the following Equation (1): which applies the standard exponential function to each element z i of the input vector z and normalizes these values by dividing by the sum of all these exponentials. In brief, the According to [18] the UNet architecture consists of an encoder and decoder subnetworks each is made of multiple stages, and they are connected by a bridge section. These subnetworks consist of multiple stages. An encoder depth, which specifies the depth of the encoder and decoder subnetworks, sets the number of stages and determines the number of max pools. The stages within the UNet encoder subnetwork consist of two sets of convolutional and ReLU layers, followed by a 2-by-2 max-pooling layer. The decoder subnetwork comprises a transposed convolution layer for upsampling, followed by two sets of convolutional and ReLU layers. The bridge section consists of two sets of convolution and ReLU layers. The bias term of all convolutional layers is initialized to zero. The Convolution layer weights in the encoder and decoder subnetworks are initialized using the 'He' weight initialization method [20]. Loss functions are a key part of any machine learning model: they define an objective against which the performance of the UNet model is measured, and the setting of weight parameters learned by the model is determined by minimizing a chosen loss function. UNet uses the cross-entropy loss (log loss) function [26], which is just a straightforward modification of the likelihood function with logarithms.
The cross-entropy predicts class probability and then compares it to the actual class desired output. A loss is calculated that penalizes the probability based on how far it is from the actual expected value. The penalty is logarithmic yielding a large score for large differences close to 1 and small score for small differences tending to 0.
Cross-entropy is used when adjusting model weights during training. The aim is to minimize the loss, i.e., the smaller the loss, the better the model. A perfect model has a cross-entropy loss of 0. Cross-entropy is defined as Equation (2): where t i is the truth label and p i is the Softmax probability for the i th class. Normally, UNet requires a large training dataset to function properly [27]; this is also proven in the experiments. Therefore, extracting enough training datasets may not be feasible, especially if the image is of medium spatial resolution and the features are hard to discern.

Self-Organizing Maps (SOMs)
Self-organizing maps (SOMs) [22,23] are an unsupervised neural network method. SOMs convert patterns of arbitrary dimensionality into responses of two-dimensional arrays of neurons. One of the most important characteristics of SOMs is that the feature map preserves the neighborhood relations of the input pattern.
A typical SOM structure is given in Figure 3. It consists of two layers: an input layer and an output layer. The number of input neurons is equal to the dimensions of the input data. The output neurons are arranged in a two-dimensional array. The network is composed of an orthogonal grid of cluster units (neurons), each of which is associated with N internal weights for the N layers of the satellite image. At each step in the training phase, the cluster unit with weights that best match the input pattern is elected as the winner, usually using the minimum Euclidean distance as in Equation (3): The network is composed of an orthogonal grid of cluster units (neurons), each of which is associated with N internal weights for the N layers of the satellite image. At each step in the training phase, the cluster unit with weights that best match the input pattern is elected as the winner, usually using the minimum Euclidean distance as in Equation (3): where x is the input vector, W k l is the weight of the wining unit l at iteration k, and W k i is the weight for neuron I at iteration k. The winning unit and a neighborhood around it are then updated in such a way that their internal weights are closer to the presented input. All the neurons within a certain neighborhood around the leader participate in the weight-update process. This learning process can be described by the iterative procedure as in Equation (4): where H k li is a smoothing kernel defined over the winning neuron. This kernel is written in terms of the Gaussian function as in Equation (5): where T 0 is the total number of iterations. α 0 is the initial learning rate, and it is equal to 0.1. The learning rate is updated in every iteration. σ k is the search distance at iteration k, which can initially be half the length of the network. After SOMs converge to a balanced state, the original image is mapped from a high-dimension to a low-dimension space. The number of clusters in this space is equal to the number of neurons in the SOMs network.

SO-UNet
Although UNet's performance improves with the provision of large training samples, obtaining enough training samples is not feasible for medium-or low-spatial resolution satellite remote sensing images. Transfer learning can effectively alleviate this problem by transferring the knowledge from the results obtained by the unsupervised SOMs to the target-in our case, the supervised deep learning network. Furthermore, the transfer of knowledge from self-learning algorithms to SOMs can improve the accuracy of UNet [28].
There are four main transfer learning categories: instance-based, mapping-based transfer, network-based deep, and adversarial-based [29]. These transfer schemes work on partial information such as partial training and parameters using different structures. However, our main objective is to transfer the complete training set to the target deep learning algorithm UNet. SOMs is capable of providing complete training datasets to UNet with no guidance (unsupervised). We call this scheme self-organizing transfer learning (Figure 4). UNet with no guidance (unsupervised). We call this scheme self-organizing transfer learning ( Figure 4). SO-UNet consists of SOMs as a source for transferring training datasets to UNet. SOMs are modified according to [23] to remove over-classification. Then, the results obtained by SOMs are labeled according to the collected training datasets, and then they are fed to UNet. Furthermore, SO-UNet implements different phases to produce a final classified image (

Verification Methods
The accuracy of the results is computed based on the confusion matrix method [30]. The matrix is of size m × m associated with a classifier that shows the predicted and observed classification ( Table 2).

T1 T2
Actual T1 True False T2 False True Using Table 2, the computation of the overall accuracy (OA) for each experiment is based on Equation (6): T1 and T2 can be actual or predicted samples, and these samples are either positive when the actual sample matches the predicted or negative otherwise.
Kappa coefficient κ [31] is another metric method for accuracy assessment. The Kappa coefficient is frequently used to test interrater agreement (Equation (7)). The coefficient value ranges from the worst agreement for value -1 to the perfect one for value +1. SO-UNet consists of SOMs as a source for transferring training datasets to UNet. SOMs are modified according to [23] to remove over-classification. Then, the results obtained by SOMs are labeled according to the collected training datasets, and then they are fed to UNet. Furthermore, SO-UNet implements different phases to produce a final classified image (  UNet with no guidance (unsupervised). We call this scheme self-organizing transfer learning ( Figure 4). SO-UNet consists of SOMs as a source for transferring training datasets to UNet. SOMs are modified according to [23] to remove over-classification. Then, the results obtained by SOMs are labeled according to the collected training datasets, and then they are fed to UNet. Furthermore, SO-UNet implements different phases to produce a final classified image (

Verification Methods
The accuracy of the results is computed based on the confusion matrix method [30]. The matrix is of size m × m associated with a classifier that shows the predicted and observed classification ( Table 2).

T1 T2
Actual T1 True False T2 False True Using Table 2, the computation of the overall accuracy (OA) for each experiment is based on Equation (6): T1 and T2 can be actual or predicted samples, and these samples are either positive when the actual sample matches the predicted or negative otherwise.
Kappa coefficient κ [31] is another metric method for accuracy assessment. The Kappa coefficient is frequently used to test interrater agreement (Equation (7)). The coefficient value ranges from the worst agreement for value -1 to the perfect one for value +1.

Verification Methods
The accuracy of the results is computed based on the confusion matrix method [30]. The matrix is of size m × m associated with a classifier that shows the predicted and observed classification ( Table 2). Using Table 2, the computation of the overall accuracy (OA) for each experiment is based on Equation (6): T1 and T2 can be actual or predicted samples, and these samples are either positive when the actual sample matches the predicted or negative otherwise.
Kappa coefficient κ [31] is another metric method for accuracy assessment. The Kappa coefficient is frequently used to test interrater agreement (Equation (7)). The coefficient value ranges from the worst agreement for value -1 to the perfect one for value +1. where pr 0 is the observed proportionate agreement (overall accuracy Equation (6)), and pr e is the overall random agreement probability (Equation (8)). According to [32], Kappa value = 0 indicates no agreement, while > 0-0.20 slight, 0.20-0.40 fair, 0.40-0.60 moderate, 0.60-0.80 substantial, and 0.80-1 almost perfect agreement.

Data Preparation
For each study area, four images with different dates were selected. One date is in the dry hot season (summer), another in the spring season. The remaining two images are in the fall season and the wet cold winter season. The size of the images is 222 × 165 pixels for peri-urban, and 207 × 287 for urban. There are eight images, among which every four images represent a different study area and different seasons in the year 2019.
Two reasons justify our selection of the study areas and the temporal images: (1) to check the effect of change in the environment on the efficiency and accuracy of the new method; and (2) to verify the robustness of the new method concerning changes in image quality due to seasonal variations. The following images (Figure 6a,b) show some of the training sets for each study area over a Sentinel-2 image.

Data Preparation
For each study area, four images with different dates were selected. One date is in the dry hot season (summer), another in the spring season. The remaining two images are in the fall season and the wet cold winter season. The size of the images is 222 × 165 pixels for peri-urban, and 207 × 287 for urban. There are eight images, among which every four images represent a different study area and different seasons in the year 2019.
Two reasons justify our selection of the study areas and the temporal images: (1) to check the effect of change in the environment on the efficiency and accuracy of the new method; and (2) to verify the robustness of the new method concerning changes in image quality due to seasonal variations. The following images (Figure 6a,b) show some of the training sets for each study area over a Sentinel-2 image.

Different Algorithm Setup
Self-organizing maps (SOMs) are initiated with a specific number of neurons and iterations. The rule for network size is based on the rule (N + [1…L]), where N is the number of classes and L ≥ 1 is the balance scale between under and over-segmentation. The L value is an odd number if N is odd, and even otherwise. In this research, N = 5, and L is equal to 1 or 3. The number of iterations is selected to be 1000, but SOMs can halt before completing all the iterations after reaching a balanced state (SOMs' weights do not change by more than a defined threshold).
In our case, UNet initially consists of 58 layers (encoder depth = 4). This means that four Max pools are used. Each image in UNet is used to extract 160 randomly positioned patches with a size of 64 × 64 pixels. The number of classes is 5 (buildings, pine trees, streets, bare land, and other trees), and the input image size is patchsize × numberofbands. numberofbands represents four different wavelengths selected from Sentinel-2 images. The number of epochs is initially set to 100, and the number of mini-batches is 16. A mini-

Different Algorithm Setup
Self-organizing maps (SOMs) are initiated with a specific number of neurons and iterations. The rule for network size is based on the rule (N + [1 . . . L]), where N is the number of classes and L ≥ 1 is the balance scale between under and over-segmentation. The L value is an odd number if N is odd, and even otherwise. In this research, N = 5, and L is equal to 1 or 3. The number of iterations is selected to be 1000, but SOMs can halt before completing all the iterations after reaching a balanced state (SOMs' weights do not change by more than a defined threshold).
In our case, UNet initially consists of 58 layers (encoder depth = 4). This means that four Max pools are used. Each image in UNet is used to extract 160 randomly positioned patches with a size of 64 × 64 pixels. The number of classes is 5 (buildings, pine trees, streets, bare land, and other trees), and the input image size is patchsize × numberofbands. numberofbands represents four different wavelengths selected from Sentinel-2 images. The number of epochs is initially set to 100, and the number of mini-batches is 16. A mini-batch is a subset of the training set that is used to evaluate the gradient of the loss function and update the weights. Mini-batch training is a combination of batch and stochastic training.
This means that the total number of iterations is 1000 (Equation (9)). Being the major forest cover in the study areas, pine is classified out of the different forest types.
Niter is the number of iterations, Nep is the number of epochs, Npat is the number of patches, and Mpat is the number of mini-batches.

Results and Discussion
In the experiments, UNet is run first alone and trained using the collected samples. Then, SO-UNet is run next with training samples supplied by SOMs' labeled results. Changing SOM and UNet parameters is another way to check the efficiency and robustness of the created method compared to UNet.
The following graphs show the accuracies and mini-batch loss for four different runs of UNet and SO-UNet. Each of the listed methods was run twice-one time for the urban image and another for the peri-urban. The two methods were run for different setups (Table 3), where only the image size, mini-batch size, and patches per image were the variables such that the total number of iterations is always 1000 according to Equation (8). This also means that the number of epochs remained fixed. The classified images for urban and peri-urban areas are for Sentinel-2 acquired on 5 February 2019. The graphs in Figure 7a,b show the accuracy and speed for UNet and SO-UNet in the classification of urban and peri-urban forests.
The results show that as we increase patch size (image size), the time required to cover 1000 iterations is longer. Moreover, increasing patch size increases the demands for computer resources, such as memory central unit processor (CPU) use. The accuracy of UNet increases slightly with the increase in patch size, but that is not the case with SO-UNet, where the patch size 64 × 64 remains the best setting. SO-UNet remains the best in speed compared to UNet. Moreover, analyzing the graphs in Figure 8a-d, one can conclude that SO-UNet is more stable than UNet (fluctuating accuracy during the run). Finally, analyzing the loss function, one can clearly see that the values provided by UNet are lower than those by SO-UNet due to training UNet with insufficient training samples compared to training SO-UNet with complete samples provided by SOMs. The results show that as we increase patch size (image size), the time required to cover 1000 iterations is longer. Moreover, increasing patch size increases the demands for computer resources, such as memory central unit processor (CPU) use. The accuracy of UNet increases slightly with the increase in patch size, but that is not the case with SO-UNet, where the patch size 64 × 64 remains the best setting. SO-UNet remains the best in speed compared to UNet. Moreover, analyzing the graphs in Figure 8a-d, one can conclude that SO-UNet is more stable than UNet (fluctuating accuracy during the run). Finally, analyzing the loss function, one can clearly see that the values provided by UNet are lower than those by SO-UNet due to training UNet with insufficient training samples compared to training SO-UNet with complete samples provided by SOMs. Based on the previous experiments, SO-UNet adapted the settings that provided the highest accuracy (patch size = 64 × 64 and mini-batch size = 16). These settings were used to classify other Sentinel-2 images, and the results of SO-UNet were compared to those of UNet. Table 4 shows the accuracies of the classification results for both SO-UNet and UNet.
It can be seen in Table 4 that classified urban and peri-urban images by SO-UNet maintained a stable range of accuracies between 80.5% and 85.6%, whereas the classification results by UNet had fluctuating accuracies that ranged between 18.2% and 45.2%. Based on the previous experiments, SO-UNet adapted the settings that provided the highest accuracy (patch size = 64 × 64 and mini-batch size = 16). These settings were used to classify other Sentinel-2 images, and the results of SO-UNet were compared to those of UNet. Table 4 shows the accuracies of the classification results for both SO-UNet and UNet. It can be seen in Table 4 that classified urban and peri-urban images by SO-UNet maintained a stable range of accuracies between 80.5% and 85.6%, whereas the classification results by UNet had fluctuating accuracies that ranged between 18.2% and 45.2%.
The classification results of peri-urban temporal images by UNet and SO-UNet are shown in Figure 9a-h. The first row of the images represents the peri-urban classified images by UNet, whereas the second row represents classified peri-urban images by SO-UNet. The UNet classification of the forests is different in each image, whereas the results of SO-UNet are almost comparable. These results prove the instability of UNet in the classification of complex images, such as remote sensing images.   UNet and SO-UNet classification of urban forests using different temporal images is displayed in Figure 10a-h. Inspecting the classified images by SO-UNet visually, one can easily see the resemblance of many features between these images that were collected in different seasons, which is not the case in UNet's images. This again proves the efficiency and stability of SO-UNet compared to UNet. Urban forests are one of the most complex features to discern from remote sensing images using any known advanced classification techniques. However, cooperative techniques can help to improve feature discrimination tasks.

Conclusions
In this research, a new deep learning classification method was created (SO-UNet). It UNet and SO-UNet classification of urban forests using different temporal images is displayed in Figure 10a-h. Inspecting the classified images by SO-UNet visually, one can easily see the resemblance of many features between these images that were collected in different seasons, which is not the case in UNet's images. This again proves the efficiency and stability of SO-UNet compared to UNet. Urban forests are one of the most complex features to discern from remote sensing images using any known advanced classification techniques. However, cooperative techniques can help to improve feature discrimination tasks.