Integrating Multitemporal Sentinel-1 / 2 Data for Coastal Land Cover Classification Using a Multibranch Convolutional Neural Network : A Case of the Yellow River Delta

Coastal land cover classification is a significant yet challenging task in remote sensing because of the complex and fragmented nature of coastal landscapes. However, availability of multitemporal and multisensor remote sensing data provides opportunities to improve classification accuracy. Meanwhile, rapid development of deep learning has achieved astonishing results in computer vision tasks and has also been a popular topic in the field of remote sensing. Nevertheless, designing an effective and concise deep learning model for coastal land cover classification remains problematic. To tackle this issue, we propose a multibranch convolutional neural network (MBCNN) for the fusion of multitemporal and multisensor Sentinel data to improve coastal land cover classification accuracy. The proposed model leverages a series of deformable convolutional neural networks to extract representative features from a single-source dataset. Extracted features are aggregated through an adaptive feature fusion module to predict final land cover categories. Experimental results indicate that the proposed MBCNN shows good performance, with an overall accuracy of 93.78% and a Kappa coefficient of 0.9297. Inclusion of multitemporal data improves accuracy by an average of 6.85%, while multisensor data contributes to 3.24% of accuracy increase. Additionally, the featured fusion module in this study also increases accuracy by about 2% when compared with the feature-stacking method. Results demonstrate that the proposed method can effectively mine and fuse multitemporal and multisource Sentinel data, which improves coastal land cover classification accuracy.


Introduction
Coastal regions play an important role in social and economic development around the globe [1][2][3][4][5].According to previous studies [2], about 24% of the world's population lives in coastal areas.Meanwhile, coastal regions are home to many valuable wetland ecosystems, which perform various functions beneficial to the sustainability of human society, including flooding control, water quality improvement, biodiversity conservation, maintaining the supply of fisheries and other resources, etc. [2][3][4][5].However, due to anthropogenic activities and global climate change, coastal regions have experienced rapid land cover changes over the last few decades [4,5].Therefore, accurate and timely monitoring of coastal regions by means of remote sensing is of great significance to regional, sustainable development.
In fact, accurate land cover classification of complex coastal regions is a challenging task [2,[6][7][8].The challenges are mainly two-fold.On the one hand, the highly fragmented landscape of coastal regions leads to large variations in the shape and scale of land objects, which increases the interclass variability and decreases the intraclass similarity.On the other hand, some vegetation classes (e.g., grassland and cropland) may have overlapping spectral reflectance at peak biomass, which also raises difficulties in accurate classification.
Many studies have been conducted on accurate coastal land cover classification [2,[6][7][8].In coastal areas, although some crops and natural vegetation share similar spectral features during peak growing season, they may have different seasonal variations and temporal characteristics.Therefore, the inclusion of multitemporal remote sensing data could improve classification accuracy when compared with monotemporal data alone.Davranche et al. [6] used multiseasonal SPOT-5 imagery and decision trees for coastal wetland classification in southern France.Yang et al. [7] adopted seasonal optical imagery for coastal land cover classification and demonstrated that combining multiseasonal images considerably improves classification accuracy over any single-date classification.In our previous work [8], we also utilized multitemporal Landsat data to monitor cropland dynamics of the Yellow River Delta and justified the role of multitemporal data in classification.
Meanwhile, because of the availability of diverse remote sensors, researchers have started to integrate multisensor data for better classification of coastal areas [9][10][11][12][13].Specifically, fusion of optical and radar data has been widely studied [9][10][11][12][13].Optical images mainly contain information regarding reflectance and emissivity characteristics of land surfaces [9], while radar data are associated with the structural, textural, and dielectric properties of land objects [10].Therefore, integration of optical and radar data can complement each other, resulting in an improved coastal land cover classification.Rodrigues et al. [9] used multisensor data from Landsat-7 and RADARSAT-1 to identify and map tropical coastal wetlands in the Amazon of northern Brazil.Beijma et al. [10] investigated the uses of multisource airborne radar and optical data to map natural coastal salt marsh vegetation habitats.
Since the successful implementation of the European Copernicus program created by the European Space Agency (ESA), Sentinel-1 radar data and Sentinel-2 optical data are now available via open access, providing new insights for remote sensing applications, especially for large-scale environmental monitoring [14][15][16][17][18][19].For instance, Hird et al. developed a workflow for large-area probabilistic wetland mapping based on Google Earth Engine (GEE) and Sentinel-1 and 2 data [14].Mahdianpari et al. also adopted GEE and multisource Sentinel data to generate the first detailed (category-based) provincial-level wetland inventory map [15].Therefore, we are highly interested in integrating multitemporal and multisensor Sentinel data for accurate coastal land cover classification.
In addition, all the above studies are based on handcrafted features and conventional machine-learning classifiers, which may fail in obtaining high-level features of complex heterogeneous coastal landscapes.Deep learning [20], on the other hand, has the ability to discover informative features with multiple levels of representation and has achieved an astonishing performance in computer vison applications [21][22][23][24][25][26], such as image classification [21], object detection [23], and semantic segmentation [24].Recently, deep learning, especially deep convolutional neural networks (CNNs), has also been successfully applied in many remote sensing applications [27][28][29][30][31][32][33][34][35][36][37].Rezaee et al. [34] applied a pre-trained AlexNet [21] for wetland mapping using monotemporal optical imagery.Rußwurm et al. [35] utilized sequential recurrent encoders and multitemporal Sentinel-2 optical data for land cover classification, which achieved state-of-the-art classification accuracies.Ji et al. [36] proposed a three-dimensional (3D) CNN for crop classification with multitemporal remote sensing images and concluded that a 3D CNN was suitable in characterizing dynamics of crop growth.Mahdianpari et al [38].investigated state-of-the-art deep learning models for classification of complex wetland classes and indicated that InceptionResNetV2, ResNet50, and Xception were distinguished as the top three models.
Despite improvements made by deep learning in the remote sensing field, two challenges of coastal land cover classification mentioned above still remain and need to be solved.In the context of deep learning, the two issues can be revisited as follows: (1) how to build a concise and effective deep learning model that accounts for variations in shapes and scales in fragmented coastal regions, and (2) how to design a fusion mechanism that adaptively fuses multitemporal and multisensor remote sensing data.
To address these issues, this study proposes a multibranch convolutional neural network (MBCNN) for coastal land cover classification using multitemporal and multisensor Sentinel data.First, a singlebranch CNN is proposed to extract representative features from each monotemporal and single-sensor Sentinel datum.A deformable multiscale residual block is utilized in the single-branch CNN to account for shape and scale variations.Afterwards, multiple single-branch CNNs are integrated through an adaptive fusion module, which are inspired by squeeze-and-excitation networks [25], to predict the final land cover category.The selected study region is the Yellow River Delta, which is the largest natural delta of China and home to abundant coastal wetlands [39][40][41][42].
The rest of the paper is organized as follows.Section 2 introduces the study area and the dataset used.Section 3 presents the architecture and training details of the proposed multibranch neural network.Section 4 shows the experimental results and discussion, while Section 5 provides the main conclusions and suggestions for future work.
The contributions of this study are mainly two-fold.(1) We have designed a concise yet effective deep learning model for coastal land cover classification, which adopts deformable convolutional layers to account for variations of scales and shapes of coastal landscapes.(2) We have proposed a feature-level fusion module based on squeeze-and-excitation networks for multitemporal and multisensor Sentinel data fusion to boost coastal land cover classification accuracy.

Study Area
The Yellow River Delta is the largest natural delta of China and is home to many coastal wetlands (Figure 1).In this study, the Yellow River Delta refers to the Yellow River Delta National Nature Reserve [39], which is located northeast of Dongying City in Shandong Province in China.Due to deposition of abundant sediments transported by the Yellow River, the newly created wetland has increased by 30 km 2 per year, making the Yellow River Delta one of the fastest growing sedimentation areas around the globe [40,41].
The Yellow River Delta belongs to a temperate, continental monsoon climate, which has a hot, humid summer and a cold, dry winter.It has an annual temperature of about 11.9 • C and an annual precipitation of about 640 mm [41].The natural vegetation includes reed, tamarisk, Suaeda, and Robinia, while the main crops include rice, lotus, corn, winter wheat, and cotton [42].
A field survey was conducted in July 2018.A total of 163 sampling sites were visited.Land cover types, photographs, and global positioning system (GPS) locations were recorded for each sampling site.According to the field survey and previous studies [8,[39][40][41][42], there were 11 land cover categories in this study: forest, grassland, salt marsh, shrubs, tidal flat, bare soil, clear water, turbid water, irrigated farmland, dry farmland, and built up.Landscape descriptions for each land cover category are shown in Table 1.Training and testing samples were derived from remote sensing images through visual inspection based on sampling site GPS locations and recorded land cover types.The spatial distribution of training and testing samples are depicted in Figure 2a,b, respectively.The number of training and testing samples (in pixels) are also shown in Table 1.To make the accuracy assessment more objective and convincing, we doubled the number of testing samples in relation to training samples.
The Yellow River Delta is the largest natural delta of China and is home to many coastal wetlands (Figure 1).In this study, the Yellow River Delta refers to the Yellow River Delta National Nature Reserve [39], which is located northeast of Dongying City in Shandong Province in China.Due to deposition of abundant sediments transported by the Yellow River, the newly created wetland has increased by 30 km 2 per year, making the Yellow River Delta one of the fastest growing sedimentation areas around the globe [40,41].The Yellow River Delta belongs to a temperate, continental monsoon climate, which has a hot, humid summer and a cold, dry winter.It has an annual temperature of about 11.9 °C and an annual precipitation of about 640 mm [41].The natural vegetation includes reed, tamarisk, Suaeda, and Robinia, while the main crops include rice, lotus, corn, winter wheat, and cotton [42].
A field survey was conducted in July 2018.A total of 163 sampling sites were visited.Land cover types, photographs, and global positioning system (GPS) locations were recorded for each sampling site.According to the field survey and previous studies [8,[39][40][41][42], there were 11 land cover categories in this study: forest, grassland, salt marsh, shrubs, tidal flat, bare soil, clear water, turbid water, irrigated farmland, dry farmland, and built up.Landscape descriptions for each land cover category are shown in Table 1.Training and testing samples were derived from remote sensing images through visual inspection based on sampling site GPS locations and recorded land cover types.The spatial distribution of training and testing samples are depicted in Figure 2 (a) and (b), respectively.The number of training and testing samples (in pixels) are also shown in Table 1.To make the accuracy assessment more objective and convincing, we doubled the number of testing samples in relation to training samples.

Dataset Used
Because of the availability of Sentinel-1 and Sentinel-2 data, these data were integrated for various applications such as vegetation mapping [16], soil moisture monitoring [17], and crop classification [18].In this study, both multitemporal and multisensor Sentinel data over an entire growing season were utilized for coastal land cover classification (Table 2).
Specifically, radar datasets were obtained from Sentinel-1 Level-1 ground range detected (GRD) images with a spatial resolution of 10 m × 10 m [16].The synthetic aperture radar (SAR) onboard Sentinel-1 operates at the C-band with a revisit time of six days [17].Preprocessing of Sentinel-1 SAR data was implemented using Sentinel-1 Toolbox provided by the ESA [18], including radiometric calibration, speckle noise reduction, and terrain correction, which outputs geo-coded backscattering

Dataset Used
Because of the availability of Sentinel-1 and Sentinel-2 data, these data were integrated for various applications such as vegetation mapping [16], soil moisture monitoring [17], and crop classification [18].In this study, both multitemporal and multisensor Sentinel data over an entire growing season were utilized for coastal land cover classification (Table 2).Specifically, radar datasets were obtained from Sentinel-1 Level-1 ground range detected (GRD) images with a spatial resolution of 10 m × 10 m [16].The synthetic aperture radar (SAR) onboard Sentinel-1 operates at the C-band with a revisit time of six days [17].Preprocessing of Sentinel-1 SAR data was implemented using Sentinel-1 Toolbox provided by the ESA [18], including radiometric calibration, speckle noise reduction, and terrain correction, which outputs geo-coded backscattering coefficients of VV (for vertical transmit and vertical receive) and VH (for vertical transmit and horizontal receive) polarizations.
Optical datasets were obtained from Sentinel-2 MSI Level-1C products under cloud-free conditions.Sentinel-2 MSI has 13 bands ranging from 443 to 2190 nm and a spatial resolution from 10 to 60 m with a revisit time of five days [18].In this study, only bands at 10 m (Bands 2, 3, 4, and 8) and 20 m (Bands 5, 6, 7, 8A, 11, and 12) resolutions were selected.Sen2Cor [19] was used to perform atmospheric correction to get the bottom-of-atmosphere (BOA) 2A product.In order to co-register with Sentinel-1 SAR data, all bands at 20 m resolution of Sentinel-2 were resampled to 10 m using a bilinear interpolation method.

Overview of a Multibranch Convolutional Neural Network (CNN)
Figure 3 shows the overview of the proposed multibranch CNN model for coastal land cover classification.
Remote Sens. 2019, 11 FOR PEER REVIEW 6 coefficients of VV (for vertical transmit and vertical receive) and VH (for vertical transmit and horizontal receive) polarizations.Optical datasets were obtained from Sentinel-2 MSI Level-1C products under cloud-free conditions.Sentinel-2 MSI has 13 bands ranging from 443 to 2190 nm and a spatial resolution from 10 to 60 m with a revisit time of five days [18].In this study, only bands at 10 m (Bands 2, 3, 4, and 8) and 20 m (Bands 5, 6, 7, 8A, 11, and 12) resolutions were selected.Sen2Cor [19] was used to perform atmospheric correction to get the bottom-of-atmosphere (BOA) 2A product.In order to co-register with Sentinel-1 SAR data, all bands at 20 m resolution of Sentinel-2 were resampled to 10 m using a bilinear interpolation method.Note.S1: Sentinel-1; S2: Sentinel-2; and GRD: ground range detected.

Overview of a Multibranch Convolutional Neural Network (CNN)
Figure 3 shows the overview of the proposed multibranch CNN model for coastal land cover classification.As shown in Figure 3, the multibranch CNN model had two major components: (1) a feature extraction module based on single-branch CNN, and (2) a feature fusion module to aggregate the extracted features for final land cover classification.Each single-branch CNN had the same network structure.Deformable convolutions [23,24] and multiscale residual blocks [22] were introduced to model the land surface with various shapes and scales.The extracted features from each branch were As shown in Figure 3, the multibranch CNN model had two major components: (1) a feature extraction module based on single-branch CNN, and (2) a feature fusion module to aggregate the extracted features for final land cover classification.Each single-branch CNN had the same network structure.Deformable convolutions [23,24] and multiscale residual blocks [22] were introduced to model the land surface with various shapes and scales.The extracted features from each branch were fed into an adaptive feature fusion module, through which the multitemporal and multisensor data were effectively synthesized for final classification.

Brief Introduction of CNNs
To better understand our proposal, a brief introduction of CNN is provided in this section.Generally, a typical CNN architecture is alternatively stacked by convolutional layers, pooling layers, and fully connected layers [29].

Convolutional Layers
Convolutional layers are of great significance in a CNN.High-level representative features can be extracted through the stacking of multiple convolutional layers.The input into a convolutional layer is a feature map x with a size of m × n × c, where m × n denotes the spatial size of the feature map, while c is the number of input channels.Supposing the convolutional layer consists of k filters, the output would be an m' × n' × k feature map with k channels and a spatial size of m' × n'.The ith output feature map of the convolutional layer, y i , can be expressed as follows.
where w i and b i denote the weights and bias of the ith filter, and * is the direct convolutional operator.Afterwards, a nonlinear activation function (e.g., the rectified linear unit [43]) is usually applied to the output feature map to increase the nonlinear learning ability of the network.

Pooling Layers
Pooling layers are used to generalize the convolved features through down-sampling.The spatial size of the input feature map is reduced after a pooling operation, which decreases the number of parameters and computational complexity.Commonly used pooling layers include max pooling and average pooling, which use the maximum or average operator to extract values for local spatial regions, respectively.

Fully Connected Layers
The role of fully connected layers is to combine all input features by reshaping them into an N-dimensional vector.Simple logistic regression is used by the fully connected layers.Finally, the extracted feature vector is fed into the softmax classifier [43] to generate the probability distribution.

Single-Branch CNN for Feature Extraction
Accurate coastal land cover classification requires a set of well-established and representative features.In this study, to account for complex and fragmented coastal landscapes, we first proposed a single-branch CNN based on both deformable convolutions and multiscale residual blocks (Figure 4).
Figure 4 illustrates that the input of the proposed single-branch CNN is an image patch centered on the labeled pixel with a size of k × k × c, where k is the patch size and c is the number of channels.The proposed network consisted of several convolutional layers, max pooling layers, and deformable multiscale residual blocks; detailed information is listed in Table 3.

Single-Branch CNN for Feature Extraction
Accurate coastal land cover classification requires a set of well-established and representative features.In this study, to account for complex and fragmented coastal landscapes, we first proposed a single-branch CNN based on both deformable convolutions and multiscale residual blocks (Figure 4).
The deformable multiscale residual block was inspired by both deformable convolution [23,24] and a multiscale residual block [22].Specifically, the multiscale residual block was borrowed from Bulat et al. [22], which had the merits of extracting hierarchical and multiscale features and improving gradient flow at the same time.By introducing deformable convolution into the multiscale residual block, the receptive field and sampling locations were trained to be adaptive to the shapes and scales of land objects, which enabled extraction of robust and representative features.Figure 5 shows the structure and parameters of the deformable multiscale residual blocks.
Remote Sens. 2019, 11 FOR PEER REVIEW 8 Figure 4 illustrates that the input of the proposed single-branch CNN is an image patch centered on the labeled pixel with a size of k × k × c, where k is the patch size and c is the number of channels.The proposed network consisted of several convolutional layers, max pooling layers, and deformable multiscale residual blocks; detailed information is listed in Table 3.
The deformable multiscale residual block was inspired by both deformable convolution [23,24] and a multiscale residual block [22].Specifically, the multiscale residual block was borrowed from Bulat et al. [22], which had the merits of extracting hierarchical and multiscale features and improving gradient flow at the same time.By introducing deformable convolution into the multiscale residual block, the receptive field and sampling locations were trained to be adaptive to the shapes and scales of land objects, which enabled extraction of robust and representative features.Figure 5 shows the structure and parameters of the deformable multiscale residual blocks.The mechanism of deformable convolution is illustrated in Figure 6.The offset field was derived from input feature maps, and the deformable kernel had the same resolution as the current convolutional layer [23].Both the kernels and offsets were learned simultaneously during the training process.Therefore, the output feature y at location p0 can be formalized as follows: The mechanism of deformable convolution is illustrated in Figure 6.The offset field was derived from input feature maps, and the deformable kernel had the same resolution as the current convolutional Remote Sens. 2019, 11, 1006 9 of 22 layer [23].Both the kernels and offsets were learned simultaneously during the training process.Therefore, the output feature y at location p 0 can be formalized as follows: where w refers to the weights of the sampled points, x refers to the input feature map, p i means the ith location, and ∆p i represents the offset to be learned [23,24].
Remote Sens. 2019, 11 FOR PEER REVIEW 9 where w refers to the weights of the sampled points, x refers to the input feature map, pi means the ith location, and i p Δ represents the offset to be learned [23][24].Additionally, a series of experiments were done to find the optimal patch size k from 9 to 29.It was found that the best classification accuracy was achieved when k = 11.

Adaptive Feature Fusion
The sequence of features extracted from each single-source (i.e., both single-date and single-sensor) Sentinel dataset was utilized in the proposed feature fusion module to make the final land cover prediction.As for the fusion method, many previous studies [29,33] simply stacked and concatenated all the input features without considering the importance of each feature.Inspired by squeeze-and-excitation networks (SENets) [25] and our previous work [30], this study proposed a fusion mechanism for feature aggregation of multibranch CNNs, which took the importance of each feature into consideration (Figure 7).As shown in Figure 7, the feature fusion module was used to recalibrate (or reweight) all the features extracted from each single-branch CNN through a series of squeeze-and-excitation (SE) blocks [25].First, any input features from each branch were passed through a global average pooling (GAP) layer to generate a channel descriptor.Next, the channel-specific weight was learned Additionally, a series of experiments were done to find the optimal patch size k from 9 to 29.It was found that the best classification accuracy was achieved when k = 11.

Adaptive Feature Fusion
The sequence of features extracted from each single-source (i.e., both single-date and single-sensor) Sentinel dataset was utilized in the proposed feature fusion module to make the final land cover prediction.As for the fusion method, many previous studies [29,33] simply stacked and concatenated all the input features without considering the importance of each feature.Inspired by squeeze-and-excitation networks (SENets) [25] and our previous work [30], this study proposed a fusion mechanism for feature aggregation of multibranch CNNs, which took the importance of each feature into consideration (Figure 7).
Remote Sens. 2019, 11 FOR PEER REVIEW 9 where w refers to the weights of the sampled points, x refers to the input feature map, pi means the ith location, and i p Δ represents the offset to be learned [23][24].Additionally, a series of experiments were done to find the optimal patch size k from 9 to 29.It was found that the best classification accuracy was achieved when k = 11.

Adaptive Feature Fusion
The sequence of features extracted from each single-source (i.e., both single-date and single-sensor) Sentinel dataset was utilized in the proposed feature fusion module to make the final land cover prediction.As for the fusion method, many previous studies [29,33] simply stacked and concatenated all the input features without considering the importance of each feature.Inspired by squeeze-and-excitation networks (SENets) [25] and our previous work [30], this study proposed a fusion mechanism for feature aggregation of multibranch CNNs, which took the importance of each feature into consideration (Figure 7).As shown in Figure 7, the feature fusion module was used to recalibrate (or reweight) all the features extracted from each single-branch CNN through a series of squeeze-and-excitation (SE) blocks [25].First, any input features from each branch were passed through a global average pooling (GAP) layer to generate a channel descriptor.Next, the channel-specific weight was learned As shown in Figure 7, the feature fusion module was used to recalibrate (or reweight) all the features extracted from each single-branch CNN through a series of squeeze-and-excitation (SE) blocks [25].First, any input features from each branch were passed through a global average pooling (GAP) layer to generate a channel descriptor.Next, the channel-specific weight was learned with two successive fully connected layers and a sigmoid layer.After all the features from each branch were reweighted, informative features were emphasized and less useful ones were suppressed, which provided a more effective and rational method to achieve feature-level fusion of multitemporal and multisensor Sentinel data.
Finally, all the reweighted features were flattened and concatenated to generate the fused feature vectors.Then, the fused features were fed into a fully connected layer and a softmax layer to calculate conditional probabilities of each land cover category.

Details of Network Training
Data augmentation was utilized in this study to overcome the limited amount of training data.All the training patches were flipped up and down, left and right, and rotated 90 • , 180 • , and 270 • to enlarge the training datasets.
All the parameters of the MBCNN should be trained.Specifically, all the weights were initialized using He normalization [43], while biases were initialized by zero.As for the optimization method, Adam [44] was utilized with a starting learning rate of 10 −5 .An early-stop strategy was used to select the best model.Only the model with the minimum validation loss was saved.
Focal loss [26] was adopted instead of cross-entropy loss to further boost classification performance.Focal loss played the role of online hard example mining, which down-weighted loss assigned to the well-classified examples and prevented the vast number of easy examples from overwhelming the classifier during training.
In this study, about 90% of training samples were randomly selected to optimize the parameters of the proposed model.The remaining 10% of training samples were used as a validation set to evaluate classification performance during the training process.The testing set was only used to calculate final overall accuracy and the confusion matrix after the model was well trained.
The proposed MBCNN was trained with the TensorFlow library [45] on the Ubuntu 16.04 operation system with Intel CORE i7-7800 @ 3.5 GHz CPU and an NVIDIA GTX TitanX GPU with 12 GB memory.

Accuracy Assessment
To justify the effectiveness of the proposed method, both visual evaluation and a confusion matrix were utilized in this study.Visual evaluation was used to check obvious classification errors, while a confusion matrix derived from the testing samples was used to quantitatively evaluate classification performance through the following metrics: overall accuracy (OA), producer accuracy (PA), user accuracy (UA), and Kappa coefficient.

Results of Coastal Land Cover Classification
Figure 8 illustrates classification results of the Yellow River Delta using the proposed multibranch CNN and multitemporal, multisensor Sentinel data.From the perspective of visual inspection, the results showed good visual effect, and the spatial distributions of each classified land cover were close to field survey records.Moreover, few obvious omission and commission errors could be found in Figure 8, which also justified the effectiveness of the proposed method.To quantitatively evaluate performance of the proposed method, the confusion matrix, OA, and Kappa coefficient were calculated from the testing samples.The results are shown in Table 4. Table 4 indicated that the proposed multibranch CNN achieved good performance with an OA of 93.78% and a Kappa coefficient of 0.9297.Almost every class demonstrated producer accuracy of more than 89%, except for shrubs, whose PA was only 66.00%.Several shrub pixels were misclassified as forest, grassland, and tidal flat.This was understandable, because the radar backscattering properties of the shrub land cover category were similar to those of the forest category.Meanwhile, shrubs (mainly tamarisks) were sparsely distributed in the coastal wetlands To quantitatively evaluate performance of the proposed method, the confusion matrix, OA, and Kappa coefficient were calculated from the testing samples.The results are shown in Table 4. Table 4 indicated that the proposed multibranch CNN achieved good performance with an OA of 93.78% and a Kappa coefficient of 0.9297.Almost every class demonstrated producer accuracy of more than 89%, except for shrubs, whose PA was only 66.00%.Several shrub pixels were misclassified as forest, grassland, and tidal flat.This was understandable, because the radar backscattering properties of the shrub land cover category were similar to those of the forest category.Meanwhile, shrubs (mainly tamarisks) were sparsely distributed in the coastal wetlands surrounded by tidal flats, which caused spectral confusion between shrubs and tidal flat categories.In addition, because of the limited spatial resolution, there were hardly no pure shrub pixels, leading to spectral confusion between shrubs and grassland categories, which could also account for classification errors.
In addition, other classification errors mainly occurred between forest, grassland, and irrigated land categories as well as bare soil and tidal flat categories.This was because of the similarity of spectral and backscattering characteristics between these land cover types.

Impact of Multisensor Data on Classification
As stated earlier, inclusion of both optical and radar data would be expected to improve the accuracy of coastal land cover classification.In this section, comparison between single-sensor and multisensor classifications will be discussed.Specifically, experiments were performed for the following cases: (1) radar-only classification: using only multitemporal radar data from Sentinel-1 for classification; (2) optical-only classification: using only multitemporal optical data from Sentinel-2 for classification; (3) feature-stacking classification: using multitemporal radar and optical data and feature stacking for classification; and (4) proposed MBCNN model: using multitemporal radar and optical data and the proposed MBCNN for classification.
The classification maps for each experiment are illustrated in Figures 9 and 10.
Remote Sens. 2019, 11 FOR PEER REVIEW 12 surrounded by tidal flats, which caused spectral confusion between shrubs and tidal flat categories.
In addition, because of the limited spatial resolution, there were hardly no pure shrub pixels, leading to spectral confusion between shrubs and grassland categories, which could also account for classification errors.
In addition, other classification errors mainly occurred between forest, grassland, and irrigated land categories as well as bare soil and tidal flat categories.This was because of the similarity of spectral and backscattering characteristics between these land cover types.

Impact of Multisensor Data on Classification
As stated earlier, inclusion of both optical and radar data would be expected to improve the accuracy of coastal land cover classification.In this section, comparison between single-sensor and multisensor classifications will be discussed.Specifically, experiments were performed for the following cases: (1) radar-only classification: using only multitemporal radar data from Sentinel-1 for classification; (2) optical-only classification: using only multitemporal optical data from Sentinel-2 for classification; (3) feature-stacking classification: using multitemporal radar and optical data and feature stacking for classification; and (4) proposed MBCNN model: using multitemporal radar and optical data and the proposed MBCNN for classification.classification map with fewer errors when compared with single-sensor classification.Meanwhile, it was difficult to get an accurate classification map by using Sentinel-1 radar data alone.There were many errors among various land cover categories, especially between grassland and irrigated farmland as well as between shrubs and grassland.Nonetheless, using Sentinel-2 optical data alone could achieve a much better classification map.Similar spatial patterns were found among classification m In addition, Figure 10 indicated that the proposed MBCNN could effectively reduce classification errors between forest and shrubs as well as between dry farmland and forest when compared with optical-only and feature-stacking methods.
Table 5 shows detailed class-level classification accuracies (i.e., producer accuracy) for each experiment.It indicated that the proposed MBCNN achieved the highest classification accuracy with an OA of 93.78% and a Kappa of 0.9297, which verified the effectiveness of the proposed Both Figures 9 and 10 illustrated that inclusion of multisensor data yielded a better classification map with fewer errors when compared with single-sensor classification.Meanwhile, it was difficult to get an accurate classification map by using Sentinel-1 radar data alone.There were many errors among various land cover categories, especially between grassland and irrigated farmland as well as between shrubs and grassland.Nonetheless, using Sentinel-2 optical data alone could achieve a much better classification map.Similar spatial patterns were found among classification maps yielded by optical-only and feature-stacking methods and the proposed MBCNN model.
In addition, Figure 10 indicated that the proposed MBCNN could effectively reduce classification errors between forest and shrubs as well as between dry farmland and forest when compared with optical-only and feature-stacking methods.
Table 5 shows detailed class-level classification accuracies (i.e., producer accuracy) for each experiment.It indicated that the proposed MBCNN achieved the highest classification accuracy with an OA of 93.78% and a Kappa of 0.9297, which verified the effectiveness of the proposed model.Radar-only classification had the lowest OA of 64.00%, which was consistent with Figures 9 and 10.The following land cover categories had low accuracies in radar-only single-sensor classification: salt marsh, shrubs, and bare soil.Meanwhile, optical-only classification demonstrated better performance than radar-only classification.This was mainly because Sentinel-2 could provide distinctive spectral characteristics, which were essential in separating different coastal land cover categories, especially with respect to confusing vegetation types.Table 5 also indicated that the synthetic use of Sentinel-1 and Sentinel-2 data led to an increase in classification accuracy for almost every coastal land cover category.This was rational, because integration of optical and radar features could enhance between-class separability [10,13].Compared with Sentinel-2 data alone, inclusion of Sentinel-1 data increased OA by 0.96% and 3.24% through feature stacking and the proposed multibranch CNN, respectively.
The adaptive feature fusion method in this study outperformed the feature-stacking method by increasing OA from 91.50% to 93.78% with an improvement of 2.28%.This was because, when simply stacking features together, the information carried by each feature may not be equally represented [30].Nonetheless, introduction of a squeeze-and-excitation module can automatically learn the weight of each feature according to its importance, fusing multiple features in a more reasonable and effective way.
Besides, Table 5 also indicated that when using SAR data alone, it was difficult for the classification model to separate shrubs from other land cover types, which meant that image features learned from shrubs were very weak.However, those weak SAR features of shrubs still existed and were enhanced by the adaptive feature fusion method in this paper, which in turn contributed to accuracy improvement when combined with optical data.

Impact of Multitemporal Data on Classification
The role of multitemporal data in coastal land cover classification should also be verified.In this section, we conducted a series of experiments for monotemporal classification.In each single-date experiment, radar data from Sentinel-1 and optical data from Sentinel-2 were involved.The classification maps and overall accuracy for each single-date dataset are illustrated in Figure 11, Figure 12, and Table 6, respectively.Both Figures 11 and 12 show that when compared with single-date classification, the inclusion of multitemporal data improved classification performance.The multitemporal classification map showed fewer obvious mistakes, especially between forest and shrubs, irrigated farmland and grassland, and dry farmland and bare soil.This was because phenological information conveyed by multitemporal data enhanced separability among different vegetation types [8].This was in accordance with the quantitative evaluation shown in Table 6.By introducing temporal information, classification accuracy was boosted by 1.15%-11.85%,with an average increase of 6.85%, which justified the importance of multitemporal data in coastal land cover classification.
Table 6 also indicated that the classification accuracy for date T1 (April 2018) was notably lower than that of other dates, with an OA of 81.93% and a Kappa of 0.7957.This was also consistent with Figures 11 and 12.This was mainly because most of the vegetation, except for winter wheat, started to turn green in April.The differences among vegetation were not distinct from either the spectral or backscattering perspectives, which resulted in low between-class separability and classification performances.Both Figures 11 and 12 show that when compared with single-date classification, the inclusion of multitemporal data improved classification performance.The multitemporal classification map showed fewer obvious mistakes, especially between forest and shrubs, irrigated farmland and grassland, and dry farmland and bare soil.This was because phenological information conveyed by multitemporal data enhanced separability among different vegetation types [8].This was in accordance with the quantitative evaluation shown in Table 6.By introducing temporal information, classification accuracy was boosted by 1.15%-11.85%,with an average increase of 6.85%, which justified the importance of multitemporal data in coastal land cover classification.
Table 6 also indicated that the classification accuracy for date T1 (April 2018) was notably lower than that of other dates, with an OA of 81.93% and a Kappa of 0.7957.This was also consistent with Figures 11 and 12.This was mainly because most of the vegetation, except for winter wheat, started to turn green in April.The differences among vegetation were not distinct from either the spectral or backscattering perspectives, which resulted in low between-class separability and classification performances.

Impact of Deformable Convolution on Classification
In contrast with previous land cover classification methods based on deep learning [29,30,[33][34][35][36][37], we introduced deformable convolution to model fragmented coastal landscapes.To better interpret the impact of deformable convolution for classification, a contrast experiment was conducted in this section.In the experiment, all the deformable convolutional layers in the MBCNN were replaced by standard convolutional layers.Accuracy comparisons between standard and deformable convolution is shown in Table 7. Table 7 indicated that when compared with standard convolution, introduction of deformable convolution improved the OA from 91.69% to 93.78% with an increase of 2.09%, which verified the effectiveness of deformable convolution.In fact, in complex heterogeneous landscapes such as coastal areas, a big challenge in land cover classification is the variations in the shapes and scales of land objects.Because of the fixed kernel shape, standard convolution could not capture these variations, which resulted in an inferior performance.However, by utilizing deformable receptive fields [23,24], which were adaptive to the shape and scale of input remote sensing data, deformable convolution extracted more representative features, showing better performance in complex coastal land cover classification when compared with standard convolution.

Comparison with Machine Learning Methods
As is known, machine learning-based methods have long been used for land cover mapping in the remote sensing field, such as in maximum likelihood classifier (MLC), random forest (RF), support vector machine (SVM), etc.To further justify the performance of the proposed method, it should be compared with those widely used machine learning methods.Specifically, with respect to RF [46], we involved 200 decision trees with a max depth of 13 and utilized the Gini coefficient [46] as the indicator for feature selection.With respect to SVM [47], we used radial basis function [47] as the kernel function with a gamma [47] of 0.01 and a penalty coefficient C [47] of 100.As for determining the parameters of RF and SVM classifiers, a grid-search method was utilized to find the optimal values.Specifically for RF, the number of trees were set between 100 and 300, while the max depth was between 3 to 15, respectively.For SVM, gamma was set between 0.001 to 0.1, while C had a range of 20 to 200.
In addition, all the above methods were trained and tested using the same training and testing samples as the proposed multibranch CNN in this study to maintain objectiveness.The results of accuracy comparisons are listed in Table 8.Table 8 indicated that the traditional machine learning methods showed inferior performance to the proposed method in coastal land cover classification.The proposed multibranch CNN outperformed MLC, RF, and SVM with an increase in OA of 19.13%, 8.80%, and 6.27%, respectively.This was mainly because the proposed deep convolutional neural network could learn high-level and discriminative representations of complex and fragmented coastal landscapes, outperforming machine learning-based methods.

Comparison with Other Land Cover Classification Methods
Because the main objective of this study was to propose a deep learning-based method for coastal land cover classification, it was necessary to compare our proposed method with other classification methods (Table 9) to further demonstrate both the merits and limitations of the proposed method.It should be noted that because of the differences in the study area, number of training samples, and classified categories, it was difficult to directly compare these methods based on classification accuracies alone.Therefore, we mainly focused on the merits and shortcomings of each method.Specifically, both Rezaee et al. [34] and Huang et al. [29] achieved good accuracies using CNN-based models and monotemporal, single-sensor data for wetland land cover and urban land use classification.Meanwhile, Mahdianpari et al. [38] investigated well-known deep learning models (e.g., ResNet, DenseNet, InceptionResNet, etc.) for wetland mapping and demonstrated that InceptionResNetV2 showed the best accuracy (96.17%).They concluded that CNN outperformed traditional machine-learning methods (e.g., random forest) in the context of complex heterogeneous landscapes, which was consistent with our findings.However, neither multitemporal nor multisensor datasets were incorporated, meaning that these methods lacked the ability to comprehensively characterize the land surface.
From the perspective of multitemporal classification, Rußwurm et al. [35] utilized recurrent neural networks (RNNs) and multitemporal Sentinel-2 data for land cover classification, and they achieved good performance.They concluded that RNN was appropriate for modeling the relationship of sequential remote sensing data, and that it showed high accuracy in multitemporal classification.Different from Rußwurm et al. [35], the MBCNN in this study utilized a feature fusion module that directly learned the importance of each temporal feature to classification performance in order to fuse temporal features.Ji et al. [36] utilized 3D CNN to learn spatio-temporal features for crop classification based on multitemporal optical data-a method that also showed high accuracy.However, when compared with MBCNN, which was based on two-dimensional (2D) convolution, 3D CNN had the drawbacks of requiring high computing complexity and having gradient vanishing along the depth channel.
In the context of multisensor fusion and classification, Xu et al. [33] adopted a two-branch CNN for urban land use classification based on hyperspectral, light detection, and raging (LiDAR) data.As for the method of data fusion, Xu et al. [33] simply used feature stacking without considering the importance of each feature.Scarpa et al. [37] studied the fusion of Sentinel-1 and Sentinel-2 data based on deep learning.They first stacked all multitemporal Sentinel-1 and Sentinel-2 data and utilized a CNN to extract features from the stacked data.Apparently, their fusion method was more on the data-level and less on the feature-level, which may lead to a weak robustness of the fused features.Compared with these studies, we constructed a feature-level fusion method that took the importance of each feature into consideration, which could increase the representativeness and robustness of the output features.Moreover, the above previous studies did not consider variations in shapes and scales of land objects, which was one of the most important reasons for improving limited land cover classification accuracy.To tackle this issue, deformable convolution, which could extract robust features regardless of shape and scale variations, was introduced in this study.
Overall, Table 9 indicated that the proposed MBCNN could achieve good classification performance when compared with state-of-the-art methods.Additionally, the proposed method could be used for crop type mapping through joint use of Sentinel-1 and Sentinel-2 data for crop growth monitoring and yield estimation on the regional scale [48][49][50][51].

Conclusions
This paper proposed a multibranch convolutional neural network for fusion of multitemporal and multisensor Sentinel data for coastal land cover classification.The proposed neural network leverages a series of single-branch CNNs for feature extraction from single-date and single-sensor Sentinel data.Deformable convolutions and multiscale residual blocks were introduced to account for the variations in shapes and scales of coastal land objects.Features extracted from each branch were then aggregated using an adaptive fusion module to make the final land cover predictions.
The experiments were performed in the Yellow River Delta, which is the largest natural delta in China.The results indicated that the proposed multibranch CNN achieved good performance with an overall accuracy of 93.78% and a Kappa coefficient of 0.9297.The introduction of deformable convolutions increased the OA by 2.09%, which justified its role in modeling complex and fragmented coastal landscapes.Meanwhile, inclusion of multitemporal data improved the OA by 1.15%-11.85%,with an average increase of 6.85%, which justified the importance of temporal information in coastal land cover classification.Moreover, when compared with optical data alone, the inclusion of radar data increased the OA from 90.54% to 93.78% with an improvement of 3.24%, which indicated that the fusion of multisensor Sentinel data could enhance the separability of coastal land cover types.However, using radar data alone cannot achieve an accurate classification result.The proposed adaptive fusion method improved the OA by an increase of 2.28% when compared with the feature-stacking method, which also justified its effectiveness in multisource data fusion.
This paper demonstrates that the proposed multibranch CNN can effectively extract and integrate features from multitemporal and multisensor Sentinel-1 and Sentinel-2 remote sensing data, which achieves good performance in coastal land cover classification.In addition, the proposed network architecture can be considered as a general framework for multitemporal and multisensor data fusion.Future work should consider more study cases to further verify the effectiveness of the proposed MBCNN.

Figure 3 .
Figure 3.An overview of the proposed multibranch convolutional neural network (CNN).

Figure 3 .
Figure 3.An overview of the proposed multibranch convolutional neural network (CNN).

Figure 4 .
Figure 4. Architecture of the proposed single-branch CNN.Figure 4. Architecture of the proposed single-branch CNN.

Figure 4 .
Figure 4. Architecture of the proposed single-branch CNN.Figure 4. Architecture of the proposed single-branch CNN.

Figure 5 .
Figure 5. Architecture of the deformable multiscale residual blocks.

Figure 5 .
Figure 5. Architecture of the deformable multiscale residual blocks.

Figure 8 .
Figure 8. Classification map of the Yellow River Delta generated by the multibranch CNN.

Figure 8 .
Figure 8. Classification map of the Yellow River Delta generated by the multibranch CNN.

Table 1 .
Classification scheme of the Yellow River Delta.

Table 1 .
Classification scheme of the Yellow River Delta.

Table 3 .
Detailed information of the single-branch CNN.

Table 3 .
Detailed information of the single-branch CNN.

Table 4 .
Confusion matrix of the proposed method.

Table 4 .
Confusion matrix of the proposed method.

Table 7 .
Accuracy comparison between standard and deformable convolutions.

Table 8 .
Accuracy comparison with machine learning methods.

Table 9 .
Overview of recently published land cover/use classification methods.